Executive Summary¶

In this assignment, we have to clean and explore the StarWars dataset. The data is collected by running a poll through SurveyMonkey, there were 1,186 respondents. The description of the questions asked in the survey is given below:

Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?
Which of the following Star Wars films have you seen? Please select all that apply. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi)
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being our least favorite film. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi)
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. (Han Solo, Luke Skywalker, Princess Leia Organa, Anakin Skywalker, Obi Wan Kenobi, Emperor Palpatine, Darth Vader, Lando Calrissian, Boba Fett, C-3P0, R2-D2, Jar Jar Binks, Padme Amidala, Yoda)
Which character shot first?
Are you familiar with the Expanded Universe?
Do you consider yourself to be a fan of the Expanded Universe?
Do you consider yourself to be a fan of the Star Trek franchise?
Gender
Age
Household Income
Education
Location (Census Region)

Task 1: Data Preparation¶

1. Import Pacakges¶

Import all the necessary packages: numpy, pandas and matplotlib etc.

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

2. Read in Data File¶

Read in StarWars.csv into dataframe df and assign the column names as below:

column 1 - respondentId
column 2 - any6Films, corresponds to the question 'Have you seen any of the 6 films in the Star Wars franchise?'
column 3 - fanOfStarWars, corresponds to the question 'Do you consider yourself to be a fan of the Star Wars film franchise?'
column 4 to 9 - Seen_I_ThePhantomMenace, Seen_II_AttackoftheClones, Seen_III_RevengeoftheSith, Seen_IV_aNewHope, Seen_V_TheEmpireStrikesBack, Seen_VI_ReturnofTheJedi, correspnds to question 'Which of the following Star Wars films have you seen?'
column 10 to 15 - Rank_I_ThePhantomMenace, Rank_II_AttackoftheClones, Rank_III_RevengeoftheSith, Rank_IV_aNewHope, Rank_V_TheEmpireStrikesBack, Rank_VI_ReturnofTheJedi, corresponds to question 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
- column 16 to 29 - Favour_HanSolo, Favour_LukeSkywalker, Favour_PrincessLeiaOrgana, Favour_AnakinSkywalker, Favour_ObiWanKenobi, Favour_EmperorPalpatine, Favour_DarthVader, Favour_LandoCalrissian, Favour_BobaFett, Favour_C-3P0, Favour_R2D2, Favour_JarJarBinks, Favour_PadmeAmidala, Favour_Yoda, corresponds to the question 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.'
- column 30 - firstShotCharacter, corresponds to question 'Which character shot first?'
- column 31 - knowExpUniverse, corresponds to question 'Are you familiar with the Expanded Universe?'
- column 32 - fanOfExpUniverse', corresponds to question 'Do you consider yourself to be a fan of the Expanded Universe?'
- column 33 - fanOfStarTrekFranchise, corresponds to question 'Do you consider yourself to be a fan of the Star Trek franchise?'
- column 34 - Gender
- column 35 - Age
- column 36 - HHIncome, represents household income
- column 37 - Education
- column 38 - Region

df = pd.read_csv("StarWars.csv", names=['respondentId', 'any6Films', 'fanOfStarWars', 'Seen_I_ThePhantomMenace', 'Seen_II_AttackoftheClones', 'Seen_III_RevengeoftheSith', 'Seen_IV_aNewHope', 'Seen_V_TheEmpireStrikesBack','Seen_VI_ReturnofTheJedi', 'Rank_I_ThePhantomMenace', 'Rank_II_AttackoftheClones', 'Rank_III_RevengeoftheSith', 'Rank_IV_aNewHope', 'Rank_V_TheEmpireStrikesBack','Rank_VI_ReturnofTheJedi', 'Favour_HanSolo', 'Favour_LukeSkywalker', 'Favour_PrincessLeiaOrgana', 'Favour_AnakinSkywalker', 'Favour_ObiWanKenobi', 'Favour_EmperorPalpatine', 'Favour_DarthVader', 'Favour_LandoCalrissian', 'Favour_BobaFett', 'Favour_C-3P0', 'Favour_R2D2', 'Favour_JarJarBinks', 'Favour_PadmeAmidala', 'Favour_Yoda', 'firstShotCharacter', 'knowExpUniverse', 'fanOfExpUniverse', 'fanOfStarTrekFranchise', 'Gender', 'Age', 'HHIncome', 'Education', 'Region'], index_col=False, header = None, skiprows=2)

3. Check Dimensions¶

Check the size of the loaded-in dataset. (x, y): x - Number of Records, y - Number of fields.

df.shape

(1186, 38)

4. Show content¶

By displaying the first 5 rows of the datasets, as there are 38 columns in the dataset, we need to set the option display.max_columns to display all the columns.

pd.set_option('display.max_columns', 50)
df.head(5)

5. Show Data Types¶

Show the data types of each read-in columns

df.dtypes

respondentId                     int64
any6Films                       object
fanOfStarWars                   object
Seen_I_ThePhantomMenace         object
Seen_II_AttackoftheClones       object
Seen_III_RevengeoftheSith       object
Seen_IV_aNewHope                object
Seen_V_TheEmpireStrikesBack     object
Seen_VI_ReturnofTheJedi         object
Rank_I_ThePhantomMenace        float64
Rank_II_AttackoftheClones      float64
Rank_III_RevengeoftheSith      float64
Rank_IV_aNewHope               float64
Rank_V_TheEmpireStrikesBack    float64
Rank_VI_ReturnofTheJedi        float64
Favour_HanSolo                  object
Favour_LukeSkywalker            object
Favour_PrincessLeiaOrgana       object
Favour_AnakinSkywalker          object
Favour_ObiWanKenobi             object
Favour_EmperorPalpatine         object
Favour_DarthVader               object
Favour_LandoCalrissian          object
Favour_BobaFett                 object
Favour_C-3P0                    object
Favour_R2D2                     object
Favour_JarJarBinks              object
Favour_PadmeAmidala             object
Favour_Yoda                     object
firstShotCharacter              object
knowExpUniverse                 object
fanOfExpUniverse                object
fanOfStarTrekFranchise          object
Gender                          object
Age                             object
HHIncome                        object
Education                       object
Region                          object
dtype: object

6. Descriptive Statistics¶

Numerical columns¶

Only respondentID and all the ranking columns are numeric. All the Ranking columns range from 1 to 6.

df.describe(include=np.number).round(2)

Categorical columns¶

Show the unique values and counts for each categorical variables

categorical_cols = df.columns[df.dtypes==object].tolist()
for col in df.columns:
    print(f"Column Name: {col}")
    print(df[col].unique())
    print(df[col].value_counts())
    print("\n")

Column Name: respondentId
[3292879998 3292879538 3292765271 ... 3288375286 3288373068 3288372923]
3289370623    1
3290063543    1
3289539921    1
3290206907    1
3291022013    1
             ..
3289861464    1
3289481310    1
3290944858    1
3288653149    1
3289878528    1
Name: respondentId, Length: 1186, dtype: int64


Column Name: any6Films
['Yes' 'No' 'Yes ']
Yes     935
No      250
Yes       1
Name: any6Films, dtype: int64


Column Name: fanOfStarWars
['Yes' nan 'No' 'Yess' 'Noo']
Yes     551
No      283
Yess      1
Noo       1
Name: fanOfStarWars, dtype: int64


Column Name: Seen_I_ThePhantomMenace
['Star Wars: Episode I  The Phantom Menace' nan]
Star Wars: Episode I  The Phantom Menace    673
Name: Seen_I_ThePhantomMenace, dtype: int64


Column Name: Seen_II_AttackoftheClones
['Star Wars: Episode II  Attack of the Clones' nan]
Star Wars: Episode II  Attack of the Clones    571
Name: Seen_II_AttackoftheClones, dtype: int64


Column Name: Seen_III_RevengeoftheSith
['Star Wars: Episode III  Revenge of the Sith' nan]
Star Wars: Episode III  Revenge of the Sith    550
Name: Seen_III_RevengeoftheSith, dtype: int64


Column Name: Seen_IV_aNewHope
['Star Wars: Episode IV  A New Hope' nan]
Star Wars: Episode IV  A New Hope    607
Name: Seen_IV_aNewHope, dtype: int64


Column Name: Seen_V_TheEmpireStrikesBack
['Star Wars: Episode V The Empire Strikes Back' nan]
Star Wars: Episode V The Empire Strikes Back    758
Name: Seen_V_TheEmpireStrikesBack, dtype: int64


Column Name: Seen_VI_ReturnofTheJedi
['Star Wars: Episode VI Return of the Jedi' nan]
Star Wars: Episode VI Return of the Jedi    738
Name: Seen_VI_ReturnofTheJedi, dtype: int64


Column Name: Rank_I_ThePhantomMenace
[ 3. nan  1.  5.  6.  4.  2.]
4.0    237
6.0    168
3.0    130
1.0    129
5.0    100
2.0     71
Name: Rank_I_ThePhantomMenace, dtype: int64


Column Name: Rank_II_AttackoftheClones
[ 2. nan  6.  4.  5.  1.  3.]
5.0    300
4.0    183
2.0    116
3.0    103
6.0    102
1.0     32
Name: Rank_II_AttackoftheClones, dtype: int64


Column Name: Rank_III_RevengeoftheSith
[ 1. nan  3.  6.  4.  5.  2.]
6.0    217
5.0    203
4.0    182
3.0    150
2.0     47
1.0     36
Name: Rank_III_RevengeoftheSith, dtype: int64


Column Name: Rank_IV_aNewHope
[ 4. nan  2.  6.  3.  1.  5.]
1.0    204
6.0    161
2.0    135
4.0    130
3.0    127
5.0     79
Name: Rank_IV_aNewHope, dtype: int64


Column Name: Rank_V_TheEmpireStrikesBack
[ 5. nan  4.  1.  2.  3.  6.]
1.0    289
2.0    235
5.0    118
3.0    106
4.0     47
6.0     41
Name: Rank_V_TheEmpireStrikesBack, dtype: int64


Column Name: Rank_VI_ReturnofTheJedi
[ 6. nan  3.  2.  1.  4.  5.]
2.0    232
3.0    220
1.0    146
6.0    145
4.0     57
5.0     36
Name: Rank_VI_ReturnofTheJedi, dtype: int64


Column Name: Favour_HanSolo
['Very favorably' nan 'Somewhat favorably'
 'Neither favorably nor unfavorably (neutral)' 'Somewhat unfavorably'
 'Unfamiliar (N/A)' 'Very unfavorably']
Very favorably                                 610
Somewhat favorably                             151
Neither favorably nor unfavorably (neutral)     44
Unfamiliar (N/A)                                15
Somewhat unfavorably                             8
Very unfavorably                                 1
Name: Favour_HanSolo, dtype: int64


Column Name: Favour_LukeSkywalker
['Very favorably' nan 'Somewhat favorably' 'Somewhat unfavorably'
 'Neither favorably nor unfavorably (neutral)' 'Very unfavorably'
 'Unfamiliar (N/A)']
Very favorably                                 552
Somewhat favorably                             219
Neither favorably nor unfavorably (neutral)     38
Somewhat unfavorably                            13
Unfamiliar (N/A)                                 6
Very unfavorably                                 3
Name: Favour_LukeSkywalker, dtype: int64


Column Name: Favour_PrincessLeiaOrgana
['Very favorably' nan 'Somewhat favorably' 'Somewhat unfavorably'
 'Neither favorably nor unfavorably (neutral)' 'Very unfavorably'
 'Unfamiliar (N/A)']
Very favorably                                 547
Somewhat favorably                             210
Neither favorably nor unfavorably (neutral)     48
Somewhat unfavorably                            12
Unfamiliar (N/A)                                 8
Very unfavorably                                 6
Name: Favour_PrincessLeiaOrgana, dtype: int64


Column Name: Favour_AnakinSkywalker
['Very favorably' nan 'Somewhat favorably' 'Somewhat unfavorably'
 'Neither favorably nor unfavorably (neutral)' 'Very unfavorably'
 'Unfamiliar (N/A)']
Somewhat favorably                             269
Very favorably                                 245
Neither favorably nor unfavorably (neutral)    135
Somewhat unfavorably                            83
Unfamiliar (N/A)                                52
Very unfavorably                                39
Name: Favour_AnakinSkywalker, dtype: int64


Column Name: Favour_ObiWanKenobi
['Very favorably' nan 'Somewhat favorably' 'Very unfavorably'
 'Neither favorably nor unfavorably (neutral)' 'Somewhat unfavorably'
 'Unfamiliar (N/A)']
Very favorably                                 591
Somewhat favorably                             159
Neither favorably nor unfavorably (neutral)     43
Unfamiliar (N/A)                                17
Somewhat unfavorably                             8
Very unfavorably                                 7
Name: Favour_ObiWanKenobi, dtype: int64


Column Name: Favour_EmperorPalpatine
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Very unfavorably' 'Neither favorably nor unfavorably (neutral)'
 'Somewhat unfavorably']
Neither favorably nor unfavorably (neutral)    213
Unfamiliar (N/A)                               156
Somewhat favorably                             143
Very unfavorably                               124
Very favorably                                 110
Somewhat unfavorably                            68
Name: Favour_EmperorPalpatine, dtype: int64


Column Name: Favour_DarthVader
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Somewhat unfavorably' 'Very unfavorably'
 'Neither favorably nor unfavorably (neutral)']
Very favorably                                 310
Somewhat favorably                             171
Very unfavorably                               149
Somewhat unfavorably                           102
Neither favorably nor unfavorably (neutral)     84
Unfamiliar (N/A)                                10
Name: Favour_DarthVader, dtype: int64


Column Name: Favour_LandoCalrissian
['Unfamiliar (N/A)' nan 'Somewhat favorably'
 'Neither favorably nor unfavorably (neutral)' 'Very favorably'
 'Somewhat unfavorably' 'Very unfavorably']
Neither favorably nor unfavorably (neutral)    236
Somewhat favorably                             223
Unfamiliar (N/A)                               148
Very favorably                                 142
Somewhat unfavorably                            63
Very unfavorably                                 8
Name: Favour_LandoCalrissian, dtype: int64


Column Name: Favour_BobaFett
['Unfamiliar (N/A)' nan 'Somewhat unfavorably' 'Very favorably'
 'Somewhat favorably' 'Neither favorably nor unfavorably (neutral)'
 'Very unfavorably']
Neither favorably nor unfavorably (neutral)    248
Somewhat favorably                             153
Very favorably                                 138
Unfamiliar (N/A)                               132
Somewhat unfavorably                            96
Very unfavorably                                45
Name: Favour_BobaFett, dtype: int64


Column Name: Favour_C-3P0
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Neither favorably nor unfavorably (neutral)' 'Somewhat unfavorably'
 'Very unfavorably']
Very favorably                                 474
Somewhat favorably                             229
Neither favorably nor unfavorably (neutral)     79
Somewhat unfavorably                            23
Unfamiliar (N/A)                                15
Very unfavorably                                 7
Name: Favour_C-3P0, dtype: int64


Column Name: Favour_R2D2
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Neither favorably nor unfavorably (neutral)' 'Somewhat unfavorably'
 'Very unfavorably']
Very favorably                                 562
Somewhat favorably                             185
Neither favorably nor unfavorably (neutral)     57
Unfamiliar (N/A)                                10
Somewhat unfavorably                            10
Very unfavorably                                 6
Name: Favour_R2D2, dtype: int64


Column Name: Favour_JarJarBinks
['Very favorably' nan 'Unfamiliar (N/A)' 'Very unfavorably'
 'Somewhat favorably' 'Somewhat unfavorably'
 'Neither favorably nor unfavorably (neutral)']
Very unfavorably                               204
Neither favorably nor unfavorably (neutral)    164
Somewhat favorably                             130
Very favorably                                 112
Unfamiliar (N/A)                               109
Somewhat unfavorably                           102
Name: Favour_JarJarBinks, dtype: int64


Column Name: Favour_PadmeAmidala
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Neither favorably nor unfavorably (neutral)' 'Somewhat unfavorably'
 'Very unfavorably']
Neither favorably nor unfavorably (neutral)    207
Somewhat favorably                             183
Very favorably                                 168
Unfamiliar (N/A)                               164
Somewhat unfavorably                            58
Very unfavorably                                34
Name: Favour_PadmeAmidala, dtype: int64


Column Name: Favour_Yoda
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Very unfavorably' 'Neither favorably nor unfavorably (neutral)'
 'Somewhat unfavorably']
Very favorably                                 605
Somewhat favorably                             144
Neither favorably nor unfavorably (neutral)     51
Unfamiliar (N/A)                                10
Somewhat unfavorably                             8
Very unfavorably                                 8
Name: Favour_Yoda, dtype: int64


Column Name: firstShotCharacter
["I don't understand this question" nan 'Greedo' 'Han']
Han                                 325
I don't understand this question    306
Greedo                              197
Name: firstShotCharacter, dtype: int64


Column Name: knowExpUniverse
['Yes' nan 'No']
No     615
Yes    213
Name: knowExpUniverse, dtype: int64


Column Name: fanOfExpUniverse
['No' nan 'Yes' 'Yess']
No      114
Yes      98
Yess      1
Name: fanOfExpUniverse, dtype: int64


Column Name: fanOfStarTrekFranchise
['No' 'Yes' nan 'Noo' 'yes' 'no ']
No     639
Yes    426
yes      1
Noo      1
no       1
Name: fanOfStarTrekFranchise, dtype: int64


Column Name: Gender
['Male' nan 'Female' 'F' 'female' 'male']
Female    546
Male      496
female      2
F           1
male        1
Name: Gender, dtype: int64


Column Name: Age
['18-29' nan '500' '30-44' '> 60' '45-60']
45-60    291
> 60     269
30-44    268
18-29    217
500        1
Name: Age, dtype: int64


Column Name: HHIncome
[nan '$0 - $24,999' '$100,000 - $149,999' '$25,000 - $49,999'
 '$50,000 - $99,999' '$150,000+']
$50,000 - $99,999      298
$25,000 - $49,999      186
$100,000 - $149,999    141
$0 - $24,999           138
$150,000+               95
Name: HHIncome, dtype: int64


Column Name: Education
['High school degree' 'Bachelor degree' 'Some college or Associate degree'
 nan 'Graduate degree' 'Less than high school degree']
Some college or Associate degree    328
Bachelor degree                     321
Graduate degree                     275
High school degree                  105
Less than high school degree          7
Name: Education, dtype: int64


Column Name: Region
['South Atlantic' 'West South Central' 'West North Central'
 'Middle Atlantic' 'East North Central' 'Pacific' nan 'Mountain'
 'New England' 'East South Central']
East North Central    181
Pacific               175
South Atlantic        170
Middle Atlantic       122
West South Central    110
West North Central     93
Mountain               79
New England            75
East South Central     38
Name: Region, dtype: int64

7. Inspect and Clean (Categorical columns )¶

Inspect the values of all the categorical columns carefully and handle the irregular cardinality issues, such as stripping redundant whitespace, replace the capital letter mismatch and data entry error with the majority values with similar meaning

i. white space found in 'Yes ' in the column 'any6Films'

ii. Typo 'Yess' and 'Noo' found in the column 'fanOfStarWars'

iii. Typo 'Yess' found in the column 'fanOfExpUniverse'

iv. Typo 'Noo' and lower case letters 'yes' 'no' found in the column 'fanOfStarTrekFranchise'

v. Irregular cardinality 'F' and lower case letters 'male' 'female' found in the column 'Gender'

df['any6Films']=df['any6Films'].str.strip()  
df['fanOfStarWars']=df['fanOfStarWars'].replace(['Yess'], 'Yes') 
df['fanOfStarWars']=df['fanOfStarWars'].replace(['Noo'], 'No')
df['fanOfExpUniverse']=df['fanOfExpUniverse'].replace(['Yess'], 'Yes') 
df['fanOfStarTrekFranchise']=df['fanOfStarTrekFranchise'].replace(['yes'], 'Yes')  
df['fanOfStarTrekFranchise']=df['fanOfStarTrekFranchise'].replace(['Noo', 'no '], 'No')
df['Gender']=df['Gender'].replace(['male'], 'Male') 
df['Gender']=df['Gender'].replace(['female', 'F'], 'Female')

Cast all text data to upper-case and strip any extra white space by using str.upper() and str.strip().

categorical_cols = df.columns[df.dtypes==object].tolist()
for col in categorical_cols:
    print(f"Column Name: {col}")
    df[col]=df[col].str.upper()
    df[col]=df[col].str.strip()
    print(df[col].unique())
    print(df[col].value_counts())
    print("\n")

Column Name: any6Films
['YES' 'NO']
YES    936
NO     250
Name: any6Films, dtype: int64


Column Name: fanOfStarWars
['YES' nan 'NO']
YES    552
NO     284
Name: fanOfStarWars, dtype: int64


Column Name: Seen_I_ThePhantomMenace
['STAR WARS: EPISODE I  THE PHANTOM MENACE' nan]
STAR WARS: EPISODE I  THE PHANTOM MENACE    673
Name: Seen_I_ThePhantomMenace, dtype: int64


Column Name: Seen_II_AttackoftheClones
['STAR WARS: EPISODE II  ATTACK OF THE CLONES' nan]
STAR WARS: EPISODE II  ATTACK OF THE CLONES    571
Name: Seen_II_AttackoftheClones, dtype: int64


Column Name: Seen_III_RevengeoftheSith
['STAR WARS: EPISODE III  REVENGE OF THE SITH' nan]
STAR WARS: EPISODE III  REVENGE OF THE SITH    550
Name: Seen_III_RevengeoftheSith, dtype: int64


Column Name: Seen_IV_aNewHope
['STAR WARS: EPISODE IV  A NEW HOPE' nan]
STAR WARS: EPISODE IV  A NEW HOPE    607
Name: Seen_IV_aNewHope, dtype: int64


Column Name: Seen_V_TheEmpireStrikesBack
['STAR WARS: EPISODE V THE EMPIRE STRIKES BACK' nan]
STAR WARS: EPISODE V THE EMPIRE STRIKES BACK    758
Name: Seen_V_TheEmpireStrikesBack, dtype: int64


Column Name: Seen_VI_ReturnofTheJedi
['STAR WARS: EPISODE VI RETURN OF THE JEDI' nan]
STAR WARS: EPISODE VI RETURN OF THE JEDI    738
Name: Seen_VI_ReturnofTheJedi, dtype: int64


Column Name: Favour_HanSolo
['VERY FAVORABLY' nan 'SOMEWHAT FAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'SOMEWHAT UNFAVORABLY'
 'UNFAMILIAR (N/A)' 'VERY UNFAVORABLY']
VERY FAVORABLY                                 610
SOMEWHAT FAVORABLY                             151
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     44
UNFAMILIAR (N/A)                                15
SOMEWHAT UNFAVORABLY                             8
VERY UNFAVORABLY                                 1
Name: Favour_HanSolo, dtype: int64


Column Name: Favour_LukeSkywalker
['VERY FAVORABLY' nan 'SOMEWHAT FAVORABLY' 'SOMEWHAT UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'VERY UNFAVORABLY'
 'UNFAMILIAR (N/A)']
VERY FAVORABLY                                 552
SOMEWHAT FAVORABLY                             219
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     38
SOMEWHAT UNFAVORABLY                            13
UNFAMILIAR (N/A)                                 6
VERY UNFAVORABLY                                 3
Name: Favour_LukeSkywalker, dtype: int64


Column Name: Favour_PrincessLeiaOrgana
['VERY FAVORABLY' nan 'SOMEWHAT FAVORABLY' 'SOMEWHAT UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'VERY UNFAVORABLY'
 'UNFAMILIAR (N/A)']
VERY FAVORABLY                                 547
SOMEWHAT FAVORABLY                             210
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     48
SOMEWHAT UNFAVORABLY                            12
UNFAMILIAR (N/A)                                 8
VERY UNFAVORABLY                                 6
Name: Favour_PrincessLeiaOrgana, dtype: int64


Column Name: Favour_AnakinSkywalker
['VERY FAVORABLY' nan 'SOMEWHAT FAVORABLY' 'SOMEWHAT UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'VERY UNFAVORABLY'
 'UNFAMILIAR (N/A)']
SOMEWHAT FAVORABLY                             269
VERY FAVORABLY                                 245
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    135
SOMEWHAT UNFAVORABLY                            83
UNFAMILIAR (N/A)                                52
VERY UNFAVORABLY                                39
Name: Favour_AnakinSkywalker, dtype: int64


Column Name: Favour_ObiWanKenobi
['VERY FAVORABLY' nan 'SOMEWHAT FAVORABLY' 'VERY UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'SOMEWHAT UNFAVORABLY'
 'UNFAMILIAR (N/A)']
VERY FAVORABLY                                 591
SOMEWHAT FAVORABLY                             159
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     43
UNFAMILIAR (N/A)                                17
SOMEWHAT UNFAVORABLY                             8
VERY UNFAVORABLY                                 7
Name: Favour_ObiWanKenobi, dtype: int64


Column Name: Favour_EmperorPalpatine
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'VERY UNFAVORABLY' 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)'
 'SOMEWHAT UNFAVORABLY']
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    213
UNFAMILIAR (N/A)                               156
SOMEWHAT FAVORABLY                             143
VERY UNFAVORABLY                               124
VERY FAVORABLY                                 110
SOMEWHAT UNFAVORABLY                            68
Name: Favour_EmperorPalpatine, dtype: int64


Column Name: Favour_DarthVader
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'SOMEWHAT UNFAVORABLY' 'VERY UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)']
VERY FAVORABLY                                 310
SOMEWHAT FAVORABLY                             171
VERY UNFAVORABLY                               149
SOMEWHAT UNFAVORABLY                           102
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     84
UNFAMILIAR (N/A)                                10
Name: Favour_DarthVader, dtype: int64


Column Name: Favour_LandoCalrissian
['UNFAMILIAR (N/A)' nan 'SOMEWHAT FAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'VERY FAVORABLY'
 'SOMEWHAT UNFAVORABLY' 'VERY UNFAVORABLY']
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    236
SOMEWHAT FAVORABLY                             223
UNFAMILIAR (N/A)                               148
VERY FAVORABLY                                 142
SOMEWHAT UNFAVORABLY                            63
VERY UNFAVORABLY                                 8
Name: Favour_LandoCalrissian, dtype: int64


Column Name: Favour_BobaFett
['UNFAMILIAR (N/A)' nan 'SOMEWHAT UNFAVORABLY' 'VERY FAVORABLY'
 'SOMEWHAT FAVORABLY' 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)'
 'VERY UNFAVORABLY']
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    248
SOMEWHAT FAVORABLY                             153
VERY FAVORABLY                                 138
UNFAMILIAR (N/A)                               132
SOMEWHAT UNFAVORABLY                            96
VERY UNFAVORABLY                                45
Name: Favour_BobaFett, dtype: int64


Column Name: Favour_C-3P0
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'SOMEWHAT UNFAVORABLY'
 'VERY UNFAVORABLY']
VERY FAVORABLY                                 474
SOMEWHAT FAVORABLY                             229
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     79
SOMEWHAT UNFAVORABLY                            23
UNFAMILIAR (N/A)                                15
VERY UNFAVORABLY                                 7
Name: Favour_C-3P0, dtype: int64


Column Name: Favour_R2D2
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'SOMEWHAT UNFAVORABLY'
 'VERY UNFAVORABLY']
VERY FAVORABLY                                 562
SOMEWHAT FAVORABLY                             185
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     57
SOMEWHAT UNFAVORABLY                            10
UNFAMILIAR (N/A)                                10
VERY UNFAVORABLY                                 6
Name: Favour_R2D2, dtype: int64


Column Name: Favour_JarJarBinks
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'VERY UNFAVORABLY'
 'SOMEWHAT FAVORABLY' 'SOMEWHAT UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)']
VERY UNFAVORABLY                               204
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    164
SOMEWHAT FAVORABLY                             130
VERY FAVORABLY                                 112
UNFAMILIAR (N/A)                               109
SOMEWHAT UNFAVORABLY                           102
Name: Favour_JarJarBinks, dtype: int64


Column Name: Favour_PadmeAmidala
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'SOMEWHAT UNFAVORABLY'
 'VERY UNFAVORABLY']
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    207
SOMEWHAT FAVORABLY                             183
VERY FAVORABLY                                 168
UNFAMILIAR (N/A)                               164
SOMEWHAT UNFAVORABLY                            58
VERY UNFAVORABLY                                34
Name: Favour_PadmeAmidala, dtype: int64


Column Name: Favour_Yoda
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'VERY UNFAVORABLY' 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)'
 'SOMEWHAT UNFAVORABLY']
VERY FAVORABLY                                 605
SOMEWHAT FAVORABLY                             144
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     51
UNFAMILIAR (N/A)                                10
SOMEWHAT UNFAVORABLY                             8
VERY UNFAVORABLY                                 8
Name: Favour_Yoda, dtype: int64


Column Name: firstShotCharacter
["I DON'T UNDERSTAND THIS QUESTION" nan 'GREEDO' 'HAN']
HAN                                 325
I DON'T UNDERSTAND THIS QUESTION    306
GREEDO                              197
Name: firstShotCharacter, dtype: int64


Column Name: knowExpUniverse
['YES' nan 'NO']
NO     615
YES    213
Name: knowExpUniverse, dtype: int64


Column Name: fanOfExpUniverse
['NO' nan 'YES']
NO     114
YES     99
Name: fanOfExpUniverse, dtype: int64


Column Name: fanOfStarTrekFranchise
['NO' 'YES' nan]
NO     641
YES    427
Name: fanOfStarTrekFranchise, dtype: int64


Column Name: Gender
['MALE' nan 'FEMALE']
FEMALE    549
MALE      497
Name: Gender, dtype: int64


Column Name: Age
['18-29' nan '500' '30-44' '> 60' '45-60']
45-60    291
> 60     269
30-44    268
18-29    217
500        1
Name: Age, dtype: int64


Column Name: HHIncome
[nan '$0 - $24,999' '$100,000 - $149,999' '$25,000 - $49,999'
 '$50,000 - $99,999' '$150,000+']
$50,000 - $99,999      298
$25,000 - $49,999      186
$100,000 - $149,999    141
$0 - $24,999           138
$150,000+               95
Name: HHIncome, dtype: int64


Column Name: Education
['HIGH SCHOOL DEGREE' 'BACHELOR DEGREE' 'SOME COLLEGE OR ASSOCIATE DEGREE'
 nan 'GRADUATE DEGREE' 'LESS THAN HIGH SCHOOL DEGREE']
SOME COLLEGE OR ASSOCIATE DEGREE    328
BACHELOR DEGREE                     321
GRADUATE DEGREE                     275
HIGH SCHOOL DEGREE                  105
LESS THAN HIGH SCHOOL DEGREE          7
Name: Education, dtype: int64


Column Name: Region
['SOUTH ATLANTIC' 'WEST SOUTH CENTRAL' 'WEST NORTH CENTRAL'
 'MIDDLE ATLANTIC' 'EAST NORTH CENTRAL' 'PACIFIC' nan 'MOUNTAIN'
 'NEW ENGLAND' 'EAST SOUTH CENTRAL']
EAST NORTH CENTRAL    181
PACIFIC               175
SOUTH ATLANTIC        170
MIDDLE ATLANTIC       122
WEST SOUTH CENTRAL    110
WEST NORTH CENTRAL     93
MOUNTAIN               79
NEW ENGLAND            75
EAST SOUTH CENTRAL     38
Name: Region, dtype: int64

8. Handling Missing Values¶

Check¶

Check missing values in each columns

df.isna().sum()

respondentId                     0
any6Films                        0
fanOfStarWars                  350
Seen_I_ThePhantomMenace        513
Seen_II_AttackoftheClones      615
Seen_III_RevengeoftheSith      636
Seen_IV_aNewHope               579
Seen_V_TheEmpireStrikesBack    428
Seen_VI_ReturnofTheJedi        448
Rank_I_ThePhantomMenace        351
Rank_II_AttackoftheClones      350
Rank_III_RevengeoftheSith      351
Rank_IV_aNewHope               350
Rank_V_TheEmpireStrikesBack    350
Rank_VI_ReturnofTheJedi        350
Favour_HanSolo                 357
Favour_LukeSkywalker           355
Favour_PrincessLeiaOrgana      355
Favour_AnakinSkywalker         363
Favour_ObiWanKenobi            361
Favour_EmperorPalpatine        372
Favour_DarthVader              360
Favour_LandoCalrissian         366
Favour_BobaFett                374
Favour_C-3P0                   359
Favour_R2D2                    356
Favour_JarJarBinks             365
Favour_PadmeAmidala            372
Favour_Yoda                    360
firstShotCharacter             358
knowExpUniverse                358
fanOfExpUniverse               973
fanOfStarTrekFranchise         118
Gender                         140
Age                            140
HHIncome                       328
Education                      150
Region                         143
dtype: int64

Fix Categoical columns¶

By examining the content of the dataset, we have the below findings:

i. Column fanOfStarWars, knowExpUniverse, fanOfExpUniverse, fanOfStarTrekFranchise intended for respondents to answer 'Yes' or 'No', however, some of the respondents have skipped the question and did not provide any answers.

ii. It is intended for respondents to select the respective answers for the following columns, however, beside these unique values, some of the respondents have skipped the question and did not provide any answers

`firstShotCharacter` : I DON'T UNDERSTAND THIS QUESTION, GREEDO, HAN

`Gender` : FEMALE, MALE

`Age`: '18-29', '500', '30-44', '> 60', '45-60'

`HHIncome`:  '\\$0 - \\$24,999',  '\\$100,000 - \\$149,999',  '\\$25,000 - \\$49,999', '\\$50,000 - \\$99,999', '\\$150,000+'

`Education`: HIGH SCHOOL DEGREE, BACHELOR DEGREE, SOME COLLEGE OR ASSOCIATE DEGREE, GRADUATE DEGREE, LESS THAN HIGH SCHOOL DEGREE

`Region`: SOUTH ATLANTIC, WEST SOUTH CENTRAL, WEST NORTH CENTRAL, MIDDLE ATLANTIC, EAST NORTH CENTRAL, PACIFIC, MOUNTAIN, NEW ENGLAND, EAST SOUTH CENTRAL

We would impute the value 'N/A' for all the above columns with missing values.

df.loc[(df['fanOfStarWars'].isna()), 'fanOfStarWars']='N/A'
df.loc[(df['knowExpUniverse'].isna()), 'knowExpUniverse']='N/A'
df.loc[(df['fanOfExpUniverse'].isna()), 'fanOfExpUniverse']='N/A'
df.loc[(df['fanOfStarTrekFranchise'].isna()), 'fanOfStarTrekFranchise']='N/A'
df.loc[(df['firstShotCharacter'].isna()), 'firstShotCharacter']='N/A'
df.loc[(df['Gender'].isnull()==True), 'Gender'] = 'N/A'
df.loc[(df['Age'].isnull()==True), 'Age'] = 'N/A'
df.loc[(df['HHIncome'].isnull()==True), 'HHIncome'] = 'N/A'
df.loc[(df['Education'].isnull()==True), 'Education'] = 'N/A'
df.loc[(df['Region'].isnull()==True), 'Region'] = 'N/A'

Fix the columns prefix with `Seen`¶

For all the columns with column name prefix with Seen, if the respondents have seen the movie, the movie name (only that movie name) would be listed as the values of the entries, otherwise no value is shown. It would be easier to manipulate the information, if we encode 1 for the entries which are not null, and 0 for entires which are null for all these columns.

seen_cols = df.columns[df.columns.str.find('Seen', 0, 4)!=-1].tolist()
for col in seen_cols:
        print(f"Column Name: {col} has value of:", df[col].unique() )

Column Name: Seen_I_ThePhantomMenace has value of: ['STAR WARS: EPISODE I  THE PHANTOM MENACE' nan]
Column Name: Seen_II_AttackoftheClones has value of: ['STAR WARS: EPISODE II  ATTACK OF THE CLONES' nan]
Column Name: Seen_III_RevengeoftheSith has value of: ['STAR WARS: EPISODE III  REVENGE OF THE SITH' nan]
Column Name: Seen_IV_aNewHope has value of: ['STAR WARS: EPISODE IV  A NEW HOPE' nan]
Column Name: Seen_V_TheEmpireStrikesBack has value of: ['STAR WARS: EPISODE V THE EMPIRE STRIKES BACK' nan]
Column Name: Seen_VI_ReturnofTheJedi has value of: ['STAR WARS: EPISODE VI RETURN OF THE JEDI' nan]

seen_cols = df.columns[df.columns.str.find('Seen', 0, 4)!=-1].tolist()
for col in seen_cols:
    df.loc[~(df[col].isna()), col]=1
    df.loc[(df[col].isna()), col]=0

Checking the unique values of these columns after encoding.

seen_cols = df.columns[df.columns.str.find('Seen', 0, 4)!=-1].tolist()
for col in seen_cols:
    print(f"Column Name: {col}")
    print(df[col].unique())
    print(df[col].value_counts())

Column Name: Seen_I_ThePhantomMenace
[1 0]
1    673
0    513
Name: Seen_I_ThePhantomMenace, dtype: int64
Column Name: Seen_II_AttackoftheClones
[1 0]
0    615
1    571
Name: Seen_II_AttackoftheClones, dtype: int64
Column Name: Seen_III_RevengeoftheSith
[1 0]
0    636
1    550
Name: Seen_III_RevengeoftheSith, dtype: int64
Column Name: Seen_IV_aNewHope
[1 0]
1    607
0    579
Name: Seen_IV_aNewHope, dtype: int64
Column Name: Seen_V_TheEmpireStrikesBack
[1 0]
1    758
0    428
Name: Seen_V_TheEmpireStrikesBack, dtype: int64
Column Name: Seen_VI_ReturnofTheJedi
[1 0]
1    738
0    448
Name: Seen_VI_ReturnofTheJedi, dtype: int64

Fix the columns prefix with `Rank`¶

For all the columns with column name prefix with 'Rank', respondents are rating each star wars movies order of preference ( 1 being favorite, 6 being least favorite), check whether all these columns are only input with values range from 1 to 6.

def checkRank(col):
    is1to6=True
    for i in np.arange(len(col)):
        if(np.isnan(col[i])==False):
            if (col[i].is_integer()== False):
                print('             Row ', i, 'contains an non-integer')
                is1to6=False
            else:
                check = 1 <= col[i] <= 6
                if (check == False):
                    print('             Row', i, 'has value of', col[i], 'and does not range from 1 to 6')
                    is1to6=False
    return is1to6

rank_cols = df.columns[df.columns.str.find('Rank', 0, 4)!=-1].tolist()
for col in rank_cols:
    print(f"Column Name: {col}" )
    btn1and6 = checkRank(df[col])
    if btn1and6 is True:
        print("             ONLY contains values between 1 to 6")

Column Name: Rank_I_ThePhantomMenace
             ONLY contains values between 1 to 6
Column Name: Rank_II_AttackoftheClones
             ONLY contains values between 1 to 6
Column Name: Rank_III_RevengeoftheSith
             ONLY contains values between 1 to 6
Column Name: Rank_IV_aNewHope
             ONLY contains values between 1 to 6
Column Name: Rank_V_TheEmpireStrikesBack
             ONLY contains values between 1 to 6
Column Name: Rank_VI_ReturnofTheJedi
             ONLY contains values between 1 to 6

After validation in previous step we know that entires with missing values are those that without any responses on the preference of the star wars movies. As the datatypes of these columns are ordinal. We should keep these columns numerical for easier manipulations, thus it would be wiser to impute the entries without any responses to another unique value (-1).

rank_cols = df.columns[df.columns.str.find('Rank', 0, 4)!=-1].tolist()
for col in rank_cols:
    df.loc[(df[col].isna()), col]=-1
    df[col]=df[col].astype('int64')

Fix the columns prefix with `Favour`¶

For all the columns with column name prefix with 'Favour', respondents are rating each star wars character with either one of the following likings (p.s. all likings turn into upper case, since we cast all textual data into upper case):

* 'UNFAMILIAR (N/A)'
* 'VERY UNFAVORABLY'
* 'SOMEWHAT UNFAVORABLY'
* 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)'
* 'SOMEWHAT FAVORABLY'
* 'VERY FAVORABLY'

def checkFavourite(col):
    isLikingValid=True
    liking = ['UNFAMILIAR (N/A)', 'VERY UNFAVORABLY', 'SOMEWHAT UNFAVORABLY', 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)', 'SOMEWHAT FAVORABLY', 'VERY FAVORABLY'] 
    for i in np.arange(len(col)):
        if(pd.isna(col[i])==False): 
            check = (col[i] in liking)
            if (check == False):
                print('             Row', i, 'has value of', col[i], 'and is not what suppose to be in the liking list.')
                isLikingValid=False
    return isLikingValid

print("Liking list values are 'UNFAMILIAR (N/A)', 'VERY UNFAVORABLY', 'SOMEWHAT UNFAVORABLY', 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)', 'SOMEWHAT FAVORABLY', 'VERY FAVORABLY'\n" )
fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
for col in fav_cols:
    print(f"Column Name: {col}" )
    likingInList = checkFavourite(df[col])
    if likingInList is True:
        print("             ONLY contains values in the liking list")

Liking list values are 'UNFAMILIAR (N/A)', 'VERY UNFAVORABLY', 'SOMEWHAT UNFAVORABLY', 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)', 'SOMEWHAT FAVORABLY', 'VERY FAVORABLY'

Column Name: Favour_HanSolo
             ONLY contains values in the liking list
Column Name: Favour_LukeSkywalker
             ONLY contains values in the liking list
Column Name: Favour_PrincessLeiaOrgana
             ONLY contains values in the liking list
Column Name: Favour_AnakinSkywalker
             ONLY contains values in the liking list
Column Name: Favour_ObiWanKenobi
             ONLY contains values in the liking list
Column Name: Favour_EmperorPalpatine
             ONLY contains values in the liking list
Column Name: Favour_DarthVader
             ONLY contains values in the liking list
Column Name: Favour_LandoCalrissian
             ONLY contains values in the liking list
Column Name: Favour_BobaFett
             ONLY contains values in the liking list
Column Name: Favour_C-3P0
             ONLY contains values in the liking list
Column Name: Favour_R2D2
             ONLY contains values in the liking list
Column Name: Favour_JarJarBinks
             ONLY contains values in the liking list
Column Name: Favour_PadmeAmidala
             ONLY contains values in the liking list
Column Name: Favour_Yoda
             ONLY contains values in the liking list

Two findings:

1) Respondents can provide no response to the liking on any of these characters which leaves the entries with empty values. 

2) Likings are ordinal

It would be more sensible to encode these columns as follows:

* 'UNFAMILIAR (N/A)' : 99
* 'VERY UNFAVORABLY':' : 0
* 'SOMEWHAT UNFAVORABLY' : 1
* 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' :2
* 'SOMEWHAT FAVORABLY' : 3
* 'VERY FAVORABLY' : 4

and -1 for entries without values

level_mapping = {'UNFAMILIAR (N/A)': 99, 'VERY UNFAVORABLY': 0, 'SOMEWHAT UNFAVORABLY': 1, 
                 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)': 2, 'SOMEWHAT FAVORABLY': 3, 'VERY FAVORABLY': 4}


fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
for col in fav_cols:
    df[col] = df[col].replace(level_mapping)
    df.loc[(df[col].isna()), col]=-1
    df[col]=df[col].astype('int64')

9. Apply test suite¶

Run a small test-suite to test all the value in the Age columns are in the defined age bracket. '18-29', '30-44', '> 60'and '45-60', since we imputed 'N/A' as the missing values, include this into the list.

ageBracket = ['18-29', '30-44', '> 60', '45-60', 'N/A'] 
for i in np.arange(len(df['Age'])):
        check = (df['Age'][i] in ageBracket)
        if (check == False):
            print("We found values out of Age Bracket at index", i, "with value", df['Age'][i])

We found values out of Age Bracket at index 12 with value 500

Since only one entry is found and 500 is definitely out of a normal age range, we imputed this with N/A, same as what we have done for missing values.

df.loc[df['Age']=='500', 'Age'] = 'N/A'

10. Validation after Fixes¶

Data Types¶

Show Data Types of all columns, Check the data types of all the columns in the dataframe after data preprocessing

df.dtypes

respondentId                    int64
any6Films                      object
fanOfStarWars                  object
Seen_I_ThePhantomMenace        object
Seen_II_AttackoftheClones      object
Seen_III_RevengeoftheSith      object
Seen_IV_aNewHope               object
Seen_V_TheEmpireStrikesBack    object
Seen_VI_ReturnofTheJedi        object
Rank_I_ThePhantomMenace         int64
Rank_II_AttackoftheClones       int64
Rank_III_RevengeoftheSith       int64
Rank_IV_aNewHope                int64
Rank_V_TheEmpireStrikesBack     int64
Rank_VI_ReturnofTheJedi         int64
Favour_HanSolo                  int64
Favour_LukeSkywalker            int64
Favour_PrincessLeiaOrgana       int64
Favour_AnakinSkywalker          int64
Favour_ObiWanKenobi             int64
Favour_EmperorPalpatine         int64
Favour_DarthVader               int64
Favour_LandoCalrissian          int64
Favour_BobaFett                 int64
Favour_C-3P0                    int64
Favour_R2D2                     int64
Favour_JarJarBinks              int64
Favour_PadmeAmidala             int64
Favour_Yoda                     int64
firstShotCharacter             object
knowExpUniverse                object
fanOfExpUniverse               object
fanOfStarTrekFranchise         object
Gender                         object
Age                            object
HHIncome                       object
Education                      object
Region                         object
dtype: object

Missing values¶

Check the missing values of all the columns, there shouldn't be any after data preprocessing.

df.isna().sum()

respondentId                   0
any6Films                      0
fanOfStarWars                  0
Seen_I_ThePhantomMenace        0
Seen_II_AttackoftheClones      0
Seen_III_RevengeoftheSith      0
Seen_IV_aNewHope               0
Seen_V_TheEmpireStrikesBack    0
Seen_VI_ReturnofTheJedi        0
Rank_I_ThePhantomMenace        0
Rank_II_AttackoftheClones      0
Rank_III_RevengeoftheSith      0
Rank_IV_aNewHope               0
Rank_V_TheEmpireStrikesBack    0
Rank_VI_ReturnofTheJedi        0
Favour_HanSolo                 0
Favour_LukeSkywalker           0
Favour_PrincessLeiaOrgana      0
Favour_AnakinSkywalker         0
Favour_ObiWanKenobi            0
Favour_EmperorPalpatine        0
Favour_DarthVader              0
Favour_LandoCalrissian         0
Favour_BobaFett                0
Favour_C-3P0                   0
Favour_R2D2                    0
Favour_JarJarBinks             0
Favour_PadmeAmidala            0
Favour_Yoda                    0
firstShotCharacter             0
knowExpUniverse                0
fanOfExpUniverse               0
fanOfStarTrekFranchise         0
Gender                         0
Age                            0
HHIncome                       0
Education                      0
Region                         0
dtype: int64

Task 2: Data Exploration¶

Explore the survey question:

Question 1 - Ranking¶

Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.

(Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi)}, then analysis how people rate Star Wars Movies.

Quick Link to Conclusion

Total Score

Rank for Most Favourite

1. Show Volume¶

Plot the volume of ranking of each star wars movies according to the value count of the columns with prefix Rank using a grouped bar chart:

x = np.arange(len(df['Rank_I_ThePhantomMenace'].value_counts()))
width = 0.12  # the width of the bars


fig, ax = plt.subplots()


rects1 = ax.bar(x + width, df['Rank_I_ThePhantomMenace'].value_counts().sort_index(), width, label='I_ThePhantomMenace')
rects2 = ax.bar(x + width*2, df['Rank_II_AttackoftheClones'].value_counts().sort_index(), width, label='II_AttackoftheClones')
rects3 = ax.bar(x + width*3, df['Rank_III_RevengeoftheSith'].value_counts().sort_index(), width, label='III_RevengeoftheSith')
rects4 = ax.bar(x + width*4, df['Rank_IV_aNewHope'].value_counts().sort_index(), width, label='IV_aNewHope')
rects5 = ax.bar(x + width*5, df['Rank_V_TheEmpireStrikesBack'].value_counts().sort_index(), width, label='V_TheEmpireStrikesBack')
rects6 = ax.bar(x + width*6, df['Rank_VI_ReturnofTheJedi'].value_counts().sort_index(), width, label='VI_ReturnofTheJedi')

ax.set_ylabel('Counts')
ax.set_xlabel('Favourite Score [-1: No response,  1 - 6 ( most favourite to least favourite)]')
ax.set_title('Rank Counts of Star Wars movies')
ax.set_xticks(x+ width*3)
ax.set_xticklabels(df['Rank_I_ThePhantomMenace'].value_counts().sort_index().index)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.2), ncol=3)
plt.savefig('OverAllRankCount.png', dpi=300, bbox_inches='tight')
plt.show()

The chart shows the following findings:

Each movie has approximately same number of no response on ranking
Most respondent vote V_TheEmpireStrikesBack as their most favourite movie
Most respondent vote III_RevengeoftheSith as their least favourite movie

Since the most favourite movie is ranked with score 1, and least favourite movie is ranked with score 6, we could even do a further analysis to compare the total scores of each movie and rate them in order. However, we can only calculate the total scores correctly if the ranking of the entries are valid, which is the respondents are giving ranking 1 to 6 to each star wars movie.

2. Find Invalid response¶

Subset all the columns which give ranking to the star wars movies. Since we know the values in these columns only range from 1 to 6. For valid ranking entry, the total score of that entry would sum up to be 21 (1 + 2 + 3 + 4 + 5 + 6). Check to see how many rows in the subset of data which does not sum up to be 21, excluding the rows with no response (i.e. total score will be -6 if there is no response on the ranking of all star wars movies)

rank = df.iloc[:, 9:15]
rank[rank.sum(axis=1)==21]
rank[(rank.sum(axis=1)!=21) & (rank.sum(axis=1)!=-6) ]

Only 2 entries are with partial response. Excluding these 2 entries should not have big effect on analysing the data, if this is just a small percentage out of the data set with valid ranking.

Let's find out how many entires with valid ranking. To do this, first we need to subset df to a new dataframe rank with the entries that only have total score sum up to be 21

rank = rank[rank.sum(axis=1)==21]

rank.shape

(834, 6)

3. Validate Ranking¶

Check whether each movie receives 834 ranking and all comprise by values of 1 - 6.

# Apply the function value_counts to all the column in rank, make the result to a new dataframe `rankCount`.
rankCount = rank.apply(pd.Series.value_counts).reset_index()
rankCount

# Sum the rankCount of each movie which should be equal to 834
np.sum(rankCount,axis=0)

index                           21
Rank_I_ThePhantomMenace        834
Rank_II_AttackoftheClones      834
Rank_III_RevengeoftheSith      834
Rank_IV_aNewHope               834
Rank_V_TheEmpireStrikesBack    834
Rank_VI_ReturnofTheJedi        834
dtype: int64

Let's make the column heading of rankCount more readable, strip the Rank suffix from the rank's column, and name index as 'Rank'

movie = rank.columns.str[5:].tolist()

movie

['I_ThePhantomMenace',
 'II_AttackoftheClones',
 'III_RevengeoftheSith',
 'IV_aNewHope',
 'V_TheEmpireStrikesBack',
 'VI_ReturnofTheJedi']

rankCountColName = ['Rank']
rankCountColName.extend(movie)
rankCountColName
rankCount.columns=rankCountColName
rankCount

np.sum(rankCount,axis=0)

Rank                       21
I_ThePhantomMenace        834
II_AttackoftheClones      834
III_RevengeoftheSith      834
IV_aNewHope               834
V_TheEmpireStrikesBack    834
VI_ReturnofTheJedi        834
dtype: int64

Sum rankCount by row and column, there are 834 counts for each ranking, and also 834 counts for each star wars movies. Thus all the entries in rank consist of valid ranking. We can use rank to sum up the scores.

rankCount.loc[:,rankCount.columns != 'Rank'].sum(axis=0)

I_ThePhantomMenace        834
II_AttackoftheClones      834
III_RevengeoftheSith      834
IV_aNewHope               834
V_TheEmpireStrikesBack    834
VI_ReturnofTheJedi        834
dtype: int64

SumbyRank = pd.DataFrame(rankCount.loc[:,rankCount.columns != 'Rank'].sum(axis=1))
SumbyRank.columns=['Sum']
pd.concat([rankCount['Rank'], SumbyRank ], axis=1)

4. Calculate Total Score¶

Sum the scores of each movie in rank

score = rank.sum(axis=0).reset_index()
score.columns=['movie_name', 'total']
score['movie_name']= movie
score = score.sort_values('total')
score

5. Ranked and watched All movies¶

From the handling missing values section, we noticed that none of the movie have been Seen for more than 800 respondents. Thus respondents giving valid ranking might not watch the movie before. It would be interesting to know if there is any change on the order of the total scores for these 6 movies if we only consider the rankings from respondents who have actually seen all of them. Subset column 4 to 16 from df to obtain all the information on ranking and seeing.

rankAllSeen = df.iloc[:, 3:15]

Filter rankAllSeen only with the entries which have watched all 6 star wars movies.

rankAllSeen = rankAllSeen[rankAllSeen.iloc[:,0:6].sum(axis=1)==6]
rankAllSeen.shape

(471, 12)

Further filter rankAllSeen with entries of valid ranking

rankAllSeen=rankAllSeen[rankAllSeen.iloc[:,6:].sum(axis=1)==21]
rankAllSeen.shape

(470, 12)

Drop the seeing column, as we only need the ranking column for analysis.

rankAllSeen = rankAllSeen.iloc[:,6:]

Sum the scores of each movie in rankAllSeen

scoreAllSeen = rankAllSeen.sum(axis=0).reset_index()
scoreAllSeen.columns=['movie_name', 'total']
scoreAllSeen['movie_name']= movie
scoreAllSeen = scoreAllSeen.sort_values('total')
scoreAllSeen

6. Visualize Total Score¶

Quick Link to conclusion

Plot score and scoreAllSeen in a side-by-side bar chart.

x = np.arange(len(score['movie_name'])) 

fig, axs = plt.subplots(1, 2, figsize=(9, 5), sharey=True)
axs[0].bar(score['movie_name'], score['total'])
axs[1].bar(scoreAllSeen['movie_name'], scoreAllSeen['total'])
rect0 = axs[0].patches
rect1 = axs[1].patches
axs[0].set_xticks(x)
axs[0].set_xticklabels(score['movie_name'], rotation=40, ha='right')
axs[0].set_title('Valid Ranking')
axs[1].set_xticklabels(scoreAllSeen['movie_name'], rotation=40, ha='right')
axs[1].set_title('Valid Ranking and watched all 6 movies')

fig.suptitle('Total Scores', fontsize=16)

for rect, label in zip(rect0, score['total']):
    height = rect.get_height()
    axs[0].text(rect.get_x() + rect.get_width() / 2, height + 5, label,
            ha='center', va='bottom')

for rect, label in zip(rect1, scoreAllSeen['total']):
    height = rect.get_height()
    axs[1].text(rect.get_x() + rect.get_width() / 2, height + 5, label,
            ha='center', va='bottom')
    

plt.savefig('TotalScore.png', dpi=300, bbox_inches='tight')
plt.show()

By looking at the above plot with least score on the left in ascending order, we have the following findings:

V_TheEmpireStrikesBack is the most favourite movie (for both situation)
VI_ReturnofTheJedi and IV_aNewHope would be the second or third favourite, if we only count respondents who have watched all 6 movies, order would swap for second and third favourite.
I_ThePhantomMenace would be the fourth favourite.
II_AttackoftheClones and III_RevengeoftheSith would be the fifth and least favourite, if we only count respondents who have watched all 6 movies, order would swap for fifth and least favourite.

7. Visualize Rank 1 (Most favourite) Count¶

Quick Link to Conclusion

We could also check if there is any difference in respondents voting the favourite movie (rank 1) if they have actually watched all 6 of them. Find the rank 1 count of each movie from rankCount.

rank1 = rankCount[rankCount['Rank']==1]
rank1 = rank1.iloc[:,1:].T.reset_index()
rank1.columns=['movie_name', 'count']
rank1 = rank1.sort_values('count', ascending=False)
rank1

Construct a dataframe which shows the value_counts for rankAllSeen

rankAllSeenCount = rankAllSeen.apply(pd.Series.value_counts).reset_index()

rankAllSeenCountColName = ['Rank']
rankAllSeenCountColName.extend(movie)
rankAllSeenCountColName
rankAllSeenCount.columns=rankCountColName
rankAllSeenCount

Find the rank 1 count of each movie from rankAllSeenCount.

rank1AllSeen = rankAllSeenCount[rankAllSeenCount['Rank']==1]
rank1AllSeen = rank1AllSeen.iloc[:,1:].T.reset_index()
rank1AllSeen.columns=['movie_name', 'count']
rank1AllSeen = rank1AllSeen.sort_values('count', ascending=False)
rank1AllSeen

Plot rank1 and rank1AllSeen in a side-by-side bar chart.

x = np.arange(len(rank1['movie_name'])) 

fig, axs = plt.subplots(1, 2, figsize=(9, 5), sharey=True)
axs[0].bar(rank1['movie_name'], rank1['count'])
axs[1].bar(rank1AllSeen['movie_name'], rank1AllSeen['count'])
axs[0].set_xticks(x)
axs[0].set_xticklabels(rank1['movie_name'], rotation=40, ha='right')
axs[0].set_title('Valid Ranking')
axs[1].set_xticklabels(rank1AllSeen['movie_name'], rotation=40, ha='right')
axs[1].set_title('Valid Ranking and watched all 6 movies')
fig.suptitle('Rank 1 count', fontsize=16)

plt.savefig('Rank1.png', dpi=300, bbox_inches='tight')
plt.show()

By looking at the above plot, we conclude that there is no change on the order for number of people voting on most favourite movie in both scenarios.

Question 2 - Pairwise comparison¶

Explore the relationships between columns; at least 3 visualisations with plausible hypothesis

1. Supplementary function : PrintStackRelation¶

Define a function printStackRelation which takes the following parameters and print out a horizontal stack bar chart of 2 different columns in df, first column would be drawn into horizontal bars, the count of unique value is displayed at the right hand side of each bar, the bar would be divided into different segments according to the percentage from the count of unique values in the second column :

a dataframe namely, realteDf which is contains row counts of unique values of the 2 chosen columns in the df dataframe using the group by function.
list of unique values in the first column of realteDf
list of unique values in the second column of relateDf
The string to display as the label of y-axis of the stack bar chart
The string to dispaly as the title of the stack bar chart
File name of the plot that we wanted to save as.

def printStackRelation(relateDf, col1Lvl, col2Lvl, ylabel, title, plotfile):
    data = np.zeros((len(col1Lvl), len(col2Lvl)))
    percent = np.zeros((len(col1Lvl), len(col2Lvl)))
    col1 = relateDf.columns[0]
    col2 = relateDf.columns[1]    
    j = 0
    k = 0
    for i in np.arange(len(col1Lvl)):
        for j in np.arange(len(col2Lvl)):
            if (sum((relateDf[col1] == col1Lvl[i]) &  (relateDf[col2] == col2Lvl[j])) == 0) :
                num = 0
            else:
                num=relateDf.loc[((relateDf[col1]==col1Lvl[i]) & (relateDf[col2]==col2Lvl[j] )),['count']].values[0]
            data[j, i] = num

    for i in np.arange(len(col1Lvl)):
        for j in np.arange(len(col2Lvl)):
            percent[j, i]= data[j,i]/data.sum(axis=0)[i] * 100

    y_pos = np.arange(len(col1Lvl))
    fig = plt.figure(figsize=(8,4))
    ax = fig.add_subplot(111)

    patch_handles = []
        # left alignment of data starts at zero
    left = np.zeros(len(col1Lvl)) 
    for i, d in enumerate(percent):
        patch_handles.append(ax.barh(y_pos, d, align='center', left=left, label=str(col2Lvl[i])))
        left += d

        # search all of the bar segments and annotate
    for j in range(len(patch_handles)):
        for i, patch in enumerate(patch_handles[j].get_children()):
            bl = patch.get_xy()
            x = 0.5*patch.get_width() + bl[0]
            y = 0.5*patch.get_height() + bl[1]
            ax.text(x,y, "%.1f%%" % (percent[j,i]), ha='center')
            if (j == (len(patch_handles) - 1)):
                    ax.text(108, y, "%d" % (data.sum(axis=0)[i]), ha='center')

    ax.set_yticks(y_pos)
    ax.set_yticklabels(col1Lvl)
    plt.title(title, fontsize=14, loc='left', pad=10, x = 0.08)
    plt.legend(bbox_to_anchor=(0.6,1.02,1,0.2), loc="lower left", ncol=3)
    ax.set_xlabel('Percent (%)')
    ax.set_ylabel(ylabel, fontsize=14)
    plt.savefig(plotfile, dpi=300, bbox_inches='tight')
    
    plt.show()

2. 1st Comparison: Fans of Star Wars vs Star Trek¶

Quick Link to Conclusion

We chose to compare the relationship between Fans of star wars and Fans of star trek. Use the groupby function to construct the dataframe WarTrekDf with the cross tabular information on unique values by count between these 2 columns.

Hypothesis: Usually people who like/dislike star wars would have the same likings towards Star Trek, and vice versa.

WarTrekDf=df.groupby(['fanOfStarWars', 'fanOfStarTrekFranchise']).size().reset_index()
WarTrekDf.columns = ['fanOfStarWars', 'fanOfStarTrekFranchise', 'count']
WarTrekDf

Construct the list of all unique values from the fanOfStarWars column as StarWarLvl and from the fanOfStarTrekFranchise columns as StarTrekLvl.

StarWarLvl=df['fanOfStarWars'].unique()
StarTrekLvl=df['fanOfStarTrekFranchise'].unique()
print(StarWarLvl)
print(StarTrekLvl)

['YES' 'N/A' 'NO']
['NO' 'YES' 'N/A']

We observed the values are the same, however the order is different, we would have a better presentation on the order of segments in the stack bar chart if we arrange them similarly as below.

StarWarLvlOrder=[1,3,2]
StarWarLvl=[x for i, x in sorted(zip(StarWarLvlOrder, StarWarLvl))]
StarTrekLvlOrder=[2,1,3]
StarTrekLvl=[x for i, x in sorted(zip(StarTrekLvlOrder, StarTrekLvl))]
print(StarWarLvl)
print(StarTrekLvl)

['YES', 'NO', 'N/A']
['YES', 'NO', 'N/A']

Pass in the information we contructed in step 2 to the function printStackRelation to produce the horizontal percentage stack bar chart.

printStackRelation(WarTrekDf, StarWarLvl, StarTrekLvl, 'Fans of Star Wars', 'Fan of Star Trek','1. WarTrek.png' )

Construct the dataframe TrekWarDf with the same cross tabular information, however fanOfStarTrekFranchise would be the first column in the group by function.

TrekWarDf=df.groupby(['fanOfStarTrekFranchise', 'fanOfStarWars']).size().reset_index()
TrekWarDf.columns = ['fanOfStarTrekFranchise', 'fanOfStarWars', 'count']
TrekWarDf

Pass in the information we constructed in previous step 6 the function printStackRelation to produce the horizontal percentage stack bar chart. This time we would have Fans of star trek as the bar and Fans of star wars would be the segments.

printStackRelation(TrekWarDf, StarTrekLvl, StarWarLvl, 'Fan of Star Trek', 'Fans of Star Wars', '1. TrekWar.png')

The above graphs show that there are more respondents who are fans of star wars (552) compare to fans of star trek (427). If they are fans of Star Trek, there is a higher percentage they would like Star Wars (86.7%) compare to the reverse (as there are only 67% of Star Wars Fans who are fans of Star Trek). There are many more respondents who do not like Star Trek (641), compare to Star Wars (284), and if they don’t like Star Wars, there are a higher chance they would not like Star Trek (83.8%), while the reverse is not so significant (only 37.1%).

3. 2nd Comparison: Fans of Star wars vs Familiarity on Expanded Universe¶

Quick Link to Conclusion

We chose to compare the relationship between Fans of star wars and respondents who are you familiar with the Expanded Universe. Use the groupby function to construct the dataframe WarKnowUniDf with the cross tabular information on unique values by count between these 2 columns.

Hypothesis: Fans of Star wars should know about Expanded Universe

WarKnowUniDf=df.groupby(['fanOfStarWars', 'knowExpUniverse']).size().reset_index()
WarKnowUniDf.columns = ['fanOfStarWars', 'knowExpUniverse', 'count']
WarKnowUniDf

Construct the list of all unique values from the knowExpUniverse column as knowUniLvl

knowUniLvl=df['knowExpUniverse'].unique()
print(StarWarLvl)
print(knowUniLvl)

['YES', 'NO', 'N/A']
['YES' 'N/A' 'NO']

We observed the unique values are the same, however the order is different, we would have a better presentation on the order of segments in the stack bar chart if we arrange them similarly as below.

knowUniLvlOrder=[1,3,2]
knowUniLvl=[x for i, x in sorted(zip(knowUniLvlOrder, knowUniLvl))]
print(StarWarLvl)
print(knowUniLvl)

['YES', 'NO', 'N/A']
['YES', 'NO', 'N/A']

Pass in the information we constructed in step 10 to the function printStackRelation to produce the horizontal percentage stack bar chart.

printStackRelation(WarKnowUniDf, StarWarLvl, knowUniLvl, 'Fans of Star Wars', 'Know Expanded Universe', '2. WarKnowUni.png')

Construct the dataframe KnowUniWarDf with the same cross tabular information, however knowExpUniverse would be the first column in the group by function.

KnowUniWarDf=df.groupby(['knowExpUniverse', 'fanOfStarWars']).size().reset_index()
KnowUniWarDf.columns = ['knowExpUniverse', 'fanOfStarWars', 'count']
KnowUniWarDf

Pass in the information we constructed in previous step to the function printStackRelation to produce the horizontal percentage stack bar chart. This time we would have knowExpUniverse as the bar and Fans of star wars would be the segments.

printStackRelation(KnowUniWarDf, knowUniLvl, StarWarLvl, 'Know Expanded Universe', 'Fans of Star Wars', '2. KnowUniWar.png')

From the above 2 charts, we observe that the percentage of Fans of Star Wars having knowledge on Expanded Universe is surprisingly low (33.5%), but if they know about Expanded Universe, very likely they would be Fans of Star Wars. (86.9%). Slightly more than half of the respondents (58.5%) who don’t know about Expanded universe are Fans of Star Wars.

4. 3rd Comparison: Fans of Star Wars vs Fans of Expanded Universe¶

Quick Link to Conclusion

We chose to compare the relationship between Fans of star wars and Fans of expanded universe. Use the groupby function to construct the dataframe WarFansOfUniDf with the cross tabular information on unique values by count between these 2 columns.

Hypothesis: Fans of Expanded Universe most likely would be Star Wars Fans, presumes to be vice versa

WarFansOfUniDf=df.groupby(['fanOfStarWars', 'fanOfExpUniverse']).size().reset_index()
WarFansOfUniDf.columns = ['fanOfStarWars', 'fanOfExpUniverse', 'count']
WarFansOfUniDf

Construct the list of all unique values from the knowExpUniverse column as knowUniLvl

ExpUniLvl=df['fanOfExpUniverse'].unique()
print(StarWarLvl)
print(ExpUniLvl)

['YES', 'NO', 'N/A']
['NO' 'N/A' 'YES']

We observed the unique values are the same, however the order is different, we would have a better presentation on the order of segments in the stack bar chart if we arrange them similarly as below.

ExpUniLvlOrder=[2,3,1]
ExpUniLvl=[x for i, x in sorted(zip(ExpUniLvlOrder, ExpUniLvl))]
print(StarWarLvl)
print(ExpUniLvl)

['YES', 'NO', 'N/A']
['YES', 'NO', 'N/A']

Pass in the information we constructed in previous step to the function printStackRelation to produce the horizontal percentage stack bar chart.

printStackRelation(WarFansOfUniDf, StarWarLvl, ExpUniLvl, 'Fan of Star Wars', 'Fans of Expanded Universe', '3. WarFansUni.png')

Construct the dataframe FansOfUniWarDf with the same cross tabular information, however fanOfExpUniverse would be the first column in the group by function.

FansOfUniWarDf=df.groupby(['fanOfExpUniverse', 'fanOfStarWars']).size().reset_index()
FansOfUniWarDf.columns = ['fanOfExpUniverse', 'fanOfStarWars' , 'count']
FansOfUniWarDf

Pass in the information we constructed previous step to the function printStackRelation to produce the horizontal percentage stack bar chart. This time we would have Fans of expanded universe as the bar and Fans of star wars would be the segments.

printStackRelation(FansOfUniWarDf, ExpUniLvl, StarWarLvl, 'Fans of Expanded Universe', 'Fan of Star Wars', '3. FansUniWar.png')

The above graphs show that majority of the respondents did not answer whether they are Fans of Expanded universe (973). However, it shows that Fans of Expanded universe, most likely would be fans of star wars (93.9%). Almost same percentage of Star Wars Fans would be (16.6%) or would not be (16.7%) the Fans of Expanded Universe, this is a bit out of my expectation, despite the high percentage of no response.

Question 3 - Demographic Impact¶

Explore whether there are relationship between people's demographics (Gender, Age, Household Income, Education, Location) and their attitude to Start War characters.

1. Supplementary function: SplitCharacterByLiking¶

Define a function splitCharacterByLiking which would partition all the respondents in the dataset according to their responses on the favourability of each character. This function would return the partition data and it's corresponding percentage in a 2-dimensional array (row: character, column: favourability), as well as a list of the character names.

def splitCharacterByLiking(level):
    fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
    character = []
    data = np.zeros((len(level), len(fav_cols)))
    levelDf = pd.DataFrame()
    percent = np.zeros((len(level), len(fav_cols)))
    i = 0
    for col in fav_cols:
#        print(f"Column Name: {col}" )
        for j in np.arange(len(level)):
            levelDf = df[col].value_counts()
            levelDf.index.name='rank'
            levelDf=levelDf.to_frame('count').reset_index()            
            VF_num=levelDf.loc[levelDf['rank']==level[j],['count']].values[0]
            data[j, i] = VF_num
        total = data.sum(axis=0)
        i +=1
        character.append(col[7:])
    i=0
    for col in fav_cols:
        for j in np.arange(len(level)):
   #         print(data[j,i], data.sum(axis=0)[i], data[j,i]/data.sum(axis=0)[i] * 100)
            percent[j, i]= data[j,i]/data.sum(axis=0)[i] * 100
        i += 1
    return percent, data, character

Define a function printLiking which would takes the following parameters and print out a horizontal stack bar chart of the respondents' response on the favorability of each character.

* a 2-D array on the favorability of each character (in percentage)
* a 2-D array on the favorability of each character (actual count)
* list of the characters' name
* list of favorability level

def printLiking(percent, data, character, level, plotfile='FavCharacter.png'):
    level_mapping = {'Unfamiliar (N/A)': 99, 'Very unfavorably': 0, 'Somewhat unfavorably': 1, 
                 'Neither favorably nor unfavorably (neutral)': 2, 'Somewhat favorably': 3, 'Very favorably': 4}

    key_list = list(level_mapping.keys()) 
    val_list = list(level_mapping.values())
            

    y_pos = np.arange(len(character))
    fig = plt.figure(figsize=(15,10))
    ax = fig.add_subplot(111)

    patch_handles = []
    # left alignment of data starts at zero
    left = np.zeros(len(character)) 
    for i, d in enumerate(percent):
        likeStr = str(level[i]) + "-" +key_list[val_list.index(level[i])] 
        patch_handles.append(ax.barh(y_pos, d, align='center', left=left,label=likeStr))
        left += d

    # search all of the bar segments and annotate
    for j in range(len(patch_handles)):
        for i, patch in enumerate(patch_handles[j].get_children()):
            bl = patch.get_xy()
            x = 0.5*patch.get_width() + bl[0]
            y = 0.5*patch.get_height() + bl[1]
            ax.text(x,y, "%d%%" % (percent[j,i]), ha='center')
            if (j == (len(patch_handles) - 1)):
                    ax.text(103, y, "%d" % (data.sum(axis=0)[i]), ha='center')

    ax.set_yticks(y_pos)
    ax.set_yticklabels(character)
    plt.title('Character with count by liking', fontsize=16)
    plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=4)
    ax.set_xlabel('Percent (%)')
    plt.savefig(plotfile, dpi=300, bbox_inches='tight')

    plt.show()

2. Visualize Liking of each characters¶

Construct the list of favourability level by calling the unique function on the Favour_HanSolo column.

df['Favour_HanSolo'].unique()
favourLvl=df['Favour_HanSolo'].unique()
favourLvl=favourLvl[favourLvl!=-1].tolist()
favourLvl.sort()

Construct the 2-D percentage and data array on the favorability of each character and list of character by calling the function splitCharacterByLiking and then pass in all the elements to the function printLiking to print the stack bar chart.

favPercent, favData, character = splitCharacterByLiking(favourLvl)
printLiking(favPercent, favData, character, favourLvl)

3. Supplementary function: SplitCharacterByLikingWithDemo¶

Define a function splitCharacterByLikingWithDemo which would partition the respondents (with certain demographic feature) according to their response on the favorability of each character, this function takes the following argument :

* the column name of the demographic feature 
* the list of values in that demographic featurea 
* list of favorability level

def splitCharacterByLikingWithDemo(demo, demoList, level):
    fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
    character = []
    data = np.zeros((len(level), len(fav_cols)))
    levelDf = pd.DataFrame()
    percent = np.zeros((len(level), len(fav_cols)))
    i = 0
    for col in fav_cols:

        levelDf= df.groupby([demo, col]).size().reset_index()
        levelDf.columns = [demo, 'rank', 'count']
        levelDf=levelDf[levelDf[demo].isin(demoList)]
        levelDf=levelDf.groupby('rank').sum().reset_index()
        for j in np.arange(len(level)):
            if(sum(levelDf['rank']==level[j])==0):
                VF_num = 0
            else:
                VF_num=levelDf.loc[levelDf['rank']==level[j],['count']].values[0]
            data[j, i] = VF_num
        total = data.sum(axis=0)
        i +=1
        character.append(col[7:])
    i=0
    for col in fav_cols:
        for j in np.arange(len(level)):
   #         print(data[j,i], data.sum(axis=0)[i], data[j,i]/data.sum(axis=0)[i] * 100)
            percent[j, i]= data[j,i]/data.sum(axis=0)[i] * 100
        i += 1
    return percent, data, character

Define a function printLiking2Chart which would takes the following parameters and print out 2 side by side horizontal stack bar chart of the respondents' response on the favorability of each character.

* a 2-D array on the favorability of each character of respondents with certain demographic feature (in percentage)
* a 2-D array on the favorability of each character of respondents with other values of the same demographic feature (in percentage)    
* a 2-D array on the favorability of each character of respondents with certain demographic feature (actual count)
* a 2-D array on the favorability of each character of respondents with other values of the same demographic feature (actual count)    
* str1 - subtitle of the left bar chart
* str2 - subtitle of the right bar chart
* list of the characters' name
* list of favorability level

def printLiking2Chart(percent1, percent2, data1, data2, str1, str2, character, level, plotfile='FavCharacterByDemo.png'):
    level_mapping = {'Unfamiliar (N/A)': 99, 'Very unfavorably': 0, 'Somewhat unfavorably': 1, 
                 'Neither favorably nor unfavorably (neutral)': 2, 'Somewhat favorably': 3, 'Very favorably': 4}

    
    key_list = list(level_mapping.keys()) 
    val_list = list(level_mapping.values())
    

    y_pos = np.arange(len(character))
    fig, axs = plt.subplots(1, 2, figsize=(15, 10), sharey=True)

    #colors ='rgwm'
    patch_handles1 = []
    # left alignment of data starts at zero
    left = np.zeros(len(character)) 
    for i, d in enumerate(percent1):
        likeStr = str(level[i]) + "-" +key_list[val_list.index(level[i])]
        patch_handles1.append(axs[0].barh(y_pos, d, align='center', 
          left=left,label=likeStr))
        left += d

    patch_handles2 = []
    # left alignment of data starts at zero
    left = np.zeros(len(character)) 
    for i, d in enumerate(percent2):
        likeStr = str(level[i]) + "-" +key_list[val_list.index(level[i])]
        patch_handles2.append(axs[1].barh(y_pos, d, align='center', 
          left=left,label=likeStr))
        left += d
        
    # search all of the bar segments and annotate
    for j in range(len(patch_handles1)):
        for i, patch in enumerate(patch_handles1[j].get_children()):
            bl = patch.get_xy()
            x = 0.5*patch.get_width() + bl[0]
            y = 0.5*patch.get_height() + bl[1]
            axs[0].text(x,y, "%d%%" % (percent1[j,i]), ha='center')
            if (j == (len(patch_handles1) - 1)):
                    axs[0].text(110, y, "%d" % (data1.sum(axis=0)[i]), ha='center')

    for j in range(len(patch_handles2)):
        for i, patch in enumerate(patch_handles2[j].get_children()):
            bl = patch.get_xy()
            x = 0.5*patch.get_width() + bl[0]
            y = 0.5*patch.get_height() + bl[1]
            axs[1].text(x,y, "%d%%" % (percent2[j,i]), ha='center')                   
            if (j == (len(patch_handles2) - 1)):
                    axs[1].text(110, y, "%d" % (data2.sum(axis=0)[i]), ha='center')

    axs[0].set_yticks(y_pos)
    axs[0].set_yticklabels(character)    
    
    axs[0].set_title(str1)
    axs[1].set_title(str2)
    axs[0].set_xlabel('Percent (%)')
    axs[1].set_xlabel('Percent (%)')

    fig.suptitle('Character with count by liking', fontsize=16)
    plt.legend(loc='upper center', bbox_to_anchor=(-0.1, -0.1), ncol=4)
    plt.savefig(plotfile, dpi=300, bbox_inches='tight')

    plt.show()

4. Visualize likings of each characters by:¶

Gender¶

Find the list of gender by calling the unique function on the Gender column

print(df['Gender'].unique())
print(df['Gender'].value_counts())

['MALE' 'N/A' 'FEMALE']
FEMALE    549
MALE      497
N/A       140
Name: Gender, dtype: int64

Construct the 2-D percentage and data array on the favorability of each character for female and Male respondents by calling the splitCharacterByLikingWithDemo function and print the side by side stack bar chart by calling the function printLiking2Chart

F_favPercent, F_favData, character = splitCharacterByLikingWithDemo('Gender', ['FEMALE'], favourLvl)
M_favPercent, M_favData, character = splitCharacterByLikingWithDemo('Gender', ['MALE'], favourLvl)

printLiking2Chart(F_favPercent, M_favPercent, F_favData, M_favData, 'Female', 'Male', character, favourLvl, '8. FavCharacterBySex.png')

From the above chart, we can observe the followings:

the split between Female and Male is almost equitable.
Respondents who provided no answer to the gender question will not be included in the chart.
Obvious variation in favourability for Jar Jar Binks (votes for Very unfavorably), R2D2 and C-3P0 (votes for Very favorably) between gender

Similarly, we can plot for the other demographic features.

Age¶

Find the list of age by calling the unique function on the Age column

print(df['Age'].unique())
print(df['Age'].value_counts())

['18-29' 'N/A' '30-44' '> 60' '45-60']
45-60    291
> 60     269
30-44    268
18-29    217
N/A      141
Name: Age, dtype: int64

Construct the 2-D percentage and data array on the favorability of each character for different age group by calling the splitCharacterByLikingWithDemo function and print the side by side stack bar chart by calling the function printLiking2Chart

a. Young versus Rest of the age group¶

Y_favPercent, Y_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29'], favourLvl)
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['30-44', '45-60', '> 60'], favourLvl)

printLiking2Chart(Y_favPercent, R_favPercent, Y_favData, R_favData, 'Age 18 - 29', 'The Rest', character, favourLvl, '10. FavCharacterYoungandRest.png')

b. Old versus Rest of the age group¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29', '30-44', '45-60'], favourLvl)
O_favPercent, O_favData, character = splitCharacterByLikingWithDemo('Age', ['> 60'], favourLvl)

printLiking2Chart(O_favPercent, R_favPercent, O_favData, R_favData, 'Age > 60', 'The Rest', character, favourLvl, '10. FavCharacterOldandRest.png')

c. Young versus Old¶

Y_favPercent, Y_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29'], favourLvl)
O_favPercent, O_favData, character = splitCharacterByLikingWithDemo('Age', ['> 60'], favourLvl)

printLiking2Chart(Y_favPercent, O_favPercent, Y_favData, O_favData, 'Age 18-29', 'Age > 60', character, favourLvl, '10. FavCharacterYoundandOld.png')

d. 30-44 versus the Rest¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29', '45-60', '> 60'], favourLvl)
Middle1_favPercent, Middle1_favData, character = splitCharacterByLikingWithDemo('Age', ['30-44'], favourLvl)

printLiking2Chart(Middle1_favPercent, R_favPercent, Middle1_favData, R_favData, 'Age 30-44', 'The Rest', character, favourLvl, '10. FavCharacter30-44Rest.png')

e. 45-60 versus the Rest¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29', '30-44', '> 60'], favourLvl)
Middle2_favPercent, Middle2_favData, character = splitCharacterByLikingWithDemo('Age', ['45-60'], favourLvl)

printLiking2Chart(Middle2_favPercent, R_favPercent, Middle2_favData, R_favData, 'Age 45-60', 'The Rest', character, favourLvl, '10. FavCharacter45-60Rest.png')

Income¶

Find the list of House hold income by calling the unique function on the HHIncome column

print(df['HHIncome'].unique())
print(df['HHIncome'].value_counts())

['N/A' '$0 - $24,999' '$100,000 - $149,999' '$25,000 - $49,999'
 '$50,000 - $99,999' '$150,000+']
N/A                    328
$50,000 - $99,999      298
$25,000 - $49,999      186
$100,000 - $149,999    141
$0 - $24,999           138
$150,000+               95
Name: HHIncome, dtype: int64

Construct the 2-D percentage and data array on the favorability of each character for different household income group by calling the splitCharacterByLikingWithDemo function and print the side by side stack bar chart by calling the function printLiking2Chart

a. Lower Income versus Rest of the household income group¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$100,000 - $149,999', '$50,000 - $99,999', '$150,000+'], favourLvl)
P_favPercent, P_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999'], favourLvl)

printLiking2Chart(P_favPercent, R_favPercent, P_favData, R_favData, 'Lower Income', 'The Rest', character, favourLvl, '12. FavCharacterPoorandRest.png')

b. Higher Income versus Rest of the household income group¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999', '$50,000 - $99,999'], favourLvl)
Rich_favPercent, Rich_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$100,000 - $149,999', '$150,000+'], favourLvl)

printLiking2Chart(Rich_favPercent, R_favPercent, Rich_favData, R_favData, 'Higher Income', 'The Rest', character, favourLvl, '12. FavCharacterRichandRest.png')

c. Middle Income versus Rest of the household income group¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999', '$100,000 - $149,999', '$150,000+'], favourLvl)
MidIncome_favPercent, MidIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', [ '$50,000 - $99,999'], favourLvl)

printLiking2Chart(MidIncome_favPercent, R_favPercent, MidIncome_favData, R_favData, 'Middle Income', 'The Rest', character, favourLvl, '12. FavCharacterMidIncomeandRest.png')

Education Level¶

Find the list of Education level by calling the unique function on the Education column

print(df['Education'].unique())
print(df['Education'].value_counts())

['HIGH SCHOOL DEGREE' 'BACHELOR DEGREE' 'SOME COLLEGE OR ASSOCIATE DEGREE'
 'N/A' 'GRADUATE DEGREE' 'LESS THAN HIGH SCHOOL DEGREE']
SOME COLLEGE OR ASSOCIATE DEGREE    328
BACHELOR DEGREE                     321
GRADUATE DEGREE                     275
N/A                                 150
HIGH SCHOOL DEGREE                  105
LESS THAN HIGH SCHOOL DEGREE          7
Name: Education, dtype: int64

Construct the 2-D percentage and data array on the favorability of each character for different education level by calling the splitCharacterByLikingWithDemo function and print the side by side stack bar chart by calling the function printLiking2Chart

a. Less Educated versus Rest of the education level¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['SOME COLLEGE OR ASSOCIATE DEGREE', 'BACHELOR DEGREE', 'GRADUATE DEGREE'], favourLvl)
Less_Educated_favPercent, Less_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE'], favourLvl)

printLiking2Chart(Less_Educated_favPercent, R_favPercent, Less_Educated_favData, R_favData, 'Less Educated', 'The Rest', character, favourLvl, '14. FavCharacterLessEducatedandRest.png')

b. College or Associate versus Rest of the education level¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE','BACHELOR DEGREE', 'GRADUATE DEGREE'], favourLvl)
AssociateDegree_favPercent, AssociateDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['SOME COLLEGE OR ASSOCIATE DEGREE'], favourLvl)

printLiking2Chart(AssociateDegree_favPercent, R_favPercent, AssociateDegree_favData, R_favData, 'Some College or Associate Degree', 'The Rest', character, favourLvl,'14.FavCharacterAssoDegandRest.png')

c. Bachelor Degree versus Rest of the education level¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE','SOME COLLEGE OR ASSOCIATE DEGREE', 'GRADUATE DEGREE'], favourLvl)
bachelorDegree_favPercent, bachelorDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['BACHELOR DEGREE'], favourLvl)

printLiking2Chart(bachelorDegree_favPercent, R_favPercent, bachelorDegree_favData, R_favData, 'Bachelor Degree', 'The Rest', character, favourLvl,'14.FavCharacterBachDegandRest.png')

d. Graduate Degree versus Rest of the education level¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE','SOME COLLEGE OR ASSOCIATE DEGREE', 'BACHELOR DEGREE'], favourLvl)
High_Educated_favPercent, High_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['GRADUATE DEGREE'], favourLvl)

printLiking2Chart(High_Educated_favPercent, R_favPercent, High_Educated_favData, R_favData, 'Graduate Degree', 'The Rest', character, favourLvl,'14.FavCharacterGraudateDegandRest.png')

Region¶

Find the list of Region by calling the unique function on the Region column

print(df['Region'].unique())
print(df['Region'].value_counts())

['SOUTH ATLANTIC' 'WEST SOUTH CENTRAL' 'WEST NORTH CENTRAL'
 'MIDDLE ATLANTIC' 'EAST NORTH CENTRAL' 'PACIFIC' 'N/A' 'MOUNTAIN'
 'NEW ENGLAND' 'EAST SOUTH CENTRAL']
EAST NORTH CENTRAL    181
PACIFIC               175
SOUTH ATLANTIC        170
N/A                   143
MIDDLE ATLANTIC       122
WEST SOUTH CENTRAL    110
WEST NORTH CENTRAL     93
MOUNTAIN               79
NEW ENGLAND            75
EAST SOUTH CENTRAL     38
Name: Region, dtype: int64

We could group the region in 4 levels for better comparison according to the geographical location from West to East as follows:

Pacific and Mountain : PACIFIC, MOUNTAIN
Central: WEST NORTH CENTRAL, WEST SOUTH CENTRAL, EAST NORTH CENTRAL, AST SOUTH CENTRAL
Atlantic: MIDDLE ATLANTIC, SOUTH ATLANTIC, NEW ENGLAND

a. Pacific and Mountain versus Rest of the Region¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Region', ['WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL', 'MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)
PM_Educated_favPercent, PM_Educated_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN'], favourLvl)

printLiking2Chart(PM_Educated_favPercent, R_favPercent, PM_Educated_favData, R_favData, 'Pacific and Mountain', 'The Rest', character, favourLvl, '15.FavCharacterPacificMountainandRest.png')

b. Central versus Rest of the Region¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN', 'MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)
Central_Educated_favPercent, Central_Educated_favData, character = splitCharacterByLikingWithDemo('Region', ['WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL'], favourLvl)

printLiking2Chart(Central_Educated_favPercent, R_favPercent, Central_Educated_favData, R_favData, 'Central', 'The Rest', character, favourLvl, '15.FavCharacterCentralandRest.png')

c. Atlantic versus Rest of the Region¶

R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN', 'WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL'], favourLvl)
Atlantic_Educated_favPercent, Atlantic_Educated_favData, character = splitCharacterByLikingWithDemo('Region', ['MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)

printLiking2Chart(Atlantic_Educated_favPercent, R_favPercent, Atlantic_Educated_favData, R_favData, 'Atlantic', 'The Rest', character, favourLvl, '15.FavCharacterAtlanticandRest.png')

9. Perform Chi Square Association Test¶

It would be more convincing to show statistical figures rather than comparing the categories of demographical features visually. We could thus perform the chi-square test for independence

1. Split all the demographic data¶

F_favPercent, F_favData, character = splitCharacterByLikingWithDemo('Gender', ['FEMALE'], favourLvl)
M_favPercent, M_favData, character = splitCharacterByLikingWithDemo('Gender', ['MALE'], favourLvl)
a18to29_favPercent, a18to29_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29'], favourLvl)
a30to44_favPercent, a30to44_favData, character = splitCharacterByLikingWithDemo('Age', ['30-44'], favourLvl)
a45to60_favPercent, a45to60_favData, character = splitCharacterByLikingWithDemo('Age', ['45-60'], favourLvl)
a60plus_favPercent, a60plus_favData, character = splitCharacterByLikingWithDemo('Age', ['> 60'], favourLvl)
lowerIncome_favPercent, lowerIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999'], favourLvl)
midIncome_favPercent, midIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', [ '$50,000 - $99,999'], favourLvl)
highIncome_favPercent, highIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$100,000 - $149,999', '$150,000+'], favourLvl)
Less_Educated_favPercent, Less_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE'], favourLvl)
AssociateDegree_favPercent, AssociateDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['SOME COLLEGE OR ASSOCIATE DEGREE'], favourLvl)
bachelorDegree_favPercent, bachelorDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['BACHELOR DEGREE'], favourLvl)
High_Educated_favPercent, High_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['GRADUATE DEGREE'], favourLvl)
PM_Region_favPercent, PM_Region_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN'], favourLvl)
Central_Region_favPercent, Central_Region_favData, character = splitCharacterByLikingWithDemo('Region', ['WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL'], favourLvl)
Atlantic_Region_favPercent, Atlantic_Region_favData, character = splitCharacterByLikingWithDemo('Region', ['MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)

2. Supplementary function: chiTest¶

define a function for chi square test which returns the p-value between 2 pass-in array

from scipy.stats import chi2_contingency
from scipy.stats import chi2

def chiTest(controlArray, testArray):
    p_list=np.zeros(controlArray.shape[1])
    for i in range(controlArray.shape[1]):
        test=testArray.T[i,]
        control=controlArray.T[i,]
        table= np.vstack((control,test))
        stat, p, dof, expected = chi2_contingency(table)
        p_list[i]=p
    return p_list

3. Chi Square test (each categories vs original dataset)¶

Quick Link to Conclusion

Perform chi square test on all the favourability split by demographic features against the master favaorability data

$H_{0}$: There is no association in favorability between demographical feature (independence)
$H_{A}$: There is an association in favorability between demographical feature (dependence)

female_pList=chiTest(favData, F_favData)
male_pList=chiTest(favData, M_favData)
a18to29_pList=chiTest(favData, a18to29_favData)
a30to44_pList=chiTest(favData, a30to44_favData)
a45to60_pList=chiTest(favData, a45to60_favData)
a60plus_pList=chiTest(favData, a60plus_favData)
lowerIncome_pList=chiTest(favData, lowerIncome_favData)
midIncome_pList=chiTest(favData, midIncome_favData)
highIncome_pList=chiTest(favData, highIncome_favData)
Less_Educated_pList=chiTest(favData, Less_Educated_favData)
AssociateDegree_pList=chiTest(favData, AssociateDegree_favData)
bachelorDegree_pList=chiTest(favData, bachelorDegree_favData)
High_Educated_pList=chiTest(favData, High_Educated_favData)
PM_Region_pList=chiTest(favData, PM_Region_favData)
Central_Region_pList=chiTest(favData, Central_Region_favData)
Atlantic_Region_pList=chiTest(favData, Atlantic_Region_favData)

i. Construct p-value list¶

Combine the p-value list of each demographical features into one dataframe, finally concatenate all into a single dataframe. Print out the probability list of each demographical features.

region_prob_list = pd.DataFrame({'PM_Region':PM_Region_pList, 'Central_Region': Central_Region_pList, 'Atlantic_Region': Atlantic_Region_pList }, index=character)
education_prob_list = pd.DataFrame({'Less_Educated':Less_Educated_pList, 'AsocDeg_Educated': AssociateDegree_pList, 'BachDeg_Educated': bachelorDegree_pList, 'High_Educated': High_Educated_pList }, index=character)
income_prob_list = pd.DataFrame({'Low_Income':lowerIncome_pList, 'Middle_Income': midIncome_pList, 'High_Income': highIncome_pList }, index=character)
age_prob_list= pd.DataFrame({'Age_18_to_29':a18to29_pList, 'Age_30_to_44': a30to44_pList, 'Age_45_to_60': a45to60_pList, 'Age_60_plus': a60plus_pList}, index=character)
gender_prob_list = pd.DataFrame({'Male':male_pList, 'Female': female_pList}, index=character)
all_prob_list = pd.concat([gender_prob_list, age_prob_list, income_prob_list, education_prob_list, region_prob_list ], axis=1, sort=False)

ii. Print out the p-value list for gender¶

gender_prob_list.round(3)

iii. Print out the p-value list for Age¶

age_prob_list.round(3)

iv. Print out the p-value list for Income¶

income_prob_list.round(3)

v. Print out the p-value list for Region¶

region_prob_list.round(3)

vi. Print out the p-value list for Education¶

education_prob_list.round(3)

vi. Print out the p-value list for all categories of different demographic features¶

all_prob_list.round(3)

vii. List out the characters with significant differences¶

Define a function that would return all the cells which p-value < alpha and return as a list

def getIndexes(dfObj, alpha): 
    listOfPos = list()
    # Get bool dataframe with True at positions where the given value exists
    result = dfObj < alpha 
    # Get list of columns that contains the value
    seriesObj = result.any()
    columnNames = list(seriesObj[seriesObj == True].index)
    # Iterate over list of columns and fetch the rows indexes where value exists
    for col in columnNames:
        rows = list(result[col][result[col] == True].index)
        for row in rows:
            listOfPos.append((row, col, dfObj.loc[row][col]))
    # Return a list of tuples indicating the positions of value in the dataframe
    return listOfPos

Make a dataframe that shows all the demographic features vs character that has probability < alpha

all_association_list = getIndexes(all_prob_list, 0.05)
df = pd.DataFrame(all_association_list, columns=['character', 'demographic feature','probability'])
df

df=df.pivot('demographic feature', 'character','probability').reset_index()
df=df.rename_axis(None, axis=1)
df=df.round(4)
df=df.where(df.notnull(), '-')
df

Conclusion: If we Reject $H_{0}$ for P-value < 0.05 ==> only Age and Gender have associations with the likings for the characters: Anakin Skywalker, Boba Fett, C-3PO, Emperor Palpatine, Jar Jar Binks, Luke Skywalker, Padme Amidala, Princess Leia Organa and R2D2 as shown in the above table.

4. Chi square test within demographical features (Gender and Age)¶

i. By Gender - Male vs Female Quick Link to Conclusion ¶

female_male_pList=chiTest(F_favData, M_favData)
female_male_df=pd.DataFrame({'Male vs Female': female_male_pList.round(4)}, index=character)
female_male_df

a. Character with significant difference by Gender¶

female_male_df.loc[(female_male_df['Male vs Female']<0.05),:]

b. Character with NO significant difference by Gender¶

female_male_df.loc[(female_male_df['Male vs Female']>0.05),:]

Conclusion: If we Reject $H_{0}$ for P-value < 0.05 ==> Emperor Palpatine, Darth Vader, Lando Calrissian, BobaFett, C-3P0, R2D2, JarJarBinks, PademAmidala and Yoda would have significant difference between Male and Female.

ii. By different age groups Quick Link to Conclusion ¶

a18to29vs30to44_pList=chiTest(a18to29_favData[1:,:], a30to44_favData[1:,:])
a18to29vs45to60_pList=chiTest(a18to29_favData[1:,:], a45to60_favData[1:,:])
a18to29vs60plus_pList=chiTest(a18to29_favData[1:,:], a60plus_favData[1:,:])
a30to44vs45to60_pList=chiTest(a45to60_favData[1:,:], a30to44_favData[1:,:])
a30to44vs60plus_pList=chiTest(a60plus_favData[1:,:], a30to44_favData[1:,:])
a45to60vs60plus_pList=chiTest(a60plus_favData[1:,:], a45to60_favData[1:,:])
all_age_prob_list= pd.DataFrame({'18to29 vs 30to44':a18to29vs30to44_pList.round(4), '18to29 vs 45to60': a18to29vs45to60_pList.round(4), '18to29 vs 60plus': a18to29vs60plus_pList.round(4), '30to44 vs 45to60': a30to44vs45to60_pList.round(4), '30to44 vs 60plus':a30to44vs60plus_pList.round(4), '45to60 vs 60plus':a45to60vs60plus_pList.round(4)}, index=character)
all_age_prob_list

a. Characters with significant difference by Age¶

all_age_prob_list.where(all_age_prob_list < 0.05, '-')

Conclusion: If we Reject $H_{0}$ for P-value < 0.05 ==> We found that there would be significant difference between age group 18-29 and 30-44 for HanSolo, LukeSkywalker, PrincessLia and Darth Vader. Similar findings for other age groups are listed in the above table.

	respondentId	Rank_I_ThePhantomMenace	Rank_II_AttackoftheClones	Rank_III_RevengeoftheSith	Rank_IV_aNewHope	Rank_V_TheEmpireStrikesBack	Rank_VI_ReturnofTheJedi
count	1.186000e+03	835.00	836.00	835.00	836.00	836.00	836.00
mean	3.290128e+09	3.73	4.09	4.34	3.27	2.51	3.05
std	1.055639e+06	1.66	1.37	1.40	1.83	1.58	1.67
min	3.288373e+09	1.00	1.00	1.00	1.00	1.00	1.00
25%	3.289451e+09	3.00	3.00	3.00	2.00	1.00	2.00
50%	3.290147e+09	4.00	4.00	5.00	3.00	2.00	3.00
75%	3.290814e+09	5.00	5.00	6.00	5.00	3.00	4.00
max	3.292880e+09	6.00	6.00	6.00	6.00	6.00	6.00

	Male	Female
HanSolo	0.991	0.966
LukeSkywalker	0.974	0.915
PrincessLeiaOrgana	0.944	0.921
AnakinSkywalker	0.598	0.611
ObiWanKenobi	0.915	0.890
EmperorPalpatine	0.074	0.027
DarthVader	0.117	0.117
LandoCalrissian	0.267	0.233
BobaFett	0.035	0.015
C-3P0	0.057	0.020
R2D2	0.098	0.021
JarJarBinks	0.118	0.099
PadmeAmidala	0.607	0.431
Yoda	0.666	0.476

	Age_18_to_29	Age_30_to_44	Age_45_to_60	Age_60_plus
HanSolo	0.248	0.211	0.899	0.203
LukeSkywalker	0.000	0.869	0.471	0.038
PrincessLeiaOrgana	0.045	0.527	0.391	0.294
AnakinSkywalker	0.023	0.002	0.265	0.000
ObiWanKenobi	0.912	0.356	0.842	0.232
EmperorPalpatine	0.136	0.878	0.855	0.461
DarthVader	0.154	0.098	0.504	0.319
LandoCalrissian	0.880	0.278	0.983	0.443
BobaFett	0.013	0.014	0.013	0.008
C-3P0	0.045	0.206	0.397	0.041
R2D2	0.319	0.961	0.566	0.882
JarJarBinks	0.031	0.005	0.687	0.000
PadmeAmidala	0.002	0.118	0.139	0.000
Yoda	0.753	0.701	0.989	0.500

	Low_Income	Middle_Income	High_Income
HanSolo	0.161	0.823	0.906
LukeSkywalker	0.080	0.744	0.759
PrincessLeiaOrgana	0.118	0.508	0.848
AnakinSkywalker	0.326	0.554	0.653
ObiWanKenobi	0.147	0.508	0.743
EmperorPalpatine	0.812	0.759	0.774
DarthVader	0.292	0.949	0.801
LandoCalrissian	0.881	0.554	0.988
BobaFett	0.993	0.917	0.822
C-3P0	0.417	0.823	0.806
R2D2	0.069	0.437	0.839
JarJarBinks	0.703	0.957	0.793
PadmeAmidala	0.553	0.775	0.762
Yoda	0.392	0.576	0.782

	PM_Region	Central_Region	Atlantic_Region
HanSolo	0.367	0.911	0.891
LukeSkywalker	0.474	0.746	0.974
PrincessLeiaOrgana	0.324	0.618	0.990
AnakinSkywalker	0.989	0.990	0.925
ObiWanKenobi	0.669	0.748	0.643
EmperorPalpatine	0.193	0.816	0.252
DarthVader	0.666	0.792	0.919
LandoCalrissian	0.874	0.840	0.827
BobaFett	0.919	0.864	0.873
C-3P0	0.569	0.691	0.982
R2D2	0.972	0.736	0.898
JarJarBinks	0.609	0.874	0.787
PadmeAmidala	0.871	0.911	0.731
Yoda	0.893	0.636	0.662

	respondentId	any6Films	fanOfStarWars	Seen_I_ThePhantomMenace	Seen_II_AttackoftheClones	Seen_III_RevengeoftheSith	Seen_IV_aNewHope	Seen_V_TheEmpireStrikesBack	Seen_VI_ReturnofTheJedi	Rank_I_ThePhantomMenace	Rank_II_AttackoftheClones	Rank_III_RevengeoftheSith	Rank_IV_aNewHope	Rank_V_TheEmpireStrikesBack	Rank_VI_ReturnofTheJedi	Favour_HanSolo	Favour_LukeSkywalker	Favour_PrincessLeiaOrgana	Favour_AnakinSkywalker	Favour_ObiWanKenobi	Favour_EmperorPalpatine	Favour_DarthVader	Favour_LandoCalrissian	Favour_BobaFett	Favour_C-3P0	Favour_R2D2	Favour_JarJarBinks	Favour_PadmeAmidala	Favour_Yoda	firstShotCharacter	knowExpUniverse	fanOfExpUniverse	fanOfStarTrekFranchise	Gender	Age	HHIncome	Education	Region
0	3292879998	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3.0	2.0	1.0	4.0	5.0	6.0	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
1	3292879538	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
2	3292765271	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1.0	2.0	3.0	4.0	5.0	6.0	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
3	3292763116	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	6.0	1.0	2.0	4.0	3.0	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
4	3292731220	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	4.0	6.0	2.0	1.0	3.0	Very favorably	Somewhat favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very unfavorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Somewhat favorably	Somewhat favorably	Very unfavorably	Somewhat favorably	Somewhat favorably	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

	index	Rank_I_ThePhantomMenace	Rank_II_AttackoftheClones	Rank_III_RevengeoftheSith	Rank_IV_aNewHope	Rank_V_TheEmpireStrikesBack	Rank_VI_ReturnofTheJedi
0	1	128	32	36	204	288	146
1	2	71	115	46	135	235	232
2	3	130	102	150	127	106	219
3	4	237	183	182	129	47	56
4	5	100	300	203	78	117	36
5	6	168	102	217	161	41	145

	Rank	I_ThePhantomMenace	II_AttackoftheClones	III_RevengeoftheSith	IV_aNewHope	V_TheEmpireStrikesBack	VI_ReturnofTheJedi
0	1	128	32	36	204	288	146
1	2	71	115	46	135	235	232
2	3	130	102	150	127	106	219
3	4	237	183	182	129	47	56
4	5	100	300	203	78	117	36
5	6	168	102	217	161	41	145

	movie_name	total
4	V_TheEmpireStrikesBack	2095
5	VI_ReturnofTheJedi	2541
3	IV_aNewHope	2727
0	I_ThePhantomMenace	3116
1	II_AttackoftheClones	3412
2	III_RevengeoftheSith	3623

	movie_name	total
4	V_TheEmpireStrikesBack	1116
3	IV_aNewHope	1346
5	VI_ReturnofTheJedi	1378
0	I_ThePhantomMenace	1995
2	III_RevengeoftheSith	1999
1	II_AttackoftheClones	2036

	Rank	I_ThePhantomMenace	II_AttackoftheClones	III_RevengeoftheSith	IV_aNewHope	V_TheEmpireStrikesBack	VI_ReturnofTheJedi
0	1	46	18	27	128	169	82
1	2	30	46	34	107	133	120
2	3	47	52	68	86	71	146
3	4	129	84	121	59	33	44
4	5	76	186	104	30	48	26
5	6	142	84	116	60	16	52

	fanOfStarWars	fanOfStarTrekFranchise	count
0	N/A	N/A	110
1	N/A	NO	228
2	N/A	YES	12
3	NO	N/A	1
4	NO	NO	238
5	NO	YES	45
6	YES	N/A	7
7	YES	NO	175
8	YES	YES	370

	fanOfStarTrekFranchise	fanOfStarWars	count
0	N/A	N/A	110
1	N/A	NO	1
2	N/A	YES	7
3	NO	N/A	228
4	NO	NO	238
5	NO	YES	175
6	YES	N/A	12
7	YES	NO	45
8	YES	YES	370

	fanOfStarWars	knowExpUniverse	count
0	N/A	N/A	350
1	NO	N/A	1
2	NO	NO	255
3	NO	YES	28
4	YES	N/A	7
5	YES	NO	360
6	YES	YES	185

	knowExpUniverse	fanOfStarWars	count
0	N/A	N/A	350
1	N/A	NO	1
2	N/A	YES	7
3	NO	NO	255
4	NO	YES	360
5	YES	NO	28
6	YES	YES	185

	Less_Educated	AsocDeg_Educated	BachDeg_Educated	High_Educated
HanSolo	0.151	0.888	0.717	0.974
LukeSkywalker	0.686	0.667	0.611	0.834
PrincessLeiaOrgana	0.559	0.929	0.988	0.996
AnakinSkywalker	0.660	0.385	0.325	0.815
ObiWanKenobi	0.309	0.717	0.423	0.797
EmperorPalpatine	0.218	0.272	0.735	0.604
DarthVader	0.511	0.467	0.467	0.555
LandoCalrissian	0.644	0.827	0.754	0.761
BobaFett	0.450	0.804	0.999	0.168
C-3P0	0.357	0.621	0.918	0.978
R2D2	0.184	0.947	0.560	0.898
JarJarBinks	0.545	0.493	0.749	0.818
PadmeAmidala	0.499	0.324	0.305	0.701
Yoda	0.562	0.863	0.953	0.641

	demographic feature	AnakinSkywalker	BobaFett	C-3P0	EmperorPalpatine	JarJarBinks	LukeSkywalker	PadmeAmidala	PrincessLeiaOrgana	R2D2
0	Age_18_to_29	0.0231	0.013	0.0452	-	0.0308	0.0005	0.0017	0.0454	-
1	Age_30_to_44	0.0018	0.0136	-	-	0.0053	-	-	-	-
2	Age_45_to_60	-	0.0133	-	-	-	-	-	-	-
3	Age_60_plus	0.0001	0.0082	0.0412	-	0	0.0383	0.0001	-	-
4	Female	-	0.0154	0.0199	0.0272	-	-	-	-	0.0205
5	Male	-	0.0352	-	-	-	-	-	-	-

	18to29 vs 30to44	18to29 vs 45to60	18to29 vs 60plus	30to44 vs 45to60	30to44 vs 60plus	45to60 vs 60plus
HanSolo	0.0102	0.1495	0.0292	0.2506	0.0070	0.3032
LukeSkywalker	0.0039	0.0000	0.0000	0.4423	0.0340	0.4645
PrincessLeiaOrgana	0.0275	0.0015	0.0037	0.2756	0.0259	0.2062
AnakinSkywalker	0.5079	0.0007	0.0000	0.0000	0.0000	0.0746
ObiWanKenobi	0.6956	0.4331	0.1062	0.1223	0.0062	0.5216
EmperorPalpatine	0.2667	0.0512	0.0604	0.5853	0.4441	0.9578
DarthVader	0.0287	0.8646	0.2197	0.0766	0.0064	0.2660
LandoCalrissian	0.0674	0.8729	0.5609	0.2379	0.0145	0.3670
BobaFett	0.4066	0.0000	0.0000	0.0000	0.0000	0.1671
C-3P0	0.1318	0.0013	0.0001	0.0215	0.0005	0.3445
R2D2	0.3715	0.0371	0.3776	0.5646	0.6592	0.4000
JarJarBinks	0.4344	0.0140	0.0001	0.0823	0.0018	0.4443
PadmeAmidala	0.1862	0.0004	0.0000	0.0540	0.0000	0.0304
Yoda	0.2064	0.6988	0.4165	0.8375	0.5336	0.5397

	18to29 vs 30to44	18to29 vs 45to60	18to29 vs 60plus	30to44 vs 45to60	30to44 vs 60plus	45to60 vs 60plus
HanSolo	0.0102	-	0.0292	-	0.007	-
LukeSkywalker	0.0039	0	0	-	0.034	-
PrincessLeiaOrgana	0.0275	0.0015	0.0037	-	0.0259	-
AnakinSkywalker	-	0.0007	0	0	0	-
ObiWanKenobi	-	-	-	-	0.0062	-
EmperorPalpatine	-	-	-	-	-	-
DarthVader	0.0287	-	-	-	0.0064	-
LandoCalrissian	-	-	-	-	0.0145	-
BobaFett	-	0	0	0	0	-
C-3P0	-	0.0013	0.0001	0.0215	0.0005	-
R2D2	-	0.0371	-	-	-	-
JarJarBinks	-	0.014	0.0001	-	0.0018	-
PadmeAmidala	-	0.0004	0	-	0	0.0304
Yoda	-	-	-	-	-	-

	index	Rank_I_ThePhantomMenace	Rank_II_AttackoftheClones	Rank_III_RevengeoftheSith	Rank_IV_aNewHope	Rank_V_TheEmpireStrikesBack	Rank_VI_ReturnofTheJedi
0	1	128	32	36	204	288	146
1	2	71	115	46	135	235	232
2	3	130	102	150	127	106	219
3	4	237	183	182	129	47	56
4	5	100	300	203	78	117	36
5	6	168	102	217	161	41	145

	Rank	I_ThePhantomMenace	II_AttackoftheClones	III_RevengeoftheSith	IV_aNewHope	V_TheEmpireStrikesBack	VI_ReturnofTheJedi
0	1	128	32	36	204	288	146
1	2	71	115	46	135	235	232
2	3	130	102	150	127	106	219
3	4	237	183	182	129	47	56
4	5	100	300	203	78	117	36
5	6	168	102	217	161	41	145

	Rank	I_ThePhantomMenace	II_AttackoftheClones	III_RevengeoftheSith	IV_aNewHope	V_TheEmpireStrikesBack	VI_ReturnofTheJedi
0	1	46	18	27	128	169	82
1	2	30	46	34	107	133	120
2	3	47	52	68	86	71	146
3	4	129	84	121	59	33	44
4	5	76	186	104	30	48	26
5	6	142	84	116	60	16	52

	fanOfStarWars	fanOfStarTrekFranchise	count
0	N/A	N/A	110
1	N/A	NO	228
2	N/A	YES	12
3	NO	N/A	1
4	NO	NO	238
5	NO	YES	45
6	YES	N/A	7
7	YES	NO	175
8	YES	YES	370

	fanOfStarTrekFranchise	fanOfStarWars	count
0	N/A	N/A	110
1	N/A	NO	1
2	N/A	YES	7
3	NO	N/A	228
4	NO	NO	238
5	NO	YES	175
6	YES	N/A	12
7	YES	NO	45
8	YES	YES	370

	fanOfStarWars	knowExpUniverse	count
0	N/A	N/A	350
1	NO	N/A	1
2	NO	NO	255
3	NO	YES	28
4	YES	N/A	7
5	YES	NO	360
6	YES	YES	185

	knowExpUniverse	fanOfStarWars	count
0	N/A	N/A	350
1	N/A	NO	1
2	N/A	YES	7
3	NO	NO	255
4	NO	YES	360
5	YES	NO	28
6	YES	YES	185

	fanOfStarWars	fanOfExpUniverse	count
0	N/A	N/A	350
1	NO	N/A	256
2	NO	NO	22
3	NO	YES	6
4	YES	N/A	367
5	YES	NO	92
6	YES	YES	93

	fanOfExpUniverse	fanOfStarWars	count
0	N/A	N/A	350
1	N/A	NO	256
2	N/A	YES	367
3	NO	NO	22
4	NO	YES	92
5	YES	NO	6
6	YES	YES	93

	character	demographic feature	probability
0	BobaFett	Male	0.035203
1	EmperorPalpatine	Female	0.027167
2	BobaFett	Female	0.015359
3	C-3P0	Female	0.019869
4	R2D2	Female	0.020548
5	LukeSkywalker	Age_18_to_29	0.000477
6	PrincessLeiaOrgana	Age_18_to_29	0.045426
7	AnakinSkywalker	Age_18_to_29	0.023067
8	BobaFett	Age_18_to_29	0.013010
9	C-3P0	Age_18_to_29	0.045152
10	JarJarBinks	Age_18_to_29	0.030795
11	PadmeAmidala	Age_18_to_29	0.001704
12	AnakinSkywalker	Age_30_to_44	0.001781
13	BobaFett	Age_30_to_44	0.013628
14	JarJarBinks	Age_30_to_44	0.005269
15	BobaFett	Age_45_to_60	0.013320
16	LukeSkywalker	Age_60_plus	0.038312
17	AnakinSkywalker	Age_60_plus	0.000058
18	BobaFett	Age_60_plus	0.008195
19	C-3P0	Age_60_plus	0.041164
20	JarJarBinks	Age_60_plus	0.000041
21	PadmeAmidala	Age_60_plus	0.000144

Practical Data Science: Star Wars Dataset

Executive Summary¶

Task 1: Data Preparation¶

1. Import Pacakges¶

2. Read in Data File¶

3. Check Dimensions¶

4. Show content¶

5. Show Data Types¶

6. Descriptive Statistics¶

Numerical columns¶

Categorical columns¶

7. Inspect and Clean (Categorical columns )¶

8. Handling Missing Values¶

Check¶

Fix Categoical columns¶

Fix the columns prefix with Seen¶

Fix the columns prefix with Rank¶

Fix the columns prefix with Favour¶

9. Apply test suite¶

10. Validation after Fixes¶

Data Types¶

Missing values¶

Task 2: Data Exploration¶

Question 1 - Ranking¶

1. Show Volume¶

2. Find Invalid response¶

3. Validate Ranking¶

4. Calculate Total Score¶

5. Ranked and watched All movies¶

6. Visualize Total Score¶

7. Visualize Rank 1 (Most favourite) Count¶

Question 2 - Pairwise comparison¶

1. Supplementary function : PrintStackRelation¶

2. 1st Comparison: Fans of Star Wars vs Star Trek¶

3. 2nd Comparison: Fans of Star wars vs Familiarity on Expanded Universe¶

4. 3rd Comparison: Fans of Star Wars vs Fans of Expanded Universe¶

Question 3 - Demographic Impact¶

1. Supplementary function: SplitCharacterByLiking¶

2. Visualize Liking of each characters¶

3. Supplementary function: SplitCharacterByLikingWithDemo¶

4. Visualize likings of each characters by:¶

Gender¶

Age¶

a. Young versus Rest of the age group¶

b. Old versus Rest of the age group¶

c. Young versus Old¶

d. 30-44 versus the Rest¶

e. 45-60 versus the Rest¶

Income¶

a. Lower Income versus Rest of the household income group¶

b. Higher Income versus Rest of the household income group¶

c. Middle Income versus Rest of the household income group¶

Education Level¶

a. Less Educated versus Rest of the education level¶

b. College or Associate versus Rest of the education level¶

c. Bachelor Degree versus Rest of the education level¶

d. Graduate Degree versus Rest of the education level¶

Region¶

a. Pacific and Mountain versus Rest of the Region¶

b. Central versus Rest of the Region¶

c. Atlantic versus Rest of the Region¶

9. Perform Chi Square Association Test¶

1. Split all the demographic data¶

2. Supplementary function: chiTest¶

3. Chi Square test (each categories vs original dataset)¶

i. Construct p-value list¶

ii. Print out the p-value list for gender¶

iii. Print out the p-value list for Age¶

iv. Print out the p-value list for Income¶

v. Print out the p-value list for Region¶

vi. Print out the p-value list for Education¶

vi. Print out the p-value list for all categories of different demographic features¶

vii. List out the characters with significant differences¶

4. Chi square test within demographical features (Gender and Age)¶

i. By Gender - Male vs Female Quick Link to Conclusion¶

a. Character with significant difference by Gender¶

b. Character with NO significant difference by Gender¶

ii. By different age groups Quick Link to Conclusion¶

a. Characters with significant difference by Age¶

Fix the columns prefix with `Seen`¶

Fix the columns prefix with `Rank`¶

Fix the columns prefix with `Favour`¶

i. By Gender - Male vs Female Quick Link to Conclusion ¶

ii. By different age groups Quick Link to Conclusion ¶

	index	Rank_I_ThePhantomMenace	Rank_II_AttackoftheClones	Rank_III_RevengeoftheSith	Rank_IV_aNewHope	Rank_V_TheEmpireStrikesBack	Rank_VI_ReturnofTheJedi
0	1	128	32	36	204	288	146
1	2	71	115	46	135	235	232
2	3	130	102	150	127	106	219
3	4	237	183	182	129	47	56
4	5	100	300	203	78	117	36
5	6	168	102	217	161	41	145

	Rank	I_ThePhantomMenace	II_AttackoftheClones	III_RevengeoftheSith	IV_aNewHope	V_TheEmpireStrikesBack	VI_ReturnofTheJedi
0	1	128	32	36	204	288	146
1	2	71	115	46	135	235	232
2	3	130	102	150	127	106	219
3	4	237	183	182	129	47	56
4	5	100	300	203	78	117	36
5	6	168	102	217	161	41	145

	Rank	I_ThePhantomMenace	II_AttackoftheClones	III_RevengeoftheSith	IV_aNewHope	V_TheEmpireStrikesBack	VI_ReturnofTheJedi
0	1	46	18	27	128	169	82
1	2	30	46	34	107	133	120
2	3	47	52	68	86	71	146
3	4	129	84	121	59	33	44
4	5	76	186	104	30	48	26
5	6	142	84	116	60	16	52

	fanOfStarWars	fanOfStarTrekFranchise	count
0	N/A	N/A	110
1	N/A	NO	228
2	N/A	YES	12
3	NO	N/A	1
4	NO	NO	238
5	NO	YES	45
6	YES	N/A	7
7	YES	NO	175
8	YES	YES	370

	fanOfStarTrekFranchise	fanOfStarWars	count
0	N/A	N/A	110
1	N/A	NO	1
2	N/A	YES	7
3	NO	N/A	228
4	NO	NO	238
5	NO	YES	175
6	YES	N/A	12
7	YES	NO	45
8	YES	YES	370

	fanOfStarWars	knowExpUniverse	count
0	N/A	N/A	350
1	NO	N/A	1
2	NO	NO	255
3	NO	YES	28
4	YES	N/A	7
5	YES	NO	360
6	YES	YES	185

	knowExpUniverse	fanOfStarWars	count
0	N/A	N/A	350
1	N/A	NO	1
2	N/A	YES	7
3	NO	NO	255
4	NO	YES	360
5	YES	NO	28
6	YES	YES	185

	fanOfStarWars	fanOfExpUniverse	count
0	N/A	N/A	350
1	NO	N/A	256
2	NO	NO	22
3	NO	YES	6
4	YES	N/A	367
5	YES	NO	92
6	YES	YES	93

	fanOfExpUniverse	fanOfStarWars	count
0	N/A	N/A	350
1	N/A	NO	256
2	N/A	YES	367
3	NO	NO	22
4	NO	YES	92
5	YES	NO	6
6	YES	YES	93

	Male vs Female
HanSolo	0.8525
LukeSkywalker	0.6514
PrincessLeiaOrgana	0.5699
AnakinSkywalker	0.0573
ObiWanKenobi	0.4709
EmperorPalpatine	0.0000
DarthVader	0.0001
LandoCalrissian	0.0015
BobaFett	0.0000
C-3P0	0.0000
R2D2	0.0000
JarJarBinks	0.0001
PadmeAmidala	0.0288
Yoda	0.0468