Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.

assignment1

Executive Summary

In this assignment, we have to clean and explore the StarWars dataset. The data is collected by running a poll through SurveyMonkey, there were 1,186 respondents. The description of the questions asked in the survey is given below:

  • Have you seen any of the 6 films in the Star Wars franchise?
  • Do you consider yourself to be a fan of the Star Wars film franchise?
  • Which of the following Star Wars films have you seen? Please select all that apply. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi)
  • Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being our least favorite film. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi)
  • Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. (Han Solo, Luke Skywalker, Princess Leia Organa, Anakin Skywalker, Obi Wan Kenobi, Emperor Palpatine, Darth Vader, Lando Calrissian, Boba Fett, C-3P0, R2-D2, Jar Jar Binks, Padme Amidala, Yoda)
  • Which character shot first?
  • Are you familiar with the Expanded Universe?
  • Do you consider yourself to be a fan of the Expanded Universe?
  • Do you consider yourself to be a fan of the Star Trek franchise?
  • Gender
  • Age
  • Household Income
  • Education
  • Location (Census Region)

Task 1: Data Preparation

1. Import Pacakges

Import all the necessary packages: numpy, pandas and matplotlib etc.

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

2. Read in Data File

Read in StarWars.csv into dataframe df and assign the column names as below:

  • column 1 - respondentId
  • column 2 - any6Films, corresponds to the question 'Have you seen any of the 6 films in the Star Wars franchise?'
  • column 3 - fanOfStarWars, corresponds to the question 'Do you consider yourself to be a fan of the Star Wars film franchise?'
  • column 4 to 9 - Seen_I_ThePhantomMenace, Seen_II_AttackoftheClones, Seen_III_RevengeoftheSith, Seen_IV_aNewHope, Seen_V_TheEmpireStrikesBack, Seen_VI_ReturnofTheJedi, correspnds to question 'Which of the following Star Wars films have you seen?'
  • column 10 to 15 - Rank_I_ThePhantomMenace, Rank_II_AttackoftheClones, Rank_III_RevengeoftheSith, Rank_IV_aNewHope, Rank_V_TheEmpireStrikesBack, Rank_VI_ReturnofTheJedi, corresponds to question 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
    • column 16 to 29 - Favour_HanSolo, Favour_LukeSkywalker, Favour_PrincessLeiaOrgana, Favour_AnakinSkywalker, Favour_ObiWanKenobi, Favour_EmperorPalpatine, Favour_DarthVader, Favour_LandoCalrissian, Favour_BobaFett, Favour_C-3P0, Favour_R2D2, Favour_JarJarBinks, Favour_PadmeAmidala, Favour_Yoda, corresponds to the question 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.'
    • column 30 - firstShotCharacter, corresponds to question 'Which character shot first?'
    • column 31 - knowExpUniverse, corresponds to question 'Are you familiar with the Expanded Universe?'
    • column 32 - fanOfExpUniverse', corresponds to question 'Do you consider yourself to be a fan of the Expanded Universe?'
    • column 33 - fanOfStarTrekFranchise, corresponds to question 'Do you consider yourself to be a fan of the Star Trek franchise?'
    • column 34 - Gender
    • column 35 - Age
    • column 36 - HHIncome, represents household income
    • column 37 - Education
    • column 38 - Region
In [2]:
df = pd.read_csv("StarWars.csv", names=['respondentId', 'any6Films', 'fanOfStarWars', 'Seen_I_ThePhantomMenace', 'Seen_II_AttackoftheClones', 'Seen_III_RevengeoftheSith', 'Seen_IV_aNewHope', 'Seen_V_TheEmpireStrikesBack','Seen_VI_ReturnofTheJedi', 'Rank_I_ThePhantomMenace', 'Rank_II_AttackoftheClones', 'Rank_III_RevengeoftheSith', 'Rank_IV_aNewHope', 'Rank_V_TheEmpireStrikesBack','Rank_VI_ReturnofTheJedi', 'Favour_HanSolo', 'Favour_LukeSkywalker', 'Favour_PrincessLeiaOrgana', 'Favour_AnakinSkywalker', 'Favour_ObiWanKenobi', 'Favour_EmperorPalpatine', 'Favour_DarthVader', 'Favour_LandoCalrissian', 'Favour_BobaFett', 'Favour_C-3P0', 'Favour_R2D2', 'Favour_JarJarBinks', 'Favour_PadmeAmidala', 'Favour_Yoda', 'firstShotCharacter', 'knowExpUniverse', 'fanOfExpUniverse', 'fanOfStarTrekFranchise', 'Gender', 'Age', 'HHIncome', 'Education', 'Region'], index_col=False, header = None, skiprows=2)

3. Check Dimensions

Check the size of the loaded-in dataset. (x, y): x - Number of Records, y - Number of fields.

In [3]:
df.shape
Out[3]:
(1186, 38)

4. Show content

By displaying the first 5 rows of the datasets, as there are 38 columns in the dataset, we need to set the option display.max_columns to display all the columns.

In [4]:
pd.set_option('display.max_columns', 50)
df.head(5)
Out[4]:
respondentId any6Films fanOfStarWars Seen_I_ThePhantomMenace Seen_II_AttackoftheClones Seen_III_RevengeoftheSith Seen_IV_aNewHope Seen_V_TheEmpireStrikesBack Seen_VI_ReturnofTheJedi Rank_I_ThePhantomMenace Rank_II_AttackoftheClones Rank_III_RevengeoftheSith Rank_IV_aNewHope Rank_V_TheEmpireStrikesBack Rank_VI_ReturnofTheJedi Favour_HanSolo Favour_LukeSkywalker Favour_PrincessLeiaOrgana Favour_AnakinSkywalker Favour_ObiWanKenobi Favour_EmperorPalpatine Favour_DarthVader Favour_LandoCalrissian Favour_BobaFett Favour_C-3P0 Favour_R2D2 Favour_JarJarBinks Favour_PadmeAmidala Favour_Yoda firstShotCharacter knowExpUniverse fanOfExpUniverse fanOfStarTrekFranchise Gender Age HHIncome Education Region
0 3292879998 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3.0 2.0 1.0 4.0 5.0 6.0 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
1 3292879538 No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
2 3292765271 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1.0 2.0 3.0 4.0 5.0 6.0 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
3 3292763116 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 6.0 1.0 2.0 4.0 3.0 Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat favorably Somewhat unfavorably Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
4 3292731220 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 4.0 6.0 2.0 1.0 3.0 Very favorably Somewhat favorably Somewhat favorably Somewhat unfavorably Very favorably Very unfavorably Somewhat favorably Neither favorably nor unfavorably (neutral) Very favorably Somewhat favorably Somewhat favorably Very unfavorably Somewhat favorably Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5. Show Data Types

Show the data types of each read-in columns

In [5]:
df.dtypes
Out[5]:
respondentId                     int64
any6Films                       object
fanOfStarWars                   object
Seen_I_ThePhantomMenace         object
Seen_II_AttackoftheClones       object
Seen_III_RevengeoftheSith       object
Seen_IV_aNewHope                object
Seen_V_TheEmpireStrikesBack     object
Seen_VI_ReturnofTheJedi         object
Rank_I_ThePhantomMenace        float64
Rank_II_AttackoftheClones      float64
Rank_III_RevengeoftheSith      float64
Rank_IV_aNewHope               float64
Rank_V_TheEmpireStrikesBack    float64
Rank_VI_ReturnofTheJedi        float64
Favour_HanSolo                  object
Favour_LukeSkywalker            object
Favour_PrincessLeiaOrgana       object
Favour_AnakinSkywalker          object
Favour_ObiWanKenobi             object
Favour_EmperorPalpatine         object
Favour_DarthVader               object
Favour_LandoCalrissian          object
Favour_BobaFett                 object
Favour_C-3P0                    object
Favour_R2D2                     object
Favour_JarJarBinks              object
Favour_PadmeAmidala             object
Favour_Yoda                     object
firstShotCharacter              object
knowExpUniverse                 object
fanOfExpUniverse                object
fanOfStarTrekFranchise          object
Gender                          object
Age                             object
HHIncome                        object
Education                       object
Region                          object
dtype: object

6. Descriptive Statistics

Numerical columns

Only respondentID and all the ranking columns are numeric. All the Ranking columns range from 1 to 6.

In [6]:
df.describe(include=np.number).round(2)
Out[6]:
respondentId Rank_I_ThePhantomMenace Rank_II_AttackoftheClones Rank_III_RevengeoftheSith Rank_IV_aNewHope Rank_V_TheEmpireStrikesBack Rank_VI_ReturnofTheJedi
count 1.186000e+03 835.00 836.00 835.00 836.00 836.00 836.00
mean 3.290128e+09 3.73 4.09 4.34 3.27 2.51 3.05
std 1.055639e+06 1.66 1.37 1.40 1.83 1.58 1.67
min 3.288373e+09 1.00 1.00 1.00 1.00 1.00 1.00
25% 3.289451e+09 3.00 3.00 3.00 2.00 1.00 2.00
50% 3.290147e+09 4.00 4.00 5.00 3.00 2.00 3.00
75% 3.290814e+09 5.00 5.00 6.00 5.00 3.00 4.00
max 3.292880e+09 6.00 6.00 6.00 6.00 6.00 6.00

Categorical columns

Show the unique values and counts for each categorical variables

In [7]:
categorical_cols = df.columns[df.dtypes==object].tolist()
for col in df.columns:
    print(f"Column Name: {col}")
    print(df[col].unique())
    print(df[col].value_counts())
    print("\n")
Column Name: respondentId
[3292879998 3292879538 3292765271 ... 3288375286 3288373068 3288372923]
3289370623    1
3290063543    1
3289539921    1
3290206907    1
3291022013    1
             ..
3289861464    1
3289481310    1
3290944858    1
3288653149    1
3289878528    1
Name: respondentId, Length: 1186, dtype: int64


Column Name: any6Films
['Yes' 'No' 'Yes ']
Yes     935
No      250
Yes       1
Name: any6Films, dtype: int64


Column Name: fanOfStarWars
['Yes' nan 'No' 'Yess' 'Noo']
Yes     551
No      283
Yess      1
Noo       1
Name: fanOfStarWars, dtype: int64


Column Name: Seen_I_ThePhantomMenace
['Star Wars: Episode I  The Phantom Menace' nan]
Star Wars: Episode I  The Phantom Menace    673
Name: Seen_I_ThePhantomMenace, dtype: int64


Column Name: Seen_II_AttackoftheClones
['Star Wars: Episode II  Attack of the Clones' nan]
Star Wars: Episode II  Attack of the Clones    571
Name: Seen_II_AttackoftheClones, dtype: int64


Column Name: Seen_III_RevengeoftheSith
['Star Wars: Episode III  Revenge of the Sith' nan]
Star Wars: Episode III  Revenge of the Sith    550
Name: Seen_III_RevengeoftheSith, dtype: int64


Column Name: Seen_IV_aNewHope
['Star Wars: Episode IV  A New Hope' nan]
Star Wars: Episode IV  A New Hope    607
Name: Seen_IV_aNewHope, dtype: int64


Column Name: Seen_V_TheEmpireStrikesBack
['Star Wars: Episode V The Empire Strikes Back' nan]
Star Wars: Episode V The Empire Strikes Back    758
Name: Seen_V_TheEmpireStrikesBack, dtype: int64


Column Name: Seen_VI_ReturnofTheJedi
['Star Wars: Episode VI Return of the Jedi' nan]
Star Wars: Episode VI Return of the Jedi    738
Name: Seen_VI_ReturnofTheJedi, dtype: int64


Column Name: Rank_I_ThePhantomMenace
[ 3. nan  1.  5.  6.  4.  2.]
4.0    237
6.0    168
3.0    130
1.0    129
5.0    100
2.0     71
Name: Rank_I_ThePhantomMenace, dtype: int64


Column Name: Rank_II_AttackoftheClones
[ 2. nan  6.  4.  5.  1.  3.]
5.0    300
4.0    183
2.0    116
3.0    103
6.0    102
1.0     32
Name: Rank_II_AttackoftheClones, dtype: int64


Column Name: Rank_III_RevengeoftheSith
[ 1. nan  3.  6.  4.  5.  2.]
6.0    217
5.0    203
4.0    182
3.0    150
2.0     47
1.0     36
Name: Rank_III_RevengeoftheSith, dtype: int64


Column Name: Rank_IV_aNewHope
[ 4. nan  2.  6.  3.  1.  5.]
1.0    204
6.0    161
2.0    135
4.0    130
3.0    127
5.0     79
Name: Rank_IV_aNewHope, dtype: int64


Column Name: Rank_V_TheEmpireStrikesBack
[ 5. nan  4.  1.  2.  3.  6.]
1.0    289
2.0    235
5.0    118
3.0    106
4.0     47
6.0     41
Name: Rank_V_TheEmpireStrikesBack, dtype: int64


Column Name: Rank_VI_ReturnofTheJedi
[ 6. nan  3.  2.  1.  4.  5.]
2.0    232
3.0    220
1.0    146
6.0    145
4.0     57
5.0     36
Name: Rank_VI_ReturnofTheJedi, dtype: int64


Column Name: Favour_HanSolo
['Very favorably' nan 'Somewhat favorably'
 'Neither favorably nor unfavorably (neutral)' 'Somewhat unfavorably'
 'Unfamiliar (N/A)' 'Very unfavorably']
Very favorably                                 610
Somewhat favorably                             151
Neither favorably nor unfavorably (neutral)     44
Unfamiliar (N/A)                                15
Somewhat unfavorably                             8
Very unfavorably                                 1
Name: Favour_HanSolo, dtype: int64


Column Name: Favour_LukeSkywalker
['Very favorably' nan 'Somewhat favorably' 'Somewhat unfavorably'
 'Neither favorably nor unfavorably (neutral)' 'Very unfavorably'
 'Unfamiliar (N/A)']
Very favorably                                 552
Somewhat favorably                             219
Neither favorably nor unfavorably (neutral)     38
Somewhat unfavorably                            13
Unfamiliar (N/A)                                 6
Very unfavorably                                 3
Name: Favour_LukeSkywalker, dtype: int64


Column Name: Favour_PrincessLeiaOrgana
['Very favorably' nan 'Somewhat favorably' 'Somewhat unfavorably'
 'Neither favorably nor unfavorably (neutral)' 'Very unfavorably'
 'Unfamiliar (N/A)']
Very favorably                                 547
Somewhat favorably                             210
Neither favorably nor unfavorably (neutral)     48
Somewhat unfavorably                            12
Unfamiliar (N/A)                                 8
Very unfavorably                                 6
Name: Favour_PrincessLeiaOrgana, dtype: int64


Column Name: Favour_AnakinSkywalker
['Very favorably' nan 'Somewhat favorably' 'Somewhat unfavorably'
 'Neither favorably nor unfavorably (neutral)' 'Very unfavorably'
 'Unfamiliar (N/A)']
Somewhat favorably                             269
Very favorably                                 245
Neither favorably nor unfavorably (neutral)    135
Somewhat unfavorably                            83
Unfamiliar (N/A)                                52
Very unfavorably                                39
Name: Favour_AnakinSkywalker, dtype: int64


Column Name: Favour_ObiWanKenobi
['Very favorably' nan 'Somewhat favorably' 'Very unfavorably'
 'Neither favorably nor unfavorably (neutral)' 'Somewhat unfavorably'
 'Unfamiliar (N/A)']
Very favorably                                 591
Somewhat favorably                             159
Neither favorably nor unfavorably (neutral)     43
Unfamiliar (N/A)                                17
Somewhat unfavorably                             8
Very unfavorably                                 7
Name: Favour_ObiWanKenobi, dtype: int64


Column Name: Favour_EmperorPalpatine
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Very unfavorably' 'Neither favorably nor unfavorably (neutral)'
 'Somewhat unfavorably']
Neither favorably nor unfavorably (neutral)    213
Unfamiliar (N/A)                               156
Somewhat favorably                             143
Very unfavorably                               124
Very favorably                                 110
Somewhat unfavorably                            68
Name: Favour_EmperorPalpatine, dtype: int64


Column Name: Favour_DarthVader
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Somewhat unfavorably' 'Very unfavorably'
 'Neither favorably nor unfavorably (neutral)']
Very favorably                                 310
Somewhat favorably                             171
Very unfavorably                               149
Somewhat unfavorably                           102
Neither favorably nor unfavorably (neutral)     84
Unfamiliar (N/A)                                10
Name: Favour_DarthVader, dtype: int64


Column Name: Favour_LandoCalrissian
['Unfamiliar (N/A)' nan 'Somewhat favorably'
 'Neither favorably nor unfavorably (neutral)' 'Very favorably'
 'Somewhat unfavorably' 'Very unfavorably']
Neither favorably nor unfavorably (neutral)    236
Somewhat favorably                             223
Unfamiliar (N/A)                               148
Very favorably                                 142
Somewhat unfavorably                            63
Very unfavorably                                 8
Name: Favour_LandoCalrissian, dtype: int64


Column Name: Favour_BobaFett
['Unfamiliar (N/A)' nan 'Somewhat unfavorably' 'Very favorably'
 'Somewhat favorably' 'Neither favorably nor unfavorably (neutral)'
 'Very unfavorably']
Neither favorably nor unfavorably (neutral)    248
Somewhat favorably                             153
Very favorably                                 138
Unfamiliar (N/A)                               132
Somewhat unfavorably                            96
Very unfavorably                                45
Name: Favour_BobaFett, dtype: int64


Column Name: Favour_C-3P0
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Neither favorably nor unfavorably (neutral)' 'Somewhat unfavorably'
 'Very unfavorably']
Very favorably                                 474
Somewhat favorably                             229
Neither favorably nor unfavorably (neutral)     79
Somewhat unfavorably                            23
Unfamiliar (N/A)                                15
Very unfavorably                                 7
Name: Favour_C-3P0, dtype: int64


Column Name: Favour_R2D2
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Neither favorably nor unfavorably (neutral)' 'Somewhat unfavorably'
 'Very unfavorably']
Very favorably                                 562
Somewhat favorably                             185
Neither favorably nor unfavorably (neutral)     57
Unfamiliar (N/A)                                10
Somewhat unfavorably                            10
Very unfavorably                                 6
Name: Favour_R2D2, dtype: int64


Column Name: Favour_JarJarBinks
['Very favorably' nan 'Unfamiliar (N/A)' 'Very unfavorably'
 'Somewhat favorably' 'Somewhat unfavorably'
 'Neither favorably nor unfavorably (neutral)']
Very unfavorably                               204
Neither favorably nor unfavorably (neutral)    164
Somewhat favorably                             130
Very favorably                                 112
Unfamiliar (N/A)                               109
Somewhat unfavorably                           102
Name: Favour_JarJarBinks, dtype: int64


Column Name: Favour_PadmeAmidala
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Neither favorably nor unfavorably (neutral)' 'Somewhat unfavorably'
 'Very unfavorably']
Neither favorably nor unfavorably (neutral)    207
Somewhat favorably                             183
Very favorably                                 168
Unfamiliar (N/A)                               164
Somewhat unfavorably                            58
Very unfavorably                                34
Name: Favour_PadmeAmidala, dtype: int64


Column Name: Favour_Yoda
['Very favorably' nan 'Unfamiliar (N/A)' 'Somewhat favorably'
 'Very unfavorably' 'Neither favorably nor unfavorably (neutral)'
 'Somewhat unfavorably']
Very favorably                                 605
Somewhat favorably                             144
Neither favorably nor unfavorably (neutral)     51
Unfamiliar (N/A)                                10
Somewhat unfavorably                             8
Very unfavorably                                 8
Name: Favour_Yoda, dtype: int64


Column Name: firstShotCharacter
["I don't understand this question" nan 'Greedo' 'Han']
Han                                 325
I don't understand this question    306
Greedo                              197
Name: firstShotCharacter, dtype: int64


Column Name: knowExpUniverse
['Yes' nan 'No']
No     615
Yes    213
Name: knowExpUniverse, dtype: int64


Column Name: fanOfExpUniverse
['No' nan 'Yes' 'Yess']
No      114
Yes      98
Yess      1
Name: fanOfExpUniverse, dtype: int64


Column Name: fanOfStarTrekFranchise
['No' 'Yes' nan 'Noo' 'yes' 'no ']
No     639
Yes    426
yes      1
Noo      1
no       1
Name: fanOfStarTrekFranchise, dtype: int64


Column Name: Gender
['Male' nan 'Female' 'F' 'female' 'male']
Female    546
Male      496
female      2
F           1
male        1
Name: Gender, dtype: int64


Column Name: Age
['18-29' nan '500' '30-44' '> 60' '45-60']
45-60    291
> 60     269
30-44    268
18-29    217
500        1
Name: Age, dtype: int64


Column Name: HHIncome
[nan '$0 - $24,999' '$100,000 - $149,999' '$25,000 - $49,999'
 '$50,000 - $99,999' '$150,000+']
$50,000 - $99,999      298
$25,000 - $49,999      186
$100,000 - $149,999    141
$0 - $24,999           138
$150,000+               95
Name: HHIncome, dtype: int64


Column Name: Education
['High school degree' 'Bachelor degree' 'Some college or Associate degree'
 nan 'Graduate degree' 'Less than high school degree']
Some college or Associate degree    328
Bachelor degree                     321
Graduate degree                     275
High school degree                  105
Less than high school degree          7
Name: Education, dtype: int64


Column Name: Region
['South Atlantic' 'West South Central' 'West North Central'
 'Middle Atlantic' 'East North Central' 'Pacific' nan 'Mountain'
 'New England' 'East South Central']
East North Central    181
Pacific               175
South Atlantic        170
Middle Atlantic       122
West South Central    110
West North Central     93
Mountain               79
New England            75
East South Central     38
Name: Region, dtype: int64


7. Inspect and Clean (Categorical columns )

Inspect the values of all the categorical columns carefully and handle the irregular cardinality issues, such as stripping redundant whitespace, replace the capital letter mismatch and data entry error with the majority values with similar meaning

i. white space found in 'Yes ' in the column 'any6Films'

ii. Typo 'Yess' and 'Noo' found in the column 'fanOfStarWars'

iii. Typo 'Yess' found in the column 'fanOfExpUniverse'

iv. Typo 'Noo' and lower case letters 'yes' 'no' found in the column 'fanOfStarTrekFranchise'

v. Irregular cardinality 'F' and lower case letters 'male' 'female' found in the column 'Gender'

In [8]:
df['any6Films']=df['any6Films'].str.strip()  
df['fanOfStarWars']=df['fanOfStarWars'].replace(['Yess'], 'Yes') 
df['fanOfStarWars']=df['fanOfStarWars'].replace(['Noo'], 'No')
df['fanOfExpUniverse']=df['fanOfExpUniverse'].replace(['Yess'], 'Yes') 
df['fanOfStarTrekFranchise']=df['fanOfStarTrekFranchise'].replace(['yes'], 'Yes')  
df['fanOfStarTrekFranchise']=df['fanOfStarTrekFranchise'].replace(['Noo', 'no '], 'No')
df['Gender']=df['Gender'].replace(['male'], 'Male') 
df['Gender']=df['Gender'].replace(['female', 'F'], 'Female')

Cast all text data to upper-case and strip any extra white space by using str.upper() and str.strip().

In [9]:
categorical_cols = df.columns[df.dtypes==object].tolist()
for col in categorical_cols:
    print(f"Column Name: {col}")
    df[col]=df[col].str.upper()
    df[col]=df[col].str.strip()
    print(df[col].unique())
    print(df[col].value_counts())
    print("\n")
Column Name: any6Films
['YES' 'NO']
YES    936
NO     250
Name: any6Films, dtype: int64


Column Name: fanOfStarWars
['YES' nan 'NO']
YES    552
NO     284
Name: fanOfStarWars, dtype: int64


Column Name: Seen_I_ThePhantomMenace
['STAR WARS: EPISODE I  THE PHANTOM MENACE' nan]
STAR WARS: EPISODE I  THE PHANTOM MENACE    673
Name: Seen_I_ThePhantomMenace, dtype: int64


Column Name: Seen_II_AttackoftheClones
['STAR WARS: EPISODE II  ATTACK OF THE CLONES' nan]
STAR WARS: EPISODE II  ATTACK OF THE CLONES    571
Name: Seen_II_AttackoftheClones, dtype: int64


Column Name: Seen_III_RevengeoftheSith
['STAR WARS: EPISODE III  REVENGE OF THE SITH' nan]
STAR WARS: EPISODE III  REVENGE OF THE SITH    550
Name: Seen_III_RevengeoftheSith, dtype: int64


Column Name: Seen_IV_aNewHope
['STAR WARS: EPISODE IV  A NEW HOPE' nan]
STAR WARS: EPISODE IV  A NEW HOPE    607
Name: Seen_IV_aNewHope, dtype: int64


Column Name: Seen_V_TheEmpireStrikesBack
['STAR WARS: EPISODE V THE EMPIRE STRIKES BACK' nan]
STAR WARS: EPISODE V THE EMPIRE STRIKES BACK    758
Name: Seen_V_TheEmpireStrikesBack, dtype: int64


Column Name: Seen_VI_ReturnofTheJedi
['STAR WARS: EPISODE VI RETURN OF THE JEDI' nan]
STAR WARS: EPISODE VI RETURN OF THE JEDI    738
Name: Seen_VI_ReturnofTheJedi, dtype: int64


Column Name: Favour_HanSolo
['VERY FAVORABLY' nan 'SOMEWHAT FAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'SOMEWHAT UNFAVORABLY'
 'UNFAMILIAR (N/A)' 'VERY UNFAVORABLY']
VERY FAVORABLY                                 610
SOMEWHAT FAVORABLY                             151
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     44
UNFAMILIAR (N/A)                                15
SOMEWHAT UNFAVORABLY                             8
VERY UNFAVORABLY                                 1
Name: Favour_HanSolo, dtype: int64


Column Name: Favour_LukeSkywalker
['VERY FAVORABLY' nan 'SOMEWHAT FAVORABLY' 'SOMEWHAT UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'VERY UNFAVORABLY'
 'UNFAMILIAR (N/A)']
VERY FAVORABLY                                 552
SOMEWHAT FAVORABLY                             219
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     38
SOMEWHAT UNFAVORABLY                            13
UNFAMILIAR (N/A)                                 6
VERY UNFAVORABLY                                 3
Name: Favour_LukeSkywalker, dtype: int64


Column Name: Favour_PrincessLeiaOrgana
['VERY FAVORABLY' nan 'SOMEWHAT FAVORABLY' 'SOMEWHAT UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'VERY UNFAVORABLY'
 'UNFAMILIAR (N/A)']
VERY FAVORABLY                                 547
SOMEWHAT FAVORABLY                             210
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     48
SOMEWHAT UNFAVORABLY                            12
UNFAMILIAR (N/A)                                 8
VERY UNFAVORABLY                                 6
Name: Favour_PrincessLeiaOrgana, dtype: int64


Column Name: Favour_AnakinSkywalker
['VERY FAVORABLY' nan 'SOMEWHAT FAVORABLY' 'SOMEWHAT UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'VERY UNFAVORABLY'
 'UNFAMILIAR (N/A)']
SOMEWHAT FAVORABLY                             269
VERY FAVORABLY                                 245
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    135
SOMEWHAT UNFAVORABLY                            83
UNFAMILIAR (N/A)                                52
VERY UNFAVORABLY                                39
Name: Favour_AnakinSkywalker, dtype: int64


Column Name: Favour_ObiWanKenobi
['VERY FAVORABLY' nan 'SOMEWHAT FAVORABLY' 'VERY UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'SOMEWHAT UNFAVORABLY'
 'UNFAMILIAR (N/A)']
VERY FAVORABLY                                 591
SOMEWHAT FAVORABLY                             159
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     43
UNFAMILIAR (N/A)                                17
SOMEWHAT UNFAVORABLY                             8
VERY UNFAVORABLY                                 7
Name: Favour_ObiWanKenobi, dtype: int64


Column Name: Favour_EmperorPalpatine
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'VERY UNFAVORABLY' 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)'
 'SOMEWHAT UNFAVORABLY']
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    213
UNFAMILIAR (N/A)                               156
SOMEWHAT FAVORABLY                             143
VERY UNFAVORABLY                               124
VERY FAVORABLY                                 110
SOMEWHAT UNFAVORABLY                            68
Name: Favour_EmperorPalpatine, dtype: int64


Column Name: Favour_DarthVader
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'SOMEWHAT UNFAVORABLY' 'VERY UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)']
VERY FAVORABLY                                 310
SOMEWHAT FAVORABLY                             171
VERY UNFAVORABLY                               149
SOMEWHAT UNFAVORABLY                           102
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     84
UNFAMILIAR (N/A)                                10
Name: Favour_DarthVader, dtype: int64


Column Name: Favour_LandoCalrissian
['UNFAMILIAR (N/A)' nan 'SOMEWHAT FAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'VERY FAVORABLY'
 'SOMEWHAT UNFAVORABLY' 'VERY UNFAVORABLY']
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    236
SOMEWHAT FAVORABLY                             223
UNFAMILIAR (N/A)                               148
VERY FAVORABLY                                 142
SOMEWHAT UNFAVORABLY                            63
VERY UNFAVORABLY                                 8
Name: Favour_LandoCalrissian, dtype: int64


Column Name: Favour_BobaFett
['UNFAMILIAR (N/A)' nan 'SOMEWHAT UNFAVORABLY' 'VERY FAVORABLY'
 'SOMEWHAT FAVORABLY' 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)'
 'VERY UNFAVORABLY']
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    248
SOMEWHAT FAVORABLY                             153
VERY FAVORABLY                                 138
UNFAMILIAR (N/A)                               132
SOMEWHAT UNFAVORABLY                            96
VERY UNFAVORABLY                                45
Name: Favour_BobaFett, dtype: int64


Column Name: Favour_C-3P0
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'SOMEWHAT UNFAVORABLY'
 'VERY UNFAVORABLY']
VERY FAVORABLY                                 474
SOMEWHAT FAVORABLY                             229
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     79
SOMEWHAT UNFAVORABLY                            23
UNFAMILIAR (N/A)                                15
VERY UNFAVORABLY                                 7
Name: Favour_C-3P0, dtype: int64


Column Name: Favour_R2D2
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'SOMEWHAT UNFAVORABLY'
 'VERY UNFAVORABLY']
VERY FAVORABLY                                 562
SOMEWHAT FAVORABLY                             185
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     57
SOMEWHAT UNFAVORABLY                            10
UNFAMILIAR (N/A)                                10
VERY UNFAVORABLY                                 6
Name: Favour_R2D2, dtype: int64


Column Name: Favour_JarJarBinks
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'VERY UNFAVORABLY'
 'SOMEWHAT FAVORABLY' 'SOMEWHAT UNFAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)']
VERY UNFAVORABLY                               204
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    164
SOMEWHAT FAVORABLY                             130
VERY FAVORABLY                                 112
UNFAMILIAR (N/A)                               109
SOMEWHAT UNFAVORABLY                           102
Name: Favour_JarJarBinks, dtype: int64


Column Name: Favour_PadmeAmidala
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' 'SOMEWHAT UNFAVORABLY'
 'VERY UNFAVORABLY']
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)    207
SOMEWHAT FAVORABLY                             183
VERY FAVORABLY                                 168
UNFAMILIAR (N/A)                               164
SOMEWHAT UNFAVORABLY                            58
VERY UNFAVORABLY                                34
Name: Favour_PadmeAmidala, dtype: int64


Column Name: Favour_Yoda
['VERY FAVORABLY' nan 'UNFAMILIAR (N/A)' 'SOMEWHAT FAVORABLY'
 'VERY UNFAVORABLY' 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)'
 'SOMEWHAT UNFAVORABLY']
VERY FAVORABLY                                 605
SOMEWHAT FAVORABLY                             144
NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)     51
UNFAMILIAR (N/A)                                10
SOMEWHAT UNFAVORABLY                             8
VERY UNFAVORABLY                                 8
Name: Favour_Yoda, dtype: int64


Column Name: firstShotCharacter
["I DON'T UNDERSTAND THIS QUESTION" nan 'GREEDO' 'HAN']
HAN                                 325
I DON'T UNDERSTAND THIS QUESTION    306
GREEDO                              197
Name: firstShotCharacter, dtype: int64


Column Name: knowExpUniverse
['YES' nan 'NO']
NO     615
YES    213
Name: knowExpUniverse, dtype: int64


Column Name: fanOfExpUniverse
['NO' nan 'YES']
NO     114
YES     99
Name: fanOfExpUniverse, dtype: int64


Column Name: fanOfStarTrekFranchise
['NO' 'YES' nan]
NO     641
YES    427
Name: fanOfStarTrekFranchise, dtype: int64


Column Name: Gender
['MALE' nan 'FEMALE']
FEMALE    549
MALE      497
Name: Gender, dtype: int64


Column Name: Age
['18-29' nan '500' '30-44' '> 60' '45-60']
45-60    291
> 60     269
30-44    268
18-29    217
500        1
Name: Age, dtype: int64


Column Name: HHIncome
[nan '$0 - $24,999' '$100,000 - $149,999' '$25,000 - $49,999'
 '$50,000 - $99,999' '$150,000+']
$50,000 - $99,999      298
$25,000 - $49,999      186
$100,000 - $149,999    141
$0 - $24,999           138
$150,000+               95
Name: HHIncome, dtype: int64


Column Name: Education
['HIGH SCHOOL DEGREE' 'BACHELOR DEGREE' 'SOME COLLEGE OR ASSOCIATE DEGREE'
 nan 'GRADUATE DEGREE' 'LESS THAN HIGH SCHOOL DEGREE']
SOME COLLEGE OR ASSOCIATE DEGREE    328
BACHELOR DEGREE                     321
GRADUATE DEGREE                     275
HIGH SCHOOL DEGREE                  105
LESS THAN HIGH SCHOOL DEGREE          7
Name: Education, dtype: int64


Column Name: Region
['SOUTH ATLANTIC' 'WEST SOUTH CENTRAL' 'WEST NORTH CENTRAL'
 'MIDDLE ATLANTIC' 'EAST NORTH CENTRAL' 'PACIFIC' nan 'MOUNTAIN'
 'NEW ENGLAND' 'EAST SOUTH CENTRAL']
EAST NORTH CENTRAL    181
PACIFIC               175
SOUTH ATLANTIC        170
MIDDLE ATLANTIC       122
WEST SOUTH CENTRAL    110
WEST NORTH CENTRAL     93
MOUNTAIN               79
NEW ENGLAND            75
EAST SOUTH CENTRAL     38
Name: Region, dtype: int64


8. Handling Missing Values

Check

Check missing values in each columns

In [10]:
df.isna().sum()
Out[10]:
respondentId                     0
any6Films                        0
fanOfStarWars                  350
Seen_I_ThePhantomMenace        513
Seen_II_AttackoftheClones      615
Seen_III_RevengeoftheSith      636
Seen_IV_aNewHope               579
Seen_V_TheEmpireStrikesBack    428
Seen_VI_ReturnofTheJedi        448
Rank_I_ThePhantomMenace        351
Rank_II_AttackoftheClones      350
Rank_III_RevengeoftheSith      351
Rank_IV_aNewHope               350
Rank_V_TheEmpireStrikesBack    350
Rank_VI_ReturnofTheJedi        350
Favour_HanSolo                 357
Favour_LukeSkywalker           355
Favour_PrincessLeiaOrgana      355
Favour_AnakinSkywalker         363
Favour_ObiWanKenobi            361
Favour_EmperorPalpatine        372
Favour_DarthVader              360
Favour_LandoCalrissian         366
Favour_BobaFett                374
Favour_C-3P0                   359
Favour_R2D2                    356
Favour_JarJarBinks             365
Favour_PadmeAmidala            372
Favour_Yoda                    360
firstShotCharacter             358
knowExpUniverse                358
fanOfExpUniverse               973
fanOfStarTrekFranchise         118
Gender                         140
Age                            140
HHIncome                       328
Education                      150
Region                         143
dtype: int64

Fix Categoical columns

By examining the content of the dataset, we have the below findings:

i. Column fanOfStarWars, knowExpUniverse, fanOfExpUniverse, fanOfStarTrekFranchise intended for respondents to answer 'Yes' or 'No', however, some of the respondents have skipped the question and did not provide any answers.

ii. It is intended for respondents to select the respective answers for the following columns, however, beside these unique values, some of the respondents have skipped the question and did not provide any answers

`firstShotCharacter` : I DON'T UNDERSTAND THIS QUESTION, GREEDO, HAN

`Gender` : FEMALE, MALE

`Age`: '18-29', '500', '30-44', '> 60', '45-60'

`HHIncome`:  '\\$0 - \\$24,999',  '\\$100,000 - \\$149,999',  '\\$25,000 - \\$49,999', '\\$50,000 - \\$99,999', '\\$150,000+'

`Education`: HIGH SCHOOL DEGREE, BACHELOR DEGREE, SOME COLLEGE OR ASSOCIATE DEGREE, GRADUATE DEGREE, LESS THAN HIGH SCHOOL DEGREE

`Region`: SOUTH ATLANTIC, WEST SOUTH CENTRAL, WEST NORTH CENTRAL, MIDDLE ATLANTIC, EAST NORTH CENTRAL, PACIFIC, MOUNTAIN, NEW ENGLAND, EAST SOUTH CENTRAL

We would impute the value 'N/A' for all the above columns with missing values. 
In [11]:
df.loc[(df['fanOfStarWars'].isna()), 'fanOfStarWars']='N/A'
df.loc[(df['knowExpUniverse'].isna()), 'knowExpUniverse']='N/A'
df.loc[(df['fanOfExpUniverse'].isna()), 'fanOfExpUniverse']='N/A'
df.loc[(df['fanOfStarTrekFranchise'].isna()), 'fanOfStarTrekFranchise']='N/A'
df.loc[(df['firstShotCharacter'].isna()), 'firstShotCharacter']='N/A'
df.loc[(df['Gender'].isnull()==True), 'Gender'] = 'N/A'
df.loc[(df['Age'].isnull()==True), 'Age'] = 'N/A'
df.loc[(df['HHIncome'].isnull()==True), 'HHIncome'] = 'N/A'
df.loc[(df['Education'].isnull()==True), 'Education'] = 'N/A'
df.loc[(df['Region'].isnull()==True), 'Region'] = 'N/A'

Fix the columns prefix with Seen

For all the columns with column name prefix with Seen, if the respondents have seen the movie, the movie name (only that movie name) would be listed as the values of the entries, otherwise no value is shown. It would be easier to manipulate the information, if we encode 1 for the entries which are not null, and 0 for entires which are null for all these columns.

In [12]:
seen_cols = df.columns[df.columns.str.find('Seen', 0, 4)!=-1].tolist()
for col in seen_cols:
        print(f"Column Name: {col} has value of:", df[col].unique() )
        
Column Name: Seen_I_ThePhantomMenace has value of: ['STAR WARS: EPISODE I  THE PHANTOM MENACE' nan]
Column Name: Seen_II_AttackoftheClones has value of: ['STAR WARS: EPISODE II  ATTACK OF THE CLONES' nan]
Column Name: Seen_III_RevengeoftheSith has value of: ['STAR WARS: EPISODE III  REVENGE OF THE SITH' nan]
Column Name: Seen_IV_aNewHope has value of: ['STAR WARS: EPISODE IV  A NEW HOPE' nan]
Column Name: Seen_V_TheEmpireStrikesBack has value of: ['STAR WARS: EPISODE V THE EMPIRE STRIKES BACK' nan]
Column Name: Seen_VI_ReturnofTheJedi has value of: ['STAR WARS: EPISODE VI RETURN OF THE JEDI' nan]
In [13]:
seen_cols = df.columns[df.columns.str.find('Seen', 0, 4)!=-1].tolist()
for col in seen_cols:
    df.loc[~(df[col].isna()), col]=1
    df.loc[(df[col].isna()), col]=0

Checking the unique values of these columns after encoding.

In [14]:
seen_cols = df.columns[df.columns.str.find('Seen', 0, 4)!=-1].tolist()
for col in seen_cols:
    print(f"Column Name: {col}")
    print(df[col].unique())
    print(df[col].value_counts())    
Column Name: Seen_I_ThePhantomMenace
[1 0]
1    673
0    513
Name: Seen_I_ThePhantomMenace, dtype: int64
Column Name: Seen_II_AttackoftheClones
[1 0]
0    615
1    571
Name: Seen_II_AttackoftheClones, dtype: int64
Column Name: Seen_III_RevengeoftheSith
[1 0]
0    636
1    550
Name: Seen_III_RevengeoftheSith, dtype: int64
Column Name: Seen_IV_aNewHope
[1 0]
1    607
0    579
Name: Seen_IV_aNewHope, dtype: int64
Column Name: Seen_V_TheEmpireStrikesBack
[1 0]
1    758
0    428
Name: Seen_V_TheEmpireStrikesBack, dtype: int64
Column Name: Seen_VI_ReturnofTheJedi
[1 0]
1    738
0    448
Name: Seen_VI_ReturnofTheJedi, dtype: int64

Fix the columns prefix with Rank

For all the columns with column name prefix with 'Rank', respondents are rating each star wars movies order of preference ( 1 being favorite, 6 being least favorite), check whether all these columns are only input with values range from 1 to 6.

In [15]:
def checkRank(col):
    is1to6=True
    for i in np.arange(len(col)):
        if(np.isnan(col[i])==False):
            if (col[i].is_integer()== False):
                print('             Row ', i, 'contains an non-integer')
                is1to6=False
            else:
                check = 1 <= col[i] <= 6
                if (check == False):
                    print('             Row', i, 'has value of', col[i], 'and does not range from 1 to 6')
                    is1to6=False
    return is1to6 
In [16]:
rank_cols = df.columns[df.columns.str.find('Rank', 0, 4)!=-1].tolist()
for col in rank_cols:
    print(f"Column Name: {col}" )
    btn1and6 = checkRank(df[col])
    if btn1and6 is True:
        print("             ONLY contains values between 1 to 6")
Column Name: Rank_I_ThePhantomMenace
             ONLY contains values between 1 to 6
Column Name: Rank_II_AttackoftheClones
             ONLY contains values between 1 to 6
Column Name: Rank_III_RevengeoftheSith
             ONLY contains values between 1 to 6
Column Name: Rank_IV_aNewHope
             ONLY contains values between 1 to 6
Column Name: Rank_V_TheEmpireStrikesBack
             ONLY contains values between 1 to 6
Column Name: Rank_VI_ReturnofTheJedi
             ONLY contains values between 1 to 6

After validation in previous step we know that entires with missing values are those that without any responses on the preference of the star wars movies. As the datatypes of these columns are ordinal. We should keep these columns numerical for easier manipulations, thus it would be wiser to impute the entries without any responses to another unique value (-1).

In [17]:
rank_cols = df.columns[df.columns.str.find('Rank', 0, 4)!=-1].tolist()
for col in rank_cols:
    df.loc[(df[col].isna()), col]=-1
    df[col]=df[col].astype('int64')

Fix the columns prefix with Favour

For all the columns with column name prefix with 'Favour', respondents are rating each star wars character with either one of the following likings (p.s. all likings turn into upper case, since we cast all textual data into upper case):

* 'UNFAMILIAR (N/A)'
* 'VERY UNFAVORABLY'
* 'SOMEWHAT UNFAVORABLY'
* 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)'
* 'SOMEWHAT FAVORABLY'
* 'VERY FAVORABLY'
In [18]:
def checkFavourite(col):
    isLikingValid=True
    liking = ['UNFAMILIAR (N/A)', 'VERY UNFAVORABLY', 'SOMEWHAT UNFAVORABLY', 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)', 'SOMEWHAT FAVORABLY', 'VERY FAVORABLY'] 
    for i in np.arange(len(col)):
        if(pd.isna(col[i])==False): 
            check = (col[i] in liking)
            if (check == False):
                print('             Row', i, 'has value of', col[i], 'and is not what suppose to be in the liking list.')
                isLikingValid=False
    return isLikingValid 
In [19]:
print("Liking list values are 'UNFAMILIAR (N/A)', 'VERY UNFAVORABLY', 'SOMEWHAT UNFAVORABLY', 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)', 'SOMEWHAT FAVORABLY', 'VERY FAVORABLY'\n" )
fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
for col in fav_cols:
    print(f"Column Name: {col}" )
    likingInList = checkFavourite(df[col])
    if likingInList is True:
        print("             ONLY contains values in the liking list")
Liking list values are 'UNFAMILIAR (N/A)', 'VERY UNFAVORABLY', 'SOMEWHAT UNFAVORABLY', 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)', 'SOMEWHAT FAVORABLY', 'VERY FAVORABLY'

Column Name: Favour_HanSolo
             ONLY contains values in the liking list
Column Name: Favour_LukeSkywalker
             ONLY contains values in the liking list
Column Name: Favour_PrincessLeiaOrgana
             ONLY contains values in the liking list
Column Name: Favour_AnakinSkywalker
             ONLY contains values in the liking list
Column Name: Favour_ObiWanKenobi
             ONLY contains values in the liking list
Column Name: Favour_EmperorPalpatine
             ONLY contains values in the liking list
Column Name: Favour_DarthVader
             ONLY contains values in the liking list
Column Name: Favour_LandoCalrissian
             ONLY contains values in the liking list
Column Name: Favour_BobaFett
             ONLY contains values in the liking list
Column Name: Favour_C-3P0
             ONLY contains values in the liking list
Column Name: Favour_R2D2
             ONLY contains values in the liking list
Column Name: Favour_JarJarBinks
             ONLY contains values in the liking list
Column Name: Favour_PadmeAmidala
             ONLY contains values in the liking list
Column Name: Favour_Yoda
             ONLY contains values in the liking list

Two findings:

1) Respondents can provide no response to the liking on any of these characters which leaves the entries with empty values. 

2) Likings are ordinal

It would be more sensible to encode these columns as follows:

* 'UNFAMILIAR (N/A)' : 99
* 'VERY UNFAVORABLY':' : 0
* 'SOMEWHAT UNFAVORABLY' : 1
* 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' :2
* 'SOMEWHAT FAVORABLY' : 3
* 'VERY FAVORABLY' : 4

and -1 for entries without values

In [20]:
level_mapping = {'UNFAMILIAR (N/A)': 99, 'VERY UNFAVORABLY': 0, 'SOMEWHAT UNFAVORABLY': 1, 
                 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)': 2, 'SOMEWHAT FAVORABLY': 3, 'VERY FAVORABLY': 4}


fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
for col in fav_cols:
    df[col] = df[col].replace(level_mapping)
    df.loc[(df[col].isna()), col]=-1
    df[col]=df[col].astype('int64')

9. Apply test suite

Run a small test-suite to test all the value in the Age columns are in the defined age bracket. '18-29', '30-44', '> 60'and '45-60', since we imputed 'N/A' as the missing values, include this into the list.

In [21]:
ageBracket = ['18-29', '30-44', '> 60', '45-60', 'N/A'] 
for i in np.arange(len(df['Age'])):
        check = (df['Age'][i] in ageBracket)
        if (check == False):
            print("We found values out of Age Bracket at index", i, "with value", df['Age'][i])
We found values out of Age Bracket at index 12 with value 500

Since only one entry is found and 500 is definitely out of a normal age range, we imputed this with N/A, same as what we have done for missing values.

In [22]:
df.loc[df['Age']=='500', 'Age'] = 'N/A'

10. Validation after Fixes

Data Types

Show Data Types of all columns, Check the data types of all the columns in the dataframe after data preprocessing

In [23]:
df.dtypes
Out[23]:
respondentId                    int64
any6Films                      object
fanOfStarWars                  object
Seen_I_ThePhantomMenace        object
Seen_II_AttackoftheClones      object
Seen_III_RevengeoftheSith      object
Seen_IV_aNewHope               object
Seen_V_TheEmpireStrikesBack    object
Seen_VI_ReturnofTheJedi        object
Rank_I_ThePhantomMenace         int64
Rank_II_AttackoftheClones       int64
Rank_III_RevengeoftheSith       int64
Rank_IV_aNewHope                int64
Rank_V_TheEmpireStrikesBack     int64
Rank_VI_ReturnofTheJedi         int64
Favour_HanSolo                  int64
Favour_LukeSkywalker            int64
Favour_PrincessLeiaOrgana       int64
Favour_AnakinSkywalker          int64
Favour_ObiWanKenobi             int64
Favour_EmperorPalpatine         int64
Favour_DarthVader               int64
Favour_LandoCalrissian          int64
Favour_BobaFett                 int64
Favour_C-3P0                    int64
Favour_R2D2                     int64
Favour_JarJarBinks              int64
Favour_PadmeAmidala             int64
Favour_Yoda                     int64
firstShotCharacter             object
knowExpUniverse                object
fanOfExpUniverse               object
fanOfStarTrekFranchise         object
Gender                         object
Age                            object
HHIncome                       object
Education                      object
Region                         object
dtype: object

Missing values

Check the missing values of all the columns, there shouldn't be any after data preprocessing.

In [24]:
df.isna().sum()
Out[24]:
respondentId                   0
any6Films                      0
fanOfStarWars                  0
Seen_I_ThePhantomMenace        0
Seen_II_AttackoftheClones      0
Seen_III_RevengeoftheSith      0
Seen_IV_aNewHope               0
Seen_V_TheEmpireStrikesBack    0
Seen_VI_ReturnofTheJedi        0
Rank_I_ThePhantomMenace        0
Rank_II_AttackoftheClones      0
Rank_III_RevengeoftheSith      0
Rank_IV_aNewHope               0
Rank_V_TheEmpireStrikesBack    0
Rank_VI_ReturnofTheJedi        0
Favour_HanSolo                 0
Favour_LukeSkywalker           0
Favour_PrincessLeiaOrgana      0
Favour_AnakinSkywalker         0
Favour_ObiWanKenobi            0
Favour_EmperorPalpatine        0
Favour_DarthVader              0
Favour_LandoCalrissian         0
Favour_BobaFett                0
Favour_C-3P0                   0
Favour_R2D2                    0
Favour_JarJarBinks             0
Favour_PadmeAmidala            0
Favour_Yoda                    0
firstShotCharacter             0
knowExpUniverse                0
fanOfExpUniverse               0
fanOfStarTrekFranchise         0
Gender                         0
Age                            0
HHIncome                       0
Education                      0
Region                         0
dtype: int64

Task 2: Data Exploration

Explore the survey question:

Question 1 - Ranking

Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.

(Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi)}, then analysis how people rate Star Wars Movies.

Quick Link to Conclusion

Total Score

Rank for Most Favourite

1. Show Volume

Plot the volume of ranking of each star wars movies according to the value count of the columns with prefix Rank using a grouped bar chart:

In [25]:
x = np.arange(len(df['Rank_I_ThePhantomMenace'].value_counts()))
width = 0.12  # the width of the bars


fig, ax = plt.subplots()


rects1 = ax.bar(x + width, df['Rank_I_ThePhantomMenace'].value_counts().sort_index(), width, label='I_ThePhantomMenace')
rects2 = ax.bar(x + width*2, df['Rank_II_AttackoftheClones'].value_counts().sort_index(), width, label='II_AttackoftheClones')
rects3 = ax.bar(x + width*3, df['Rank_III_RevengeoftheSith'].value_counts().sort_index(), width, label='III_RevengeoftheSith')
rects4 = ax.bar(x + width*4, df['Rank_IV_aNewHope'].value_counts().sort_index(), width, label='IV_aNewHope')
rects5 = ax.bar(x + width*5, df['Rank_V_TheEmpireStrikesBack'].value_counts().sort_index(), width, label='V_TheEmpireStrikesBack')
rects6 = ax.bar(x + width*6, df['Rank_VI_ReturnofTheJedi'].value_counts().sort_index(), width, label='VI_ReturnofTheJedi')

ax.set_ylabel('Counts')
ax.set_xlabel('Favourite Score [-1: No response,  1 - 6 ( most favourite to least favourite)]')
ax.set_title('Rank Counts of Star Wars movies')
ax.set_xticks(x+ width*3)
ax.set_xticklabels(df['Rank_I_ThePhantomMenace'].value_counts().sort_index().index)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.2), ncol=3)
plt.savefig('OverAllRankCount.png', dpi=300, bbox_inches='tight')
plt.show()

The chart shows the following findings:

  • Each movie has approximately same number of no response on ranking
  • Most respondent vote V_TheEmpireStrikesBack as their most favourite movie
  • Most respondent vote III_RevengeoftheSith as their least favourite movie

Since the most favourite movie is ranked with score 1, and least favourite movie is ranked with score 6, we could even do a further analysis to compare the total scores of each movie and rate them in order. However, we can only calculate the total scores correctly if the ranking of the entries are valid, which is the respondents are giving ranking 1 to 6 to each star wars movie.

2. Find Invalid response

Subset all the columns which give ranking to the star wars movies. Since we know the values in these columns only range from 1 to 6. For valid ranking entry, the total score of that entry would sum up to be 21 (1 + 2 + 3 + 4 + 5 + 6). Check to see how many rows in the subset of data which does not sum up to be 21, excluding the rows with no response (i.e. total score will be -6 if there is no response on the ranking of all star wars movies)

In [26]:
rank = df.iloc[:, 9:15]
rank[rank.sum(axis=1)==21]
rank[(rank.sum(axis=1)!=21) & (rank.sum(axis=1)!=-6) ]
Out[26]:
Rank_I_ThePhantomMenace Rank_II_AttackoftheClones Rank_III_RevengeoftheSith Rank_IV_aNewHope Rank_V_TheEmpireStrikesBack Rank_VI_ReturnofTheJedi
384 1 2 -1 4 5 3
889 -1 3 2 5 1 4

Only 2 entries are with partial response. Excluding these 2 entries should not have big effect on analysing the data, if this is just a small percentage out of the data set with valid ranking.

Let's find out how many entires with valid ranking. To do this, first we need to subset df to a new dataframe rank with the entries that only have total score sum up to be 21

In [27]:
rank = rank[rank.sum(axis=1)==21]
In [28]:
rank.shape
Out[28]:
(834, 6)

3. Validate Ranking

Check whether each movie receives 834 ranking and all comprise by values of 1 - 6.

In [29]:
# Apply the function value_counts to all the column in rank, make the result to a new dataframe `rankCount`.
rankCount = rank.apply(pd.Series.value_counts).reset_index()
rankCount
Out[29]:
index Rank_I_ThePhantomMenace Rank_II_AttackoftheClones Rank_III_RevengeoftheSith Rank_IV_aNewHope Rank_V_TheEmpireStrikesBack Rank_VI_ReturnofTheJedi
0 1 128 32 36 204 288 146
1 2 71 115 46 135 235 232
2 3 130 102 150 127 106 219
3 4 237 183 182 129 47 56
4 5 100 300 203 78 117 36
5 6 168 102 217 161 41 145
In [30]:
# Sum the rankCount of each movie which should be equal to 834
np.sum(rankCount,axis=0)
Out[30]:
index                           21
Rank_I_ThePhantomMenace        834
Rank_II_AttackoftheClones      834
Rank_III_RevengeoftheSith      834
Rank_IV_aNewHope               834
Rank_V_TheEmpireStrikesBack    834
Rank_VI_ReturnofTheJedi        834
dtype: int64

Let's make the column heading of rankCount more readable, strip the Rank suffix from the rank's column, and name index as 'Rank'

In [31]:
movie = rank.columns.str[5:].tolist()
In [32]:
movie
Out[32]:
['I_ThePhantomMenace',
 'II_AttackoftheClones',
 'III_RevengeoftheSith',
 'IV_aNewHope',
 'V_TheEmpireStrikesBack',
 'VI_ReturnofTheJedi']
In [33]:
rankCountColName = ['Rank']
rankCountColName.extend(movie)
rankCountColName
rankCount.columns=rankCountColName
rankCount
Out[33]:
Rank I_ThePhantomMenace II_AttackoftheClones III_RevengeoftheSith IV_aNewHope V_TheEmpireStrikesBack VI_ReturnofTheJedi
0 1 128 32 36 204 288 146
1 2 71 115 46 135 235 232
2 3 130 102 150 127 106 219
3 4 237 183 182 129 47 56
4 5 100 300 203 78 117 36
5 6 168 102 217 161 41 145
In [34]:
np.sum(rankCount,axis=0)
Out[34]:
Rank                       21
I_ThePhantomMenace        834
II_AttackoftheClones      834
III_RevengeoftheSith      834
IV_aNewHope               834
V_TheEmpireStrikesBack    834
VI_ReturnofTheJedi        834
dtype: int64

Sum rankCount by row and column, there are 834 counts for each ranking, and also 834 counts for each star wars movies. Thus all the entries in rank consist of valid ranking. We can use rank to sum up the scores.

In [35]:
rankCount.loc[:,rankCount.columns != 'Rank'].sum(axis=0) 
Out[35]:
I_ThePhantomMenace        834
II_AttackoftheClones      834
III_RevengeoftheSith      834
IV_aNewHope               834
V_TheEmpireStrikesBack    834
VI_ReturnofTheJedi        834
dtype: int64
In [36]:
SumbyRank = pd.DataFrame(rankCount.loc[:,rankCount.columns != 'Rank'].sum(axis=1))
SumbyRank.columns=['Sum']
pd.concat([rankCount['Rank'], SumbyRank ], axis=1)
Out[36]:
Rank Sum
0 1 834
1 2 834
2 3 834
3 4 834
4 5 834
5 6 834

4. Calculate Total Score

Sum the scores of each movie in rank

In [37]:
score = rank.sum(axis=0).reset_index()
score.columns=['movie_name', 'total']
score['movie_name']= movie
score = score.sort_values('total')
score
Out[37]:
movie_name total
4 V_TheEmpireStrikesBack 2095
5 VI_ReturnofTheJedi 2541
3 IV_aNewHope 2727
0 I_ThePhantomMenace 3116
1 II_AttackoftheClones 3412
2 III_RevengeoftheSith 3623

5. Ranked and watched All movies

From the handling missing values section, we noticed that none of the movie have been Seen for more than 800 respondents. Thus respondents giving valid ranking might not watch the movie before. It would be interesting to know if there is any change on the order of the total scores for these 6 movies if we only consider the rankings from respondents who have actually seen all of them. Subset column 4 to 16 from df to obtain all the information on ranking and seeing.

In [38]:
rankAllSeen = df.iloc[:, 3:15]

Filter rankAllSeen only with the entries which have watched all 6 star wars movies.

In [39]:
rankAllSeen = rankAllSeen[rankAllSeen.iloc[:,0:6].sum(axis=1)==6]
rankAllSeen.shape
Out[39]:
(471, 12)

Further filter rankAllSeen with entries of valid ranking

In [40]:
rankAllSeen=rankAllSeen[rankAllSeen.iloc[:,6:].sum(axis=1)==21]
rankAllSeen.shape
Out[40]:
(470, 12)

Drop the seeing column, as we only need the ranking column for analysis.

In [41]:
rankAllSeen = rankAllSeen.iloc[:,6:]

Sum the scores of each movie in rankAllSeen

In [42]:
scoreAllSeen = rankAllSeen.sum(axis=0).reset_index()
scoreAllSeen.columns=['movie_name', 'total']
scoreAllSeen['movie_name']= movie
scoreAllSeen = scoreAllSeen.sort_values('total')
scoreAllSeen
Out[42]:
movie_name total
4 V_TheEmpireStrikesBack 1116
3 IV_aNewHope 1346
5 VI_ReturnofTheJedi 1378
0 I_ThePhantomMenace 1995
2 III_RevengeoftheSith 1999
1 II_AttackoftheClones 2036

6. Visualize Total Score

Quick Link to conclusion

Plot score and scoreAllSeen in a side-by-side bar chart.

In [43]:
x = np.arange(len(score['movie_name'])) 

fig, axs = plt.subplots(1, 2, figsize=(9, 5), sharey=True)
axs[0].bar(score['movie_name'], score['total'])
axs[1].bar(scoreAllSeen['movie_name'], scoreAllSeen['total'])
rect0 = axs[0].patches
rect1 = axs[1].patches
axs[0].set_xticks(x)
axs[0].set_xticklabels(score['movie_name'], rotation=40, ha='right')
axs[0].set_title('Valid Ranking')
axs[1].set_xticklabels(scoreAllSeen['movie_name'], rotation=40, ha='right')
axs[1].set_title('Valid Ranking and watched all 6 movies')

fig.suptitle('Total Scores', fontsize=16)

for rect, label in zip(rect0, score['total']):
    height = rect.get_height()
    axs[0].text(rect.get_x() + rect.get_width() / 2, height + 5, label,
            ha='center', va='bottom')

for rect, label in zip(rect1, scoreAllSeen['total']):
    height = rect.get_height()
    axs[1].text(rect.get_x() + rect.get_width() / 2, height + 5, label,
            ha='center', va='bottom')
    

plt.savefig('TotalScore.png', dpi=300, bbox_inches='tight')
plt.show()

By looking at the above plot with least score on the left in ascending order, we have the following findings:

  • V_TheEmpireStrikesBack is the most favourite movie (for both situation)
  • VI_ReturnofTheJedi and IV_aNewHope would be the second or third favourite, if we only count respondents who have watched all 6 movies, order would swap for second and third favourite.
  • I_ThePhantomMenace would be the fourth favourite.
  • II_AttackoftheClones and III_RevengeoftheSith would be the fifth and least favourite, if we only count respondents who have watched all 6 movies, order would swap for fifth and least favourite.

7. Visualize Rank 1 (Most favourite) Count

Quick Link to Conclusion

We could also check if there is any difference in respondents voting the favourite movie (rank 1) if they have actually watched all 6 of them. Find the rank 1 count of each movie from rankCount.

In [44]:
rank1 = rankCount[rankCount['Rank']==1]
rank1 = rank1.iloc[:,1:].T.reset_index()
rank1.columns=['movie_name', 'count']
rank1 = rank1.sort_values('count', ascending=False)
rank1
Out[44]:
movie_name count
4 V_TheEmpireStrikesBack 288
3 IV_aNewHope 204
5 VI_ReturnofTheJedi 146
0 I_ThePhantomMenace 128
2 III_RevengeoftheSith 36
1 II_AttackoftheClones 32

Construct a dataframe which shows the value_counts for rankAllSeen

In [45]:
rankAllSeenCount = rankAllSeen.apply(pd.Series.value_counts).reset_index()
In [46]:
rankAllSeenCountColName = ['Rank']
rankAllSeenCountColName.extend(movie)
rankAllSeenCountColName
rankAllSeenCount.columns=rankCountColName
rankAllSeenCount
Out[46]:
Rank I_ThePhantomMenace II_AttackoftheClones III_RevengeoftheSith IV_aNewHope V_TheEmpireStrikesBack VI_ReturnofTheJedi
0 1 46 18 27 128 169 82
1 2 30 46 34 107 133 120
2 3 47 52 68 86 71 146
3 4 129 84 121 59 33 44
4 5 76 186 104 30 48 26
5 6 142 84 116 60 16 52

Find the rank 1 count of each movie from rankAllSeenCount.

In [47]:
rank1AllSeen = rankAllSeenCount[rankAllSeenCount['Rank']==1]
rank1AllSeen = rank1AllSeen.iloc[:,1:].T.reset_index()
rank1AllSeen.columns=['movie_name', 'count']
rank1AllSeen = rank1AllSeen.sort_values('count', ascending=False)
rank1AllSeen
Out[47]:
movie_name count
4 V_TheEmpireStrikesBack 169
3 IV_aNewHope 128
5 VI_ReturnofTheJedi 82
0 I_ThePhantomMenace 46
2 III_RevengeoftheSith 27
1 II_AttackoftheClones 18

Plot rank1 and rank1AllSeen in a side-by-side bar chart.

In [48]:
x = np.arange(len(rank1['movie_name'])) 

fig, axs = plt.subplots(1, 2, figsize=(9, 5), sharey=True)
axs[0].bar(rank1['movie_name'], rank1['count'])
axs[1].bar(rank1AllSeen['movie_name'], rank1AllSeen['count'])
axs[0].set_xticks(x)
axs[0].set_xticklabels(rank1['movie_name'], rotation=40, ha='right')
axs[0].set_title('Valid Ranking')
axs[1].set_xticklabels(rank1AllSeen['movie_name'], rotation=40, ha='right')
axs[1].set_title('Valid Ranking and watched all 6 movies')
fig.suptitle('Rank 1 count', fontsize=16)

plt.savefig('Rank1.png', dpi=300, bbox_inches='tight')
plt.show()

By looking at the above plot, we conclude that there is no change on the order for number of people voting on most favourite movie in both scenarios.

Question 2 - Pairwise comparison

Explore the relationships between columns; at least 3 visualisations with plausible hypothesis

1. Supplementary function : PrintStackRelation

Define a function printStackRelation which takes the following parameters and print out a horizontal stack bar chart of 2 different columns in df, first column would be drawn into horizontal bars, the count of unique value is displayed at the right hand side of each bar, the bar would be divided into different segments according to the percentage from the count of unique values in the second column :

  • a dataframe namely, realteDf which is contains row counts of unique values of the 2 chosen columns in the df dataframe using the group by function.
  • list of unique values in the first column of realteDf
  • list of unique values in the second column of relateDf
  • The string to display as the label of y-axis of the stack bar chart
  • The string to dispaly as the title of the stack bar chart
  • File name of the plot that we wanted to save as.
In [49]:
def printStackRelation(relateDf, col1Lvl, col2Lvl, ylabel, title, plotfile):
    data = np.zeros((len(col1Lvl), len(col2Lvl)))
    percent = np.zeros((len(col1Lvl), len(col2Lvl)))
    col1 = relateDf.columns[0]
    col2 = relateDf.columns[1]    
    j = 0
    k = 0
    for i in np.arange(len(col1Lvl)):
        for j in np.arange(len(col2Lvl)):
            if (sum((relateDf[col1] == col1Lvl[i]) &  (relateDf[col2] == col2Lvl[j])) == 0) :
                num = 0
            else:
                num=relateDf.loc[((relateDf[col1]==col1Lvl[i]) & (relateDf[col2]==col2Lvl[j] )),['count']].values[0]
            data[j, i] = num

    for i in np.arange(len(col1Lvl)):
        for j in np.arange(len(col2Lvl)):
            percent[j, i]= data[j,i]/data.sum(axis=0)[i] * 100

    y_pos = np.arange(len(col1Lvl))
    fig = plt.figure(figsize=(8,4))
    ax = fig.add_subplot(111)

    patch_handles = []
        # left alignment of data starts at zero
    left = np.zeros(len(col1Lvl)) 
    for i, d in enumerate(percent):
        patch_handles.append(ax.barh(y_pos, d, align='center', left=left, label=str(col2Lvl[i])))
        left += d

        # search all of the bar segments and annotate
    for j in range(len(patch_handles)):
        for i, patch in enumerate(patch_handles[j].get_children()):
            bl = patch.get_xy()
            x = 0.5*patch.get_width() + bl[0]
            y = 0.5*patch.get_height() + bl[1]
            ax.text(x,y, "%.1f%%" % (percent[j,i]), ha='center')
            if (j == (len(patch_handles) - 1)):
                    ax.text(108, y, "%d" % (data.sum(axis=0)[i]), ha='center')

    ax.set_yticks(y_pos)
    ax.set_yticklabels(col1Lvl)
    plt.title(title, fontsize=14, loc='left', pad=10, x = 0.08)
    plt.legend(bbox_to_anchor=(0.6,1.02,1,0.2), loc="lower left", ncol=3)
    ax.set_xlabel('Percent (%)')
    ax.set_ylabel(ylabel, fontsize=14)
    plt.savefig(plotfile, dpi=300, bbox_inches='tight')
    
    plt.show() 
    

2. 1st Comparison: Fans of Star Wars vs Star Trek

Quick Link to Conclusion

We chose to compare the relationship between Fans of star wars and Fans of star trek. Use the groupby function to construct the dataframe WarTrekDf with the cross tabular information on unique values by count between these 2 columns.

Hypothesis: Usually people who like/dislike star wars would have the same likings towards Star Trek, and vice versa.

In [50]:
WarTrekDf=df.groupby(['fanOfStarWars', 'fanOfStarTrekFranchise']).size().reset_index()
WarTrekDf.columns = ['fanOfStarWars', 'fanOfStarTrekFranchise', 'count']
WarTrekDf
Out[50]:
fanOfStarWars fanOfStarTrekFranchise count
0 N/A N/A 110
1 N/A NO 228
2 N/A YES 12
3 NO N/A 1
4 NO NO 238
5 NO YES 45
6 YES N/A 7
7 YES NO 175
8 YES YES 370

Construct the list of all unique values from the fanOfStarWars column as StarWarLvl and from the fanOfStarTrekFranchise columns as StarTrekLvl.

In [51]:
StarWarLvl=df['fanOfStarWars'].unique()
StarTrekLvl=df['fanOfStarTrekFranchise'].unique()
print(StarWarLvl)
print(StarTrekLvl)
['YES' 'N/A' 'NO']
['NO' 'YES' 'N/A']

We observed the values are the same, however the order is different, we would have a better presentation on the order of segments in the stack bar chart if we arrange them similarly as below.

In [52]:
StarWarLvlOrder=[1,3,2]
StarWarLvl=[x for i, x in sorted(zip(StarWarLvlOrder, StarWarLvl))]
StarTrekLvlOrder=[2,1,3]
StarTrekLvl=[x for i, x in sorted(zip(StarTrekLvlOrder, StarTrekLvl))]
print(StarWarLvl)
print(StarTrekLvl)
['YES', 'NO', 'N/A']
['YES', 'NO', 'N/A']

Pass in the information we contructed in step 2 to the function printStackRelation to produce the horizontal percentage stack bar chart.

In [53]:
printStackRelation(WarTrekDf, StarWarLvl, StarTrekLvl, 'Fans of Star Wars', 'Fan of Star Trek','1. WarTrek.png' )

Construct the dataframe TrekWarDf with the same cross tabular information, however fanOfStarTrekFranchise would be the first column in the group by function.

In [54]:
TrekWarDf=df.groupby(['fanOfStarTrekFranchise', 'fanOfStarWars']).size().reset_index()
TrekWarDf.columns = ['fanOfStarTrekFranchise', 'fanOfStarWars', 'count']
TrekWarDf
Out[54]:
fanOfStarTrekFranchise fanOfStarWars count
0 N/A N/A 110
1 N/A NO 1
2 N/A YES 7
3 NO N/A 228
4 NO NO 238
5 NO YES 175
6 YES N/A 12
7 YES NO 45
8 YES YES 370

Pass in the information we constructed in previous step 6 the function printStackRelation to produce the horizontal percentage stack bar chart. This time we would have Fans of star trek as the bar and Fans of star wars would be the segments.

In [55]:
printStackRelation(TrekWarDf, StarTrekLvl, StarWarLvl, 'Fan of Star Trek', 'Fans of Star Wars', '1. TrekWar.png')

The above graphs show that there are more respondents who are fans of star wars (552) compare to fans of star trek (427). If they are fans of Star Trek, there is a higher percentage they would like Star Wars (86.7%) compare to the reverse (as there are only 67% of Star Wars Fans who are fans of Star Trek). There are many more respondents who do not like Star Trek (641), compare to Star Wars (284), and if they don’t like Star Wars, there are a higher chance they would not like Star Trek (83.8%), while the reverse is not so significant (only 37.1%).

3. 2nd Comparison: Fans of Star wars vs Familiarity on Expanded Universe

Quick Link to Conclusion

We chose to compare the relationship between Fans of star wars and respondents who are you familiar with the Expanded Universe. Use the groupby function to construct the dataframe WarKnowUniDf with the cross tabular information on unique values by count between these 2 columns.

Hypothesis: Fans of Star wars should know about Expanded Universe

In [56]:
WarKnowUniDf=df.groupby(['fanOfStarWars', 'knowExpUniverse']).size().reset_index()
WarKnowUniDf.columns = ['fanOfStarWars', 'knowExpUniverse', 'count']
WarKnowUniDf
Out[56]:
fanOfStarWars knowExpUniverse count
0 N/A N/A 350
1 NO N/A 1
2 NO NO 255
3 NO YES 28
4 YES N/A 7
5 YES NO 360
6 YES YES 185

Construct the list of all unique values from the knowExpUniverse column as knowUniLvl

In [57]:
knowUniLvl=df['knowExpUniverse'].unique()
print(StarWarLvl)
print(knowUniLvl)
['YES', 'NO', 'N/A']
['YES' 'N/A' 'NO']

We observed the unique values are the same, however the order is different, we would have a better presentation on the order of segments in the stack bar chart if we arrange them similarly as below.

In [58]:
knowUniLvlOrder=[1,3,2]
knowUniLvl=[x for i, x in sorted(zip(knowUniLvlOrder, knowUniLvl))]
print(StarWarLvl)
print(knowUniLvl)
['YES', 'NO', 'N/A']
['YES', 'NO', 'N/A']

Pass in the information we constructed in step 10 to the function printStackRelation to produce the horizontal percentage stack bar chart.

In [59]:
printStackRelation(WarKnowUniDf, StarWarLvl, knowUniLvl, 'Fans of Star Wars', 'Know Expanded Universe', '2. WarKnowUni.png')

Construct the dataframe KnowUniWarDf with the same cross tabular information, however knowExpUniverse would be the first column in the group by function.

In [60]:
KnowUniWarDf=df.groupby(['knowExpUniverse', 'fanOfStarWars']).size().reset_index()
KnowUniWarDf.columns = ['knowExpUniverse', 'fanOfStarWars', 'count']
KnowUniWarDf
Out[60]:
knowExpUniverse fanOfStarWars count
0 N/A N/A 350
1 N/A NO 1
2 N/A YES 7
3 NO NO 255
4 NO YES 360
5 YES NO 28
6 YES YES 185

Pass in the information we constructed in previous step to the function printStackRelation to produce the horizontal percentage stack bar chart. This time we would have knowExpUniverse as the bar and Fans of star wars would be the segments.

In [61]:
printStackRelation(KnowUniWarDf, knowUniLvl, StarWarLvl, 'Know Expanded Universe', 'Fans of Star Wars', '2. KnowUniWar.png')

From the above 2 charts, we observe that the percentage of Fans of Star Wars having knowledge on Expanded Universe is surprisingly low (33.5%), but if they know about Expanded Universe, very likely they would be Fans of Star Wars. (86.9%). Slightly more than half of the respondents (58.5%) who don’t know about Expanded universe are Fans of Star Wars.

4. 3rd Comparison: Fans of Star Wars vs Fans of Expanded Universe

Quick Link to Conclusion

We chose to compare the relationship between Fans of star wars and Fans of expanded universe. Use the groupby function to construct the dataframe WarFansOfUniDf with the cross tabular information on unique values by count between these 2 columns.

Hypothesis: Fans of Expanded Universe most likely would be Star Wars Fans, presumes to be vice versa

In [62]:
WarFansOfUniDf=df.groupby(['fanOfStarWars', 'fanOfExpUniverse']).size().reset_index()
WarFansOfUniDf.columns = ['fanOfStarWars', 'fanOfExpUniverse', 'count']
WarFansOfUniDf
Out[62]:
fanOfStarWars fanOfExpUniverse count
0 N/A N/A 350
1 NO N/A 256
2 NO NO 22
3 NO YES 6
4 YES N/A 367
5 YES NO 92
6 YES YES 93

Construct the list of all unique values from the knowExpUniverse column as knowUniLvl

In [63]:
ExpUniLvl=df['fanOfExpUniverse'].unique()
print(StarWarLvl)
print(ExpUniLvl)
['YES', 'NO', 'N/A']
['NO' 'N/A' 'YES']

We observed the unique values are the same, however the order is different, we would have a better presentation on the order of segments in the stack bar chart if we arrange them similarly as below.

In [64]:
ExpUniLvlOrder=[2,3,1]
ExpUniLvl=[x for i, x in sorted(zip(ExpUniLvlOrder, ExpUniLvl))]
print(StarWarLvl)
print(ExpUniLvl)
['YES', 'NO', 'N/A']
['YES', 'NO', 'N/A']

Pass in the information we constructed in previous step to the function printStackRelation to produce the horizontal percentage stack bar chart.

In [65]:
printStackRelation(WarFansOfUniDf, StarWarLvl, ExpUniLvl, 'Fan of Star Wars', 'Fans of Expanded Universe', '3. WarFansUni.png')

Construct the dataframe FansOfUniWarDf with the same cross tabular information, however fanOfExpUniverse would be the first column in the group by function.

In [66]:
FansOfUniWarDf=df.groupby(['fanOfExpUniverse', 'fanOfStarWars']).size().reset_index()
FansOfUniWarDf.columns = ['fanOfExpUniverse', 'fanOfStarWars' , 'count']
FansOfUniWarDf
Out[66]:
fanOfExpUniverse fanOfStarWars count
0 N/A N/A 350
1 N/A NO 256
2 N/A YES 367
3 NO NO 22
4 NO YES 92
5 YES NO 6
6 YES YES 93

Pass in the information we constructed previous step to the function printStackRelation to produce the horizontal percentage stack bar chart. This time we would have Fans of expanded universe as the bar and Fans of star wars would be the segments.

In [67]:
printStackRelation(FansOfUniWarDf, ExpUniLvl, StarWarLvl, 'Fans of Expanded Universe', 'Fan of Star Wars', '3. FansUniWar.png')

The above graphs show that majority of the respondents did not answer whether they are Fans of Expanded universe (973). However, it shows that Fans of Expanded universe, most likely would be fans of star wars (93.9%). Almost same percentage of Star Wars Fans would be (16.6%) or would not be (16.7%) the Fans of Expanded Universe, this is a bit out of my expectation, despite the high percentage of no response.

Question 3 - Demographic Impact

Explore whether there are relationship between people's demographics (Gender, Age, Household Income, Education, Location) and their attitude to Start War characters.

1. Supplementary function: SplitCharacterByLiking

Define a function splitCharacterByLiking which would partition all the respondents in the dataset according to their responses on the favourability of each character. This function would return the partition data and it's corresponding percentage in a 2-dimensional array (row: character, column: favourability), as well as a list of the character names.

In [68]:
def splitCharacterByLiking(level):
    fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
    character = []
    data = np.zeros((len(level), len(fav_cols)))
    levelDf = pd.DataFrame()
    percent = np.zeros((len(level), len(fav_cols)))
    i = 0
    for col in fav_cols:
#        print(f"Column Name: {col}" )
        for j in np.arange(len(level)):
            levelDf = df[col].value_counts()
            levelDf.index.name='rank'
            levelDf=levelDf.to_frame('count').reset_index()            
            VF_num=levelDf.loc[levelDf['rank']==level[j],['count']].values[0]
            data[j, i] = VF_num
        total = data.sum(axis=0)
        i +=1
        character.append(col[7:])
    i=0
    for col in fav_cols:
        for j in np.arange(len(level)):
   #         print(data[j,i], data.sum(axis=0)[i], data[j,i]/data.sum(axis=0)[i] * 100)
            percent[j, i]= data[j,i]/data.sum(axis=0)[i] * 100
        i += 1
    return percent, data, character

Define a function printLiking which would takes the following parameters and print out a horizontal stack bar chart of the respondents' response on the favorability of each character.

* a 2-D array on the favorability of each character (in percentage)
* a 2-D array on the favorability of each character (actual count)
* list of the characters' name
* list of favorability level
In [69]:
def printLiking(percent, data, character, level, plotfile='FavCharacter.png'):
    level_mapping = {'Unfamiliar (N/A)': 99, 'Very unfavorably': 0, 'Somewhat unfavorably': 1, 
                 'Neither favorably nor unfavorably (neutral)': 2, 'Somewhat favorably': 3, 'Very favorably': 4}

    key_list = list(level_mapping.keys()) 
    val_list = list(level_mapping.values())
            

    y_pos = np.arange(len(character))
    fig = plt.figure(figsize=(15,10))
    ax = fig.add_subplot(111)

    patch_handles = []
    # left alignment of data starts at zero
    left = np.zeros(len(character)) 
    for i, d in enumerate(percent):
        likeStr = str(level[i]) + "-" +key_list[val_list.index(level[i])] 
        patch_handles.append(ax.barh(y_pos, d, align='center', left=left,label=likeStr))
        left += d

    # search all of the bar segments and annotate
    for j in range(len(patch_handles)):
        for i, patch in enumerate(patch_handles[j].get_children()):
            bl = patch.get_xy()
            x = 0.5*patch.get_width() + bl[0]
            y = 0.5*patch.get_height() + bl[1]
            ax.text(x,y, "%d%%" % (percent[j,i]), ha='center')
            if (j == (len(patch_handles) - 1)):
                    ax.text(103, y, "%d" % (data.sum(axis=0)[i]), ha='center')

    ax.set_yticks(y_pos)
    ax.set_yticklabels(character)
    plt.title('Character with count by liking', fontsize=16)
    plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=4)
    ax.set_xlabel('Percent (%)')
    plt.savefig(plotfile, dpi=300, bbox_inches='tight')

    plt.show()

2. Visualize Liking of each characters

Construct the list of favourability level by calling the unique function on the Favour_HanSolo column.

In [70]:
df['Favour_HanSolo'].unique()
favourLvl=df['Favour_HanSolo'].unique()
favourLvl=favourLvl[favourLvl!=-1].tolist()
favourLvl.sort()

Construct the 2-D percentage and data array on the favorability of each character and list of character by calling the function splitCharacterByLiking and then pass in all the elements to the function printLiking to print the stack bar chart.

In [71]:
favPercent, favData, character = splitCharacterByLiking(favourLvl)
printLiking(favPercent, favData, character, favourLvl)

3. Supplementary function: SplitCharacterByLikingWithDemo

Define a function splitCharacterByLikingWithDemo which would partition the respondents (with certain demographic feature) according to their response on the favorability of each character, this function takes the following argument :

* the column name of the demographic feature 
* the list of values in that demographic featurea 
* list of favorability level
In [72]:
def splitCharacterByLikingWithDemo(demo, demoList, level):
    fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
    character = []
    data = np.zeros((len(level), len(fav_cols)))
    levelDf = pd.DataFrame()
    percent = np.zeros((len(level), len(fav_cols)))
    i = 0
    for col in fav_cols:

        levelDf= df.groupby([demo, col]).size().reset_index()
        levelDf.columns = [demo, 'rank', 'count']
        levelDf=levelDf[levelDf[demo].isin(demoList)]
        levelDf=levelDf.groupby('rank').sum().reset_index()
        for j in np.arange(len(level)):
            if(sum(levelDf['rank']==level[j])==0):
                VF_num = 0
            else:
                VF_num=levelDf.loc[levelDf['rank']==level[j],['count']].values[0]
            data[j, i] = VF_num
        total = data.sum(axis=0)
        i +=1
        character.append(col[7:])
    i=0
    for col in fav_cols:
        for j in np.arange(len(level)):
   #         print(data[j,i], data.sum(axis=0)[i], data[j,i]/data.sum(axis=0)[i] * 100)
            percent[j, i]= data[j,i]/data.sum(axis=0)[i] * 100
        i += 1
    return percent, data, character

Define a function printLiking2Chart which would takes the following parameters and print out 2 side by side horizontal stack bar chart of the respondents' response on the favorability of each character.

* a 2-D array on the favorability of each character of respondents with certain demographic feature (in percentage)
* a 2-D array on the favorability of each character of respondents with other values of the same demographic feature (in percentage)    
* a 2-D array on the favorability of each character of respondents with certain demographic feature (actual count)
* a 2-D array on the favorability of each character of respondents with other values of the same demographic feature (actual count)    
* str1 - subtitle of the left bar chart
* str2 - subtitle of the right bar chart
* list of the characters' name
* list of favorability level
In [73]:
def printLiking2Chart(percent1, percent2, data1, data2, str1, str2, character, level, plotfile='FavCharacterByDemo.png'):
    level_mapping = {'Unfamiliar (N/A)': 99, 'Very unfavorably': 0, 'Somewhat unfavorably': 1, 
                 'Neither favorably nor unfavorably (neutral)': 2, 'Somewhat favorably': 3, 'Very favorably': 4}

    
    key_list = list(level_mapping.keys()) 
    val_list = list(level_mapping.values())
    

    y_pos = np.arange(len(character))
    fig, axs = plt.subplots(1, 2, figsize=(15, 10), sharey=True)

    #colors ='rgwm'
    patch_handles1 = []
    # left alignment of data starts at zero
    left = np.zeros(len(character)) 
    for i, d in enumerate(percent1):
        likeStr = str(level[i]) + "-" +key_list[val_list.index(level[i])]
        patch_handles1.append(axs[0].barh(y_pos, d, align='center', 
          left=left,label=likeStr))
        left += d

    patch_handles2 = []
    # left alignment of data starts at zero
    left = np.zeros(len(character)) 
    for i, d in enumerate(percent2):
        likeStr = str(level[i]) + "-" +key_list[val_list.index(level[i])]
        patch_handles2.append(axs[1].barh(y_pos, d, align='center', 
          left=left,label=likeStr))
        left += d
        
    # search all of the bar segments and annotate
    for j in range(len(patch_handles1)):
        for i, patch in enumerate(patch_handles1[j].get_children()):
            bl = patch.get_xy()
            x = 0.5*patch.get_width() + bl[0]
            y = 0.5*patch.get_height() + bl[1]
            axs[0].text(x,y, "%d%%" % (percent1[j,i]), ha='center')
            if (j == (len(patch_handles1) - 1)):
                    axs[0].text(110, y, "%d" % (data1.sum(axis=0)[i]), ha='center')

    for j in range(len(patch_handles2)):
        for i, patch in enumerate(patch_handles2[j].get_children()):
            bl = patch.get_xy()
            x = 0.5*patch.get_width() + bl[0]
            y = 0.5*patch.get_height() + bl[1]
            axs[1].text(x,y, "%d%%" % (percent2[j,i]), ha='center')                   
            if (j == (len(patch_handles2) - 1)):
                    axs[1].text(110, y, "%d" % (data2.sum(axis=0)[i]), ha='center')

    axs[0].set_yticks(y_pos)
    axs[0].set_yticklabels(character)    
    
    axs[0].set_title(str1)
    axs[1].set_title(str2)
    axs[0].set_xlabel('Percent (%)')
    axs[1].set_xlabel('Percent (%)')

    fig.suptitle('Character with count by liking', fontsize=16)
    plt.legend(loc='upper center', bbox_to_anchor=(-0.1, -0.1), ncol=4)
    plt.savefig(plotfile, dpi=300, bbox_inches='tight')

    plt.show()

4. Visualize likings of each characters by:

Gender

Find the list of gender by calling the unique function on the Gender column

In [74]:
print(df['Gender'].unique())
print(df['Gender'].value_counts())
['MALE' 'N/A' 'FEMALE']
FEMALE    549
MALE      497
N/A       140
Name: Gender, dtype: int64

Construct the 2-D percentage and data array on the favorability of each character for female and Male respondents by calling the splitCharacterByLikingWithDemo function and print the side by side stack bar chart by calling the function printLiking2Chart

In [75]:
F_favPercent, F_favData, character = splitCharacterByLikingWithDemo('Gender', ['FEMALE'], favourLvl)
M_favPercent, M_favData, character = splitCharacterByLikingWithDemo('Gender', ['MALE'], favourLvl)
In [76]:
printLiking2Chart(F_favPercent, M_favPercent, F_favData, M_favData, 'Female', 'Male', character, favourLvl, '8. FavCharacterBySex.png')

From the above chart, we can observe the followings:

  • the split between Female and Male is almost equitable.
  • Respondents who provided no answer to the gender question will not be included in the chart.
  • Obvious variation in favourability for Jar Jar Binks (votes for Very unfavorably), R2D2 and C-3P0 (votes for Very favorably) between gender

Similarly, we can plot for the other demographic features.

Age

Find the list of age by calling the unique function on the Age column

In [77]:
print(df['Age'].unique())
print(df['Age'].value_counts())
['18-29' 'N/A' '30-44' '> 60' '45-60']
45-60    291
> 60     269
30-44    268
18-29    217
N/A      141
Name: Age, dtype: int64

Construct the 2-D percentage and data array on the favorability of each character for different age group by calling the splitCharacterByLikingWithDemo function and print the side by side stack bar chart by calling the function printLiking2Chart

a. Young versus Rest of the age group
In [78]:
Y_favPercent, Y_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29'], favourLvl)
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['30-44', '45-60', '> 60'], favourLvl)
In [79]:
printLiking2Chart(Y_favPercent, R_favPercent, Y_favData, R_favData, 'Age 18 - 29', 'The Rest', character, favourLvl, '10. FavCharacterYoungandRest.png')
b. Old versus Rest of the age group
In [80]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29', '30-44', '45-60'], favourLvl)
O_favPercent, O_favData, character = splitCharacterByLikingWithDemo('Age', ['> 60'], favourLvl)
In [81]:
printLiking2Chart(O_favPercent, R_favPercent, O_favData, R_favData, 'Age > 60', 'The Rest', character, favourLvl, '10. FavCharacterOldandRest.png')
c. Young versus Old
In [82]:
Y_favPercent, Y_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29'], favourLvl)
O_favPercent, O_favData, character = splitCharacterByLikingWithDemo('Age', ['> 60'], favourLvl)
In [83]:
printLiking2Chart(Y_favPercent, O_favPercent, Y_favData, O_favData, 'Age 18-29', 'Age > 60', character, favourLvl, '10. FavCharacterYoundandOld.png')
d. 30-44 versus the Rest
In [84]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29', '45-60', '> 60'], favourLvl)
Middle1_favPercent, Middle1_favData, character = splitCharacterByLikingWithDemo('Age', ['30-44'], favourLvl)
In [85]:
printLiking2Chart(Middle1_favPercent, R_favPercent, Middle1_favData, R_favData, 'Age 30-44', 'The Rest', character, favourLvl, '10. FavCharacter30-44Rest.png')
e. 45-60 versus the Rest
In [86]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29', '30-44', '> 60'], favourLvl)
Middle2_favPercent, Middle2_favData, character = splitCharacterByLikingWithDemo('Age', ['45-60'], favourLvl)
In [87]:
printLiking2Chart(Middle2_favPercent, R_favPercent, Middle2_favData, R_favData, 'Age 45-60', 'The Rest', character, favourLvl, '10. FavCharacter45-60Rest.png')

Income

Find the list of House hold income by calling the unique function on the HHIncome column

In [88]:
print(df['HHIncome'].unique())
print(df['HHIncome'].value_counts())
['N/A' '$0 - $24,999' '$100,000 - $149,999' '$25,000 - $49,999'
 '$50,000 - $99,999' '$150,000+']
N/A                    328
$50,000 - $99,999      298
$25,000 - $49,999      186
$100,000 - $149,999    141
$0 - $24,999           138
$150,000+               95
Name: HHIncome, dtype: int64

Construct the 2-D percentage and data array on the favorability of each character for different household income group by calling the splitCharacterByLikingWithDemo function and print the side by side stack bar chart by calling the function printLiking2Chart

a. Lower Income versus Rest of the household income group
In [89]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$100,000 - $149,999', '$50,000 - $99,999', '$150,000+'], favourLvl)
P_favPercent, P_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999'], favourLvl)
In [90]:
printLiking2Chart(P_favPercent, R_favPercent, P_favData, R_favData, 'Lower Income', 'The Rest', character, favourLvl, '12. FavCharacterPoorandRest.png')
b. Higher Income versus Rest of the household income group
In [91]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999', '$50,000 - $99,999'], favourLvl)
Rich_favPercent, Rich_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$100,000 - $149,999', '$150,000+'], favourLvl)
In [92]:
printLiking2Chart(Rich_favPercent, R_favPercent, Rich_favData, R_favData, 'Higher Income', 'The Rest', character, favourLvl, '12. FavCharacterRichandRest.png')
c. Middle Income versus Rest of the household income group
In [93]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999', '$100,000 - $149,999', '$150,000+'], favourLvl)
MidIncome_favPercent, MidIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', [ '$50,000 - $99,999'], favourLvl)
In [94]:
printLiking2Chart(MidIncome_favPercent, R_favPercent, MidIncome_favData, R_favData, 'Middle Income', 'The Rest', character, favourLvl, '12. FavCharacterMidIncomeandRest.png')

Education Level

Find the list of Education level by calling the unique function on the Education column

In [95]:
print(df['Education'].unique())
print(df['Education'].value_counts())
['HIGH SCHOOL DEGREE' 'BACHELOR DEGREE' 'SOME COLLEGE OR ASSOCIATE DEGREE'
 'N/A' 'GRADUATE DEGREE' 'LESS THAN HIGH SCHOOL DEGREE']
SOME COLLEGE OR ASSOCIATE DEGREE    328
BACHELOR DEGREE                     321
GRADUATE DEGREE                     275
N/A                                 150
HIGH SCHOOL DEGREE                  105
LESS THAN HIGH SCHOOL DEGREE          7
Name: Education, dtype: int64

Construct the 2-D percentage and data array on the favorability of each character for different education level by calling the splitCharacterByLikingWithDemo function and print the side by side stack bar chart by calling the function printLiking2Chart

a. Less Educated versus Rest of the education level
In [96]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['SOME COLLEGE OR ASSOCIATE DEGREE', 'BACHELOR DEGREE', 'GRADUATE DEGREE'], favourLvl)
Less_Educated_favPercent, Less_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE'], favourLvl)
In [97]:
printLiking2Chart(Less_Educated_favPercent, R_favPercent, Less_Educated_favData, R_favData, 'Less Educated', 'The Rest', character, favourLvl, '14. FavCharacterLessEducatedandRest.png')
b. College or Associate versus Rest of the education level
In [98]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE','BACHELOR DEGREE', 'GRADUATE DEGREE'], favourLvl)
AssociateDegree_favPercent, AssociateDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['SOME COLLEGE OR ASSOCIATE DEGREE'], favourLvl)
In [99]:
printLiking2Chart(AssociateDegree_favPercent, R_favPercent, AssociateDegree_favData, R_favData, 'Some College or Associate Degree', 'The Rest', character, favourLvl,'14.FavCharacterAssoDegandRest.png')
c. Bachelor Degree versus Rest of the education level
In [100]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE','SOME COLLEGE OR ASSOCIATE DEGREE', 'GRADUATE DEGREE'], favourLvl)
bachelorDegree_favPercent, bachelorDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['BACHELOR DEGREE'], favourLvl)
In [101]:
printLiking2Chart(bachelorDegree_favPercent, R_favPercent, bachelorDegree_favData, R_favData, 'Bachelor Degree', 'The Rest', character, favourLvl,'14.FavCharacterBachDegandRest.png')
d. Graduate Degree versus Rest of the education level
In [102]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE','SOME COLLEGE OR ASSOCIATE DEGREE', 'BACHELOR DEGREE'], favourLvl)
High_Educated_favPercent, High_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['GRADUATE DEGREE'], favourLvl)
In [103]:
printLiking2Chart(High_Educated_favPercent, R_favPercent, High_Educated_favData, R_favData, 'Graduate Degree', 'The Rest', character, favourLvl,'14.FavCharacterGraudateDegandRest.png')

Region

Find the list of Region by calling the unique function on the Region column

In [104]:
print(df['Region'].unique())
print(df['Region'].value_counts())
['SOUTH ATLANTIC' 'WEST SOUTH CENTRAL' 'WEST NORTH CENTRAL'
 'MIDDLE ATLANTIC' 'EAST NORTH CENTRAL' 'PACIFIC' 'N/A' 'MOUNTAIN'
 'NEW ENGLAND' 'EAST SOUTH CENTRAL']
EAST NORTH CENTRAL    181
PACIFIC               175
SOUTH ATLANTIC        170
N/A                   143
MIDDLE ATLANTIC       122
WEST SOUTH CENTRAL    110
WEST NORTH CENTRAL     93
MOUNTAIN               79
NEW ENGLAND            75
EAST SOUTH CENTRAL     38
Name: Region, dtype: int64

We could group the region in 4 levels for better comparison according to the geographical location from West to East as follows:

  • Pacific and Mountain : PACIFIC, MOUNTAIN
  • Central: WEST NORTH CENTRAL, WEST SOUTH CENTRAL, EAST NORTH CENTRAL, AST SOUTH CENTRAL
  • Atlantic: MIDDLE ATLANTIC, SOUTH ATLANTIC, NEW ENGLAND
a. Pacific and Mountain versus Rest of the Region
In [105]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Region', ['WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL', 'MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)
PM_Educated_favPercent, PM_Educated_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN'], favourLvl)
In [106]:
printLiking2Chart(PM_Educated_favPercent, R_favPercent, PM_Educated_favData, R_favData, 'Pacific and Mountain', 'The Rest', character, favourLvl, '15.FavCharacterPacificMountainandRest.png')
b. Central versus Rest of the Region
In [107]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN', 'MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)
Central_Educated_favPercent, Central_Educated_favData, character = splitCharacterByLikingWithDemo('Region', ['WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL'], favourLvl)
In [108]:
printLiking2Chart(Central_Educated_favPercent, R_favPercent, Central_Educated_favData, R_favData, 'Central', 'The Rest', character, favourLvl, '15.FavCharacterCentralandRest.png')
c. Atlantic versus Rest of the Region
In [109]:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN', 'WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL'], favourLvl)
Atlantic_Educated_favPercent, Atlantic_Educated_favData, character = splitCharacterByLikingWithDemo('Region', ['MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)
In [110]:
printLiking2Chart(Atlantic_Educated_favPercent, R_favPercent, Atlantic_Educated_favData, R_favData, 'Atlantic', 'The Rest', character, favourLvl, '15.FavCharacterAtlanticandRest.png')

9. Perform Chi Square Association Test

It would be more convincing to show statistical figures rather than comparing the categories of demographical features visually. We could thus perform the chi-square test for independence

1. Split all the demographic data

In [111]:
F_favPercent, F_favData, character = splitCharacterByLikingWithDemo('Gender', ['FEMALE'], favourLvl)
M_favPercent, M_favData, character = splitCharacterByLikingWithDemo('Gender', ['MALE'], favourLvl)
a18to29_favPercent, a18to29_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29'], favourLvl)
a30to44_favPercent, a30to44_favData, character = splitCharacterByLikingWithDemo('Age', ['30-44'], favourLvl)
a45to60_favPercent, a45to60_favData, character = splitCharacterByLikingWithDemo('Age', ['45-60'], favourLvl)
a60plus_favPercent, a60plus_favData, character = splitCharacterByLikingWithDemo('Age', ['> 60'], favourLvl)
lowerIncome_favPercent, lowerIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999'], favourLvl)
midIncome_favPercent, midIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', [ '$50,000 - $99,999'], favourLvl)
highIncome_favPercent, highIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$100,000 - $149,999', '$150,000+'], favourLvl)
Less_Educated_favPercent, Less_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE'], favourLvl)
AssociateDegree_favPercent, AssociateDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['SOME COLLEGE OR ASSOCIATE DEGREE'], favourLvl)
bachelorDegree_favPercent, bachelorDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['BACHELOR DEGREE'], favourLvl)
High_Educated_favPercent, High_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['GRADUATE DEGREE'], favourLvl)
PM_Region_favPercent, PM_Region_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN'], favourLvl)
Central_Region_favPercent, Central_Region_favData, character = splitCharacterByLikingWithDemo('Region', ['WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL'], favourLvl)
Atlantic_Region_favPercent, Atlantic_Region_favData, character = splitCharacterByLikingWithDemo('Region', ['MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)

2. Supplementary function: chiTest

define a function for chi square test which returns the p-value between 2 pass-in array

In [112]:
from scipy.stats import chi2_contingency
from scipy.stats import chi2

def chiTest(controlArray, testArray):
    p_list=np.zeros(controlArray.shape[1])
    for i in range(controlArray.shape[1]):
        test=testArray.T[i,]
        control=controlArray.T[i,]
        table= np.vstack((control,test))
        stat, p, dof, expected = chi2_contingency(table)
        p_list[i]=p
    return p_list

3. Chi Square test (each categories vs original dataset)

Quick Link to Conclusion

Perform chi square test on all the favourability split by demographic features against the master favaorability data

  • $H_{0}$: There is no association in favorability between demographical feature (independence)
  • $H_{A}$: There is an association in favorability between demographical feature (dependence)
In [113]:
female_pList=chiTest(favData, F_favData)
male_pList=chiTest(favData, M_favData)
a18to29_pList=chiTest(favData, a18to29_favData)
a30to44_pList=chiTest(favData, a30to44_favData)
a45to60_pList=chiTest(favData, a45to60_favData)
a60plus_pList=chiTest(favData, a60plus_favData)
lowerIncome_pList=chiTest(favData, lowerIncome_favData)
midIncome_pList=chiTest(favData, midIncome_favData)
highIncome_pList=chiTest(favData, highIncome_favData)
Less_Educated_pList=chiTest(favData, Less_Educated_favData)
AssociateDegree_pList=chiTest(favData, AssociateDegree_favData)
bachelorDegree_pList=chiTest(favData, bachelorDegree_favData)
High_Educated_pList=chiTest(favData, High_Educated_favData)
PM_Region_pList=chiTest(favData, PM_Region_favData)
Central_Region_pList=chiTest(favData, Central_Region_favData)
Atlantic_Region_pList=chiTest(favData, Atlantic_Region_favData)
i. Construct p-value list

Combine the p-value list of each demographical features into one dataframe, finally concatenate all into a single dataframe. Print out the probability list of each demographical features.

In [114]:
region_prob_list = pd.DataFrame({'PM_Region':PM_Region_pList, 'Central_Region': Central_Region_pList, 'Atlantic_Region': Atlantic_Region_pList }, index=character)
education_prob_list = pd.DataFrame({'Less_Educated':Less_Educated_pList, 'AsocDeg_Educated': AssociateDegree_pList, 'BachDeg_Educated': bachelorDegree_pList, 'High_Educated': High_Educated_pList }, index=character)
income_prob_list = pd.DataFrame({'Low_Income':lowerIncome_pList, 'Middle_Income': midIncome_pList, 'High_Income': highIncome_pList }, index=character)
age_prob_list= pd.DataFrame({'Age_18_to_29':a18to29_pList, 'Age_30_to_44': a30to44_pList, 'Age_45_to_60': a45to60_pList, 'Age_60_plus': a60plus_pList}, index=character)
gender_prob_list = pd.DataFrame({'Male':male_pList, 'Female': female_pList}, index=character)
all_prob_list = pd.concat([gender_prob_list, age_prob_list, income_prob_list, education_prob_list, region_prob_list ], axis=1, sort=False)
ii. Print out the p-value list for gender
In [115]:
gender_prob_list.round(3)
Out[115]:
Male Female
HanSolo 0.991 0.966
LukeSkywalker 0.974 0.915
PrincessLeiaOrgana 0.944 0.921
AnakinSkywalker 0.598 0.611
ObiWanKenobi 0.915 0.890
EmperorPalpatine 0.074 0.027
DarthVader 0.117 0.117
LandoCalrissian 0.267 0.233
BobaFett 0.035 0.015
C-3P0 0.057 0.020
R2D2 0.098 0.021
JarJarBinks 0.118 0.099
PadmeAmidala 0.607 0.431
Yoda 0.666 0.476
iii. Print out the p-value list for Age
In [116]:
age_prob_list.round(3)
Out[116]:
Age_18_to_29 Age_30_to_44 Age_45_to_60 Age_60_plus
HanSolo 0.248 0.211 0.899 0.203
LukeSkywalker 0.000 0.869 0.471 0.038
PrincessLeiaOrgana 0.045 0.527 0.391 0.294
AnakinSkywalker 0.023 0.002 0.265 0.000
ObiWanKenobi 0.912 0.356 0.842 0.232
EmperorPalpatine 0.136 0.878 0.855 0.461
DarthVader 0.154 0.098 0.504 0.319
LandoCalrissian 0.880 0.278 0.983 0.443
BobaFett 0.013 0.014 0.013 0.008
C-3P0 0.045 0.206 0.397 0.041
R2D2 0.319 0.961 0.566 0.882
JarJarBinks 0.031 0.005 0.687 0.000
PadmeAmidala 0.002 0.118 0.139 0.000
Yoda 0.753 0.701 0.989 0.500
iv. Print out the p-value list for Income
In [117]:
income_prob_list.round(3)
Out[117]:
Low_Income Middle_Income High_Income
HanSolo 0.161 0.823 0.906
LukeSkywalker 0.080 0.744 0.759
PrincessLeiaOrgana 0.118 0.508 0.848
AnakinSkywalker 0.326 0.554 0.653
ObiWanKenobi 0.147 0.508 0.743
EmperorPalpatine 0.812 0.759 0.774
DarthVader 0.292 0.949 0.801
LandoCalrissian 0.881 0.554 0.988
BobaFett 0.993 0.917 0.822
C-3P0 0.417 0.823 0.806
R2D2 0.069 0.437 0.839
JarJarBinks 0.703 0.957 0.793
PadmeAmidala 0.553 0.775 0.762
Yoda 0.392 0.576 0.782
v. Print out the p-value list for Region
In [118]:
region_prob_list.round(3)
Out[118]:
PM_Region Central_Region Atlantic_Region
HanSolo 0.367 0.911 0.891
LukeSkywalker 0.474 0.746 0.974
PrincessLeiaOrgana 0.324 0.618 0.990
AnakinSkywalker 0.989 0.990 0.925
ObiWanKenobi 0.669 0.748 0.643
EmperorPalpatine 0.193 0.816 0.252
DarthVader 0.666 0.792 0.919
LandoCalrissian 0.874 0.840 0.827
BobaFett 0.919 0.864 0.873
C-3P0 0.569 0.691 0.982
R2D2 0.972 0.736 0.898
JarJarBinks 0.609 0.874 0.787
PadmeAmidala 0.871 0.911 0.731
Yoda 0.893 0.636 0.662
vi. Print out the p-value list for Education
In [119]:
education_prob_list.round(3)
Out[119]:
Less_Educated AsocDeg_Educated BachDeg_Educated High_Educated
HanSolo 0.151 0.888 0.717 0.974
LukeSkywalker 0.686 0.667 0.611 0.834
PrincessLeiaOrgana 0.559 0.929 0.988 0.996
AnakinSkywalker 0.660 0.385 0.325 0.815
ObiWanKenobi 0.309 0.717 0.423 0.797
EmperorPalpatine 0.218 0.272 0.735 0.604
DarthVader 0.511 0.467 0.467 0.555
LandoCalrissian 0.644 0.827 0.754 0.761
BobaFett 0.450 0.804 0.999 0.168
C-3P0 0.357 0.621 0.918 0.978
R2D2 0.184 0.947 0.560 0.898
JarJarBinks 0.545 0.493 0.749 0.818
PadmeAmidala 0.499 0.324 0.305 0.701
Yoda 0.562 0.863 0.953 0.641
vi. Print out the p-value list for all categories of different demographic features
In [120]:
all_prob_list.round(3)
Out[120]:
Male Female Age_18_to_29 Age_30_to_44 Age_45_to_60 Age_60_plus Low_Income Middle_Income High_Income Less_Educated AsocDeg_Educated BachDeg_Educated High_Educated PM_Region Central_Region Atlantic_Region
HanSolo 0.991 0.966 0.248 0.211 0.899 0.203 0.161 0.823 0.906 0.151 0.888 0.717 0.974 0.367 0.911 0.891
LukeSkywalker 0.974 0.915 0.000 0.869 0.471 0.038 0.080 0.744 0.759 0.686 0.667 0.611 0.834 0.474 0.746 0.974
PrincessLeiaOrgana 0.944 0.921 0.045 0.527 0.391 0.294 0.118 0.508 0.848 0.559 0.929 0.988 0.996 0.324 0.618 0.990
AnakinSkywalker 0.598 0.611 0.023 0.002 0.265 0.000 0.326 0.554 0.653 0.660 0.385 0.325 0.815 0.989 0.990 0.925
ObiWanKenobi 0.915 0.890 0.912 0.356 0.842 0.232 0.147 0.508 0.743 0.309 0.717 0.423 0.797 0.669 0.748 0.643
EmperorPalpatine 0.074 0.027 0.136 0.878 0.855 0.461 0.812 0.759 0.774 0.218 0.272 0.735 0.604 0.193 0.816 0.252
DarthVader 0.117 0.117 0.154 0.098 0.504 0.319 0.292 0.949 0.801 0.511 0.467 0.467 0.555 0.666 0.792 0.919
LandoCalrissian 0.267 0.233 0.880 0.278 0.983 0.443 0.881 0.554 0.988 0.644 0.827 0.754 0.761 0.874 0.840 0.827
BobaFett 0.035 0.015 0.013 0.014 0.013 0.008 0.993 0.917 0.822 0.450 0.804 0.999 0.168 0.919 0.864 0.873
C-3P0 0.057 0.020 0.045 0.206 0.397 0.041 0.417 0.823 0.806 0.357 0.621 0.918 0.978 0.569 0.691 0.982
R2D2 0.098 0.021 0.319 0.961 0.566 0.882 0.069 0.437 0.839 0.184 0.947 0.560 0.898 0.972 0.736 0.898
JarJarBinks 0.118 0.099 0.031 0.005 0.687 0.000 0.703 0.957 0.793 0.545 0.493 0.749 0.818 0.609 0.874 0.787
PadmeAmidala 0.607 0.431 0.002 0.118 0.139 0.000 0.553 0.775 0.762 0.499 0.324 0.305 0.701 0.871 0.911 0.731
Yoda 0.666 0.476 0.753 0.701 0.989 0.500 0.392 0.576 0.782 0.562 0.863 0.953 0.641 0.893 0.636 0.662
vii. List out the characters with significant differences

Define a function that would return all the cells which p-value < alpha and return as a list

In [121]:
def getIndexes(dfObj, alpha): 
    listOfPos = list()
    # Get bool dataframe with True at positions where the given value exists
    result = dfObj < alpha 
    # Get list of columns that contains the value
    seriesObj = result.any()
    columnNames = list(seriesObj[seriesObj == True].index)
    # Iterate over list of columns and fetch the rows indexes where value exists
    for col in columnNames:
        rows = list(result[col][result[col] == True].index)
        for row in rows:
            listOfPos.append((row, col, dfObj.loc[row][col]))
    # Return a list of tuples indicating the positions of value in the dataframe
    return listOfPos

Make a dataframe that shows all the demographic features vs character that has probability < alpha

In [122]:
all_association_list = getIndexes(all_prob_list, 0.05)
df = pd.DataFrame(all_association_list, columns=['character', 'demographic feature','probability'])
df
Out[122]:
character demographic feature probability
0 BobaFett Male 0.035203
1 EmperorPalpatine Female 0.027167
2 BobaFett Female 0.015359
3 C-3P0 Female 0.019869
4 R2D2 Female 0.020548
5 LukeSkywalker Age_18_to_29 0.000477
6 PrincessLeiaOrgana Age_18_to_29 0.045426
7 AnakinSkywalker Age_18_to_29 0.023067
8 BobaFett Age_18_to_29 0.013010
9 C-3P0 Age_18_to_29 0.045152
10 JarJarBinks Age_18_to_29 0.030795
11 PadmeAmidala Age_18_to_29 0.001704
12 AnakinSkywalker Age_30_to_44 0.001781
13 BobaFett Age_30_to_44 0.013628
14 JarJarBinks Age_30_to_44 0.005269
15 BobaFett Age_45_to_60 0.013320
16 LukeSkywalker Age_60_plus 0.038312
17 AnakinSkywalker Age_60_plus 0.000058
18 BobaFett Age_60_plus 0.008195
19 C-3P0 Age_60_plus 0.041164
20 JarJarBinks Age_60_plus 0.000041
21 PadmeAmidala Age_60_plus 0.000144
In [123]:
df=df.pivot('demographic feature', 'character','probability').reset_index()
df=df.rename_axis(None, axis=1)
df=df.round(4)
df=df.where(df.notnull(), '-')
df
Out[123]:
demographic feature AnakinSkywalker BobaFett C-3P0 EmperorPalpatine JarJarBinks LukeSkywalker PadmeAmidala PrincessLeiaOrgana R2D2
0 Age_18_to_29 0.0231 0.013 0.0452 - 0.0308 0.0005 0.0017 0.0454 -
1 Age_30_to_44 0.0018 0.0136 - - 0.0053 - - - -
2 Age_45_to_60 - 0.0133 - - - - - - -
3 Age_60_plus 0.0001 0.0082 0.0412 - 0 0.0383 0.0001 - -
4 Female - 0.0154 0.0199 0.0272 - - - - 0.0205
5 Male - 0.0352 - - - - - - -

Conclusion: If we Reject $H_{0}$ for P-value < 0.05 ==> only Age and Gender have associations with the likings for the characters: Anakin Skywalker, Boba Fett, C-3PO, Emperor Palpatine, Jar Jar Binks, Luke Skywalker, Padme Amidala, Princess Leia Organa and R2D2 as shown in the above table.

4. Chi square test within demographical features (Gender and Age)

In [124]:
female_male_pList=chiTest(F_favData, M_favData)
female_male_df=pd.DataFrame({'Male vs Female': female_male_pList.round(4)}, index=character)
female_male_df
Out[124]:
Male vs Female
HanSolo 0.8525
LukeSkywalker 0.6514
PrincessLeiaOrgana 0.5699
AnakinSkywalker 0.0573
ObiWanKenobi 0.4709
EmperorPalpatine 0.0000
DarthVader 0.0001
LandoCalrissian 0.0015
BobaFett 0.0000
C-3P0 0.0000
R2D2 0.0000
JarJarBinks 0.0001
PadmeAmidala 0.0288
Yoda 0.0468
a. Character with significant difference by Gender
In [125]:
female_male_df.loc[(female_male_df['Male vs Female']<0.05),:]
Out[125]:
Male vs Female
EmperorPalpatine 0.0000
DarthVader 0.0001
LandoCalrissian 0.0015
BobaFett 0.0000
C-3P0 0.0000
R2D2 0.0000
JarJarBinks 0.0001
PadmeAmidala 0.0288
Yoda 0.0468
b. Character with NO significant difference by Gender
In [126]:
female_male_df.loc[(female_male_df['Male vs Female']>0.05),:]
Out[126]:
Male vs Female
HanSolo 0.8525
LukeSkywalker 0.6514
PrincessLeiaOrgana 0.5699
AnakinSkywalker 0.0573
ObiWanKenobi 0.4709

Conclusion: If we Reject $H_{0}$ for P-value < 0.05 ==> Emperor Palpatine, Darth Vader, Lando Calrissian, BobaFett, C-3P0, R2D2, JarJarBinks, PademAmidala and Yoda would have significant difference between Male and Female.

In [127]:
a18to29vs30to44_pList=chiTest(a18to29_favData[1:,:], a30to44_favData[1:,:])
a18to29vs45to60_pList=chiTest(a18to29_favData[1:,:], a45to60_favData[1:,:])
a18to29vs60plus_pList=chiTest(a18to29_favData[1:,:], a60plus_favData[1:,:])
a30to44vs45to60_pList=chiTest(a45to60_favData[1:,:], a30to44_favData[1:,:])
a30to44vs60plus_pList=chiTest(a60plus_favData[1:,:], a30to44_favData[1:,:])
a45to60vs60plus_pList=chiTest(a60plus_favData[1:,:], a45to60_favData[1:,:])
all_age_prob_list= pd.DataFrame({'18to29 vs 30to44':a18to29vs30to44_pList.round(4), '18to29 vs 45to60': a18to29vs45to60_pList.round(4), '18to29 vs 60plus': a18to29vs60plus_pList.round(4), '30to44 vs 45to60': a30to44vs45to60_pList.round(4), '30to44 vs 60plus':a30to44vs60plus_pList.round(4), '45to60 vs 60plus':a45to60vs60plus_pList.round(4)}, index=character)
all_age_prob_list
Out[127]:
18to29 vs 30to44 18to29 vs 45to60 18to29 vs 60plus 30to44 vs 45to60 30to44 vs 60plus 45to60 vs 60plus
HanSolo 0.0102 0.1495 0.0292 0.2506 0.0070 0.3032
LukeSkywalker 0.0039 0.0000 0.0000 0.4423 0.0340 0.4645
PrincessLeiaOrgana 0.0275 0.0015 0.0037 0.2756 0.0259 0.2062
AnakinSkywalker 0.5079 0.0007 0.0000 0.0000 0.0000 0.0746
ObiWanKenobi 0.6956 0.4331 0.1062 0.1223 0.0062 0.5216
EmperorPalpatine 0.2667 0.0512 0.0604 0.5853 0.4441 0.9578
DarthVader 0.0287 0.8646 0.2197 0.0766 0.0064 0.2660
LandoCalrissian 0.0674 0.8729 0.5609 0.2379 0.0145 0.3670
BobaFett 0.4066 0.0000 0.0000 0.0000 0.0000 0.1671
C-3P0 0.1318 0.0013 0.0001 0.0215 0.0005 0.3445
R2D2 0.3715 0.0371 0.3776 0.5646 0.6592 0.4000
JarJarBinks 0.4344 0.0140 0.0001 0.0823 0.0018 0.4443
PadmeAmidala 0.1862 0.0004 0.0000 0.0540 0.0000 0.0304
Yoda 0.2064 0.6988 0.4165 0.8375 0.5336 0.5397
a. Characters with significant difference by Age
In [128]:
all_age_prob_list.where(all_age_prob_list < 0.05, '-')
Out[128]:
18to29 vs 30to44 18to29 vs 45to60 18to29 vs 60plus 30to44 vs 45to60 30to44 vs 60plus 45to60 vs 60plus
HanSolo 0.0102 - 0.0292 - 0.007 -
LukeSkywalker 0.0039 0 0 - 0.034 -
PrincessLeiaOrgana 0.0275 0.0015 0.0037 - 0.0259 -
AnakinSkywalker - 0.0007 0 0 0 -
ObiWanKenobi - - - - 0.0062 -
EmperorPalpatine - - - - - -
DarthVader 0.0287 - - - 0.0064 -
LandoCalrissian - - - - 0.0145 -
BobaFett - 0 0 0 0 -
C-3P0 - 0.0013 0.0001 0.0215 0.0005 -
R2D2 - 0.0371 - - - -
JarJarBinks - 0.014 0.0001 - 0.0018 -
PadmeAmidala - 0.0004 0 - 0 0.0304
Yoda - - - - - -

Conclusion: If we Reject $H_{0}$ for P-value < 0.05 ==> We found that there would be significant difference between age group 18-29 and 30-44 for HanSolo, LukeSkywalker, PrincessLia and Darth Vader. Similar findings for other age groups are listed in the above table.

In [ ]: