Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.
In this assignment, we have to clean and explore the StarWars dataset. The data is collected by running a poll through SurveyMonkey, there were 1,186 respondents. The description of the questions asked in the survey is given below:
Import all the necessary packages: numpy, pandas and matplotlib etc.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
Read in StarWars.csv into dataframe df
and assign the column names as below:
df = pd.read_csv("StarWars.csv", names=['respondentId', 'any6Films', 'fanOfStarWars', 'Seen_I_ThePhantomMenace', 'Seen_II_AttackoftheClones', 'Seen_III_RevengeoftheSith', 'Seen_IV_aNewHope', 'Seen_V_TheEmpireStrikesBack','Seen_VI_ReturnofTheJedi', 'Rank_I_ThePhantomMenace', 'Rank_II_AttackoftheClones', 'Rank_III_RevengeoftheSith', 'Rank_IV_aNewHope', 'Rank_V_TheEmpireStrikesBack','Rank_VI_ReturnofTheJedi', 'Favour_HanSolo', 'Favour_LukeSkywalker', 'Favour_PrincessLeiaOrgana', 'Favour_AnakinSkywalker', 'Favour_ObiWanKenobi', 'Favour_EmperorPalpatine', 'Favour_DarthVader', 'Favour_LandoCalrissian', 'Favour_BobaFett', 'Favour_C-3P0', 'Favour_R2D2', 'Favour_JarJarBinks', 'Favour_PadmeAmidala', 'Favour_Yoda', 'firstShotCharacter', 'knowExpUniverse', 'fanOfExpUniverse', 'fanOfStarTrekFranchise', 'Gender', 'Age', 'HHIncome', 'Education', 'Region'], index_col=False, header = None, skiprows=2)
Check the size of the loaded-in dataset. (x, y): x - Number of Records, y - Number of fields.
df.shape
By displaying the first 5 rows of the datasets, as there are 38 columns in the dataset, we need to set the option display.max_columns
to display all the columns.
pd.set_option('display.max_columns', 50)
df.head(5)
Show the data types of each read-in columns
df.dtypes
df.describe(include=np.number).round(2)
Show the unique values and counts for each categorical variables
categorical_cols = df.columns[df.dtypes==object].tolist()
for col in df.columns:
print(f"Column Name: {col}")
print(df[col].unique())
print(df[col].value_counts())
print("\n")
Inspect the values of all the categorical columns carefully and handle the irregular cardinality issues, such as stripping redundant whitespace, replace the capital letter mismatch and data entry error with the majority values with similar meaning
i. white space found in 'Yes ' in the column 'any6Films'
ii. Typo 'Yess' and 'Noo' found in the column 'fanOfStarWars'
iii. Typo 'Yess' found in the column 'fanOfExpUniverse'
iv. Typo 'Noo' and lower case letters 'yes' 'no' found in the column 'fanOfStarTrekFranchise'
v. Irregular cardinality 'F' and lower case letters 'male' 'female' found in the column 'Gender'
df['any6Films']=df['any6Films'].str.strip()
df['fanOfStarWars']=df['fanOfStarWars'].replace(['Yess'], 'Yes')
df['fanOfStarWars']=df['fanOfStarWars'].replace(['Noo'], 'No')
df['fanOfExpUniverse']=df['fanOfExpUniverse'].replace(['Yess'], 'Yes')
df['fanOfStarTrekFranchise']=df['fanOfStarTrekFranchise'].replace(['yes'], 'Yes')
df['fanOfStarTrekFranchise']=df['fanOfStarTrekFranchise'].replace(['Noo', 'no '], 'No')
df['Gender']=df['Gender'].replace(['male'], 'Male')
df['Gender']=df['Gender'].replace(['female', 'F'], 'Female')
Cast all text data to upper-case and strip any extra white space by using str.upper() and str.strip().
categorical_cols = df.columns[df.dtypes==object].tolist()
for col in categorical_cols:
print(f"Column Name: {col}")
df[col]=df[col].str.upper()
df[col]=df[col].str.strip()
print(df[col].unique())
print(df[col].value_counts())
print("\n")
df.isna().sum()
By examining the content of the dataset, we have the below findings:
i. Column fanOfStarWars
, knowExpUniverse
, fanOfExpUniverse
, fanOfStarTrekFranchise
intended for respondents to answer 'Yes' or 'No', however, some of the respondents have skipped the question and did not provide any answers.
ii. It is intended for respondents to select the respective answers for the following columns, however, beside these unique values, some of the respondents have skipped the question and did not provide any answers
`firstShotCharacter` : I DON'T UNDERSTAND THIS QUESTION, GREEDO, HAN
`Gender` : FEMALE, MALE
`Age`: '18-29', '500', '30-44', '> 60', '45-60'
`HHIncome`: '\\$0 - \\$24,999', '\\$100,000 - \\$149,999', '\\$25,000 - \\$49,999', '\\$50,000 - \\$99,999', '\\$150,000+'
`Education`: HIGH SCHOOL DEGREE, BACHELOR DEGREE, SOME COLLEGE OR ASSOCIATE DEGREE, GRADUATE DEGREE, LESS THAN HIGH SCHOOL DEGREE
`Region`: SOUTH ATLANTIC, WEST SOUTH CENTRAL, WEST NORTH CENTRAL, MIDDLE ATLANTIC, EAST NORTH CENTRAL, PACIFIC, MOUNTAIN, NEW ENGLAND, EAST SOUTH CENTRAL
We would impute the value 'N/A' for all the above columns with missing values.
df.loc[(df['fanOfStarWars'].isna()), 'fanOfStarWars']='N/A'
df.loc[(df['knowExpUniverse'].isna()), 'knowExpUniverse']='N/A'
df.loc[(df['fanOfExpUniverse'].isna()), 'fanOfExpUniverse']='N/A'
df.loc[(df['fanOfStarTrekFranchise'].isna()), 'fanOfStarTrekFranchise']='N/A'
df.loc[(df['firstShotCharacter'].isna()), 'firstShotCharacter']='N/A'
df.loc[(df['Gender'].isnull()==True), 'Gender'] = 'N/A'
df.loc[(df['Age'].isnull()==True), 'Age'] = 'N/A'
df.loc[(df['HHIncome'].isnull()==True), 'HHIncome'] = 'N/A'
df.loc[(df['Education'].isnull()==True), 'Education'] = 'N/A'
df.loc[(df['Region'].isnull()==True), 'Region'] = 'N/A'
Seen
¶For all the columns with column name prefix with Seen, if the respondents have seen the movie, the movie name (only that movie name) would be listed as the values of the entries, otherwise no value is shown. It would be easier to manipulate the information, if we encode 1 for the entries which are not null, and 0 for entires which are null for all these columns.
seen_cols = df.columns[df.columns.str.find('Seen', 0, 4)!=-1].tolist()
for col in seen_cols:
print(f"Column Name: {col} has value of:", df[col].unique() )
seen_cols = df.columns[df.columns.str.find('Seen', 0, 4)!=-1].tolist()
for col in seen_cols:
df.loc[~(df[col].isna()), col]=1
df.loc[(df[col].isna()), col]=0
Checking the unique values of these columns after encoding.
seen_cols = df.columns[df.columns.str.find('Seen', 0, 4)!=-1].tolist()
for col in seen_cols:
print(f"Column Name: {col}")
print(df[col].unique())
print(df[col].value_counts())
Rank
¶For all the columns with column name prefix with 'Rank', respondents are rating each star wars movies order of preference ( 1 being favorite, 6 being least favorite), check whether all these columns are only input with values range from 1 to 6.
def checkRank(col):
is1to6=True
for i in np.arange(len(col)):
if(np.isnan(col[i])==False):
if (col[i].is_integer()== False):
print(' Row ', i, 'contains an non-integer')
is1to6=False
else:
check = 1 <= col[i] <= 6
if (check == False):
print(' Row', i, 'has value of', col[i], 'and does not range from 1 to 6')
is1to6=False
return is1to6
rank_cols = df.columns[df.columns.str.find('Rank', 0, 4)!=-1].tolist()
for col in rank_cols:
print(f"Column Name: {col}" )
btn1and6 = checkRank(df[col])
if btn1and6 is True:
print(" ONLY contains values between 1 to 6")
After validation in previous step we know that entires with missing values are those that without any responses on the preference of the star wars movies. As the datatypes of these columns are ordinal. We should keep these columns numerical for easier manipulations, thus it would be wiser to impute the entries without any responses to another unique value (-1).
rank_cols = df.columns[df.columns.str.find('Rank', 0, 4)!=-1].tolist()
for col in rank_cols:
df.loc[(df[col].isna()), col]=-1
df[col]=df[col].astype('int64')
Favour
¶For all the columns with column name prefix with 'Favour', respondents are rating each star wars character with either one of the following likings (p.s. all likings turn into upper case, since we cast all textual data into upper case):
* 'UNFAMILIAR (N/A)'
* 'VERY UNFAVORABLY'
* 'SOMEWHAT UNFAVORABLY'
* 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)'
* 'SOMEWHAT FAVORABLY'
* 'VERY FAVORABLY'
def checkFavourite(col):
isLikingValid=True
liking = ['UNFAMILIAR (N/A)', 'VERY UNFAVORABLY', 'SOMEWHAT UNFAVORABLY', 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)', 'SOMEWHAT FAVORABLY', 'VERY FAVORABLY']
for i in np.arange(len(col)):
if(pd.isna(col[i])==False):
check = (col[i] in liking)
if (check == False):
print(' Row', i, 'has value of', col[i], 'and is not what suppose to be in the liking list.')
isLikingValid=False
return isLikingValid
print("Liking list values are 'UNFAMILIAR (N/A)', 'VERY UNFAVORABLY', 'SOMEWHAT UNFAVORABLY', 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)', 'SOMEWHAT FAVORABLY', 'VERY FAVORABLY'\n" )
fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
for col in fav_cols:
print(f"Column Name: {col}" )
likingInList = checkFavourite(df[col])
if likingInList is True:
print(" ONLY contains values in the liking list")
Two findings:
1) Respondents can provide no response to the liking on any of these characters which leaves the entries with empty values.
2) Likings are ordinal
It would be more sensible to encode these columns as follows:
* 'UNFAMILIAR (N/A)' : 99
* 'VERY UNFAVORABLY':' : 0
* 'SOMEWHAT UNFAVORABLY' : 1
* 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)' :2
* 'SOMEWHAT FAVORABLY' : 3
* 'VERY FAVORABLY' : 4
and -1 for entries without values
level_mapping = {'UNFAMILIAR (N/A)': 99, 'VERY UNFAVORABLY': 0, 'SOMEWHAT UNFAVORABLY': 1,
'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)': 2, 'SOMEWHAT FAVORABLY': 3, 'VERY FAVORABLY': 4}
fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
for col in fav_cols:
df[col] = df[col].replace(level_mapping)
df.loc[(df[col].isna()), col]=-1
df[col]=df[col].astype('int64')
Run a small test-suite to test all the value in the Age
columns are in the defined age bracket. '18-29', '30-44', '> 60'and '45-60', since we imputed 'N/A' as the missing values, include this into the list.
ageBracket = ['18-29', '30-44', '> 60', '45-60', 'N/A']
for i in np.arange(len(df['Age'])):
check = (df['Age'][i] in ageBracket)
if (check == False):
print("We found values out of Age Bracket at index", i, "with value", df['Age'][i])
Since only one entry is found and 500 is definitely out of a normal age range, we imputed this with N/A, same as what we have done for missing values.
df.loc[df['Age']=='500', 'Age'] = 'N/A'
df.dtypes
Check the missing values of all the columns, there shouldn't be any after data preprocessing.
df.isna().sum()
Explore the survey question:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
(Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi)}, then analysis how people rate Star Wars Movies.
Quick Link to Conclusion
Plot the volume of ranking of each star wars movies according to the value count of the columns with prefix Rank
using a grouped bar chart:
x = np.arange(len(df['Rank_I_ThePhantomMenace'].value_counts()))
width = 0.12 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(x + width, df['Rank_I_ThePhantomMenace'].value_counts().sort_index(), width, label='I_ThePhantomMenace')
rects2 = ax.bar(x + width*2, df['Rank_II_AttackoftheClones'].value_counts().sort_index(), width, label='II_AttackoftheClones')
rects3 = ax.bar(x + width*3, df['Rank_III_RevengeoftheSith'].value_counts().sort_index(), width, label='III_RevengeoftheSith')
rects4 = ax.bar(x + width*4, df['Rank_IV_aNewHope'].value_counts().sort_index(), width, label='IV_aNewHope')
rects5 = ax.bar(x + width*5, df['Rank_V_TheEmpireStrikesBack'].value_counts().sort_index(), width, label='V_TheEmpireStrikesBack')
rects6 = ax.bar(x + width*6, df['Rank_VI_ReturnofTheJedi'].value_counts().sort_index(), width, label='VI_ReturnofTheJedi')
ax.set_ylabel('Counts')
ax.set_xlabel('Favourite Score [-1: No response, 1 - 6 ( most favourite to least favourite)]')
ax.set_title('Rank Counts of Star Wars movies')
ax.set_xticks(x+ width*3)
ax.set_xticklabels(df['Rank_I_ThePhantomMenace'].value_counts().sort_index().index)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.2), ncol=3)
plt.savefig('OverAllRankCount.png', dpi=300, bbox_inches='tight')
plt.show()
The chart shows the following findings:
Since the most favourite movie is ranked with score 1, and least favourite movie is ranked with score 6, we could even do a further analysis to compare the total scores of each movie and rate them in order. However, we can only calculate the total scores correctly if the ranking of the entries are valid, which is the respondents are giving ranking 1 to 6 to each star wars movie.
Subset all the columns which give ranking to the star wars movies. Since we know the values in these columns only range from 1 to 6. For valid ranking entry, the total score of that entry would sum up to be 21 (1 + 2 + 3 + 4 + 5 + 6). Check to see how many rows in the subset of data which does not sum up to be 21, excluding the rows with no response (i.e. total score will be -6 if there is no response on the ranking of all star wars movies)
rank = df.iloc[:, 9:15]
rank[rank.sum(axis=1)==21]
rank[(rank.sum(axis=1)!=21) & (rank.sum(axis=1)!=-6) ]
Only 2 entries are with partial response. Excluding these 2 entries should not have big effect on analysing the data, if this is just a small percentage out of the data set with valid ranking.
Let's find out how many entires with valid ranking. To do this, first we need to subset df
to a new dataframe rank
with the entries that only have total score sum up to be 21
rank = rank[rank.sum(axis=1)==21]
rank.shape
Check whether each movie receives 834 ranking and all comprise by values of 1 - 6.
# Apply the function value_counts to all the column in rank, make the result to a new dataframe `rankCount`.
rankCount = rank.apply(pd.Series.value_counts).reset_index()
rankCount
# Sum the rankCount of each movie which should be equal to 834
np.sum(rankCount,axis=0)
Let's make the column heading of rankCount
more readable, strip the Rank
suffix from the rank's column, and name index as 'Rank'
movie = rank.columns.str[5:].tolist()
movie
rankCountColName = ['Rank']
rankCountColName.extend(movie)
rankCountColName
rankCount.columns=rankCountColName
rankCount
np.sum(rankCount,axis=0)
Sum rankCount
by row and column, there are 834 counts for each ranking, and also 834 counts for each star wars movies. Thus all the entries in rank
consist of valid ranking. We can use rank
to sum up the scores.
rankCount.loc[:,rankCount.columns != 'Rank'].sum(axis=0)
SumbyRank = pd.DataFrame(rankCount.loc[:,rankCount.columns != 'Rank'].sum(axis=1))
SumbyRank.columns=['Sum']
pd.concat([rankCount['Rank'], SumbyRank ], axis=1)
Sum the scores of each movie in rank
score = rank.sum(axis=0).reset_index()
score.columns=['movie_name', 'total']
score['movie_name']= movie
score = score.sort_values('total')
score
From the handling missing values section, we noticed that none of the movie have been Seen for more than 800 respondents. Thus respondents giving valid ranking might not watch the movie before. It would be interesting to know if there is any change on the order of the total scores for these 6 movies if we only consider the rankings from respondents who have actually seen all of them. Subset column 4 to 16 from df
to obtain all the information on ranking and seeing.
rankAllSeen = df.iloc[:, 3:15]
Filter rankAllSeen
only with the entries which have watched all 6 star wars movies.
rankAllSeen = rankAllSeen[rankAllSeen.iloc[:,0:6].sum(axis=1)==6]
rankAllSeen.shape
Further filter rankAllSeen
with entries of valid ranking
rankAllSeen=rankAllSeen[rankAllSeen.iloc[:,6:].sum(axis=1)==21]
rankAllSeen.shape
Drop the seeing column, as we only need the ranking column for analysis.
rankAllSeen = rankAllSeen.iloc[:,6:]
Sum the scores of each movie in rankAllSeen
scoreAllSeen = rankAllSeen.sum(axis=0).reset_index()
scoreAllSeen.columns=['movie_name', 'total']
scoreAllSeen['movie_name']= movie
scoreAllSeen = scoreAllSeen.sort_values('total')
scoreAllSeen
Plot score
and scoreAllSeen
in a side-by-side bar chart.
x = np.arange(len(score['movie_name']))
fig, axs = plt.subplots(1, 2, figsize=(9, 5), sharey=True)
axs[0].bar(score['movie_name'], score['total'])
axs[1].bar(scoreAllSeen['movie_name'], scoreAllSeen['total'])
rect0 = axs[0].patches
rect1 = axs[1].patches
axs[0].set_xticks(x)
axs[0].set_xticklabels(score['movie_name'], rotation=40, ha='right')
axs[0].set_title('Valid Ranking')
axs[1].set_xticklabels(scoreAllSeen['movie_name'], rotation=40, ha='right')
axs[1].set_title('Valid Ranking and watched all 6 movies')
fig.suptitle('Total Scores', fontsize=16)
for rect, label in zip(rect0, score['total']):
height = rect.get_height()
axs[0].text(rect.get_x() + rect.get_width() / 2, height + 5, label,
ha='center', va='bottom')
for rect, label in zip(rect1, scoreAllSeen['total']):
height = rect.get_height()
axs[1].text(rect.get_x() + rect.get_width() / 2, height + 5, label,
ha='center', va='bottom')
plt.savefig('TotalScore.png', dpi=300, bbox_inches='tight')
plt.show()
By looking at the above plot with least score on the left in ascending order, we have the following findings:
We could also check if there is any difference in respondents voting the favourite movie (rank 1) if they have actually watched all 6 of them. Find the rank 1 count of each movie from rankCount
.
rank1 = rankCount[rankCount['Rank']==1]
rank1 = rank1.iloc[:,1:].T.reset_index()
rank1.columns=['movie_name', 'count']
rank1 = rank1.sort_values('count', ascending=False)
rank1
Construct a dataframe which shows the value_counts for rankAllSeen
rankAllSeenCount = rankAllSeen.apply(pd.Series.value_counts).reset_index()
rankAllSeenCountColName = ['Rank']
rankAllSeenCountColName.extend(movie)
rankAllSeenCountColName
rankAllSeenCount.columns=rankCountColName
rankAllSeenCount
Find the rank 1 count of each movie from rankAllSeenCount
.
rank1AllSeen = rankAllSeenCount[rankAllSeenCount['Rank']==1]
rank1AllSeen = rank1AllSeen.iloc[:,1:].T.reset_index()
rank1AllSeen.columns=['movie_name', 'count']
rank1AllSeen = rank1AllSeen.sort_values('count', ascending=False)
rank1AllSeen
Plot rank1
and rank1AllSeen
in a side-by-side bar chart.
x = np.arange(len(rank1['movie_name']))
fig, axs = plt.subplots(1, 2, figsize=(9, 5), sharey=True)
axs[0].bar(rank1['movie_name'], rank1['count'])
axs[1].bar(rank1AllSeen['movie_name'], rank1AllSeen['count'])
axs[0].set_xticks(x)
axs[0].set_xticklabels(rank1['movie_name'], rotation=40, ha='right')
axs[0].set_title('Valid Ranking')
axs[1].set_xticklabels(rank1AllSeen['movie_name'], rotation=40, ha='right')
axs[1].set_title('Valid Ranking and watched all 6 movies')
fig.suptitle('Rank 1 count', fontsize=16)
plt.savefig('Rank1.png', dpi=300, bbox_inches='tight')
plt.show()
Explore the relationships between columns; at least 3 visualisations with plausible hypothesis
Define a function printStackRelation
which takes the following parameters and print out a horizontal stack bar chart of 2 different columns in df
, first column would be drawn into horizontal bars, the count of unique value is displayed at the right hand side of each bar, the bar would be divided into different segments according to the percentage from the count of unique values in the second column :
realteDf
which is contains row counts of unique values of the 2 chosen columns in the df
dataframe using the group by function.realteDf
relateDf
def printStackRelation(relateDf, col1Lvl, col2Lvl, ylabel, title, plotfile):
data = np.zeros((len(col1Lvl), len(col2Lvl)))
percent = np.zeros((len(col1Lvl), len(col2Lvl)))
col1 = relateDf.columns[0]
col2 = relateDf.columns[1]
j = 0
k = 0
for i in np.arange(len(col1Lvl)):
for j in np.arange(len(col2Lvl)):
if (sum((relateDf[col1] == col1Lvl[i]) & (relateDf[col2] == col2Lvl[j])) == 0) :
num = 0
else:
num=relateDf.loc[((relateDf[col1]==col1Lvl[i]) & (relateDf[col2]==col2Lvl[j] )),['count']].values[0]
data[j, i] = num
for i in np.arange(len(col1Lvl)):
for j in np.arange(len(col2Lvl)):
percent[j, i]= data[j,i]/data.sum(axis=0)[i] * 100
y_pos = np.arange(len(col1Lvl))
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(111)
patch_handles = []
# left alignment of data starts at zero
left = np.zeros(len(col1Lvl))
for i, d in enumerate(percent):
patch_handles.append(ax.barh(y_pos, d, align='center', left=left, label=str(col2Lvl[i])))
left += d
# search all of the bar segments and annotate
for j in range(len(patch_handles)):
for i, patch in enumerate(patch_handles[j].get_children()):
bl = patch.get_xy()
x = 0.5*patch.get_width() + bl[0]
y = 0.5*patch.get_height() + bl[1]
ax.text(x,y, "%.1f%%" % (percent[j,i]), ha='center')
if (j == (len(patch_handles) - 1)):
ax.text(108, y, "%d" % (data.sum(axis=0)[i]), ha='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(col1Lvl)
plt.title(title, fontsize=14, loc='left', pad=10, x = 0.08)
plt.legend(bbox_to_anchor=(0.6,1.02,1,0.2), loc="lower left", ncol=3)
ax.set_xlabel('Percent (%)')
ax.set_ylabel(ylabel, fontsize=14)
plt.savefig(plotfile, dpi=300, bbox_inches='tight')
plt.show()
We chose to compare the relationship between Fans of star wars and Fans of star trek. Use the groupby function to construct the dataframe WarTrekDf
with the cross tabular information on unique values by count between these 2 columns.
Hypothesis: Usually people who like/dislike star wars would have the same likings towards Star Trek, and vice versa.
WarTrekDf=df.groupby(['fanOfStarWars', 'fanOfStarTrekFranchise']).size().reset_index()
WarTrekDf.columns = ['fanOfStarWars', 'fanOfStarTrekFranchise', 'count']
WarTrekDf
Construct the list of all unique values from the fanOfStarWars
column as StarWarLvl
and from the fanOfStarTrekFranchise
columns as StarTrekLvl
.
StarWarLvl=df['fanOfStarWars'].unique()
StarTrekLvl=df['fanOfStarTrekFranchise'].unique()
print(StarWarLvl)
print(StarTrekLvl)
We observed the values are the same, however the order is different, we would have a better presentation on the order of segments in the stack bar chart if we arrange them similarly as below.
StarWarLvlOrder=[1,3,2]
StarWarLvl=[x for i, x in sorted(zip(StarWarLvlOrder, StarWarLvl))]
StarTrekLvlOrder=[2,1,3]
StarTrekLvl=[x for i, x in sorted(zip(StarTrekLvlOrder, StarTrekLvl))]
print(StarWarLvl)
print(StarTrekLvl)
Pass in the information we contructed in step 2 to the function printStackRelation
to produce the horizontal percentage stack bar chart.
printStackRelation(WarTrekDf, StarWarLvl, StarTrekLvl, 'Fans of Star Wars', 'Fan of Star Trek','1. WarTrek.png' )
Construct the dataframe TrekWarDf
with the same cross tabular information, however fanOfStarTrekFranchise
would be the first column in the group by function.
TrekWarDf=df.groupby(['fanOfStarTrekFranchise', 'fanOfStarWars']).size().reset_index()
TrekWarDf.columns = ['fanOfStarTrekFranchise', 'fanOfStarWars', 'count']
TrekWarDf
Pass in the information we constructed in previous step 6 the function printStackRelation
to produce the horizontal percentage stack bar chart. This time we would have Fans of star trek as the bar and Fans of star wars would be the segments.
printStackRelation(TrekWarDf, StarTrekLvl, StarWarLvl, 'Fan of Star Trek', 'Fans of Star Wars', '1. TrekWar.png')
The above graphs show that there are more respondents who are fans of star wars (552) compare to fans of star trek (427). If they are fans of Star Trek, there is a higher percentage they would like Star Wars (86.7%) compare to the reverse (as there are only 67% of Star Wars Fans who are fans of Star Trek). There are many more respondents who do not like Star Trek (641), compare to Star Wars (284), and if they don’t like Star Wars, there are a higher chance they would not like Star Trek (83.8%), while the reverse is not so significant (only 37.1%).
We chose to compare the relationship between Fans of star wars and respondents who are you familiar with the Expanded Universe. Use the groupby
function to construct the dataframe WarKnowUniDf
with the cross tabular information on unique values by count between these 2 columns.
Hypothesis: Fans of Star wars should know about Expanded Universe
WarKnowUniDf=df.groupby(['fanOfStarWars', 'knowExpUniverse']).size().reset_index()
WarKnowUniDf.columns = ['fanOfStarWars', 'knowExpUniverse', 'count']
WarKnowUniDf
Construct the list of all unique values from the knowExpUniverse
column as knowUniLvl
knowUniLvl=df['knowExpUniverse'].unique()
print(StarWarLvl)
print(knowUniLvl)
We observed the unique values are the same, however the order is different, we would have a better presentation on the order of segments in the stack bar chart if we arrange them similarly as below.
knowUniLvlOrder=[1,3,2]
knowUniLvl=[x for i, x in sorted(zip(knowUniLvlOrder, knowUniLvl))]
print(StarWarLvl)
print(knowUniLvl)
Pass in the information we constructed in step 10 to the function printStackRelation
to produce the horizontal percentage stack bar chart.
printStackRelation(WarKnowUniDf, StarWarLvl, knowUniLvl, 'Fans of Star Wars', 'Know Expanded Universe', '2. WarKnowUni.png')
Construct the dataframe KnowUniWarDf
with the same cross tabular information, however knowExpUniverse
would be the first column in the group by function.
KnowUniWarDf=df.groupby(['knowExpUniverse', 'fanOfStarWars']).size().reset_index()
KnowUniWarDf.columns = ['knowExpUniverse', 'fanOfStarWars', 'count']
KnowUniWarDf
Pass in the information we constructed in previous step to the function printStackRelation
to produce the horizontal percentage stack bar chart. This time we would have knowExpUniverse as the bar and Fans of star wars would be the segments.
printStackRelation(KnowUniWarDf, knowUniLvl, StarWarLvl, 'Know Expanded Universe', 'Fans of Star Wars', '2. KnowUniWar.png')
From the above 2 charts, we observe that the percentage of Fans of Star Wars having knowledge on Expanded Universe is surprisingly low (33.5%), but if they know about Expanded Universe, very likely they would be Fans of Star Wars. (86.9%). Slightly more than half of the respondents (58.5%) who don’t know about Expanded universe are Fans of Star Wars.
We chose to compare the relationship between Fans of star wars and Fans of expanded universe. Use the groupby function to construct the dataframe WarFansOfUniDf
with the cross tabular information on unique values by count between these 2 columns.
Hypothesis: Fans of Expanded Universe most likely would be Star Wars Fans, presumes to be vice versa
WarFansOfUniDf=df.groupby(['fanOfStarWars', 'fanOfExpUniverse']).size().reset_index()
WarFansOfUniDf.columns = ['fanOfStarWars', 'fanOfExpUniverse', 'count']
WarFansOfUniDf
Construct the list of all unique values from the knowExpUniverse
column as knowUniLvl
ExpUniLvl=df['fanOfExpUniverse'].unique()
print(StarWarLvl)
print(ExpUniLvl)
We observed the unique values are the same, however the order is different, we would have a better presentation on the order of segments in the stack bar chart if we arrange them similarly as below.
ExpUniLvlOrder=[2,3,1]
ExpUniLvl=[x for i, x in sorted(zip(ExpUniLvlOrder, ExpUniLvl))]
print(StarWarLvl)
print(ExpUniLvl)
Pass in the information we constructed in previous step to the function printStackRelation
to produce the horizontal percentage stack bar chart.
printStackRelation(WarFansOfUniDf, StarWarLvl, ExpUniLvl, 'Fan of Star Wars', 'Fans of Expanded Universe', '3. WarFansUni.png')
Construct the dataframe FansOfUniWarDf
with the same cross tabular information, however fanOfExpUniverse
would be the first column in the group by function.
FansOfUniWarDf=df.groupby(['fanOfExpUniverse', 'fanOfStarWars']).size().reset_index()
FansOfUniWarDf.columns = ['fanOfExpUniverse', 'fanOfStarWars' , 'count']
FansOfUniWarDf
Pass in the information we constructed previous step to the function printStackRelation
to produce the horizontal percentage stack bar chart. This time we would have Fans of expanded universe as the bar and Fans of star wars would be the segments.
printStackRelation(FansOfUniWarDf, ExpUniLvl, StarWarLvl, 'Fans of Expanded Universe', 'Fan of Star Wars', '3. FansUniWar.png')
The above graphs show that majority of the respondents did not answer whether they are Fans of Expanded universe (973). However, it shows that Fans of Expanded universe, most likely would be fans of star wars (93.9%). Almost same percentage of Star Wars Fans would be (16.6%) or would not be (16.7%) the Fans of Expanded Universe, this is a bit out of my expectation, despite the high percentage of no response.
Explore whether there are relationship between people's demographics (Gender, Age, Household Income, Education, Location) and their attitude to Start War characters.
Define a function splitCharacterByLiking
which would partition all the respondents in the dataset according to their responses on the favourability of each character.
This function would return the partition data and it's corresponding percentage in a 2-dimensional array (row: character, column: favourability), as well as a list of the character names.
def splitCharacterByLiking(level):
fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
character = []
data = np.zeros((len(level), len(fav_cols)))
levelDf = pd.DataFrame()
percent = np.zeros((len(level), len(fav_cols)))
i = 0
for col in fav_cols:
# print(f"Column Name: {col}" )
for j in np.arange(len(level)):
levelDf = df[col].value_counts()
levelDf.index.name='rank'
levelDf=levelDf.to_frame('count').reset_index()
VF_num=levelDf.loc[levelDf['rank']==level[j],['count']].values[0]
data[j, i] = VF_num
total = data.sum(axis=0)
i +=1
character.append(col[7:])
i=0
for col in fav_cols:
for j in np.arange(len(level)):
# print(data[j,i], data.sum(axis=0)[i], data[j,i]/data.sum(axis=0)[i] * 100)
percent[j, i]= data[j,i]/data.sum(axis=0)[i] * 100
i += 1
return percent, data, character
Define a function printLiking
which would takes the following parameters and print out a horizontal stack bar chart of the respondents' response on the favorability of each character.
* a 2-D array on the favorability of each character (in percentage)
* a 2-D array on the favorability of each character (actual count)
* list of the characters' name
* list of favorability level
def printLiking(percent, data, character, level, plotfile='FavCharacter.png'):
level_mapping = {'Unfamiliar (N/A)': 99, 'Very unfavorably': 0, 'Somewhat unfavorably': 1,
'Neither favorably nor unfavorably (neutral)': 2, 'Somewhat favorably': 3, 'Very favorably': 4}
key_list = list(level_mapping.keys())
val_list = list(level_mapping.values())
y_pos = np.arange(len(character))
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(111)
patch_handles = []
# left alignment of data starts at zero
left = np.zeros(len(character))
for i, d in enumerate(percent):
likeStr = str(level[i]) + "-" +key_list[val_list.index(level[i])]
patch_handles.append(ax.barh(y_pos, d, align='center', left=left,label=likeStr))
left += d
# search all of the bar segments and annotate
for j in range(len(patch_handles)):
for i, patch in enumerate(patch_handles[j].get_children()):
bl = patch.get_xy()
x = 0.5*patch.get_width() + bl[0]
y = 0.5*patch.get_height() + bl[1]
ax.text(x,y, "%d%%" % (percent[j,i]), ha='center')
if (j == (len(patch_handles) - 1)):
ax.text(103, y, "%d" % (data.sum(axis=0)[i]), ha='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(character)
plt.title('Character with count by liking', fontsize=16)
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=4)
ax.set_xlabel('Percent (%)')
plt.savefig(plotfile, dpi=300, bbox_inches='tight')
plt.show()
Construct the list of favourability level by calling the unique function on the Favour_HanSolo
column.
df['Favour_HanSolo'].unique()
favourLvl=df['Favour_HanSolo'].unique()
favourLvl=favourLvl[favourLvl!=-1].tolist()
favourLvl.sort()
Construct the 2-D percentage and data array on the favorability of each character and list of character by calling the function splitCharacterByLiking
and then pass in all the elements to the function printLiking
to print the stack bar chart.
favPercent, favData, character = splitCharacterByLiking(favourLvl)
printLiking(favPercent, favData, character, favourLvl)
Define a function splitCharacterByLikingWithDemo
which would partition the respondents (with certain demographic feature) according to their response on the favorability of each character, this function takes the following argument :
* the column name of the demographic feature
* the list of values in that demographic featurea
* list of favorability level
def splitCharacterByLikingWithDemo(demo, demoList, level):
fav_cols = df.columns[df.columns.str.find('Favour', 0, 6)!=-1].tolist()
character = []
data = np.zeros((len(level), len(fav_cols)))
levelDf = pd.DataFrame()
percent = np.zeros((len(level), len(fav_cols)))
i = 0
for col in fav_cols:
levelDf= df.groupby([demo, col]).size().reset_index()
levelDf.columns = [demo, 'rank', 'count']
levelDf=levelDf[levelDf[demo].isin(demoList)]
levelDf=levelDf.groupby('rank').sum().reset_index()
for j in np.arange(len(level)):
if(sum(levelDf['rank']==level[j])==0):
VF_num = 0
else:
VF_num=levelDf.loc[levelDf['rank']==level[j],['count']].values[0]
data[j, i] = VF_num
total = data.sum(axis=0)
i +=1
character.append(col[7:])
i=0
for col in fav_cols:
for j in np.arange(len(level)):
# print(data[j,i], data.sum(axis=0)[i], data[j,i]/data.sum(axis=0)[i] * 100)
percent[j, i]= data[j,i]/data.sum(axis=0)[i] * 100
i += 1
return percent, data, character
Define a function printLiking2Chart
which would takes the following parameters and print out 2 side by side horizontal stack bar chart of the respondents' response on the favorability of each character.
* a 2-D array on the favorability of each character of respondents with certain demographic feature (in percentage)
* a 2-D array on the favorability of each character of respondents with other values of the same demographic feature (in percentage)
* a 2-D array on the favorability of each character of respondents with certain demographic feature (actual count)
* a 2-D array on the favorability of each character of respondents with other values of the same demographic feature (actual count)
* str1 - subtitle of the left bar chart
* str2 - subtitle of the right bar chart
* list of the characters' name
* list of favorability level
def printLiking2Chart(percent1, percent2, data1, data2, str1, str2, character, level, plotfile='FavCharacterByDemo.png'):
level_mapping = {'Unfamiliar (N/A)': 99, 'Very unfavorably': 0, 'Somewhat unfavorably': 1,
'Neither favorably nor unfavorably (neutral)': 2, 'Somewhat favorably': 3, 'Very favorably': 4}
key_list = list(level_mapping.keys())
val_list = list(level_mapping.values())
y_pos = np.arange(len(character))
fig, axs = plt.subplots(1, 2, figsize=(15, 10), sharey=True)
#colors ='rgwm'
patch_handles1 = []
# left alignment of data starts at zero
left = np.zeros(len(character))
for i, d in enumerate(percent1):
likeStr = str(level[i]) + "-" +key_list[val_list.index(level[i])]
patch_handles1.append(axs[0].barh(y_pos, d, align='center',
left=left,label=likeStr))
left += d
patch_handles2 = []
# left alignment of data starts at zero
left = np.zeros(len(character))
for i, d in enumerate(percent2):
likeStr = str(level[i]) + "-" +key_list[val_list.index(level[i])]
patch_handles2.append(axs[1].barh(y_pos, d, align='center',
left=left,label=likeStr))
left += d
# search all of the bar segments and annotate
for j in range(len(patch_handles1)):
for i, patch in enumerate(patch_handles1[j].get_children()):
bl = patch.get_xy()
x = 0.5*patch.get_width() + bl[0]
y = 0.5*patch.get_height() + bl[1]
axs[0].text(x,y, "%d%%" % (percent1[j,i]), ha='center')
if (j == (len(patch_handles1) - 1)):
axs[0].text(110, y, "%d" % (data1.sum(axis=0)[i]), ha='center')
for j in range(len(patch_handles2)):
for i, patch in enumerate(patch_handles2[j].get_children()):
bl = patch.get_xy()
x = 0.5*patch.get_width() + bl[0]
y = 0.5*patch.get_height() + bl[1]
axs[1].text(x,y, "%d%%" % (percent2[j,i]), ha='center')
if (j == (len(patch_handles2) - 1)):
axs[1].text(110, y, "%d" % (data2.sum(axis=0)[i]), ha='center')
axs[0].set_yticks(y_pos)
axs[0].set_yticklabels(character)
axs[0].set_title(str1)
axs[1].set_title(str2)
axs[0].set_xlabel('Percent (%)')
axs[1].set_xlabel('Percent (%)')
fig.suptitle('Character with count by liking', fontsize=16)
plt.legend(loc='upper center', bbox_to_anchor=(-0.1, -0.1), ncol=4)
plt.savefig(plotfile, dpi=300, bbox_inches='tight')
plt.show()
print(df['Gender'].unique())
print(df['Gender'].value_counts())
Construct the 2-D percentage and data array on the favorability of each character for female and Male respondents by calling the splitCharacterByLikingWithDemo
function and print the side by side stack bar chart by calling the function printLiking2Chart
F_favPercent, F_favData, character = splitCharacterByLikingWithDemo('Gender', ['FEMALE'], favourLvl)
M_favPercent, M_favData, character = splitCharacterByLikingWithDemo('Gender', ['MALE'], favourLvl)
printLiking2Chart(F_favPercent, M_favPercent, F_favData, M_favData, 'Female', 'Male', character, favourLvl, '8. FavCharacterBySex.png')
From the above chart, we can observe the followings:
Similarly, we can plot for the other demographic features.
Find the list of age by calling the unique function on the Age
column
print(df['Age'].unique())
print(df['Age'].value_counts())
Construct the 2-D percentage and data array on the favorability of each character for different age group by calling the splitCharacterByLikingWithDemo
function and print the side by side stack bar chart by calling the function printLiking2Chart
Y_favPercent, Y_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29'], favourLvl)
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['30-44', '45-60', '> 60'], favourLvl)
printLiking2Chart(Y_favPercent, R_favPercent, Y_favData, R_favData, 'Age 18 - 29', 'The Rest', character, favourLvl, '10. FavCharacterYoungandRest.png')
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29', '30-44', '45-60'], favourLvl)
O_favPercent, O_favData, character = splitCharacterByLikingWithDemo('Age', ['> 60'], favourLvl)
printLiking2Chart(O_favPercent, R_favPercent, O_favData, R_favData, 'Age > 60', 'The Rest', character, favourLvl, '10. FavCharacterOldandRest.png')
Y_favPercent, Y_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29'], favourLvl)
O_favPercent, O_favData, character = splitCharacterByLikingWithDemo('Age', ['> 60'], favourLvl)
printLiking2Chart(Y_favPercent, O_favPercent, Y_favData, O_favData, 'Age 18-29', 'Age > 60', character, favourLvl, '10. FavCharacterYoundandOld.png')
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29', '45-60', '> 60'], favourLvl)
Middle1_favPercent, Middle1_favData, character = splitCharacterByLikingWithDemo('Age', ['30-44'], favourLvl)
printLiking2Chart(Middle1_favPercent, R_favPercent, Middle1_favData, R_favData, 'Age 30-44', 'The Rest', character, favourLvl, '10. FavCharacter30-44Rest.png')
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29', '30-44', '> 60'], favourLvl)
Middle2_favPercent, Middle2_favData, character = splitCharacterByLikingWithDemo('Age', ['45-60'], favourLvl)
printLiking2Chart(Middle2_favPercent, R_favPercent, Middle2_favData, R_favData, 'Age 45-60', 'The Rest', character, favourLvl, '10. FavCharacter45-60Rest.png')
Find the list of House hold income by calling the unique function on the HHIncome
column
print(df['HHIncome'].unique())
print(df['HHIncome'].value_counts())
Construct the 2-D percentage and data array on the favorability of each character for different household income group by calling the splitCharacterByLikingWithDemo
function and print the side by side stack bar chart by calling the function printLiking2Chart
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$100,000 - $149,999', '$50,000 - $99,999', '$150,000+'], favourLvl)
P_favPercent, P_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999'], favourLvl)
printLiking2Chart(P_favPercent, R_favPercent, P_favData, R_favData, 'Lower Income', 'The Rest', character, favourLvl, '12. FavCharacterPoorandRest.png')
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999', '$50,000 - $99,999'], favourLvl)
Rich_favPercent, Rich_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$100,000 - $149,999', '$150,000+'], favourLvl)
printLiking2Chart(Rich_favPercent, R_favPercent, Rich_favData, R_favData, 'Higher Income', 'The Rest', character, favourLvl, '12. FavCharacterRichandRest.png')
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999', '$100,000 - $149,999', '$150,000+'], favourLvl)
MidIncome_favPercent, MidIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', [ '$50,000 - $99,999'], favourLvl)
printLiking2Chart(MidIncome_favPercent, R_favPercent, MidIncome_favData, R_favData, 'Middle Income', 'The Rest', character, favourLvl, '12. FavCharacterMidIncomeandRest.png')
Find the list of Education level by calling the unique function on the Education
column
print(df['Education'].unique())
print(df['Education'].value_counts())
Construct the 2-D percentage and data array on the favorability of each character for different education level by calling the splitCharacterByLikingWithDemo
function and print the side by side stack bar chart by calling the function printLiking2Chart
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['SOME COLLEGE OR ASSOCIATE DEGREE', 'BACHELOR DEGREE', 'GRADUATE DEGREE'], favourLvl)
Less_Educated_favPercent, Less_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE'], favourLvl)
printLiking2Chart(Less_Educated_favPercent, R_favPercent, Less_Educated_favData, R_favData, 'Less Educated', 'The Rest', character, favourLvl, '14. FavCharacterLessEducatedandRest.png')
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE','BACHELOR DEGREE', 'GRADUATE DEGREE'], favourLvl)
AssociateDegree_favPercent, AssociateDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['SOME COLLEGE OR ASSOCIATE DEGREE'], favourLvl)
printLiking2Chart(AssociateDegree_favPercent, R_favPercent, AssociateDegree_favData, R_favData, 'Some College or Associate Degree', 'The Rest', character, favourLvl,'14.FavCharacterAssoDegandRest.png')
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE','SOME COLLEGE OR ASSOCIATE DEGREE', 'GRADUATE DEGREE'], favourLvl)
bachelorDegree_favPercent, bachelorDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['BACHELOR DEGREE'], favourLvl)
printLiking2Chart(bachelorDegree_favPercent, R_favPercent, bachelorDegree_favData, R_favData, 'Bachelor Degree', 'The Rest', character, favourLvl,'14.FavCharacterBachDegandRest.png')
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE','SOME COLLEGE OR ASSOCIATE DEGREE', 'BACHELOR DEGREE'], favourLvl)
High_Educated_favPercent, High_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['GRADUATE DEGREE'], favourLvl)
printLiking2Chart(High_Educated_favPercent, R_favPercent, High_Educated_favData, R_favData, 'Graduate Degree', 'The Rest', character, favourLvl,'14.FavCharacterGraudateDegandRest.png')
Find the list of Region by calling the unique function on the Region
column
print(df['Region'].unique())
print(df['Region'].value_counts())
We could group the region in 4 levels for better comparison according to the geographical location from West to East as follows:
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Region', ['WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL', 'MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)
PM_Educated_favPercent, PM_Educated_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN'], favourLvl)
printLiking2Chart(PM_Educated_favPercent, R_favPercent, PM_Educated_favData, R_favData, 'Pacific and Mountain', 'The Rest', character, favourLvl, '15.FavCharacterPacificMountainandRest.png')
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN', 'MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)
Central_Educated_favPercent, Central_Educated_favData, character = splitCharacterByLikingWithDemo('Region', ['WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL'], favourLvl)
printLiking2Chart(Central_Educated_favPercent, R_favPercent, Central_Educated_favData, R_favData, 'Central', 'The Rest', character, favourLvl, '15.FavCharacterCentralandRest.png')
R_favPercent, R_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN', 'WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL'], favourLvl)
Atlantic_Educated_favPercent, Atlantic_Educated_favData, character = splitCharacterByLikingWithDemo('Region', ['MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)
printLiking2Chart(Atlantic_Educated_favPercent, R_favPercent, Atlantic_Educated_favData, R_favData, 'Atlantic', 'The Rest', character, favourLvl, '15.FavCharacterAtlanticandRest.png')
It would be more convincing to show statistical figures rather than comparing the categories of demographical features visually. We could thus perform the chi-square test for independence
F_favPercent, F_favData, character = splitCharacterByLikingWithDemo('Gender', ['FEMALE'], favourLvl)
M_favPercent, M_favData, character = splitCharacterByLikingWithDemo('Gender', ['MALE'], favourLvl)
a18to29_favPercent, a18to29_favData, character = splitCharacterByLikingWithDemo('Age', ['18-29'], favourLvl)
a30to44_favPercent, a30to44_favData, character = splitCharacterByLikingWithDemo('Age', ['30-44'], favourLvl)
a45to60_favPercent, a45to60_favData, character = splitCharacterByLikingWithDemo('Age', ['45-60'], favourLvl)
a60plus_favPercent, a60plus_favData, character = splitCharacterByLikingWithDemo('Age', ['> 60'], favourLvl)
lowerIncome_favPercent, lowerIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$0 - $24,999', '$25,000 - $49,999'], favourLvl)
midIncome_favPercent, midIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', [ '$50,000 - $99,999'], favourLvl)
highIncome_favPercent, highIncome_favData, character = splitCharacterByLikingWithDemo('HHIncome', ['$100,000 - $149,999', '$150,000+'], favourLvl)
Less_Educated_favPercent, Less_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['LESS THAN HIGH SCHOOL DEGREE', 'HIGH SCHOOL DEGREE'], favourLvl)
AssociateDegree_favPercent, AssociateDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['SOME COLLEGE OR ASSOCIATE DEGREE'], favourLvl)
bachelorDegree_favPercent, bachelorDegree_favData, character = splitCharacterByLikingWithDemo('Education', ['BACHELOR DEGREE'], favourLvl)
High_Educated_favPercent, High_Educated_favData, character = splitCharacterByLikingWithDemo('Education', ['GRADUATE DEGREE'], favourLvl)
PM_Region_favPercent, PM_Region_favData, character = splitCharacterByLikingWithDemo('Region', ['PACIFIC', 'MOUNTAIN'], favourLvl)
Central_Region_favPercent, Central_Region_favData, character = splitCharacterByLikingWithDemo('Region', ['WEST NORTH CENTRAL', 'WEST SOUTH CENTRAL', 'EAST NORTH CENTRAL', 'EAST SOUTH CENTRAL'], favourLvl)
Atlantic_Region_favPercent, Atlantic_Region_favData, character = splitCharacterByLikingWithDemo('Region', ['MIDDLE ATLANTIC', 'SOUTH ATLANTIC', 'NEW ENGLAND'], favourLvl)
define a function for chi square test which returns the p-value between 2 pass-in array
from scipy.stats import chi2_contingency
from scipy.stats import chi2
def chiTest(controlArray, testArray):
p_list=np.zeros(controlArray.shape[1])
for i in range(controlArray.shape[1]):
test=testArray.T[i,]
control=controlArray.T[i,]
table= np.vstack((control,test))
stat, p, dof, expected = chi2_contingency(table)
p_list[i]=p
return p_list
Perform chi square test on all the favourability split by demographic features against the master favaorability data
female_pList=chiTest(favData, F_favData)
male_pList=chiTest(favData, M_favData)
a18to29_pList=chiTest(favData, a18to29_favData)
a30to44_pList=chiTest(favData, a30to44_favData)
a45to60_pList=chiTest(favData, a45to60_favData)
a60plus_pList=chiTest(favData, a60plus_favData)
lowerIncome_pList=chiTest(favData, lowerIncome_favData)
midIncome_pList=chiTest(favData, midIncome_favData)
highIncome_pList=chiTest(favData, highIncome_favData)
Less_Educated_pList=chiTest(favData, Less_Educated_favData)
AssociateDegree_pList=chiTest(favData, AssociateDegree_favData)
bachelorDegree_pList=chiTest(favData, bachelorDegree_favData)
High_Educated_pList=chiTest(favData, High_Educated_favData)
PM_Region_pList=chiTest(favData, PM_Region_favData)
Central_Region_pList=chiTest(favData, Central_Region_favData)
Atlantic_Region_pList=chiTest(favData, Atlantic_Region_favData)
Combine the p-value list of each demographical features into one dataframe, finally concatenate all into a single dataframe. Print out the probability list of each demographical features.
region_prob_list = pd.DataFrame({'PM_Region':PM_Region_pList, 'Central_Region': Central_Region_pList, 'Atlantic_Region': Atlantic_Region_pList }, index=character)
education_prob_list = pd.DataFrame({'Less_Educated':Less_Educated_pList, 'AsocDeg_Educated': AssociateDegree_pList, 'BachDeg_Educated': bachelorDegree_pList, 'High_Educated': High_Educated_pList }, index=character)
income_prob_list = pd.DataFrame({'Low_Income':lowerIncome_pList, 'Middle_Income': midIncome_pList, 'High_Income': highIncome_pList }, index=character)
age_prob_list= pd.DataFrame({'Age_18_to_29':a18to29_pList, 'Age_30_to_44': a30to44_pList, 'Age_45_to_60': a45to60_pList, 'Age_60_plus': a60plus_pList}, index=character)
gender_prob_list = pd.DataFrame({'Male':male_pList, 'Female': female_pList}, index=character)
all_prob_list = pd.concat([gender_prob_list, age_prob_list, income_prob_list, education_prob_list, region_prob_list ], axis=1, sort=False)
gender_prob_list.round(3)
age_prob_list.round(3)
income_prob_list.round(3)
region_prob_list.round(3)
education_prob_list.round(3)
all_prob_list.round(3)
Define a function that would return all the cells which p-value < alpha and return as a list
def getIndexes(dfObj, alpha):
listOfPos = list()
# Get bool dataframe with True at positions where the given value exists
result = dfObj < alpha
# Get list of columns that contains the value
seriesObj = result.any()
columnNames = list(seriesObj[seriesObj == True].index)
# Iterate over list of columns and fetch the rows indexes where value exists
for col in columnNames:
rows = list(result[col][result[col] == True].index)
for row in rows:
listOfPos.append((row, col, dfObj.loc[row][col]))
# Return a list of tuples indicating the positions of value in the dataframe
return listOfPos
Make a dataframe that shows all the demographic features vs character that has probability < alpha
all_association_list = getIndexes(all_prob_list, 0.05)
df = pd.DataFrame(all_association_list, columns=['character', 'demographic feature','probability'])
df
df=df.pivot('demographic feature', 'character','probability').reset_index()
df=df.rename_axis(None, axis=1)
df=df.round(4)
df=df.where(df.notnull(), '-')
df
female_male_pList=chiTest(F_favData, M_favData)
female_male_df=pd.DataFrame({'Male vs Female': female_male_pList.round(4)}, index=character)
female_male_df
female_male_df.loc[(female_male_df['Male vs Female']<0.05),:]
female_male_df.loc[(female_male_df['Male vs Female']>0.05),:]
a18to29vs30to44_pList=chiTest(a18to29_favData[1:,:], a30to44_favData[1:,:])
a18to29vs45to60_pList=chiTest(a18to29_favData[1:,:], a45to60_favData[1:,:])
a18to29vs60plus_pList=chiTest(a18to29_favData[1:,:], a60plus_favData[1:,:])
a30to44vs45to60_pList=chiTest(a45to60_favData[1:,:], a30to44_favData[1:,:])
a30to44vs60plus_pList=chiTest(a60plus_favData[1:,:], a30to44_favData[1:,:])
a45to60vs60plus_pList=chiTest(a60plus_favData[1:,:], a45to60_favData[1:,:])
all_age_prob_list= pd.DataFrame({'18to29 vs 30to44':a18to29vs30to44_pList.round(4), '18to29 vs 45to60': a18to29vs45to60_pList.round(4), '18to29 vs 60plus': a18to29vs60plus_pList.round(4), '30to44 vs 45to60': a30to44vs45to60_pList.round(4), '30to44 vs 60plus':a30to44vs60plus_pList.round(4), '45to60 vs 60plus':a45to60vs60plus_pList.round(4)}, index=character)
all_age_prob_list
all_age_prob_list.where(all_age_prob_list < 0.05, '-')