Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.
The dataset was created from participants in experimental speed dating events from 2002-2004. During these events, the participants would have a four-minute "first date" with every other participant of the opposite sex. After which they would rate their partner. They were also asked if they would like to see their date again. If so, this first date event would be classified as a positive match.
The dataset is found from https://github.com/vaksakalli/datasets, Each observation (row) represents one single dating event, and each variable (column) provide the information of:
All the columns are identified as descriptive features, except the last column - match, which is our target feature. Descriptive features are used to classify the values of target feature, details of the features in this dataset are listed as below:
has null
Row has null values or not wave
Experiment numbersgender
Gender of self age
Age of self age_o
Age of partner d_age
Difference in aged_d_age
Binned values of difference in age, 4 groups:[0-1], [2-3], [4-6] and [7-37]race
Race of self race_o
Race of partner samerace
Whether the two persons have the same race or not. importance_same_race
How important is it that partner is of same race? importance_same_religion
How important is it that partner has same religion? d_importance_same_race
Binned values of raced_importance_same_religion
Binned Values of religionfield
Field of study pref_o_attractive
How important does partner rate attractiveness pref_o_sinsere
How important does partner rate sincerity pref_o_intelligence
How important does partner rate intelligence pref_o_funny
How important does partner rate being funny pref_o_ambitious
How important does partner rate ambition pref_o_shared_interests
How important does partner rate having shared interestsd_pref_o_attractive
Binned values for how important does partner rate attractiveness, 3 groups: [0-15], [16-20] and [21-100]d_pref_o_sinsere
Binned values for How important does partner rate sincerity, 3 groups: [0-15], [16-20] and [21-100] d_pref_o_intelligence
Binned values for how important does partner rate intelligence, 3 groups: [0-15], [16-20] and [21-100] d_pref_o_funny
Binned values for how important does partner rate being funny, 3 groups: [0-15], [16-20] and [21-100] d_pref_o_ambitious
Binned values Binned values for how important does partner rate being funny, 3 groups: [0-15], [16-20] and [21-100] d_pref_o_shared_interests
Binned values for Binned values for how important does partner rate being funny, 3 groups: [0-15], [16-20] and [21-100] attractive_o
Rating by partner (about me) at night of event on attractiveness sincere_o
Rating by partner (about me) at night of event on sincerity intelligence_o
Rating by partner (about me) at night of event on intelligence funny_o
Rating by partner (about me) at night of event on being funny ambitous_o
Rating by partner (about me) at night of event on being ambitious shared_interests_o
Rating by partner (about me) at night of event on shared interestd_attractive_o
Binned values for rating by partner (about me) at night of event on attractiveness: d_sinsere_o
Binned values for rating by partner (about me) at night of event on sincerity, 3 groups: [0-5], [6-8] and [9-10]d_intelligence_o
Binned values for rating by partner (about me) at night of event on intelligence, 3 groups: [0-5], [6-8] and [9-10]d_funny_o
Binned values for rating by partner (about me) at night of event on being funny, 3 groups: [0-5], [6-8] and [9-10]d_ambitous_o
Binned values for rating by partner (about me) at night of event on being ambitious, 3 groups: [0-5], [6-8] and [9-10]d_shared_interests_o
Binned values for rating by partner (about me) at night of event on being ambitious, 3 groups: [0-5], [6-8] and [9-10]attractive_important
What do you look for in a partner - attractiveness sincere_important
What do you look for in a partner - sincerity intellicence_important
What do you look for in a partner - intelligence funny_important
What do you look for in a partner - being funny ambtition_important
What do you look for in a partner - ambition shared_interests_important
What do you look for in a partner - shared interests d_attractive_important
Binned values for what do you look for in a partner - attractiveness, 3 groups: [0-15], [16-20] and [21-100] d_sincere_important
Binned values for what do you look for in a partner - sincerity, 3 groups: [0-15], [16-20] and [21-100] d_intellicence_important
Binned values for what do you look for in a partner - intelligence, 3 groups: [0-15], [16-20] and [21-100] d_funny_important
Binned values for what do you look for in a partner - being funny, 3 groups: [0-15], [16-20] and [21-100] d_ambtition_important
Binned values for what do you look for in a partner - ambition, 3 groups: [0-15], [16-20] and [21-100] d_shared_interests_important
Binned values for what do you look for in a partner - shared interests, 3 groups: [0-15], [16-20] and [21-100] attractive
Rate yourself - attractiveness sincere
Rate yourself - sincerity intelligence
Rate yourself - intelligence funny
Rate yourself - being funny ambition
Rate yourself - ambitiond_attractive
Binned values for rate yourself - attractiveness, 3 groups: [0-5], [6-8] and [9-10]d_sincere
Binned values for rate yourself - sincerity, 3 groups: [0-5], [6-8] and [9-10]d_intelligence
Binned values for rate yourself - intelligence, 3 groups: [0-5], [6-8] and [9-10]d_funny
Binned values for rate yourself - being funny, 3 groups: [0-5], [6-8] and [9-10]d_ambition
Binned values for Rate yourself - ambition, 3 groups: [0-5], [6-8] and [9-10]attractive_partner
Rate your partner - attractiveness sincere_partner
Rate your partner - sincerity intelligence_partner
Rate your partner - intelligence funny_partner
Rate your partner - being funny ambition_partner
Rate your partner - ambition shared_interests_partner
Rate your partner - shared interests d_attractive_partner
Binned values for rate your partner - attractiveness, 3 groups: [0-5], [6-8] and [9-10]d_sincere_partner
Binned values for rate your partner - sincerity, 3 groups: [0-5], [6-8] and [9-10]d_intelligence_partner
Binned values for rate your partner - intelligence, 3 groups: [0-5], [6-8] and [9-10]d_funny_partner
Binned values for rate your partner - being funny, 3 groups: [0-5], [6-8] and [9-10]d_ambition_partner
Binned values for Rate your partner - ambition, 3 groups: [0-5], [6-8] and [9-10]d_shared_interests_partner
Binned values for Rate your partner - shared interest, 3 groups: [0-5], [6-8] and [9-10]sports
Your own intereststvsports
Your own interestsexercise
Your own interestsdining
Your own interestsmuseums
Your own interests art
Your own interestshiking
Your own interests gaming
Your own interestsclubbing
Your own interestsreading
Your own intereststv
Your own intereststheater
Your own interests movies
Your own interestsconcerts
Your own interestsmusic
Your own interestsshopping
Your own interests yoga
Your own interestsd_sports
Binned values for Your own interests - sports, 3 groups: [0-5], [6-8] and [9-10]d_tvsports
Binned values for Your own interests - tvsports, 3 groups: [0-5], [6-8] and [9-10]d_exercise
Binned values for Your own interests - exercise, 3 groups: [0-5], [6-8] and [9-10]d_dining
Binned values for Your own interests - dining, 3 groups: [0-5], [6-8] and [9-10]d_museums
Binned values for Your own interests - museums, 3 groups: [0-5], [6-8] and [9-10]d_art
Binned values for Your own interests - art, 3 groups: [0-5], [6-8] and [9-10]d_hiking
Binned values for Your own interests - hiking, 3 groups: [0-5], [6-8] and [9-10]d_gaming
Binned values for Your own interests - gaming, 3 groups: [0-5], [6-8] and [9-10]d_clubbing
Binned values for Your own interests - clubbing, 3 groups: [0-5], [6-8] and [9-10]d_reading
Binned values for Your own interests - reading, 3 groups: [0-5], [6-8] and [9-10]d_tv
Binned values for Your own interests - tv, 3 groups: [0-5], [6-8] and [9-10]d_theater
Binned values for Your own interests - theatre, 3 groups: [0-5], [6-8] and [9-10]d_movies
Binned values for Your own interests - movies, 3 groups: [0-5], [6-8] and [9-10]d_concerts
Binned values for Your own interests - concerts, 3 groups: [0-5], [6-8] and [9-10]d_music
Binned values for Your own interests - music, 3 groups: [0-5], [6-8] and [9-10]d_shopping
Binned values for Your own interests - shopping, 3 groups: [0-5], [6-8] and [9-10]d_yoga
Binned values for Your own interests - yoga, 3 groups: [0-5], [6-8] and [9-10]interests_correlate
Correlation between participant’s and partner’s ratings of interests. d_interests_correlate
Binned values for Correlation between participant’s and partner’s ratings of interests., 3 groups: [-1-0], [0-0.33] and [0.33-1]expected_happy_with_sd_people
How happy do you expect to be with the people you meet during the speed-dating event? expected_num_interested_in_me
Out of the 20 people you will meet, how many do you expect will be interested in dating you? expected_num_matches
How many matches do you expect to get?d_expected_happy_with_sd_people
Binned values for how happy do you expect to be with the people you meet during the speed-dating event? 3 groups: [0-4], [5-6] and [7-10] d_expected_num_interested_in_me
Binned values for out of the 20 people you will meet, how many do you expect will be interested in dating you, 3 groups: [0-3], [4-9] and [10-20] d_expected_num_matches
Binned values for How many matches do you expect to get, 3 groups: [0-2], [3-5] and [5-18] like
Did you like your partner? guess_prob_liked
How likely do you think your partner likes you?d_like
Binned values for did you like your partner, 3 groups: [0-5], [6-8] and [9-10]d_guess_prob_liked
Binned values for how likely do you think your partner likes you, 3 groups: [0-4], [5-6] and [7-10]met
Have you met your partner before? match
is our target feature which is binary. '1' is the positive class which denotes match, '0' denotes not match.
The objective of this analysis is to predict if a person would match (go on with second date) with another person based on the data collected during the short interaction they had (4-minute first date) with each other.
To do this, We are going to fit the processed dataset and compare the prediction results from different classifiers to determine which one would be the best model after parameter tuning, 5 classifiers are chosen to use in the analysis and they are listed as below:
1. We import all the necessary packages.
2. We set the display options and random seeds.
3. We import the csv file directly from the github's url into the dataframe `speed_df`.
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
sns.set(style="darkgrid")
import numpy as np
import pandas as pd
import io
import requests
import matplotlib.pyplot as plt
from IPython.display import display, HTML
from sklearn import metrics
from scipy import stats
np.random.seed(999)
random_state=999
# so that we can see all the columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context
speed_url = 'https://raw.githubusercontent.com/vaksakalli/datasets/master/speed_dating.csv'
url_content = requests.get(speed_url).content
speed_df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))
4. Once the dataframe is created, we check whether the number of columns and rows of the imported data is the same as the csv file
speed_df.shape
5. We then randomly select 10 rows from the dataset and get an overview of the content of the dataset.
display(HTML('<b>Table 1: Sample of Speed Dating Dataset </b>'))
speed_df.sample(n=10, random_state=8)
6. Explore the missing values in the dataset
speed_df.isnull().sum()
We found that the column expected_num_interested_in_me
has the highest count (6578 out of 8378 records) of missing value.
7. Randomly select 10 observations which has missing values in `expected_num_interested_in_me` and compare to the value in it's corresponding binned column `d_expected_num_interested_in_me`.
check_na_df=speed_df[np.isnan(speed_df.expected_num_interested_in_me)].sample(n=10, random_state=8)
check_na_df.loc[:, ["expected_num_interested_in_me", "d_expected_num_interested_in_me"]]
We found that all the observations with missing values in expected_num_interested_in_me
would be put in the lowest bin in d_expected_num_interested_in_me
. Similar findings with the column shared_interests_partner
and d_shared_interests_partner
as shown below.
check_na_df=speed_df[np.isnan(speed_df.shared_interests_partner)].sample(n=10, random_state=8)
check_na_df.loc[:, ["shared_interests_partner", "d_shared_interests_partner"]]
Further lookup from the web, we found this dataset could also be found from https://www.openml.org/d/40536. In the openML repository, none of the binned column exist.
8. Base on the principal to retain the originality, as well as predicting the target with completed descriptive features. We would drop all the observations with null values, as well as the following columns:
- all the columns with `d_` as prefix except column `d_age` as `d_age` is not a column with binned values, it contains the difference in age between participant and his/her partner.
- column `has_null` as it only contains indicating value whether there is any NA value in the particular row, it has no help on determining our target feature.
- column `wave` as it only contains experimental batch number of the dating event, it also has no help on determining our target feature.
bool_binned_cols = (speed_df.columns.str.find('d_', 0, 2)!=-1) & (speed_df.columns.str.find('d_age', 0, 5)==-1)
binned_cols = speed_df.columns[bool_binned_cols].tolist()
speed_df= speed_df.dropna()
speed_df=speed_df.drop(columns =['has_null','wave'])
speed_df=speed_df.drop(columns = binned_cols)
print(f"Shape of the dataset is {speed_df.shape} \n")
na_sum=speed_df.isna().sum()!=0
print(f"Final Check for Null Values: {sum(na_sum)} \n")
print(f"Now that all null values have been removed, lets check the data types of these attributes. \n")
display(HTML('<b>Table 2: Data types of the attributes </b>'))
print(speed_df.dtypes)
9. check the *descriptive statistics* of all the numerical features using the *describe* function
display(HTML('<b>Table 3: Summary of continuous features </b>'))
speed_df.describe(include = np.number).round(2)
10. Get the *summary statistics* of the categorical variables
display(HTML('<b>Table 4: Summary Statistics of categorical features </b>'))
categorical_cols = speed_df.columns[speed_df.dtypes == np.object].tolist()
for categorical_col in categorical_cols:
print(categorical_col + ':')
print(speed_df[categorical_col].value_counts().round(2))
print('\n')
11. Group the values of `field` to
- avoid difference in capitalization of letters
- make the value less scattered, such that similar nature of disciplines will combine to same values.
speed_df['field']= speed_df['field'].str.strip()
speed_df['field']=speed_df['field'].replace(['law', 'LAW'], 'Law')
speed_df['field']=speed_df['field'].replace(["'Social Work'", "'social work'", "Sociology", "'Masters of Social Work'"], 'Social Work')
speed_df['field']=speed_df['field'].replace(["philosophy","psychology", "'psychology and english'", "'Organizational Psychology'", "Psychology"], 'Psychology/Philosophy')
speed_df['field']=speed_df['field'].replace(["chemistry"], 'Chemistry')
speed_df['field']=speed_df['field'].replace(["'Electrical Engineering'", "'Mechanical Engineering'", "'Biomedical Engineering'", "'Computer Science'", "Engineering"], 'Engineering/Computer Science')
speed_df['field']=speed_df['field'].replace(["Business","Finance","Economics", "Marketing", "Finance&Economics", "Finanace","'Economics; Sociology'", "'Business & International Affairs'"], "Business/Finance")
speed_df['field']=speed_df['field'].replace(["MBA", "Business- MBA", "'Business- MBA'", "'Business [MBA]'"], "Business [MBA]")
speed_df['field']=speed_df['field'].replace(["'political science'", "'Economics and Political Science'"], 'Political Science')
speed_df['field']=speed_df['field'].replace(["Journalism", "Communications", "'Masters in Public Administration'"], "Journalism/Communications")
speed_df['field']=speed_df['field'].replace(["'Elementary/Childhood Education [MA]'","'Educational Psychology'","'International Educational Development'", "'TC [Health Ed]'"], "Education")
speed_df['field']=speed_df['field'].replace(["Medicine","microbiology","'Biomedical Engineering'"], "Medical Science")
speed_df['field']=speed_df['field'].replace(["'Art History/medicine'", "Mathematics", "'Mathematical Finance'", "'financial math'","'Applied Maths/Econs'", "Statistics"], "Mathematical-related")
speed_df['field']=speed_df['field'].replace(["'Operations Research'","'Operations Research [SEAS]'"], 'Operational Research')
speed_df['field']=speed_df['field'].replace(["English", "Polish", "'German Literature'", "'Speech Language Pathology'"], "Language-related")
speed_df['field']=speed_df['field'].replace(["English", "Polish", "'German Literature'", "'Speech Language Pathology'"], "Language-related")
12. Assign all the descriptive features (except `match`) as Data and the predictive label column `match` as target
speed_df.to_csv('speed_clean.csv', index=False)
Data = speed_df.drop(columns = 'match')
target = speed_df['match']
13. Perform one-hot encoding for all the descriptive features, if the categorical descriptive has only 2-level, encode it with only one binary variable
# get the list of categorical descriptive features
categorical_cols = Data.columns[Data.dtypes==object].tolist()
# if a categorical descriptive feature has only 2 levels,
# define only one binary variable
for col in categorical_cols:
n = len(Data[col].unique())
if (n == 2):
Data[col] = pd.get_dummies(Data[col], drop_first=True)
# for other categorical features (with > 2 levels),
# use regular one-hot-encoding
# if a feature is numeric, it will be untouched
Data = pd.get_dummies(Data)
14. Check all the data types and shape of the descriptive feature after encoding
Data.dtypes
Data.shape
15. Retain the column name of the descriptive and target feature by copying them into new dataframe `Data_df` and `target_df`.
Data_df = Data.copy()
target_df =target.copy()
print("Target Type:", type(target))
print("Counts Using NumPy:")
print(np.unique(target, return_counts = True))
print("Counts Using Pandas:")
print(pd.Series(target).value_counts())
16. Perform normalization to all the descriptive features using MinMaxScaler()
from sklearn import preprocessing
Data = preprocessing.MinMaxScaler().fit_transform(Data)
target=target.values
17. Ensure both `Data` and `target` are numpy array, such that they are ready to pass-in to scikit learn for data modelling.
print(type(Data))
print(type(target))
1. Explore the `gender` distribution in the dataset.
plt.figure()
speed_df['gender'].value_counts().plot(kind='bar')
plt.xticks(rotation='horizontal')
plt.ylabel('Count')
plt.title('Gender Distribution', fontsize = 16)
plt.show()
display(HTML('<b>Figure 1: Gender Distribution </b>'))
Figure 1 indicates that the two genders are equally distributed, therefore there is no bias in gender.
2. Explore the `racial` distribution in the dataset.
plt.figure()
speed_df['race'].value_counts().plot(kind='bar')
plt.ylabel('Number of People')
plt.title('Racial Distribution',fontsize = 16)
plt.xticks(rotation=80)
plt.show()
display(HTML('<b>Figure 2: Racial Distribution </b>'))
Figure 2 shows us that there is a clear bias towards European Americans as they represent 60% of the dataset. Moreover we can see that there are very few Black/African Americans and no Native Americans.
3. Explore the `field` distribution in the dataset.
plt.figure(figsize=(10, 10))
field_df=pd.DataFrame(speed_df['field'].value_counts()).reset_index()
field_df.columns=['field_name', 'count']
sns.barplot(x="field_name", y="count", data=field_df)
plt.title('Field Distribution',fontsize = 16)
plt.xticks(rotation=80)
display(HTML('<b>Figure 3: Field Distribution </b>'))
Figure 3 shows the count of professional field for all the participants
1. Explore the count of each descriptive features group by the values of target feature
speed_df_no_target = speed_df.drop(columns = 'match')
for i in speed_df_no_target.columns:
title_str='Count of ' + i.upper() + ' group by target feature "MATCH"'
plt.figure(figsize=(10,6))
plt.xticks(rotation=90)
g=sns.countplot(speed_df[i],hue=speed_df['match'])
g.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title(title_str, fontsize = 16)
2. Explore the proportion of each descriptive features group by the values of target feature
for i in speed_df_no_target.columns:
title_str='Proportion of ' + i.upper() + ' group by target feature "MATCH"'
col_list = speed_df[i].dropna().unique().tolist()
match_list = speed_df['match'].dropna().unique().tolist()
result=pd.DataFrame(columns=[i, 'match', 'percentage'])
for col_element in col_list:
for match_value in match_list:
element_num = len(speed_df[(speed_df[i]==col_element) & (speed_df['match']==match_value)])
element_percent = element_num / len(speed_df[speed_df['match'] == match_value])
result.loc[len(result)] = [col_element, match_value, element_percent]
# print(col_element, match_value, element_num, element_percent)
# print(result)
plt.figure()
sns.barplot(x=i, y="percentage", hue="match", data=result)
plt.xticks(rotation=80)
plt.title(title_str, fontsize = 16)
plt.ylabel('Proportion')
plt.show()
After going through all the Two-variable plots, we found that beside interests_correlate
, all the numerical values are discrete in the datasets.
1. Explore the relationship between Age, Gender and the target `Match`
sns.boxplot(x="gender", y="age", hue='match', data=speed_df)
plt.title('Age vs Gender group by "MATCH"', fontsize = 16)
2. Explore the relationship between Age, Race and the target `Match`
plt.xticks(rotation=80)
sns.boxplot(x="race", y="age", hue='match', data=speed_df)
plt.title('Age vs Race group by "MATCH"', fontsize = 16)
3. Explore the relationship between Gender, attractiveness of participant rated by partner and the target `Match`
plt.xticks(rotation='horizontal')
sns.boxplot(x="gender", y="attractive_o", hue='match', data=speed_df)
plt.title('Age vs Attractiveness Rated by partner group by "MATCH"', fontsize = 16)
4. Explore the relationship between Age of participant, age of partner and the target `Match`
sns.lmplot(x='age', y='age_o', markers=['o', 'x'], hue='match',
data=speed_df.loc[speed_df['match'].isin([0,1])],
fit_reg=False
)
plt.title('Age of participant vs Age of partner group by "MATCH"', fontsize = 16)
5. Explore the relationship between attractiveness of participant rated by own self vs rated by partner and the target `Match`
sns.lmplot(x='attractive', y='attractive_o', markers=['o', 'x'], hue='match',
data=speed_df.loc[speed_df['match'].isin([0,1])],
fit_reg=False
)
plt.title('Attractiveness rated by participant own self vs rated by partner group by "MATCH"', fontsize = 16)
6. Explore the relationship between like your partner vs how likely do you think your partner like you the target `Match`
sns.lmplot(x='like', y='guess_prob_liked', markers=['o', 'x'], hue='match',
data=speed_df.loc[speed_df['match'].isin([0,1])],
fit_reg=False
)
plt.title('Like your partner vs How likely do you think your partner like you group by "MATCH"', fontsize = 16)
Grid search and repeated cross-validation are used to find the optimal parameters of the following classifiers:
Let's observe the proportion of class label and number of observations in the dataset again.
target_df.value_counts(normalize=True)
Data.shape
Obviously the class label is imbalanced in the final dataset, and the data size is not huge. Thus we chose to use RepeatedStratifiedKFold
to balance the class label in the fold, as well as average out the variance of validation results in the dataset.
We then define:
- pipeline to host the process of feature selection and classifier
- Grid search to host feature selection score function and all the parameter of the classifier
so that we can perform exhaustively searches for the best parameter through all possible combinations while training the dataset.
For performance analysis, we would choose the scoring metric AUC
which would provide unbiased judgement for imbalance class label.
We execute the following 3 steps before we perform feature selection and Hyperparameter tuning on each classifier. The products of these steps would be commonly used in each classifier tuning process.
1. Prepare the dataset using the `holdout` approach, split the sampled data 70% as training, and 30% as testing (which is the unseen data for later-on performance analysis for each model).
from sklearn import feature_selection as fs
from sklearn.model_selection import train_test_split
# The "\" character below allows us to split the line across multiple lines
D_train, D_test, t_train, t_test = \
train_test_split(Data, target, test_size = 0.3,
stratify=target, shuffle=True, random_state=888)
print (D_train.shape)
print (D_test.shape)
print (t_train.shape)
print (t_test.shape)
2. Define the cross-validation method `RepeatedStratifiedKFold` (with k=5, n=2) and scoring metric `AUC`.
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score, GridSearchCV
cv_method = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=999)
scoring_metric = 'roc_auc'
3. Define a custom function to format the search results and return as a Pandas dataframe
def get_search_results(gs):
def model_result(scores, params):
scores = {'mean_score': np.mean(scores),
'std_score': np.std(scores),
'min_score': np.min(scores),
'max_score': np.max(scores)}
return pd.Series({**params,**scores})
models = []
scores = []
for i in range(gs.n_splits_):
key = f"split{i}_test_score"
r = gs.cv_results_[key]
scores.append(r.reshape(-1,1))
all_scores = np.hstack(scores)
for p, s in zip(gs.cv_results_['params'], all_scores):
models.append((model_result(s, p)))
pipe_results = pd.concat(models, axis=1).T.sort_values(['mean_score'], ascending=False)
columns_first = ['mean_score', 'std_score', 'max_score', 'min_score']
columns = columns_first + [c for c in pipe_results.columns if c not in columns_first]
return pipe_results[columns]
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
1. Using grid search for KNN hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve".
- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a KNN model with `k` values in {5, 10, 20}, `p` values in {1, 2, 5} and n_neighbors values in {10, 25, 50, 75,100, 150, 200, 300}
from sklearn.neighbors import KNeighborsClassifier
pipe_KNN = Pipeline([('fselector', SelectKBest()),
('knn', KNeighborsClassifier())])
params_pipe_KNN = {'fselector__score_func': [f_classif, mutual_info_classif],
'fselector__k': [5, 10, 20],
'knn__n_neighbors': [10, 25, 50, 75,100,150, 200, 300],
'knn__p': [1, 2, 5]}
gs_pipe_KNN = GridSearchCV(estimator=pipe_KNN,
param_grid=params_pipe_KNN,
cv=cv_method,
n_jobs=-2,
scoring='roc_auc',
verbose=1)
gs_pipe_KNN.fit(D_train, t_train);
2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.
results_KNN = get_search_results(gs_pipe_KNN)
results_KNN.head()
results_KNN = pd.DataFrame(gs_pipe_KNN.cv_results_['params'])
results_KNN['test_score'] = gs_pipe_KNN.cv_results_['mean_test_score']
results_KNN['knn__p'] = results_KNN['knn__p'].replace([1,2,5], ["Manhattan", "Euclidean", "Minkowski"])
results_KNN['fselector__score_func'] = results_KNN['fselector__score_func'].replace([1,2,5], ["Manhattan", "Euclidean", "Minkowski"])
results_KNN['fselector__score_func'] = results_KNN['fselector__score_func'].astype(str)
results_KNN.loc[results_KNN['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif"
results_KNN.loc[results_KNN['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
3. visualize the tuning results.
g = sns.FacetGrid(results_KNN, height=6, col="fselector__k", row="fselector__score_func", hue = "knn__p")
g.map(plt.plot, "knn__n_neighbors", "test_score", alpha=.7, marker=".")
g.add_legend()
g.fig.suptitle("KNN performance comparison", size=18)
g.fig.subplots_adjust(top=.9)
For f_classif
when k
= 5, We spot the trend would tip the top at n_neighbors = 50 (Minikowski) or 75 (Manhattan). For other parameter combination of f_classif
although the trend is at highest when n_nieghbors > 150, we would presume that would be an overfitting for the dataset when there are only 733 observations in D_train. For mutual_info_classif
, the score fluctuates between 0.78 and 0.82 with different parameter combinations, as there is no obvious upward trend in the tail, this suggests the tuning is completed for the knn model.
1. Using grid search for DT hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve".
- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a DT model with both `gini` and `entropy` criterion, `min_samples_split` values in {10, 50, 60, 70, 80, 100, 150}, `max_depth` in {3, 5, 7, 9}
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier(random_state=random_state)
pipe_DT = Pipeline([('fselector', SelectKBest()),
('dt', dt_classifier)])
params_pipe_DT = {'fselector__score_func': [f_classif, mutual_info_classif],
'fselector__k': [5, 10, 20],
'dt__criterion': ['gini', 'entropy'],
'dt__min_samples_split': [ 10, 50, 60, 70, 80, 100, 150],
'dt__max_depth': [3, 5, 7, 9]}
gs_pipe_DT = GridSearchCV(estimator=pipe_DT,
param_grid=params_pipe_DT,
cv=cv_method,
n_jobs=-2,
scoring='roc_auc',
verbose=1)
gs_pipe_DT.fit(D_train, t_train);
2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.
results_DT = get_search_results(gs_pipe_DT)
results_DT.head()
results_DT = pd.DataFrame(gs_pipe_DT.cv_results_['params'])
results_DT['test_score'] = gs_pipe_DT.cv_results_['mean_test_score']
results_DT_gini = results_DT[results_DT['dt__criterion']=='gini']
results_DT_entropy = results_DT[results_DT['dt__criterion']=='entropy']
results_DT_gini.columns
results_DT_gini['fselector__score_func'] = results_DT_gini['fselector__score_func'].astype(str)
results_DT_gini.loc[results_DT_gini['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif"
results_DT_gini.loc[results_DT_gini['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
3. visualize the tuning results.
g = sns.FacetGrid(results_DT_gini, height=6, col="fselector__k", row="fselector__score_func", hue="dt__max_depth")
g.map(plt.plot, "dt__min_samples_split", "test_score", alpha=.7, marker=".")
g.add_legend()
g.fig.suptitle("Decision Tree performance comparison (Gini index)", size=18)
g.fig.subplots_adjust(top=.9)
results_DT_entropy['fselector__score_func'] = results_DT_entropy['fselector__score_func'].astype(str)
results_DT_entropy.loc[results_DT_entropy['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif"
results_DT_entropy.loc[results_DT_entropy['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
g = sns.FacetGrid(results_DT_entropy, height=6, col="fselector__k", row="fselector__score_func", hue="dt__max_depth")
g.map(plt.plot, "dt__min_samples_split", "test_score", alpha=.7, marker=".")
g.add_legend()
g.fig.suptitle("Decision Tree performance comparison (Entropy)", size=18)
g.fig.subplots_adjust(top=.9)
We spot the trend of all the graphs will go downward once min_samples_split > 100 for both gini
and entropy
index, this suggests we don't need to add any more values for min_samples_split > 100 for tuning.
The scores varies for different combination of parameter when min_samples_split is between 50 and 100, thus we add more values (60, 70 and 80) in between to complete the tuning for the decision tree model.
1. Using grid search for Random Forest hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve".
- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a Random Forest model with `n_estimators` values in {5, 10, 20, 50, 100, 150, 200}, `max_depth` values in {5, 7, 10, 12}.
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=999)
pipe_RF = Pipeline([('fselector', SelectKBest()),
('rf', rf_classifier)])
params_pipe_RF = {'fselector__score_func': [f_classif, mutual_info_classif],
'fselector__k': [5, 10, 20],
'rf__n_estimators': [5, 10, 20, 50, 100, 150, 200],
'rf__max_depth': [5, 7, 10, 12]}
gs_pipe_RF = GridSearchCV(estimator=pipe_RF,
param_grid=params_pipe_RF,
cv=cv_method,
n_jobs=-2,
scoring='roc_auc',
verbose=1)
gs_pipe_RF.fit(D_train, t_train);
2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.
results_RF = get_search_results(gs_pipe_RF)
results_RF.head()
results_RF = pd.DataFrame(gs_pipe_RF.cv_results_['params'])
results_RF['test_score'] = gs_pipe_RF.cv_results_['mean_test_score']
results_RF['fselector__score_func'] = results_RF['fselector__score_func'].astype(str)
results_RF.loc[results_RF['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif"
results_RF.loc[results_RF['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
3. visualize the tuning results.
g = sns.FacetGrid(results_RF, height=6,col="fselector__k", row="fselector__score_func", hue="rf__max_depth")
g.map(plt.plot, "rf__n_estimators", "test_score", alpha=.7, marker=".")
g.add_legend()
g.fig.suptitle("Random Forest performance comparison", size=18)
g.fig.subplots_adjust(top=.9)
For f_classif
, We spot the trend become flatten when n_estimators >= 25 or > =50, for mutual_info_classif
, the score fluctuates a bit more, but the trend is getting flattening also, as there are only 733 observations in D_train, going beyond 200 for n_estimators wouldn't make any sense, thus by observing from the above graph, we can conduct that tuning is completed for the random forest model.
1. Using grid search for Naive Bayes hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve".
- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a Naive model with `var_smoothing` values in the `logspace` start with $10^1$ and end with $10^{-9}$
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer
nb_classifier = GaussianNB()
pipe_NB = Pipeline([('fselector', SelectKBest()),
('nb', nb_classifier)])
params_pipe_NB = {'fselector__score_func': [f_classif, mutual_info_classif],
'fselector__k': [5, 10, 20],
'nb__var_smoothing': np.logspace(1,-9, num=100)}
gs_pipe_NB = GridSearchCV(estimator=pipe_NB,
param_grid=params_pipe_NB,
cv=cv_method,
n_jobs=-2,
scoring='roc_auc',
verbose=1)
Data_transformed = PowerTransformer().fit_transform(D_train)
gs_pipe_NB.fit(Data_transformed, t_train);
2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.
results_NB = get_search_results(gs_pipe_NB)
results_NB.head()
results_NB = pd.DataFrame(gs_pipe_NB.cv_results_['params'])
results_NB['test_score'] = gs_pipe_NB.cv_results_['mean_test_score']
results_NB['fselector__score_func'] = results_NB['fselector__score_func'].astype(str)
results_NB.loc[results_NB['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif"
results_NB.loc[results_NB['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
3. visualize the tuning results.
g = sns.FacetGrid(results_NB, height=6,col ="fselector__k", hue ="fselector__score_func")
g.map(plt.plot, "nb__var_smoothing", "test_score", marker=".", alpha=.7)
g.add_legend()
g.add_legend()
g.fig.suptitle("Naive Bayes performance comparison", size=18)
g.fig.subplots_adjust(top=.9)
We spot the trend for f_classif
is very steady during parameter tuning, for mutual_info_classif
the score does fluctuate. Naive Bayes assumes that the descriptive features follow normal distribution, the above tuning shows that by search the logspace from $10^1$ to $10^{-9}$ for power transforming the data in descriptive features, there is variation of the score in mutual_info_classif
, but no significant improvement, this suggests maybe there is no needs to go any further on tuning the naive bayes model.
1. Using grid search for Logistic Regression hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve".
- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a Logistic Regression model with `penalty` values in {l1, l2}, `C` values in {0.1, 1, 10, 100, 500, 1000, 1500} and class_weight value = `balanced`
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
pipe_LR = Pipeline([('fselector', SelectKBest()),
('lr', lr)])
params_pipe_LR = {'fselector__score_func': [f_classif, mutual_info_classif],
'fselector__k': [5, 10, 20],
'lr__penalty': ['l1','l2'],
'lr__C': [0.1,1,10,100, 500,1000, 1500],
'lr__class_weight': ['balanced']}
gs_pipe_LR = GridSearchCV(estimator=pipe_LR,
param_grid=params_pipe_LR,
cv=cv_method,
n_jobs=-2,
scoring='roc_auc',
verbose=1)
gs_pipe_LR.fit(D_train, t_train);
2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.
results_LR = get_search_results(gs_pipe_LR)
results_LR.head()
results_LR = pd.DataFrame(gs_pipe_LR.cv_results_['params'])
results_LR['test_score'] = gs_pipe_LR.cv_results_['mean_test_score']
results_LR['fselector__score_func'] = results_LR['fselector__score_func'].astype(str)
results_LR.loc[results_LR['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif"
results_LR.loc[results_LR['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
3. visualize the tuning results.
g = sns.FacetGrid(results_LR, height=6,col="fselector__k", row="fselector__score_func", hue="lr__penalty")
g.map(plt.plot, "lr__C", "test_score", alpha=.7, marker=".")
g.add_legend()
g.fig.suptitle("Logistic Regression performance comparison", size=18)
g.fig.subplots_adjust(top=.9)
We spot the trend for f_classif
becomes flatten when C = 1, for mutual_info_classif
the score does fluctuate, but would not go beyond 0.83 with C in the values from 10 to 1500 , this suggests that tuning is completed for logistic regression.
1. Compare the metrics by generating the classification reports for all 5 models with their best estimator using the **unseen** test portion split from the `hold-out` sample that we prepared earlier.
# KNN
clf_best = gs_pipe_KNN.best_estimator_
predictions = clf_best.predict(D_test)
print("KNN :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")
#DT
clf_best = gs_pipe_DT.best_estimator_
predictions = clf_best.predict(D_test)
print("Decision Tree :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")
#RF
clf_best = gs_pipe_RF.best_estimator_
predictions = clf_best.predict(D_test)
print("Random Forest :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")
#NB
clf_best = gs_pipe_NB.best_estimator_
predictions = clf_best.predict(D_test)
print("Naive Bayes :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")
#LR
clf_best = gs_pipe_LR.best_estimator_
predictions = clf_best.predict(D_test)
print("Logistic Regression :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")
Since Match
= 1 is classified as positive label, the above confusion matrix can be interpreted as:
where TN denotes True Negative, TP denotes True Positive, FP denotes False Positive, FN denotes False Negative
In the classification report:
$Precision_{0}=\frac {TN} {(TN+FN)} $, $Precision_{1}=\frac {TP} {(TP+FP)} $, $Macro Avg_{Precision}=\frac {Precision_{0} + Precision_{1}} {2} $
$WeightedAvg_{Precision}=\frac {Precision_{0} * Number_{0} + Precision_{1} * Number_{1}} {Number_{0} + Number_{1}} $
$Accuracy=\frac {TP + TN } {(TP+TN+FP+FN)} $
$F1_{0}=\frac {2 * (Precision_{0} * Recall_{0})} {(Precision_{0} * Recall_{0})} $, $F1_{1}=\frac {2 * (Precision_{1} * Recall_{1})} {(Precision_{1} * Recall_{1})} $, $Macro Avg_{F1}=\frac {F1_{0} + F1_{1}} {2} $, $Weighted Avg_{F1}=\frac {F1_{0} * Number_{0} + F1_{1} * Number_{1}} {Number_{0} + Number_{1}} $
For example, $$CM_{KNN} = \begin{bmatrix} 252 & 7 \\ 38 & 18 \end{bmatrix}$$
TN = 252, FP = 7, FN = 38, TP = 18, $Number_{0}$ = 259, $Number_{1}$ = 56
From the classification reports of all the models, we could summarize as below:
match
observations are classified incorrectly. Thus the accuracy is so low in this model. Thus we found that the best estimator of Naive Bayes and Logistic Regression can distinguish positive labels better than other models, while KNN, Decision Tree and Random Forest would make less mistakes on predicting negative labels from the unseen test dataset.
2. Prepare a dataframe with the `fpr` and `tpr` for all 5 models with their best estimator by calculating the prediction scores using the `predict_proba` method in `Scikit-learn` and plot the AUC curve of all these models in the same graph.
clf_best = gs_pipe_KNN.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
KNN_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
KNN_df['model']='KNN'
neworder = ['model','fpr','tpr']
KNN_df=KNN_df.reindex(columns=neworder)
clf_best = gs_pipe_DT.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
DT_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
DT_df['model']='Decision Tree'
neworder = ['model','fpr','tpr']
DT_df=DT_df.reindex(columns=neworder)
clf_best = gs_pipe_RF.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
RF_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
RF_df['model']='Random Forest'
neworder = ['model','fpr','tpr']
RF_df=RF_df.reindex(columns=neworder)
clf_best = gs_pipe_NB.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
NB_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
NB_df['model']='Naive Bayes'
neworder = ['model','fpr','tpr']
NB_df=NB_df.reindex(columns=neworder)
clf_best = gs_pipe_LR.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
LR_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
LR_df['model']='Logistic Regression'
neworder = ['model','fpr','tpr']
LR_df=LR_df.reindex(columns=neworder)
result = pd.concat([KNN_df, DT_df, RF_df, NB_df, LR_df], ignore_index=True, sort=False)
x=np.linspace(0.0,1.0,101)
plt.figure(figsize=(10, 10))
sns.lineplot(x="fpr", y="tpr", hue="model", data=result)
plt.plot(x, x, color='black')
plt.title("Comparison of Roc Curve of 5 classifiers", fontsize = 16)
plt.show()
From the AUC curve, we found that beside decision tree, all other 4 models perform similarly (ideal model would have 0/1 fpr-tpr relationship) on the unseen dataset.
3. Let's use t-test to compare the mean score of the models against each other to find any significant difference. We set a different random seed this time.
cv_method_ttest = RepeatedStratifiedKFold(n_splits=5,
n_repeats=3,
random_state=222)
cv_results_KNN = cross_val_score(estimator=gs_pipe_KNN.best_estimator_,
X=Data,
y=target,
cv=cv_method_ttest,
n_jobs=-2,
scoring='roc_auc')
cv_results_DT = cross_val_score(estimator=gs_pipe_DT.best_estimator_,
X=Data,
y=target,
cv=cv_method_ttest,
n_jobs=-2,
scoring='roc_auc')
cv_results_RF = cross_val_score(estimator=gs_pipe_RF.best_estimator_,
X=Data,
y=target,
cv=cv_method_ttest,
n_jobs=-2,
scoring='roc_auc')
cv_results_NB = cross_val_score(estimator=gs_pipe_NB.best_estimator_,
X=Data,
y=target,
cv=cv_method_ttest,
n_jobs=-2,
scoring='roc_auc')
cv_results_LR = cross_val_score(estimator=gs_pipe_LR.best_estimator_,
X=Data,
y=target,
cv=cv_method_ttest,
n_jobs=-2,
scoring='roc_auc')
print("KNN cross validation mean score:", cv_results_KNN.mean().round(3))
print("Decision Tree cross validation mean score:", cv_results_DT.mean().round(3))
print("Random Forest cross validation mean score:", cv_results_RF.mean().round(3))
print("Naive Bayes cross validation mean score:", cv_results_NB.mean().round(3))
print("Logistic Regression cross validation mean score:", cv_results_LR.mean().round(3))
print("==================================================================")
print("Difference between KNN and Decisioin Tree: ", stats.ttest_rel(cv_results_KNN, cv_results_DT).pvalue)
print("Difference between KNN and Random Forest: ",stats.ttest_rel(cv_results_KNN, cv_results_RF).pvalue)
print("Difference between KNN and Naive Bayes: ",stats.ttest_rel(cv_results_KNN, cv_results_NB).pvalue)
print("Difference between KNN and Logistic Regression: ", stats.ttest_rel(cv_results_KNN, cv_results_LR).pvalue)
print("==================================================================")
print("Difference between Decision Tree and Random Forest: ", stats.ttest_rel(cv_results_DT, cv_results_RF).pvalue)
print("Difference between Decision Tree and Naive Bayes: ", stats.ttest_rel(cv_results_DT, cv_results_NB).pvalue)
print("Difference between Decision Tree and Logistic Regression: ", stats.ttest_rel(cv_results_DT, cv_results_LR).pvalue)
print("==================================================================")
print("Difference between Random Forest and Naive Bayes: ", stats.ttest_rel(cv_results_RF, cv_results_NB).pvalue)
print("Difference between Random Forest and Logistic Regression: ", stats.ttest_rel(cv_results_RF, cv_results_LR).pvalue)
print("Difference between Logistic Regression and Naive Bayes: ", stats.ttest_rel(cv_results_NB, cv_results_LR).pvalue)
With 95%of confidence interval, we found that only decision tree has significant difference (p-value < 0.05) on the mean score (under AUC metric) compare to other models.
One of the biggest limitations in our project is handling missing values. We have truncated all the observations with missing values. Below are the top 3 features with most missing values:
* expected_num_interested_in_me, 6578 observations with missing values
* expected_num_matches, 1173 observations with missing values
* shared_interests_partner, 1067 observations with missing values
If we fit the whole dataset to select the best 5 features for classif
fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=5)
fs_fit_fscore.fit_transform(Data, target)
fs_indices_fscore = np.argsort(fs_fit_fscore.scores_)[::-1][0:5]
fs_indices_fscore
best_features_fscore = Data_df.columns[fs_indices_fscore].values
best_features_fscore
None of the top 3 missing values features would be selected.
If we fit the whole dataset to select the best 10 features for classif
and mutual_info
fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=10)
fs_fit_fscore.fit_transform(Data, target)
fs_indices_fscore = np.argsort(fs_fit_fscore.scores_)[::-1][0:10]
fs_indices_fscore
best_features_fscore = Data_df.columns[fs_indices_fscore].values
best_features_fscore
fs_fit_fscore = fs.SelectKBest(fs.mutual_info_classif, k=10)
fs_fit_fscore.fit_transform(Data, target)
fs_indices_fscore = np.argsort(fs_fit_fscore.scores_)[::-1][0:10]
fs_indices_fscore
best_features_fscore = Data_df.columns[fs_indices_fscore].values
best_features_fscore
expected_num_matches and shared_interests_partner would then be selected, but expected_num_interested_in_me would still not be the top 10 selections with both mutual_info
and classif
metric, it would be interesting to see the comparison of the whole analysis by removing the expected_num_interested_in_me feature before truncating any observations with missing values. We would have a much larger dataset if so, and we believe the results would change dramatically.
Also, we have deleted all the binned features, however from reviewing the 2-variable plots, we do find the values in interests_correlate
are not discrete, we might also compare the analysis if we bin this feature instead. If time and resources is allowed, it would also be interesting to use all the descriptive features to train any of our classifiers and see if there would be any difference in parameter tuning.
On the other hand, Cross-validation on tuning all the parameters with the 5 dedicated models is the strength of our investigation, as we could see when we compare the mean score by cross-validating with different random seed, the mean scores stay relatively similar to the best estimator with original random seed on the training dataset.
The main goal of this analysis is to predict if a person would match (go on with second date) with another person based on the data collected during speed dating.
We found that visualization on data helps us a lot on the analysis:
From the trained models with optimal parameter, we find that Naive Bayes would distinguish a match with highest accuracy. However, if we apply this model to help the speed dating agent to identify match on any unseen data, we might end up contacting too many couples that would not be matched (for example in the test data set, we would contact 188 + 55 potential match cases, and then we would find out the 188 false positive cases are waste of effort).
In contrast, if we use the KNN, decision tree or random forest model, by observing the test data set, we would miss a high percentage of match case. (e.g. for KNN, we would miss 67.8% (38/56) of potential match)
Even though the AUC metric score is not low from the above analysis due to the imbalanced class label, if we need to deploy this model for real, we would really suggest to train and fine-tune the model on a larger dataset (perhaps dropping the expected_num_interested_in_me
feature before truncating any missing value observations would be one of the option to retain more data in the dataset), as by simply feeding into the sampled test data, we already found discrepancies in each model, which cannot help us to achieve our goal in an overall satisfactory way. Review the survey questions and improve the relevance of descriptive features could enhance our tuning process to find a better model too, in short, for industry deployment, it is essential to repetitively train and fine tune a classifier that can result higher accuracy for detecting both positive and negative class label on unseen data to achieve the business goal.