Executive Summary: Predicting User Behaviour in Speed Dating Events¶

The dataset was created from participants in experimental speed dating events from 2002-2004. During these events, the participants would have a four-minute "first date" with every other participant of the opposite sex. After which they would rate their partner. They were also asked if they would like to see their date again. If so, this first date event would be classified as a positive match.

Data Source ¶

The dataset is found from https://github.com/vaksakalli/datasets, Each observation (row) represents one single dating event, and each variable (column) provide the information of:

particular characteristics/liking of the participant or his/her partner.
rating on the characteristics/importance of certain values from participant on himself / herself or to his/her partner.

All the columns are identified as descriptive features, except the last column - match, which is our target feature. Descriptive features are used to classify the values of target feature, details of the features in this dataset are listed as below:

Descriptive Features¶

has null Row has null values or not
wave Experiment numbers
gender Gender of self
age Age of self
age_o Age of partner
d_ageDifference in age
d_d_age Binned values of difference in age, 4 groups:[0-1], [2-3], [4-6] and [7-37]
race Race of self
race_o Race of partner
samerace Whether the two persons have the same race or not.
importance_same_race How important is it that partner is of same race?
importance_same_religion How important is it that partner has same religion?
d_importance_same_race Binned values of race
d_importance_same_religionBinned Values of religion
field Field of study
pref_o_attractive How important does partner rate attractiveness
pref_o_sinsere How important does partner rate sincerity
pref_o_intelligence How important does partner rate intelligence
pref_o_funny How important does partner rate being funny
pref_o_ambitious How important does partner rate ambition
pref_o_shared_interests How important does partner rate having shared interests
d_pref_o_attractive Binned values for how important does partner rate attractiveness, 3 groups: [0-15], [16-20] and [21-100]
d_pref_o_sinsere Binned values for How important does partner rate sincerity, 3 groups: [0-15], [16-20] and [21-100]
d_pref_o_intelligence Binned values for how important does partner rate intelligence, 3 groups: [0-15], [16-20] and [21-100]
d_pref_o_funny Binned values for how important does partner rate being funny, 3 groups: [0-15], [16-20] and [21-100]
d_pref_o_ambitious Binned values Binned values for how important does partner rate being funny, 3 groups: [0-15], [16-20] and [21-100]
d_pref_o_shared_interests Binned values for Binned values for how important does partner rate being funny, 3 groups: [0-15], [16-20] and [21-100]
attractive_o Rating by partner (about me) at night of event on attractiveness
sincere_o Rating by partner (about me) at night of event on sincerity
intelligence_o Rating by partner (about me) at night of event on intelligence
funny_o Rating by partner (about me) at night of event on being funny
ambitous_o Rating by partner (about me) at night of event on being ambitious
shared_interests_o Rating by partner (about me) at night of event on shared interest
d_attractive_o Binned values for rating by partner (about me) at night of event on attractiveness:
d_sinsere_o Binned values for rating by partner (about me) at night of event on sincerity, 3 groups: [0-5], [6-8] and [9-10]
d_intelligence_o Binned values for rating by partner (about me) at night of event on intelligence, 3 groups: [0-5], [6-8] and [9-10]
d_funny_o Binned values for rating by partner (about me) at night of event on being funny, 3 groups: [0-5], [6-8] and [9-10]
d_ambitous_o Binned values for rating by partner (about me) at night of event on being ambitious, 3 groups: [0-5], [6-8] and [9-10]
d_shared_interests_o Binned values for rating by partner (about me) at night of event on being ambitious, 3 groups: [0-5], [6-8] and [9-10]
attractive_important What do you look for in a partner - attractiveness
sincere_important What do you look for in a partner - sincerity
intellicence_important What do you look for in a partner - intelligence
funny_important What do you look for in a partner - being funny
ambtition_important What do you look for in a partner - ambition
shared_interests_important What do you look for in a partner - shared interests
d_attractive_important Binned values for what do you look for in a partner - attractiveness, 3 groups: [0-15], [16-20] and [21-100]
d_sincere_important Binned values for what do you look for in a partner - sincerity, 3 groups: [0-15], [16-20] and [21-100]
d_intellicence_important Binned values for what do you look for in a partner - intelligence, 3 groups: [0-15], [16-20] and [21-100]
d_funny_important Binned values for what do you look for in a partner - being funny, 3 groups: [0-15], [16-20] and [21-100]
d_ambtition_important Binned values for what do you look for in a partner - ambition, 3 groups: [0-15], [16-20] and [21-100]
d_shared_interests_important Binned values for what do you look for in a partner - shared interests, 3 groups: [0-15], [16-20] and [21-100]
attractive Rate yourself - attractiveness
sincere Rate yourself - sincerity
intelligence Rate yourself - intelligence
funny Rate yourself - being funny
ambition Rate yourself - ambition
d_attractive Binned values for rate yourself - attractiveness, 3 groups: [0-5], [6-8] and [9-10]
d_sincere Binned values for rate yourself - sincerity, 3 groups: [0-5], [6-8] and [9-10]
d_intelligence Binned values for rate yourself - intelligence, 3 groups: [0-5], [6-8] and [9-10]
d_funny Binned values for rate yourself - being funny, 3 groups: [0-5], [6-8] and [9-10]
d_ambition Binned values for Rate yourself - ambition, 3 groups: [0-5], [6-8] and [9-10]
attractive_partner Rate your partner - attractiveness
sincere_partner Rate your partner - sincerity
intelligence_partner Rate your partner - intelligence
funny_partner Rate your partner - being funny
ambition_partner Rate your partner - ambition
shared_interests_partner Rate your partner - shared interests
d_attractive_partner Binned values for rate your partner - attractiveness, 3 groups: [0-5], [6-8] and [9-10]
d_sincere_partner Binned values for rate your partner - sincerity, 3 groups: [0-5], [6-8] and [9-10]
d_intelligence_partner Binned values for rate your partner - intelligence, 3 groups: [0-5], [6-8] and [9-10]
d_funny_partner Binned values for rate your partner - being funny, 3 groups: [0-5], [6-8] and [9-10]
d_ambition_partner Binned values for Rate your partner - ambition, 3 groups: [0-5], [6-8] and [9-10]
d_shared_interests_partner Binned values for Rate your partner - shared interest, 3 groups: [0-5], [6-8] and [9-10]
sports Your own interests
tvsports Your own interests
exercise Your own interests
dining Your own interests
museums Your own interests
art Your own interests
hiking Your own interests
gaming Your own interests
clubbing Your own interests
reading Your own interests
tv Your own interests
theater Your own interests
movies Your own interests
concerts Your own interests
music Your own interests
shopping Your own interests
yoga Your own interests
d_sports Binned values for Your own interests - sports, 3 groups: [0-5], [6-8] and [9-10]
d_tvsports Binned values for Your own interests - tvsports, 3 groups: [0-5], [6-8] and [9-10]
d_exercise Binned values for Your own interests - exercise, 3 groups: [0-5], [6-8] and [9-10]
d_dining Binned values for Your own interests - dining, 3 groups: [0-5], [6-8] and [9-10]
d_museums Binned values for Your own interests - museums, 3 groups: [0-5], [6-8] and [9-10]
d_art Binned values for Your own interests - art, 3 groups: [0-5], [6-8] and [9-10]
d_hiking Binned values for Your own interests - hiking, 3 groups: [0-5], [6-8] and [9-10]
d_gaming Binned values for Your own interests - gaming, 3 groups: [0-5], [6-8] and [9-10]
d_clubbing Binned values for Your own interests - clubbing, 3 groups: [0-5], [6-8] and [9-10]
d_reading Binned values for Your own interests - reading, 3 groups: [0-5], [6-8] and [9-10]
d_tv Binned values for Your own interests - tv, 3 groups: [0-5], [6-8] and [9-10]
d_theater Binned values for Your own interests - theatre, 3 groups: [0-5], [6-8] and [9-10]
d_movies Binned values for Your own interests - movies, 3 groups: [0-5], [6-8] and [9-10]
d_concerts Binned values for Your own interests - concerts, 3 groups: [0-5], [6-8] and [9-10]
d_music Binned values for Your own interests - music, 3 groups: [0-5], [6-8] and [9-10]
d_shopping Binned values for Your own interests - shopping, 3 groups: [0-5], [6-8] and [9-10]
d_yoga Binned values for Your own interests - yoga, 3 groups: [0-5], [6-8] and [9-10]
interests_correlate Correlation between participant’s and partner’s ratings of interests.
d_interests_correlate Binned values for Correlation between participant’s and partner’s ratings of interests., 3 groups: [-1-0], [0-0.33] and [0.33-1]
expected_happy_with_sd_people How happy do you expect to be with the people you meet during the speed-dating event?
expected_num_interested_in_meOut of the 20 people you will meet, how many do you expect will be interested in dating you?
expected_num_matches How many matches do you expect to get?
d_expected_happy_with_sd_people Binned values for how happy do you expect to be with the people you meet during the speed-dating event? 3 groups: [0-4], [5-6] and [7-10]
d_expected_num_interested_in_me Binned values for out of the 20 people you will meet, how many do you expect will be interested in dating you, 3 groups: [0-3], [4-9] and [10-20]
d_expected_num_matches Binned values for How many matches do you expect to get, 3 groups: [0-2], [3-5] and [5-18]
like Did you like your partner?
guess_prob_liked How likely do you think your partner likes you?
d_like Binned values for did you like your partner, 3 groups: [0-5], [6-8] and [9-10]
d_guess_prob_liked Binned values for how likely do you think your partner likes you, 3 groups: [0-4], [5-6] and [7-10]
met Have you met your partner before?

Target Feature¶

match is our target feature which is binary. '1' is the positive class which denotes match, '0' denotes not match.

Goal and Objectives ¶

The objective of this analysis is to predict if a person would match (go on with second date) with another person based on the data collected during the short interaction they had (4-minute first date) with each other.

To do this, We are going to fit the processed dataset and compare the prediction results from different classifiers to determine which one would be the best model after parameter tuning, 5 classifiers are chosen to use in the analysis and they are listed as below:

KNN
Decision Tree
Random Forest
Naive Byes
Logistic Regression

Data Pre-processing ¶

1.  We import all the necessary packages. 
2.  We set the display options and random seeds.
3.  We import the csv file directly from the github's url into the dataframe `speed_df`.

import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
sns.set(style="darkgrid")


import numpy as np
import pandas as pd
import io
import requests
import matplotlib.pyplot as plt
from IPython.display import display, HTML
from sklearn import metrics
from scipy import stats


np.random.seed(999)
random_state=999

# so that we can see all the columns
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)

import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context

speed_url = 'https://raw.githubusercontent.com/vaksakalli/datasets/master/speed_dating.csv'
url_content = requests.get(speed_url).content
speed_df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

4.  Once the dataframe is created, we check whether the number of columns and rows of the imported data is the same as the csv file

speed_df.shape

(8378, 121)

5. We then randomly select 10 rows from the dataset and get an overview of the content of the dataset.

display(HTML('<b>Table 1: Sample of Speed Dating Dataset </b>'))
speed_df.sample(n=10, random_state=8)

6. Explore the missing values in the dataset

speed_df.isnull().sum()

has_null                              0
wave                                  0
gender                                0
age                                  95
age_o                               104
d_age                                 0
d_d_age                               0
race                                 63
race_o                               73
samerace                              0
importance_same_race                 79
importance_same_religion             79
d_importance_same_race                0
d_importance_same_religion            0
field                                63
pref_o_attractive                    89
pref_o_sincere                       89
pref_o_intelligence                  89
pref_o_funny                         98
pref_o_ambitious                    107
pref_o_shared_interests             129
d_pref_o_attractive                   0
d_pref_o_sincere                      0
d_pref_o_intelligence                 0
d_pref_o_funny                        0
d_pref_o_ambitious                    0
d_pref_o_shared_interests             0
attractive_o                        212
sinsere_o                           287
intelligence_o                      306
funny_o                             360
ambitous_o                          722
shared_interests_o                 1076
d_attractive_o                        0
d_sinsere_o                           0
d_intelligence_o                      0
d_funny_o                             0
d_ambitous_o                          0
d_shared_interests_o                  0
attractive_important                 79
sincere_important                    79
intellicence_important               79
funny_important                      89
ambtition_important                  99
shared_interests_important          121
d_attractive_important                0
d_sincere_important                   0
d_intellicence_important              0
d_funny_important                     0
d_ambtition_important                 0
d_shared_interests_important          0
attractive                          105
sincere                             105
intelligence                        105
funny                               105
ambition                            105
d_attractive                          0
d_sincere                             0
d_intelligence                        0
d_funny                               0
d_ambition                            0
attractive_partner                  202
sincere_partner                     277
intelligence_partner                296
funny_partner                       350
ambition_partner                    712
shared_interests_partner           1067
d_attractive_partner                  0
d_sincere_partner                     0
d_intelligence_partner                0
d_funny_partner                       0
d_ambition_partner                    0
d_shared_interests_partner            0
sports                               79
tvsports                             79
exercise                             79
dining                               79
museums                              79
art                                  79
hiking                               79
gaming                               79
clubbing                             79
reading                              79
tv                                   79
theater                              79
movies                               79
concerts                             79
music                                79
shopping                             79
yoga                                 79
d_sports                              0
d_tvsports                            0
d_exercise                            0
d_dining                              0
d_museums                             0
d_art                                 0
d_hiking                              0
d_gaming                              0
d_clubbing                            0
d_reading                             0
d_tv                                  0
d_theater                             0
d_movies                              0
d_concerts                            0
d_music                               0
d_shopping                            0
d_yoga                                0
interests_correlate                 158
d_interests_correlate                 0
expected_happy_with_sd_people       101
expected_num_interested_in_me      6578
expected_num_matches               1173
d_expected_happy_with_sd_people       0
d_expected_num_interested_in_me       0
d_expected_num_matches                0
like                                240
guess_prob_liked                    309
d_like                                0
d_guess_prob_liked                    0
met                                 375
match                                 0
dtype: int64

We found that the column expected_num_interested_in_me has the highest count (6578 out of 8378 records) of missing value.

7. Randomly select 10 observations which has missing values in `expected_num_interested_in_me` and compare to the value in it's corresponding binned column `d_expected_num_interested_in_me`.

check_na_df=speed_df[np.isnan(speed_df.expected_num_interested_in_me)].sample(n=10, random_state=8)
check_na_df.loc[:, ["expected_num_interested_in_me", "d_expected_num_interested_in_me"]]

We found that all the observations with missing values in expected_num_interested_in_me would be put in the lowest bin in d_expected_num_interested_in_me. Similar findings with the column shared_interests_partner and d_shared_interests_partner as shown below.

check_na_df=speed_df[np.isnan(speed_df.shared_interests_partner)].sample(n=10, random_state=8)
check_na_df.loc[:, ["shared_interests_partner", "d_shared_interests_partner"]]

Further lookup from the web, we found this dataset could also be found from https://www.openml.org/d/40536. In the openML repository, none of the binned column exist.

8.  Base on the principal to retain the originality, as well as predicting the target with completed descriptive features.  We would drop all the observations with null values, as well as the following columns:

- all the columns with `d_` as prefix except column `d_age` as `d_age` is not a column with binned values, it contains the difference in age between participant and his/her partner.
- column `has_null` as it only contains indicating value whether there is any NA value in the particular row, it has no help on determining our target feature.
- column `wave` as it only contains experimental batch number of the dating event, it also has no help on determining our target feature.

bool_binned_cols = (speed_df.columns.str.find('d_', 0, 2)!=-1) & (speed_df.columns.str.find('d_age', 0, 5)==-1)

binned_cols = speed_df.columns[bool_binned_cols].tolist()

speed_df= speed_df.dropna()
speed_df=speed_df.drop(columns =['has_null','wave'])
speed_df=speed_df.drop(columns = binned_cols)

print(f"Shape of the dataset is {speed_df.shape} \n")
na_sum=speed_df.isna().sum()!=0
print(f"Final Check for Null Values:  {sum(na_sum)} \n")
print(f"Now that all null values have been removed, lets check the data types of these attributes. \n")
display(HTML('<b>Table 2: Data types of the attributes </b>'))
print(speed_df.dtypes)

Shape of the dataset is (1048, 64) 

Final Check for Null Values:  0 

Now that all null values have been removed, lets check the data types of these attributes.

gender                            object
age                              float64
age_o                            float64
d_age                              int64
race                              object
race_o                            object
samerace                           int64
importance_same_race             float64
importance_same_religion         float64
field                             object
pref_o_attractive                float64
pref_o_sincere                   float64
pref_o_intelligence              float64
pref_o_funny                     float64
pref_o_ambitious                 float64
pref_o_shared_interests          float64
attractive_o                     float64
sinsere_o                        float64
intelligence_o                   float64
funny_o                          float64
ambitous_o                       float64
shared_interests_o               float64
attractive_important             float64
sincere_important                float64
intellicence_important           float64
funny_important                  float64
ambtition_important              float64
shared_interests_important       float64
attractive                       float64
sincere                          float64
intelligence                     float64
funny                            float64
ambition                         float64
attractive_partner               float64
sincere_partner                  float64
intelligence_partner             float64
funny_partner                    float64
ambition_partner                 float64
shared_interests_partner         float64
sports                           float64
tvsports                         float64
exercise                         float64
dining                           float64
museums                          float64
art                              float64
hiking                           float64
gaming                           float64
clubbing                         float64
reading                          float64
tv                               float64
theater                          float64
movies                           float64
concerts                         float64
music                            float64
shopping                         float64
yoga                             float64
interests_correlate              float64
expected_happy_with_sd_people    float64
expected_num_interested_in_me    float64
expected_num_matches             float64
like                             float64
guess_prob_liked                 float64
met                              float64
match                              int64
dtype: object

9.  check the *descriptive statistics* of all the numerical features using the *describe* function

display(HTML('<b>Table 3: Summary of continuous features </b>'))
speed_df.describe(include = np.number).round(2)

10. Get the *summary statistics* of the categorical variables

display(HTML('<b>Table 4: Summary Statistics of categorical features </b>'))
categorical_cols = speed_df.columns[speed_df.dtypes == np.object].tolist()
for categorical_col in categorical_cols:
    print(categorical_col + ':')
    print(speed_df[categorical_col].value_counts().round(2))
    print('\n')

gender:
female    531
male      517
Name: gender, dtype: int64


race:
European/Caucasian-American                636
'Asian/Pacific Islander/Asian-American'    205
'Latino/Hispanic American'                  84
Other                                       84
'Black/African American'                    39
Name: race, dtype: int64


race_o:
European/Caucasian-American                641
'Asian/Pacific Islander/Asian-American'    196
'Latino/Hispanic American'                  77
Other                                       70
'Black/African American'                    64
Name: race_o, dtype: int64


field:
Law                                        139
'Social Work'                               81
Business                                    48
law                                         45
Psychology                                  39
Economics                                   37
Finance                                     36
MBA                                         33
Chemistry                                   30
chemistry                                   26
'Operations Research'                       26
Film                                        23
'Electrical Engineering'                    18
'political science'                         17
Engineering                                 16
Marketing                                   15
LAW                                         15
microbiology                                15
Journalism                                  15
Finance&Economics                           15
'Business- MBA'                             15
'Elementary/Childhood Education [MA]'       15
'Educational Psychology'                    14
Medicine                                    14
'Business [MBA]'                            14
'Operations Research [SEAS]'                14
Finanace                                    13
Communications                              13
Mathematics                                 13
'Masters of Social Work'                    13
'Mechanical Engineering'                    13
'International Educational Development'     13
Statistics                                  12
'Mathematical Finance'                      12
psychology                                  12
'TC [Health Ed]'                            11
'Organizational Psychology'                 11
'Masters in Public Administration'          10
'Speech Language Pathology'                 10
'Applied Maths/Econs'                        9
'Undergrad - GS'                             9
'German Literature'                          9
'social work'                                9
'Art History/medicine'                       9
philosophy                                   8
Sociology                                    8
English                                      8
'Biomedical Engineering'                     8
'psychology and english'                     8
'Computer Science'                           8
'Economics and Political Science'            8
'Business & International Affairs'           6
'Economics; Sociology'                       5
'financial math'                             3
Classics                                     1
Polish                                       1
Name: field, dtype: int64

11. Group the values of `field` to

 - avoid difference in capitalization of letters
 - make the value less scattered, such that similar nature of disciplines will combine to same values.

speed_df['field']= speed_df['field'].str.strip()  
speed_df['field']=speed_df['field'].replace(['law', 'LAW'], 'Law') 
speed_df['field']=speed_df['field'].replace(["'Social Work'", "'social work'", "Sociology", "'Masters of Social Work'"], 'Social Work') 
speed_df['field']=speed_df['field'].replace(["philosophy","psychology", "'psychology and english'", "'Organizational Psychology'", "Psychology"], 'Psychology/Philosophy') 
speed_df['field']=speed_df['field'].replace(["chemistry"], 'Chemistry') 
speed_df['field']=speed_df['field'].replace(["'Electrical Engineering'", "'Mechanical Engineering'", "'Biomedical Engineering'", "'Computer Science'", "Engineering"], 'Engineering/Computer Science') 
speed_df['field']=speed_df['field'].replace(["Business","Finance","Economics", "Marketing", "Finance&Economics", "Finanace","'Economics; Sociology'", "'Business & International Affairs'"], "Business/Finance")
speed_df['field']=speed_df['field'].replace(["MBA", "Business- MBA", "'Business- MBA'", "'Business [MBA]'"], "Business [MBA]")
speed_df['field']=speed_df['field'].replace(["'political science'", "'Economics and Political Science'"], 'Political Science')
speed_df['field']=speed_df['field'].replace(["Journalism", "Communications", "'Masters in Public Administration'"], "Journalism/Communications")
speed_df['field']=speed_df['field'].replace(["'Elementary/Childhood Education [MA]'","'Educational Psychology'","'International Educational Development'", "'TC [Health Ed]'"], "Education")
speed_df['field']=speed_df['field'].replace(["Medicine","microbiology","'Biomedical Engineering'"], "Medical Science")
speed_df['field']=speed_df['field'].replace(["'Art History/medicine'", "Mathematics", "'Mathematical Finance'", "'financial math'","'Applied Maths/Econs'", "Statistics"], "Mathematical-related")
speed_df['field']=speed_df['field'].replace(["'Operations Research'","'Operations Research [SEAS]'"], 'Operational Research')
speed_df['field']=speed_df['field'].replace(["English", "Polish", "'German Literature'", "'Speech Language Pathology'"], "Language-related")
speed_df['field']=speed_df['field'].replace(["English", "Polish", "'German Literature'", "'Speech Language Pathology'"], "Language-related")

12.  Assign all the descriptive features (except `match`) as Data and the predictive label column `match` as target

speed_df.to_csv('speed_clean.csv', index=False)
Data = speed_df.drop(columns = 'match')
target = speed_df['match']

13. Perform one-hot encoding for all the descriptive features, if the categorical descriptive has only 2-level, encode it with only one binary variable

# get the list of categorical descriptive features
categorical_cols = Data.columns[Data.dtypes==object].tolist()

# if a categorical descriptive feature has only 2 levels,
# define only one binary variable
for col in categorical_cols:
    n = len(Data[col].unique())
    if (n == 2):
        Data[col] = pd.get_dummies(Data[col], drop_first=True)

# for other categorical features (with > 2 levels), 
# use regular one-hot-encoding 
# if a feature is numeric, it will be untouched
Data = pd.get_dummies(Data)

14.  Check all the data types and shape of the descriptive feature after encoding

Data.dtypes

gender                                              uint8
age                                               float64
age_o                                             float64
d_age                                               int64
samerace                                            int64
importance_same_race                              float64
importance_same_religion                          float64
pref_o_attractive                                 float64
pref_o_sincere                                    float64
pref_o_intelligence                               float64
pref_o_funny                                      float64
pref_o_ambitious                                  float64
pref_o_shared_interests                           float64
attractive_o                                      float64
sinsere_o                                         float64
intelligence_o                                    float64
funny_o                                           float64
ambitous_o                                        float64
shared_interests_o                                float64
attractive_important                              float64
sincere_important                                 float64
intellicence_important                            float64
funny_important                                   float64
ambtition_important                               float64
shared_interests_important                        float64
attractive                                        float64
sincere                                           float64
intelligence                                      float64
funny                                             float64
ambition                                          float64
attractive_partner                                float64
sincere_partner                                   float64
intelligence_partner                              float64
funny_partner                                     float64
ambition_partner                                  float64
shared_interests_partner                          float64
sports                                            float64
tvsports                                          float64
exercise                                          float64
dining                                            float64
museums                                           float64
art                                               float64
hiking                                            float64
gaming                                            float64
clubbing                                          float64
reading                                           float64
tv                                                float64
theater                                           float64
movies                                            float64
concerts                                          float64
music                                             float64
shopping                                          float64
yoga                                              float64
interests_correlate                               float64
expected_happy_with_sd_people                     float64
expected_num_interested_in_me                     float64
expected_num_matches                              float64
like                                              float64
guess_prob_liked                                  float64
met                                               float64
race_'Asian/Pacific Islander/Asian-American'        uint8
race_'Black/African American'                       uint8
race_'Latino/Hispanic American'                     uint8
race_European/Caucasian-American                    uint8
race_Other                                          uint8
race_o_'Asian/Pacific Islander/Asian-American'      uint8
race_o_'Black/African American'                     uint8
race_o_'Latino/Hispanic American'                   uint8
race_o_European/Caucasian-American                  uint8
race_o_Other                                        uint8
field_'Undergrad - GS'                              uint8
field_Business [MBA]                                uint8
field_Business/Finance                              uint8
field_Chemistry                                     uint8
field_Classics                                      uint8
field_Education                                     uint8
field_Engineering/Computer Science                  uint8
field_Film                                          uint8
field_Journalism/Communications                     uint8
field_Language-related                              uint8
field_Law                                           uint8
field_Mathematical-related                          uint8
field_Medical Science                               uint8
field_Operational Research                          uint8
field_Political Science                             uint8
field_Psychology/Philosophy                         uint8
field_Social Work                                   uint8
dtype: object

Data.shape

(1048, 87)

15.  Retain the column name of the descriptive and target feature by copying them into new dataframe `Data_df` and `target_df`.

Data_df = Data.copy()
target_df =target.copy()

print("Target Type:", type(target))
print("Counts Using NumPy:")
print(np.unique(target, return_counts = True))
print("Counts Using Pandas:")
print(pd.Series(target).value_counts())

Target Type: <class 'pandas.core.series.Series'>
Counts Using NumPy:
(array([0, 1], dtype=int64), array([862, 186], dtype=int64))
Counts Using Pandas:
0    862
1    186
Name: match, dtype: int64

16.  Perform normalization to all the descriptive features using MinMaxScaler()

from sklearn import preprocessing
Data = preprocessing.MinMaxScaler().fit_transform(Data)
target=target.values

17.  Ensure both `Data` and `target` are numpy array, such that they are ready to pass-in to scikit learn for data modelling.

print(type(Data))
print(type(target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

Data exploration & Visualisation ¶

Univariate Visualisation - One-Variable plots ¶

1.  Explore the `gender` distribution in the dataset.

plt.figure()
speed_df['gender'].value_counts().plot(kind='bar')
plt.xticks(rotation='horizontal')
plt.ylabel('Count')
plt.title('Gender Distribution', fontsize = 16)
plt.show()
display(HTML('<b>Figure 1: Gender Distribution </b>'))

Figure 1 indicates that the two genders are equally distributed, therefore there is no bias in gender.

2.  Explore the `racial` distribution in the dataset.

plt.figure()
speed_df['race'].value_counts().plot(kind='bar')
plt.ylabel('Number of People')
plt.title('Racial Distribution',fontsize = 16)
plt.xticks(rotation=80)
plt.show()
display(HTML('<b>Figure 2: Racial Distribution </b>'))

Figure 2 shows us that there is a clear bias towards European Americans as they represent 60% of the dataset. Moreover we can see that there are very few Black/African Americans and no Native Americans.

3. Explore the `field` distribution in the dataset.

plt.figure(figsize=(10, 10))
field_df=pd.DataFrame(speed_df['field'].value_counts()).reset_index()
field_df.columns=['field_name', 'count']
sns.barplot(x="field_name", y="count", data=field_df)
plt.title('Field Distribution',fontsize = 16)
plt.xticks(rotation=80)
display(HTML('<b>Figure 3: Field Distribution </b>'))

Figure 3 shows the count of professional field for all the participants

Bivariate Visualisation - Two-Variable plots ¶

1.  Explore the count of each descriptive features group by the values of target feature

speed_df_no_target = speed_df.drop(columns = 'match')

for i in speed_df_no_target.columns:
    title_str='Count of ' + i.upper() + ' group by target feature "MATCH"'  
    plt.figure(figsize=(10,6))
    plt.xticks(rotation=90)
    g=sns.countplot(speed_df[i],hue=speed_df['match'])
    g.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
    plt.title(title_str, fontsize = 16)

2.  Explore the proportion of each descriptive features group by the values of target feature

for i in speed_df_no_target.columns:
    title_str='Proportion of ' + i.upper() + ' group by target feature "MATCH"'  

    col_list = speed_df[i].dropna().unique().tolist()
    match_list = speed_df['match'].dropna().unique().tolist()
    result=pd.DataFrame(columns=[i, 'match', 'percentage'])

    for col_element in col_list:
        for match_value in match_list:
            element_num = len(speed_df[(speed_df[i]==col_element) & (speed_df['match']==match_value)])
            element_percent = element_num / len(speed_df[speed_df['match'] == match_value])
            result.loc[len(result)] = [col_element, match_value, element_percent]
#            print(col_element, match_value, element_num, element_percent)
#            print(result)

    plt.figure()
    sns.barplot(x=i, y="percentage", hue="match", data=result)
    plt.xticks(rotation=80)
    plt.title(title_str, fontsize = 16)
    plt.ylabel('Proportion')
    plt.show()

After going through all the Two-variable plots, we found that beside interests_correlate, all the numerical values are discrete in the datasets.

Trivariate Visualisation - Three-Variable plots ¶

1.  Explore the relationship between Age, Gender and the target `Match`

sns.boxplot(x="gender", y="age", hue='match', data=speed_df)
plt.title('Age vs Gender group by "MATCH"', fontsize = 16)

Text(0.5, 1.0, 'Age vs Gender group by "MATCH"')

 2.  Explore the relationship between Age, Race and the target `Match`

plt.xticks(rotation=80)
sns.boxplot(x="race", y="age", hue='match', data=speed_df)
plt.title('Age vs Race group by "MATCH"', fontsize = 16)

Text(0.5, 1.0, 'Age vs Race group by "MATCH"')

3.  Explore the relationship between Gender, attractiveness of participant rated by partner and the target `Match`

plt.xticks(rotation='horizontal')
sns.boxplot(x="gender", y="attractive_o", hue='match', data=speed_df)
plt.title('Age vs Attractiveness Rated by partner group by "MATCH"', fontsize = 16)

Text(0.5, 1.0, 'Age vs Attractiveness Rated by partner group by "MATCH"')

4.  Explore the relationship between Age of participant, age of partner and the target `Match`

sns.lmplot(x='age', y='age_o', markers=['o', 'x'], hue='match',
           data=speed_df.loc[speed_df['match'].isin([0,1])],
           fit_reg=False
          )
plt.title('Age of participant vs Age of partner group by "MATCH"', fontsize = 16)

Text(0.5, 1, 'Age of participant vs Age of partner group by "MATCH"')

5.  Explore the relationship between attractiveness of participant rated by own self vs rated by partner and the target `Match`

sns.lmplot(x='attractive', y='attractive_o', markers=['o', 'x'], hue='match',
           data=speed_df.loc[speed_df['match'].isin([0,1])],
           fit_reg=False
          )
plt.title('Attractiveness rated by participant own self vs rated by partner group by "MATCH"', fontsize = 16)

Text(0.5, 1, 'Attractiveness rated by participant own self vs rated by partner group by "MATCH"')

6.  Explore the relationship between like your partner vs how likely do you think your partner like you  the target `Match`

sns.lmplot(x='like', y='guess_prob_liked', markers=['o', 'x'], hue='match',
           data=speed_df.loc[speed_df['match'].isin([0,1])],
           fit_reg=False
          )
plt.title('Like your partner vs How likely do you think your partner like you group by "MATCH"', fontsize = 16)

Text(0.5, 1, 'Like your partner vs How likely do you think your partner like you group by "MATCH"')

Predictive modelling ¶

Overview of Methodology ¶

Grid search and repeated cross-validation are used to find the optimal parameters of the following classifiers:

KNN
Decision Tree
Random Forest
Naive Byes
Logistic Regression

Let's observe the proportion of class label and number of observations in the dataset again.

target_df.value_counts(normalize=True)

0    0.822519
1    0.177481
Name: match, dtype: float64

Data.shape

(1048, 87)

Obviously the class label is imbalanced in the final dataset, and the data size is not huge. Thus we chose to use RepeatedStratifiedKFold to balance the class label in the fold, as well as average out the variance of validation results in the dataset. We then define:

- pipeline to host the process of feature selection and classifier 
- Grid search to host feature selection score function and all the parameter of the classifier

so that we can perform exhaustively searches for the best parameter through all possible combinations while training the dataset.

For performance analysis, we would choose the scoring metric AUC which would provide unbiased judgement for imbalance class label.

Feature Selection, Hyperparameter Tuning and Performance Analysis ¶

We execute the following 3 steps before we perform feature selection and Hyperparameter tuning on each classifier. The products of these steps would be commonly used in each classifier tuning process.

1. Prepare the dataset using the `holdout` approach, split the sampled data 70% as training, and 30% as testing (which is the unseen data for later-on performance analysis for each model).

from sklearn import feature_selection as fs
from sklearn.model_selection import train_test_split

# The "\" character below allows us to split the line across multiple lines
D_train, D_test, t_train, t_test = \
    train_test_split(Data, target, test_size = 0.3, 
                     stratify=target, shuffle=True, random_state=888)
print (D_train.shape)
print (D_test.shape)
print (t_train.shape)
print (t_test.shape)

(733, 87)
(315, 87)
(733,)
(315,)

 2.  Define the cross-validation method `RepeatedStratifiedKFold` (with k=5, n=2) and scoring metric `AUC`.

from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score, GridSearchCV
cv_method = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=999)
scoring_metric = 'roc_auc'

3. Define a custom function to format the search results and return as a Pandas dataframe

def get_search_results(gs):
    def model_result(scores, params):
        scores = {'mean_score': np.mean(scores),
             'std_score': np.std(scores),
             'min_score': np.min(scores),
             'max_score': np.max(scores)}
        return pd.Series({**params,**scores})
    models = []
    scores = []
    for i in range(gs.n_splits_):
        key = f"split{i}_test_score"
        r = gs.cv_results_[key]        
        scores.append(r.reshape(-1,1))
    all_scores = np.hstack(scores)
    for p, s in zip(gs.cv_results_['params'], all_scores):
        models.append((model_result(s, p)))
    pipe_results = pd.concat(models, axis=1).T.sort_values(['mean_score'], ascending=False)
    columns_first = ['mean_score', 'std_score', 'max_score', 'min_score']
    columns = columns_first + [c for c in pipe_results.columns if c not in columns_first]
    return pipe_results[columns]

KNN ¶

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier

1. Using grid search for KNN hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a KNN model with `k` values in {5, 10, 20}, `p` values in {1, 2, 5} and n_neighbors values in {10, 25, 50, 75,100, 150, 200, 300}

from sklearn.neighbors import KNeighborsClassifier

pipe_KNN = Pipeline([('fselector', SelectKBest()), 
                     ('knn', KNeighborsClassifier())])

params_pipe_KNN = {'fselector__score_func': [f_classif, mutual_info_classif],
                   'fselector__k': [5, 10, 20],
                   'knn__n_neighbors': [10, 25, 50, 75,100,150, 200, 300],
                   'knn__p': [1, 2, 5]}
 
gs_pipe_KNN = GridSearchCV(estimator=pipe_KNN, 
                           param_grid=params_pipe_KNN, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='roc_auc',
                           verbose=1)

gs_pipe_KNN.fit(D_train, t_train);

Fitting 10 folds for each of 144 candidates, totalling 1440 fits

[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  36 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-2)]: Done 683 tasks      | elapsed:   31.1s
[Parallel(n_jobs=-2)]: Done 1402 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-2)]: Done 1440 out of 1440 | elapsed:  1.6min finished

2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.

results_KNN = get_search_results(gs_pipe_KNN)
results_KNN.head()

results_KNN = pd.DataFrame(gs_pipe_KNN.cv_results_['params'])
results_KNN['test_score'] = gs_pipe_KNN.cv_results_['mean_test_score']

results_KNN['knn__p'] = results_KNN['knn__p'].replace([1,2,5], ["Manhattan", "Euclidean", "Minkowski"])
results_KNN['fselector__score_func'] = results_KNN['fselector__score_func'].replace([1,2,5], ["Manhattan", "Euclidean", "Minkowski"])

results_KNN['fselector__score_func'] = results_KNN['fselector__score_func'].astype(str)
results_KNN.loc[results_KNN['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_KNN.loc[results_KNN['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"

3. visualize the tuning results.

g = sns.FacetGrid(results_KNN,  height=6, col="fselector__k", row="fselector__score_func", hue = "knn__p")
g.map(plt.plot, "knn__n_neighbors", "test_score", alpha=.7,  marker=".")
g.add_legend()
g.fig.suptitle("KNN performance comparison", size=18)
g.fig.subplots_adjust(top=.9)

For f_classif when k= 5, We spot the trend would tip the top at n_neighbors = 50 (Minikowski) or 75 (Manhattan). For other parameter combination of f_classif although the trend is at highest when n_nieghbors > 150, we would presume that would be an overfitting for the dataset when there are only 733 observations in D_train. For mutual_info_classif, the score fluctuates between 0.78 and 0.82 with different parameter combinations, as there is no obvious upward trend in the tail, this suggests the tuning is completed for the knn model.

Decision Tree ¶

1. Using grid search for DT hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a DT model with both `gini` and `entropy` criterion,  `min_samples_split` values in {10, 50, 60, 70, 80, 100, 150}, `max_depth` in  {3, 5, 7, 9}

from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier(random_state=random_state)


pipe_DT = Pipeline([('fselector', SelectKBest()), 
                     ('dt', dt_classifier)])

params_pipe_DT = {'fselector__score_func': [f_classif, mutual_info_classif],
                  'fselector__k': [5, 10, 20],
                  'dt__criterion': ['gini', 'entropy'],
                  'dt__min_samples_split': [ 10, 50, 60, 70, 80, 100, 150],
                  'dt__max_depth': [3, 5, 7, 9]}

 
gs_pipe_DT = GridSearchCV(estimator=pipe_DT, 
                           param_grid=params_pipe_DT, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='roc_auc',
                           verbose=1) 


gs_pipe_DT.fit(D_train, t_train);

Fitting 10 folds for each of 336 candidates, totalling 3360 fits

[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  55 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-2)]: Done 207 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-2)]: Done 457 tasks      | elapsed:   39.0s
[Parallel(n_jobs=-2)]: Done 807 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-2)]: Done 1257 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-2)]: Done 1807 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-2)]: Done 2457 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-2)]: Done 3207 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-2)]: Done 3360 out of 3360 | elapsed:  4.1min finished

2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.

results_DT = get_search_results(gs_pipe_DT)
results_DT.head()

results_DT = pd.DataFrame(gs_pipe_DT.cv_results_['params'])
results_DT['test_score'] = gs_pipe_DT.cv_results_['mean_test_score']
results_DT_gini = results_DT[results_DT['dt__criterion']=='gini']
results_DT_entropy = results_DT[results_DT['dt__criterion']=='entropy']
results_DT_gini.columns

Index(['dt__criterion', 'dt__max_depth', 'dt__min_samples_split',
       'fselector__k', 'fselector__score_func', 'test_score'],
      dtype='object')

results_DT_gini['fselector__score_func'] = results_DT_gini['fselector__score_func'].astype(str)
results_DT_gini.loc[results_DT_gini['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_DT_gini.loc[results_DT_gini['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"

3. visualize the tuning results.

g = sns.FacetGrid(results_DT_gini,   height=6, col="fselector__k", row="fselector__score_func", hue="dt__max_depth")
g.map(plt.plot, "dt__min_samples_split", "test_score", alpha=.7, marker=".")
g.add_legend()
g.fig.suptitle("Decision Tree performance comparison (Gini index)", size=18)
g.fig.subplots_adjust(top=.9)

results_DT_entropy['fselector__score_func'] = results_DT_entropy['fselector__score_func'].astype(str)
results_DT_entropy.loc[results_DT_entropy['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_DT_entropy.loc[results_DT_entropy['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"

g = sns.FacetGrid(results_DT_entropy,  height=6, col="fselector__k", row="fselector__score_func", hue="dt__max_depth")
g.map(plt.plot, "dt__min_samples_split", "test_score", alpha=.7, marker=".")
g.add_legend()
g.fig.suptitle("Decision Tree performance comparison (Entropy)", size=18)
g.fig.subplots_adjust(top=.9)

We spot the trend of all the graphs will go downward once min_samples_split > 100 for both gini and entropy index, this suggests we don't need to add any more values for min_samples_split > 100 for tuning. The scores varies for different combination of parameter when min_samples_split is between 50 and 100, thus we add more values (60, 70 and 80) in between to complete the tuning for the decision tree model.

Random Forest ¶

1. Using grid search for Random Forest hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a Random Forest model with `n_estimators` values in {5, 10, 20, 50, 100, 150, 200},  `max_depth` values in {5, 7, 10, 12}.

from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=999)

pipe_RF = Pipeline([('fselector', SelectKBest()), 
                     ('rf', rf_classifier)])

params_pipe_RF = {'fselector__score_func': [f_classif, mutual_info_classif],
                  'fselector__k': [5, 10, 20],
                  'rf__n_estimators': [5, 10, 20, 50, 100, 150, 200],
                  'rf__max_depth': [5, 7, 10, 12]}

 
gs_pipe_RF = GridSearchCV(estimator=pipe_RF, 
                           param_grid=params_pipe_RF, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='roc_auc',
                           verbose=1) 


gs_pipe_RF.fit(D_train, t_train);

Fitting 10 folds for each of 168 candidates, totalling 1680 fits

[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  74 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-2)]: Done 347 tasks      | elapsed:   21.4s
[Parallel(n_jobs=-2)]: Done 597 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-2)]: Done 947 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-2)]: Done 1429 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-2)]: Done 1680 out of 1680 | elapsed:  3.5min finished

2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.

results_RF = get_search_results(gs_pipe_RF)
results_RF.head()

results_RF = pd.DataFrame(gs_pipe_RF.cv_results_['params'])
results_RF['test_score'] = gs_pipe_RF.cv_results_['mean_test_score']

results_RF['fselector__score_func'] = results_RF['fselector__score_func'].astype(str)
results_RF.loc[results_RF['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_RF.loc[results_RF['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"

3. visualize the tuning results.

g = sns.FacetGrid(results_RF, height=6,col="fselector__k", row="fselector__score_func", hue="rf__max_depth")
g.map(plt.plot, "rf__n_estimators", "test_score", alpha=.7,  marker=".")
g.add_legend()
g.fig.suptitle("Random Forest performance comparison", size=18)
g.fig.subplots_adjust(top=.9)

For f_classif, We spot the trend become flatten when n_estimators >= 25 or > =50, for mutual_info_classif, the score fluctuates a bit more, but the trend is getting flattening also, as there are only 733 observations in D_train, going beyond 200 for n_estimators wouldn't make any sense, thus by observing from the above graph, we can conduct that tuning is completed for the random forest model.

Naive Bayes ¶

1. Using grid search for Naive Bayes hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a Naive model with `var_smoothing` values in the `logspace` start with $10^1$ and end with $10^{-9}$

from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer

nb_classifier = GaussianNB()

pipe_NB = Pipeline([('fselector', SelectKBest()), 
                     ('nb', nb_classifier)])

params_pipe_NB = {'fselector__score_func': [f_classif, mutual_info_classif],
                  'fselector__k': [5, 10, 20],
                  'nb__var_smoothing': np.logspace(1,-9, num=100)}

 
gs_pipe_NB = GridSearchCV(estimator=pipe_NB, 
                           param_grid=params_pipe_NB, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='roc_auc',
                           verbose=1) 

Data_transformed = PowerTransformer().fit_transform(D_train)

gs_pipe_NB.fit(Data_transformed, t_train);

Fitting 10 folds for each of 600 candidates, totalling 6000 fits

[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  74 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-2)]: Done 1390 tasks      | elapsed:   54.8s
[Parallel(n_jobs=-2)]: Done 1640 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-2)]: Done 1990 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-2)]: Done 3539 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-2)]: Done 4701 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-2)]: Done 6000 out of 6000 | elapsed:  7.0min finished

2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.

results_NB = get_search_results(gs_pipe_NB)
results_NB.head()

results_NB = pd.DataFrame(gs_pipe_NB.cv_results_['params'])
results_NB['test_score'] = gs_pipe_NB.cv_results_['mean_test_score']

results_NB['fselector__score_func'] = results_NB['fselector__score_func'].astype(str)
results_NB.loc[results_NB['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_NB.loc[results_NB['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"

3. visualize the tuning results.

g = sns.FacetGrid(results_NB, height=6,col ="fselector__k", hue ="fselector__score_func")
g.map(plt.plot, "nb__var_smoothing", "test_score",  marker=".", alpha=.7)
g.add_legend()
g.add_legend()
g.fig.suptitle("Naive Bayes performance comparison", size=18)
g.fig.subplots_adjust(top=.9)

We spot the trend for f_classif is very steady during parameter tuning, for mutual_info_classif the score does fluctuate. Naive Bayes assumes that the descriptive features follow normal distribution, the above tuning shows that by search the logspace from $10^1$ to $10^{-9}$ for power transforming the data in descriptive features, there is variation of the score in mutual_info_classif, but no significant improvement, this suggests maybe there is no needs to go any further on tuning the naive bayes model.

Logistic Regression ¶

1. Using grid search for Logistic Regression hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a Logistic Regression model with `penalty` values in {l1, l2}, `C` values in {0.1, 1, 10, 100, 500, 1000, 1500} and class_weight value = `balanced`

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()


pipe_LR = Pipeline([('fselector', SelectKBest()), 
                     ('lr', lr)])

params_pipe_LR = {'fselector__score_func': [f_classif, mutual_info_classif],
                  'fselector__k': [5, 10, 20],
                  'lr__penalty': ['l1','l2'],
                  'lr__C': [0.1,1,10,100, 500,1000, 1500],
                  'lr__class_weight': ['balanced']}


 
gs_pipe_LR = GridSearchCV(estimator=pipe_LR, 
                           param_grid=params_pipe_LR, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='roc_auc',
                           verbose=1) 


gs_pipe_LR.fit(D_train, t_train);

Fitting 10 folds for each of 84 candidates, totalling 840 fits

[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  74 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-2)]: Done 550 tasks      | elapsed:   42.8s
[Parallel(n_jobs=-2)]: Done 840 out of 840 | elapsed:  1.0min finished

2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.

results_LR = get_search_results(gs_pipe_LR)
results_LR.head()

results_LR = pd.DataFrame(gs_pipe_LR.cv_results_['params'])
results_LR['test_score'] = gs_pipe_LR.cv_results_['mean_test_score']

results_LR['fselector__score_func'] = results_LR['fselector__score_func'].astype(str)
results_LR.loc[results_LR['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_LR.loc[results_LR['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"

3. visualize the tuning results.

g = sns.FacetGrid(results_LR, height=6,col="fselector__k", row="fselector__score_func", hue="lr__penalty")
g.map(plt.plot, "lr__C", "test_score", alpha=.7,  marker=".")
g.add_legend()
g.fig.suptitle("Logistic Regression performance comparison", size=18)
g.fig.subplots_adjust(top=.9)

We spot the trend for f_classif becomes flatten when C = 1, for mutual_info_classif the score does fluctuate, but would not go beyond 0.83 with C in the values from 10 to 1500 , this suggests that tuning is completed for logistic regression.

Performance Comparison and Paired T-Test ¶

1. Compare the metrics by generating the classification reports for all 5 models with their best estimator using the **unseen** test portion split from the `hold-out` sample that we prepared earlier.

# KNN
clf_best = gs_pipe_KNN.best_estimator_
predictions = clf_best.predict(D_test)
print("KNN :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")

#DT
clf_best = gs_pipe_DT.best_estimator_
predictions = clf_best.predict(D_test)
print("Decision Tree :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")

#RF
clf_best = gs_pipe_RF.best_estimator_
predictions = clf_best.predict(D_test)
print("Random Forest :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")

#NB
clf_best = gs_pipe_NB.best_estimator_
predictions = clf_best.predict(D_test)
print("Naive Bayes :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")

#LR
clf_best = gs_pipe_LR.best_estimator_
predictions = clf_best.predict(D_test)
print("Logistic Regression :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")

KNN :
Accuracy :  0.8634920634920635
Confusion Matrix : 
 [[259   0]
 [ 43  13]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.86      1.00      0.92       259
           1       1.00      0.23      0.38        56

    accuracy                           0.86       315
   macro avg       0.93      0.62      0.65       315
weighted avg       0.88      0.86      0.83       315

=================================================================================
Decision Tree :
Accuracy :  0.8507936507936508
Confusion Matrix : 
 [[254   5]
 [ 42  14]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.86      0.98      0.92       259
           1       0.74      0.25      0.37        56

    accuracy                           0.85       315
   macro avg       0.80      0.62      0.64       315
weighted avg       0.84      0.85      0.82       315

=================================================================================
Random Forest :
Accuracy :  0.8603174603174604
Confusion Matrix : 
 [[253   6]
 [ 38  18]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.87      0.98      0.92       259
           1       0.75      0.32      0.45        56

    accuracy                           0.86       315
   macro avg       0.81      0.65      0.69       315
weighted avg       0.85      0.86      0.84       315

=================================================================================
Naive Bayes :
Accuracy :  0.26666666666666666
Confusion Matrix : 
 [[ 28 231]
 [  0  56]]
Classification Report: 
               precision    recall  f1-score   support

           0       1.00      0.11      0.20       259
           1       0.20      1.00      0.33        56

    accuracy                           0.27       315
   macro avg       0.60      0.55      0.26       315
weighted avg       0.86      0.27      0.22       315

=================================================================================
Logistic Regression :
Accuracy :  0.7333333333333333
Confusion Matrix : 
 [[184  75]
 [  9  47]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.95      0.71      0.81       259
           1       0.39      0.84      0.53        56

    accuracy                           0.73       315
   macro avg       0.67      0.77      0.67       315
weighted avg       0.85      0.73      0.76       315

=================================================================================

Since Match = 1 is classified as positive label, the above confusion matrix can be interpreted as:

$$CM = \begin{bmatrix} TN & FP \\ FN & TP \end{bmatrix}$$

where TN denotes True Negative, TP denotes True Positive, FP denotes False Positive, FN denotes False Negative

In the classification report:

$Precision_{0}=\frac {TN} {(TN+FN)} $, $Precision_{1}=\frac {TP} {(TP+FP)} $, $Macro Avg_{Precision}=\frac {Precision_{0} + Precision_{1}} {2} $

$WeightedAvg_{Precision}=\frac {Precision_{0} * Number_{0} + Precision_{1} * Number_{1}} {Number_{0} + Number_{1}} $

$Recall_{0}=\frac {TN} {(TN+FP)} $, $Recall_{1}=\frac {TP} {(TP+FN)} $, $Macro Avg_{Recall}=\frac {Recall_{0} + Recall_{1}} {2} $, $Weighted Avg_{Recall}=\frac {Recall_{0} * Number_{0} + Recall_{1} * Number_{1}} {Number_{0} + Number_{1}} $

$Accuracy=\frac {TP + TN } {(TP+TN+FP+FN)} $
$F1_{0}=\frac {2 * (Precision_{0} * Recall_{0})} {(Precision_{0} * Recall_{0})} $, $F1_{1}=\frac {2 * (Precision_{1} * Recall_{1})} {(Precision_{1} * Recall_{1})} $, $Macro Avg_{F1}=\frac {F1_{0} + F1_{1}} {2} $, $Weighted Avg_{F1}=\frac {F1_{0} * Number_{0} + F1_{1} * Number_{1}} {Number_{0} + Number_{1}} $

For example, $$CM_{KNN} = \begin{bmatrix} 252 & 7 \\ 38 & 18 \end{bmatrix}$$

TN = 252, FP = 7, FN = 38, TP = 18, $Number_{0}$ = 259, $Number_{1}$ = 56

From the classification reports of all the models, we could summarize as below:

KNN - All metrics are good, except $Recall_{1}$ is low (0.32) as TP is 18, but FN is 38, which means 38 out of 56 positive labels has been classified as negative and only 18 are classified correctly.
Decision Tree - All metrics are good, except $Recall_{1}$ is (0.45, well a bit higher than KNN) low as TP is 25, but FN is 31, which means 31 out of 56 positive labels has been classified as negative and only 25 are classified correctly.
Random Forest - All metrics are good, except $Recall_{1}$ is (0.34) low as TP is 19, but FN is 37, which means 37 out of 56 positive labels has been classified as negative and only 19 are classified correctly.
Naive Bayes - Even though almost all positive labels (55 out of 56 observations) are classified correctly, but for negative labels, only 71 out of 259 are classified correctly. All other 188 negative match observations are classified incorrectly. Thus the accuracy is so low in this model.
Logistic Regression - for positive labels, 47 (out of 56) are classified correctly. For negative labels, 181 (out of 259) are classified correctly, it shows that this model is better in classifying positive labels than negative labels.

Thus we found that the best estimator of Naive Bayes and Logistic Regression can distinguish positive labels better than other models, while KNN, Decision Tree and Random Forest would make less mistakes on predicting negative labels from the unseen test dataset.

2. Prepare a dataframe with the `fpr` and `tpr` for all 5 models with their best estimator by calculating the prediction scores using the `predict_proba` method in `Scikit-learn` and plot the AUC curve of all these models in the same graph.

clf_best = gs_pipe_KNN.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
KNN_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
KNN_df['model']='KNN'
neworder = ['model','fpr','tpr']
KNN_df=KNN_df.reindex(columns=neworder)


clf_best = gs_pipe_DT.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
DT_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
DT_df['model']='Decision Tree'
neworder = ['model','fpr','tpr']
DT_df=DT_df.reindex(columns=neworder)

clf_best = gs_pipe_RF.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
RF_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
RF_df['model']='Random Forest'
neworder = ['model','fpr','tpr']
RF_df=RF_df.reindex(columns=neworder)

clf_best = gs_pipe_NB.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
NB_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
NB_df['model']='Naive Bayes'
neworder = ['model','fpr','tpr']
NB_df=NB_df.reindex(columns=neworder)

clf_best = gs_pipe_LR.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
LR_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
LR_df['model']='Logistic Regression'
neworder = ['model','fpr','tpr']
LR_df=LR_df.reindex(columns=neworder)

result = pd.concat([KNN_df, DT_df, RF_df, NB_df, LR_df], ignore_index=True, sort=False)


x=np.linspace(0.0,1.0,101)
plt.figure(figsize=(10, 10))
sns.lineplot(x="fpr", y="tpr", hue="model", data=result)
plt.plot(x, x, color='black')
plt.title("Comparison of Roc Curve of 5 classifiers", fontsize = 16)
plt.show()

From the AUC curve, we found that beside decision tree, all other 4 models perform similarly (ideal model would have 0/1 fpr-tpr relationship) on the unseen dataset.

3.  Let's use t-test to compare the mean score of the models against each other to find any significant difference.  We set a different random seed this time.

cv_method_ttest = RepeatedStratifiedKFold(n_splits=5, 
                                          n_repeats=3, 
                                          random_state=222)

cv_results_KNN = cross_val_score(estimator=gs_pipe_KNN.best_estimator_,
                                 X=Data,
                                 y=target, 
                                 cv=cv_method_ttest, 
                                 n_jobs=-2,
                                 scoring='roc_auc')
cv_results_DT = cross_val_score(estimator=gs_pipe_DT.best_estimator_,
                                X=Data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='roc_auc')
cv_results_RF = cross_val_score(estimator=gs_pipe_RF.best_estimator_,
                                X=Data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='roc_auc')
cv_results_NB = cross_val_score(estimator=gs_pipe_NB.best_estimator_,
                                X=Data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='roc_auc')
cv_results_LR = cross_val_score(estimator=gs_pipe_LR.best_estimator_,
                                X=Data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='roc_auc')

print("KNN cross validation mean score:", cv_results_KNN.mean().round(3))
print("Decision Tree cross validation mean score:", cv_results_DT.mean().round(3))
print("Random Forest cross validation mean score:", cv_results_RF.mean().round(3))
print("Naive Bayes cross validation mean score:", cv_results_NB.mean().round(3))
print("Logistic Regression cross validation mean score:", cv_results_LR.mean().round(3))
print("==================================================================")


print("Difference between KNN and Decisioin Tree: ", stats.ttest_rel(cv_results_KNN, cv_results_DT).pvalue)
print("Difference between KNN and Random Forest: ",stats.ttest_rel(cv_results_KNN, cv_results_RF).pvalue)
print("Difference between KNN and Naive Bayes: ",stats.ttest_rel(cv_results_KNN, cv_results_NB).pvalue)
print("Difference between KNN and Logistic Regression: ", stats.ttest_rel(cv_results_KNN, cv_results_LR).pvalue)

print("==================================================================")


print("Difference between Decision Tree and Random Forest: ", stats.ttest_rel(cv_results_DT, cv_results_RF).pvalue)
print("Difference between Decision Tree and Naive Bayes: ", stats.ttest_rel(cv_results_DT, cv_results_NB).pvalue)
print("Difference between Decision Tree and Logistic Regression: ", stats.ttest_rel(cv_results_DT, cv_results_LR).pvalue)
print("==================================================================")

print("Difference between Random Forest and Naive Bayes: ", stats.ttest_rel(cv_results_RF, cv_results_NB).pvalue)
print("Difference between Random Forest and Logistic Regression: ", stats.ttest_rel(cv_results_RF, cv_results_LR).pvalue)

print("Difference between Logistic Regression and Naive Bayes: ", stats.ttest_rel(cv_results_NB, cv_results_LR).pvalue)

KNN cross validation mean score: 0.824
Decision Tree cross validation mean score: 0.784
Random Forest cross validation mean score: 0.839
Naive Bayes cross validation mean score: 0.832
Logistic Regression cross validation mean score: 0.833
==================================================================
Difference between KNN and Decisioin Tree:  2.235138967706454e-05
Difference between KNN and Random Forest:  0.03945350976204024
Difference between KNN and Naive Bayes:  0.21035155300613778
Difference between KNN and Logistic Regression:  0.15251342670858226
==================================================================
Difference between Decision Tree and Random Forest:  6.135442923421278e-06
Difference between Decision Tree and Naive Bayes:  2.038343652173581e-05
Difference between Decision Tree and Logistic Regression:  6.232179612259313e-05
==================================================================
Difference between Random Forest and Naive Bayes:  0.3083120733642208
Difference between Random Forest and Logistic Regression:  0.3824185833221857
Difference between Logistic Regression and Naive Bayes:  0.9441211442623972

With 95%of confidence interval, we found that only decision tree has significant difference (p-value < 0.05) on the mean score (under AUC metric) compare to other models.

Critique of our approach ¶

One of the biggest limitations in our project is handling missing values. We have truncated all the observations with missing values. Below are the top 3 features with most missing values:

* expected_num_interested_in_me, 6578 observations with missing values
* expected_num_matches, 1173 observations with missing values
* shared_interests_partner, 1067 observations with missing values

If we fit the whole dataset to select the best 5 features for classif

fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=5)
fs_fit_fscore.fit_transform(Data, target)
fs_indices_fscore = np.argsort(fs_fit_fscore.scores_)[::-1][0:5]
fs_indices_fscore

array([57, 30, 58, 13, 18], dtype=int64)

best_features_fscore = Data_df.columns[fs_indices_fscore].values
best_features_fscore

array(['like', 'attractive_partner', 'guess_prob_liked', 'attractive_o',
       'shared_interests_o'], dtype=object)

None of the top 3 missing values features would be selected.

If we fit the whole dataset to select the best 10 features for classif and mutual_info

fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=10)
fs_fit_fscore.fit_transform(Data, target)
fs_indices_fscore = np.argsort(fs_fit_fscore.scores_)[::-1][0:10]
fs_indices_fscore

array([57, 30, 58, 13, 18, 35, 16, 33, 56, 49], dtype=int64)

best_features_fscore = Data_df.columns[fs_indices_fscore].values
best_features_fscore

array(['like', 'attractive_partner', 'guess_prob_liked', 'attractive_o',
       'shared_interests_o', 'shared_interests_partner', 'funny_o',
       'funny_partner', 'expected_num_matches', 'concerts'], dtype=object)

fs_fit_fscore = fs.SelectKBest(fs.mutual_info_classif, k=10)
fs_fit_fscore.fit_transform(Data, target)
fs_indices_fscore = np.argsort(fs_fit_fscore.scores_)[::-1][0:10]
fs_indices_fscore

array([58, 53, 57, 16, 18, 56, 13, 30, 35, 31], dtype=int64)

best_features_fscore = Data_df.columns[fs_indices_fscore].values
best_features_fscore

array(['guess_prob_liked', 'interests_correlate', 'like', 'funny_o',
       'shared_interests_o', 'expected_num_matches', 'attractive_o',
       'attractive_partner', 'shared_interests_partner',
       'sincere_partner'], dtype=object)

expected_num_matches and shared_interests_partner would then be selected, but expected_num_interested_in_me would still not be the top 10 selections with both mutual_info and classif metric, it would be interesting to see the comparison of the whole analysis by removing the expected_num_interested_in_me feature before truncating any observations with missing values. We would have a much larger dataset if so, and we believe the results would change dramatically.

Also, we have deleted all the binned features, however from reviewing the 2-variable plots, we do find the values in interests_correlate are not discrete, we might also compare the analysis if we bin this feature instead. If time and resources is allowed, it would also be interesting to use all the descriptive features to train any of our classifiers and see if there would be any difference in parameter tuning.

On the other hand, Cross-validation on tuning all the parameters with the 5 dedicated models is the strength of our investigation, as we could see when we compare the mean score by cross-validating with different random seed, the mean scores stay relatively similar to the best estimator with original random seed on the training dataset.

Summary and Conclusion ¶

The main goal of this analysis is to predict if a person would match (go on with second date) with another person based on the data collected during speed dating.

We found that visualization on data helps us a lot on the analysis:

we have in-depth understanding on the dataset after generating all the variable plots
we could adjust the values of our parameters much easier by observing the fine-tuning plots

From the trained models with optimal parameter, we find that Naive Bayes would distinguish a match with highest accuracy. However, if we apply this model to help the speed dating agent to identify match on any unseen data, we might end up contacting too many couples that would not be matched (for example in the test data set, we would contact 188 + 55 potential match cases, and then we would find out the 188 false positive cases are waste of effort).

In contrast, if we use the KNN, decision tree or random forest model, by observing the test data set, we would miss a high percentage of match case. (e.g. for KNN, we would miss 67.8% (38/56) of potential match)

Even though the AUC metric score is not low from the above analysis due to the imbalanced class label, if we need to deploy this model for real, we would really suggest to train and fine-tune the model on a larger dataset (perhaps dropping the expected_num_interested_in_me feature before truncating any missing value observations would be one of the option to retain more data in the dataset), as by simply feeding into the sampled test data, we already found discrepancies in each model, which cannot help us to achieve our goal in an overall satisfactory way. Review the survey questions and improve the relevance of descriptive features could enhance our tuning process to find a better model too, in short, for industry deployment, it is essential to repetitively train and fine tune a classifier that can result higher accuracy for detecting both positive and negative class label on unseen data to achieve the business goal.

	expected_num_interested_in_me	d_expected_num_interested_in_me
4648	NaN	[0-3]
3108	NaN	[0-3]
3538	NaN	[0-3]
2668	NaN	[0-3]
4443	NaN	[0-3]
2647	NaN	[0-3]
7112	NaN	[0-3]
2667	NaN	[0-3]
7572	NaN	[0-3]
7991	NaN	[0-3]

	shared_interests_partner	d_shared_interests_partner
4258	NaN	[0-5]
5410	NaN	[0-5]
2918	NaN	[0-5]
8207	NaN	[0-5]
8144	NaN	[0-5]
5305	NaN	[0-5]
6126	NaN	[0-5]
2666	NaN	[0-5]
4848	NaN	[0-5]
3394	NaN	[0-5]

	age	age_o	d_age	samerace	importance_same_race	importance_same_religion	pref_o_attractive	pref_o_sincere	pref_o_intelligence	pref_o_funny	pref_o_ambitious	pref_o_shared_interests	attractive_o	sinsere_o	intelligence_o	funny_o	ambitous_o	shared_interests_o	attractive_important	sincere_important	intellicence_important	funny_important	ambtition_important	shared_interests_important	attractive	sincere	intelligence	funny	ambition	attractive_partner	sincere_partner	intelligence_partner	funny_partner	ambition_partner	shared_interests_partner	sports	tvsports	exercise	dining	museums	art	hiking	gaming	clubbing	reading	tv	theater	movies	concerts	music	shopping	yoga	interests_correlate	expected_happy_with_sd_people	expected_num_interested_in_me	expected_num_matches	like	guess_prob_liked	met	match
count	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.0	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00	1048.00
mean	25.01	24.82	3.03	0.42	4.02	4.14	23.73	16.97	22.26	17.33	9.73	10.33	6.21	7.17	7.39	6.30	6.80	5.42	23.77	17.37	21.82	16.73	9.99	10.84	6.89	8.16	7.60	8.32	7.21	6.19	7.22	7.44	6.36	6.8	5.43	6.25	4.56	5.98	7.63	6.81	6.44	5.10	3.92	6.10	7.42	5.59	6.88	8.06	6.96	7.71	5.51	4.13	0.15	5.38	5.76	2.84	6.22	4.98	0.08	0.18
std	3.27	3.18	2.43	0.49	3.03	3.02	12.66	7.45	7.35	6.67	7.07	6.76	1.96	1.74	1.54	2.07	1.83	2.17	13.56	7.42	7.31	6.57	7.25	6.94	1.49	1.38	1.77	1.00	2.04	1.91	1.75	1.51	2.04	1.8	2.12	2.64	2.79	2.46	1.79	1.96	2.19	2.58	2.37	2.18	1.96	2.49	2.28	1.59	2.03	1.90	2.60	2.70	0.34	1.63	4.95	2.37	1.86	2.27	0.27	0.38
min	18.00	18.00	0.00	0.00	1.00	1.00	5.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	5.00	0.00	0.00	0.00	0.00	0.00	2.00	2.00	2.00	5.00	2.00	0.00	0.00	0.00	0.00	0.0	0.00	1.00	1.00	1.00	3.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	3.00	1.00	1.00	1.00	1.00	-0.63	1.00	0.00	0.00	0.00	0.00	0.00	0.00
25%	22.00	22.00	1.00	0.00	1.00	1.00	15.00	10.00	20.00	10.83	5.00	5.00	5.00	6.00	6.88	5.00	6.00	4.00	15.00	11.11	20.00	10.00	5.00	5.00	6.00	8.00	7.00	8.00	6.00	5.00	6.00	7.00	5.00	6.0	4.00	4.00	2.00	4.00	6.00	6.00	5.00	3.00	2.00	4.00	6.00	4.00	5.00	7.00	6.00	7.00	4.00	2.00	-0.11	5.00	2.00	1.00	5.00	3.00	0.00	0.00
50%	25.00	25.00	2.00	0.00	3.00	3.00	20.00	18.00	20.00	18.18	10.00	10.00	6.00	7.00	7.00	6.00	7.00	5.00	20.00	20.00	20.00	15.00	10.00	10.00	7.00	8.00	8.00	8.00	8.00	6.00	7.00	7.00	7.00	7.0	5.00	7.00	4.00	6.00	8.00	7.00	7.00	5.00	4.00	6.00	8.00	6.00	7.00	8.00	7.00	8.00	5.00	3.00	0.15	5.00	4.00	2.00	6.00	5.00	0.00	0.00
75%	27.00	27.00	4.00	1.00	7.00	7.00	30.00	20.00	25.00	20.00	15.00	15.00	8.00	8.00	8.00	8.00	8.00	7.00	30.00	20.00	25.00	20.00	15.00	15.00	8.00	9.00	9.00	9.00	9.00	8.00	8.00	8.00	8.00	8.0	7.00	8.00	7.00	8.00	9.00	8.00	8.00	7.00	6.00	8.00	9.00	8.00	9.00	9.00	8.00	9.00	8.00	7.00	0.42	7.00	8.00	4.00	7.00	7.00	0.00	0.00
max	35.00	35.00	14.00	1.00	10.00	10.00	100.00	40.00	50.00	40.00	53.00	30.00	10.00	10.00	10.00	10.00	10.00	10.00	100.00	40.00	50.00	40.00	53.00	30.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.0	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	0.90	9.00	20.00	10.00	10.00	10.00	3.00	1.00

	mean_score	std_score	max_score	min_score	fselector__k	fselector__score_func	knn__n_neighbors	knn__p
12	0.832758	0.0336231	0.878735	0.792276	5	<function f_classif at 0x0000026419EAC708>	100	1
8	0.83178	0.0433289	0.89288	0.776542	5	<function f_classif at 0x0000026419EAC708>	50	5
10	0.831373	0.0414457	0.889383	0.779879	5	<function f_classif at 0x0000026419EAC708>	75	2
11	0.831328	0.0416365	0.891291	0.779085	5	<function f_classif at 0x0000026419EAC708>	75	5
13	0.830208	0.0370159	0.882708	0.782263	5	<function f_classif at 0x0000026419EAC708>	100	2

	mean_score	std_score	max_score	min_score	dt__criterion	dt__max_depth	dt__min_samples_split	fselector__k	fselector__score_func
96	0.793543	0.03456	0.852829	0.749682	gini	7	60	5	<function f_classif at 0x0000026419EAC708>
60	0.792593	0.0468779	0.866179	0.732517	gini	5	70	5	<function f_classif at 0x0000026419EAC708>
237	0.791185	0.0546864	0.846795	0.654005	entropy	5	80	10	<function mutual_info_classif at 0x00000264179...
54	0.790705	0.0459067	0.866179	0.729339	gini	5	60	5	<function f_classif at 0x0000026419EAC708>
48	0.78979	0.0422831	0.867451	0.720121	gini	5	50	5	<function f_classif at 0x0000026419EAC708>

	has_null	wave	gender	age	age_o	d_age	d_d_age	race	race_o	samerace	importance_same_race	importance_same_religion	d_importance_same_race	d_importance_same_religion	field	pref_o_attractive	pref_o_sincere	pref_o_intelligence	pref_o_funny	pref_o_ambitious	pref_o_shared_interests	d_pref_o_attractive	d_pref_o_sincere	d_pref_o_intelligence	d_pref_o_funny	d_pref_o_ambitious	d_pref_o_shared_interests	attractive_o	sinsere_o	intelligence_o	funny_o	ambitous_o	shared_interests_o	d_attractive_o	d_sinsere_o	d_intelligence_o	d_funny_o	d_ambitous_o	d_shared_interests_o	attractive_important	sincere_important	intellicence_important	funny_important	ambtition_important	shared_interests_important	d_attractive_important	d_sincere_important	d_intellicence_important	d_funny_important	d_ambtition_important	d_shared_interests_important	attractive	sincere	intelligence	funny	ambition	d_attractive	d_sincere	d_intelligence	d_funny	d_ambition	attractive_partner	sincere_partner	intelligence_partner	funny_partner	ambition_partner	shared_interests_partner	d_attractive_partner	d_sincere_partner	d_intelligence_partner	d_funny_partner	d_ambition_partner	d_shared_interests_partner	sports	tvsports	exercise	dining	museums	art	hiking	gaming	clubbing	reading	tv	theater	movies	concerts	music	shopping	yoga	d_sports	d_tvsports	d_exercise	d_dining	d_museums	d_art	d_hiking	d_gaming	d_clubbing	d_reading	d_tv	d_theater	d_movies	d_concerts	d_music	d_shopping	d_yoga	interests_correlate	d_interests_correlate	expected_happy_with_sd_people	expected_num_interested_in_me	expected_num_matches	d_expected_happy_with_sd_people	d_expected_num_interested_in_me	d_expected_num_matches	like	guess_prob_liked	d_like	d_guess_prob_liked	match
1670	0	5	female	21.0	22.0	1	[0-1]	European/Caucasian-American	European/Caucasian-American	1	3.0	8.0	[2-5]	[6-10]	Economics	25.00	40.00	15.00	10.00	5.00	5.00	[21-100]	[21-100]	[0-15]	[0-15]	[0-15]	[0-15]	6.0	8.0	9.0	9.0	8.0	6.0	[6-8]	[6-8]	[9-10]	[9-10]	[6-8]	[6-8]	15.00	15.00	25.00	25.00	15.00	5.00	[0-15]	[0-15]	[21-100]	[21-100]	[0-15]	[0-15]	8.0	10.0	10.0	8.0	9.0	[6-8]	[9-10]	[9-10]	[6-8]	[9-10]	6.0	9.0	8.0	7.0	7.0	6.0	[6-8]	[9-10]	[6-8]	[6-8]	[6-8]	[6-8]	5.0	3.0	7.0	10.0	9.0	8.0	7.0	2.0	6.0	8.0	7.0	10.0	10.0	9.0	10.0	10.0	7.0	[0-5]	[0-5]	[6-8]	[9-10]	[9-10]	[6-8]	[6-8]	[0-5]	[6-8]	[6-8]	[6-8]	[9-10]	[9-10]	[9-10]	[9-10]	[9-10]	[6-8]	0.44	[0.33-1]	7.0	20.0	0.0	[7-10]	[10-20]	[0-2]	7.0	7.0	[6-8]	[7-10]	1
1823	1	5	male	NaN	20.0	20	[7-37]	European/Caucasian-American	'Latino/Hispanic American'	0	1.0	1.0	[0-1]	[0-1]	Economics	10.00	10.00	35.00	35.00	8.00	2.00	[0-15]	[0-15]	[21-100]	[21-100]	[0-15]	[0-15]	5.0	7.0	2.0	2.0	4.0	2.0	[0-5]	[6-8]	[0-5]	[0-5]	[0-5]	[0-5]	40.00	20.00	20.00	20.00	NaN	NaN	[21-100]	[16-20]	[16-20]	[16-20]	[0-15]	[0-15]	8.0	8.0	8.0	8.0	8.0	[6-8]	[6-8]	[6-8]	[6-8]	[6-8]	9.0	8.0	8.0	10.0	9.0	8.0	[9-10]	[6-8]	[6-8]	[9-10]	[9-10]	[6-8]	7.0	4.0	7.0	9.0	4.0	5.0	9.0	2.0	6.0	9.0	4.0	4.0	7.0	6.0	6.0	4.0	6.0	[6-8]	[0-5]	[6-8]	[9-10]	[0-5]	[0-5]	[9-10]	[0-5]	[6-8]	[9-10]	[0-5]	[0-5]	[6-8]	[6-8]	[6-8]	[0-5]	[6-8]	0.36	[0.33-1]	10.0	10.0	5.0	[7-10]	[10-20]	[3-5]	8.0	8.0	[6-8]	[7-10]	0
4526	1	12	female	24.0	23.0	1	[0-1]	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American	0	1.0	1.0	[0-1]	[0-1]	'Social Work'	25.00	5.00	30.00	15.00	5.00	20.00	[21-100]	[0-15]	[21-100]	[0-15]	[0-15]	[16-20]	5.0	7.0	8.0	7.0	6.0	3.0	[0-5]	[6-8]	[6-8]	[6-8]	[6-8]	[0-5]	10.00	40.00	10.00	20.00	10.00	10.00	[0-15]	[21-100]	[0-15]	[16-20]	[0-15]	[0-15]	6.0	9.0	9.0	8.0	9.0	[6-8]	[9-10]	[9-10]	[6-8]	[9-10]	8.0	9.0	9.0	8.0	9.0	10.0	[6-8]	[9-10]	[9-10]	[6-8]	[9-10]	[9-10]	7.0	1.0	7.0	8.0	9.0	8.0	9.0	4.0	9.0	9.0	6.0	7.0	9.0	7.0	9.0	5.0	4.0	[6-8]	[0-5]	[6-8]	[6-8]	[9-10]	[6-8]	[9-10]	[0-5]	[9-10]	[9-10]	[6-8]	[6-8]	[9-10]	[6-8]	[9-10]	[0-5]	[0-5]	0.40	[0.33-1]	5.0	NaN	2.0	[5-6]	[0-3]	[0-2]	8.0	7.0	[6-8]	[7-10]	0
4420	1	11	male	28.0	29.0	1	[0-1]	'Black/African American'	European/Caucasian-American	0	7.0	1.0	[6-10]	[0-1]	'International Affairs'	10.00	40.00	20.00	20.00	0.00	10.00	[0-15]	[21-100]	[16-20]	[16-20]	[0-15]	[0-15]	10.0	9.0	9.0	6.0	6.0	5.0	[9-10]	[9-10]	[9-10]	[6-8]	[6-8]	[0-5]	20.00	18.00	20.00	17.00	10.00	15.00	[16-20]	[16-20]	[16-20]	[16-20]	[0-15]	[0-15]	8.0	10.0	8.0	10.0	10.0	[6-8]	[9-10]	[6-8]	[9-10]	[9-10]	10.0	10.0	10.0	4.0	5.0	4.0	[9-10]	[9-10]	[9-10]	[0-5]	[0-5]	[0-5]	10.0	5.0	8.0	8.0	7.0	5.0	10.0	7.0	9.0	7.0	4.0	7.0	9.0	5.0	10.0	1.0	1.0	[9-10]	[0-5]	[6-8]	[6-8]	[6-8]	[0-5]	[9-10]	[6-8]	[9-10]	[6-8]	[0-5]	[6-8]	[9-10]	[0-5]	[9-10]	[0-5]	[0-5]	0.48	[0.33-1]	6.0	NaN	6.0	[5-6]	[0-3]	[5-18]	5.0	5.0	[0-5]	[5-6]	0
4565	1	12	female	21.0	32.0	11	[7-37]	European/Caucasian-American	European/Caucasian-American	1	8.0	7.0	[6-10]	[6-10]	'speech pathology'	20.00	20.00	20.00	20.00	10.00	10.00	[16-20]	[16-20]	[16-20]	[16-20]	[0-15]	[0-15]	5.0	4.0	4.0	4.0	4.0	0.0	[0-5]	[0-5]	[0-5]	[0-5]	[0-5]	[0-5]	50.00	5.00	20.00	10.00	5.00	10.00	[21-100]	[0-15]	[16-20]	[0-15]	[0-15]	[0-15]	8.0	8.0	8.0	8.0	10.0	[6-8]	[6-8]	[6-8]	[6-8]	[9-10]	4.0	9.0	8.0	4.0	8.0	NaN	[0-5]	[9-10]	[6-8]	[0-5]	[6-8]	[0-5]	10.0	10.0	9.0	7.0	5.0	6.0	6.0	6.0	8.0	5.0	7.0	7.0	7.0	9.0	9.0	6.0	6.0	[9-10]	[9-10]	[9-10]	[6-8]	[0-5]	[6-8]	[6-8]	[6-8]	[6-8]	[0-5]	[6-8]	[6-8]	[6-8]	[9-10]	[9-10]	[6-8]	[6-8]	-0.30	[-1-0]	8.0	NaN	4.0	[7-10]	[0-3]	[3-5]	4.0	4.0	[0-5]	[0-4]	0
4433	1	11	male	28.0	22.0	6	[4-6]	'Latino/Hispanic American'	European/Caucasian-American	0	3.0	7.0	[2-5]	[6-10]	'Business [MBA]'	25.00	7.00	25.00	25.00	8.00	10.00	[21-100]	[0-15]	[21-100]	[21-100]	[0-15]	[0-15]	4.0	4.0	5.0	5.0	6.0	5.0	[0-5]	[0-5]	[0-5]	[0-5]	[6-8]	[0-5]	23.00	18.00	21.00	18.00	10.00	10.00	[21-100]	[16-20]	[21-100]	[16-20]	[0-15]	[0-15]	7.0	10.0	9.0	9.0	9.0	[6-8]	[9-10]	[9-10]	[9-10]	[9-10]	6.0	8.0	8.0	6.0	5.0	6.0	[6-8]	[6-8]	[6-8]	[6-8]	[0-5]	[6-8]	6.0	9.0	7.0	10.0	8.0	7.0	6.0	6.0	8.0	9.0	6.0	9.0	9.0	8.0	8.0	6.0	4.0	[6-8]	[9-10]	[6-8]	[9-10]	[6-8]	[6-8]	[6-8]	[6-8]	[6-8]	[9-10]	[6-8]	[9-10]	[9-10]	[6-8]	[6-8]	[6-8]	[0-5]	0.27	[0-0.33]	6.0	NaN	3.0	[5-6]	[0-3]	[3-5]	6.0	6.0	[6-8]	[5-6]	0
4249	1	11	male	27.0	25.0	2	[2-3]	European/Caucasian-American	European/Caucasian-American	1	1.0	5.0	[0-1]	[2-5]	'Business School'	15.00	18.00	19.00	19.00	17.00	12.00	[0-15]	[16-20]	[16-20]	[16-20]	[16-20]	[0-15]	7.0	8.0	9.0	8.0	NaN	6.0	[6-8]	[6-8]	[9-10]	[6-8]	[0-5]	[6-8]	25.00	20.00	25.00	20.00	10.00	0.00	[21-100]	[16-20]	[21-100]	[16-20]	[0-15]	[0-15]	7.0	6.0	7.0	8.0	7.0	[6-8]	[6-8]	[6-8]	[6-8]	[6-8]	5.0	4.0	6.0	4.0	NaN	NaN	[0-5]	[0-5]	[6-8]	[0-5]	[0-5]	[0-5]	9.0	2.0	5.0	7.0	7.0	9.0	8.0	1.0	5.0	9.0	5.0	5.0	9.0	8.0	9.0	8.0	5.0	[9-10]	[0-5]	[0-5]	[6-8]	[6-8]	[9-10]	[6-8]	[0-5]	[0-5]	[9-10]	[0-5]	[0-5]	[9-10]	[6-8]	[9-10]	[6-8]	[0-5]	0.63	[0.33-1]	7.0	NaN	2.0	[7-10]	[0-3]	[0-2]	4.0	1.0	[0-5]	[0-4]	0
875	0	3	female	26.0	22.0	4	[4-6]	'Latino/Hispanic American'	European/Caucasian-American	0	1.0	1.0	[0-1]	[0-1]	law	30.00	10.00	20.00	30.00	0.00	10.00	[21-100]	[0-15]	[16-20]	[21-100]	[0-15]	[0-15]	7.0	7.0	7.0	6.0	5.0	5.0	[6-8]	[6-8]	[6-8]	[6-8]	[0-5]	[0-5]	30.00	10.00	20.00	20.00	10.00	10.00	[21-100]	[0-15]	[16-20]	[16-20]	[0-15]	[0-15]	9.0	9.0	9.0	9.0	9.0	[9-10]	[9-10]	[9-10]	[9-10]	[9-10]	7.0	8.0	8.0	7.0	9.0	7.0	[6-8]	[6-8]	[6-8]	[6-8]	[9-10]	[6-8]	8.0	5.0	7.0	8.0	6.0	6.0	7.0	7.0	7.0	7.0	3.0	2.0	9.0	7.0	9.0	9.0	2.0	[6-8]	[0-5]	[6-8]	[6-8]	[6-8]	[6-8]	[6-8]	[6-8]	[6-8]	[6-8]	[0-5]	[0-5]	[9-10]	[6-8]	[9-10]	[9-10]	[0-5]	0.14	[0-0.33]	5.0	1.0	5.0	[5-6]	[0-3]	[3-5]	7.0	8.0	[6-8]	[7-10]	0
3010	1	9	male	23.0	25.0	2	[2-3]	European/Caucasian-American	'Asian/Pacific Islander/Asian-American'	0	6.0	8.0	[6-10]	[6-10]	'Computational Biochemsistry'	17.65	17.65	17.65	15.69	15.69	15.69	[16-20]	[16-20]	[16-20]	[16-20]	[16-20]	[16-20]	5.0	8.0	7.0	9.0	8.0	5.0	[0-5]	[6-8]	[6-8]	[9-10]	[6-8]	[0-5]	17.02	21.28	17.02	21.28	14.89	8.51	[16-20]	[21-100]	[16-20]	[21-100]	[0-15]	[0-15]	8.0	10.0	10.0	9.0	8.0	[6-8]	[9-10]	[9-10]	[9-10]	[6-8]	7.0	8.0	8.0	7.0	9.0	5.0	[6-8]	[6-8]	[6-8]	[6-8]	[9-10]	[0-5]	9.0	3.0	10.0	8.0	6.0	7.0	7.0	4.0	7.0	3.0	1.0	9.0	6.0	4.0	6.0	7.0	1.0	[9-10]	[0-5]	[9-10]	[6-8]	[6-8]	[6-8]	[6-8]	[0-5]	[6-8]	[0-5]	[0-5]	[9-10]	[6-8]	[0-5]	[6-8]	[6-8]	[0-5]	0.00	[-1-0]	6.0	NaN	NaN	[5-6]	[0-3]	[0-2]	6.0	6.0	[6-8]	[5-6]	1
127	1	1	male	22.0	25.0	3	[2-3]	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American	0	3.0	5.0	[2-5]	[2-5]	Law	9.09	18.18	27.27	18.18	18.18	9.09	[0-15]	[16-20]	[21-100]	[16-20]	[16-20]	[0-15]	5.0	8.0	8.0	6.0	7.0	7.0	[0-5]	[6-8]	[6-8]	[6-8]	[6-8]	[6-8]	19.00	18.00	19.00	18.00	14.00	12.00	[16-20]	[16-20]	[16-20]	[16-20]	[0-15]	[0-15]	4.0	7.0	8.0	8.0	3.0	[0-5]	[6-8]	[6-8]	[6-8]	[0-5]	10.0	10.0	10.0	10.0	10.0	10.0	[9-10]	[9-10]	[9-10]	[9-10]	[9-10]	[9-10]	7.0	8.0	2.0	9.0	5.0	6.0	4.0	7.0	7.0	6.0	8.0	10.0	8.0	9.0	9.0	8.0	1.0	[6-8]	[6-8]	[0-5]	[9-10]	[0-5]	[6-8]	[0-5]	[6-8]	[6-8]	[6-8]	[6-8]	[9-10]	[6-8]	[9-10]	[9-10]	[6-8]	[0-5]	0.42	[0.33-1]	3.0	4.0	NaN	[0-4]	[4-9]	[0-2]	10.0	10.0	[9-10]	[7-10]	1

	mean_score	std_score	max_score	min_score	fselector__k	fselector__score_func	rf__max_depth	rf__n_estimators
165	0.833761	0.0391557	0.890496	0.777404	20	<function mutual_info_classif at 0x00000264179...	12	100
167	0.833072	0.0420947	0.898284	0.780128	20	<function mutual_info_classif at 0x00000264179...	12	200
5	0.83234	0.0347954	0.880006	0.77225	5	<function f_classif at 0x0000026419EAC708>	5	150
6	0.832308	0.0358723	0.880642	0.771297	5	<function f_classif at 0x0000026419EAC708>	5	200
10	0.83163	0.0434665	0.886364	0.747139	5	<function f_classif at 0x0000026419EAC708>	7	50

	mean_score	std_score	max_score	min_score	fselector__k	fselector__score_func	nb__var_smoothing
373	0.826385	0.0461895	0.887821	0.76637	10	<function mutual_info_classif at 0x00000264179...	4.22924e-07
393	0.826037	0.0489426	0.89288	0.726282	10	<function mutual_info_classif at 0x00000264179...	4.03702e-09
392	0.824496	0.0395232	0.894872	0.76192	10	<function mutual_info_classif at 0x00000264179...	5.09414e-09
374	0.824082	0.0375718	0.86904	0.768277	10	<function mutual_info_classif at 0x00000264179...	3.3516e-07
233	0.823413	0.0459822	0.883026	0.745709	10	<function f_classif at 0x0000026419EAC708>	0.00464159

	mean_score	std_score	max_score	min_score	fselector__k	fselector__score_func	lr__C	lr__class_weight	lr__penalty
73	0.830106	0.0342993	0.880801	0.769872	20	<function mutual_info_classif at 0x00000264179...	1	balanced	l2
59	0.829682	0.0414529	0.884933	0.766026	20	<function f_classif at 0x0000026419EAC708>	1	balanced	l2
61	0.829111	0.0357382	0.879212	0.778526	20	<function f_classif at 0x0000026419EAC708>	10	balanced	l2
63	0.827005	0.0340319	0.880483	0.779167	20	<function f_classif at 0x0000026419EAC708>	100	balanced	l2
65	0.826974	0.034029	0.880801	0.778846	20	<function f_classif at 0x0000026419EAC708>	500	balanced	l2

Machine Learning: Speed Dating