Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.

S3806940_Project_for portfolio

Executive Summary: Predicting User Behaviour in Speed Dating Events

The dataset was created from participants in experimental speed dating events from 2002-2004. During these events, the participants would have a four-minute "first date" with every other participant of the opposite sex. After which they would rate their partner. They were also asked if they would like to see their date again. If so, this first date event would be classified as a positive match.

Data Source

The dataset is found from https://github.com/vaksakalli/datasets, Each observation (row) represents one single dating event, and each variable (column) provide the information of:

  • particular characteristics/liking of the participant or his/her partner.
  • rating on the characteristics/importance of certain values from participant on himself / herself or to his/her partner.

All the columns are identified as descriptive features, except the last column - match, which is our target feature. Descriptive features are used to classify the values of target feature, details of the features in this dataset are listed as below:

Descriptive Features

  • has null Row has null values or not
  • wave Experiment numbers
  • gender Gender of self
  • age Age of self
  • age_o Age of partner
  • d_ageDifference in age
  • d_d_age Binned values of difference in age, 4 groups:[0-1], [2-3], [4-6] and [7-37]
  • race Race of self
  • race_o Race of partner
  • samerace Whether the two persons have the same race or not.
  • importance_same_race How important is it that partner is of same race?
  • importance_same_religion How important is it that partner has same religion?
  • d_importance_same_race Binned values of race
  • d_importance_same_religionBinned Values of religion
  • field Field of study
  • pref_o_attractive How important does partner rate attractiveness
  • pref_o_sinsere How important does partner rate sincerity
  • pref_o_intelligence How important does partner rate intelligence
  • pref_o_funny How important does partner rate being funny
  • pref_o_ambitious How important does partner rate ambition
  • pref_o_shared_interests How important does partner rate having shared interests
  • d_pref_o_attractive Binned values for how important does partner rate attractiveness, 3 groups: [0-15], [16-20] and [21-100]
  • d_pref_o_sinsere Binned values for How important does partner rate sincerity, 3 groups: [0-15], [16-20] and [21-100]
  • d_pref_o_intelligence Binned values for how important does partner rate intelligence, 3 groups: [0-15], [16-20] and [21-100]
  • d_pref_o_funny Binned values for how important does partner rate being funny, 3 groups: [0-15], [16-20] and [21-100]
  • d_pref_o_ambitious Binned values Binned values for how important does partner rate being funny, 3 groups: [0-15], [16-20] and [21-100]
  • d_pref_o_shared_interests Binned values for Binned values for how important does partner rate being funny, 3 groups: [0-15], [16-20] and [21-100]
  • attractive_o Rating by partner (about me) at night of event on attractiveness
  • sincere_o Rating by partner (about me) at night of event on sincerity
  • intelligence_o Rating by partner (about me) at night of event on intelligence
  • funny_o Rating by partner (about me) at night of event on being funny
  • ambitous_o Rating by partner (about me) at night of event on being ambitious
  • shared_interests_o Rating by partner (about me) at night of event on shared interest
  • d_attractive_o Binned values for rating by partner (about me) at night of event on attractiveness:
  • d_sinsere_o Binned values for rating by partner (about me) at night of event on sincerity, 3 groups: [0-5], [6-8] and [9-10]
  • d_intelligence_o Binned values for rating by partner (about me) at night of event on intelligence, 3 groups: [0-5], [6-8] and [9-10]
  • d_funny_o Binned values for rating by partner (about me) at night of event on being funny, 3 groups: [0-5], [6-8] and [9-10]
  • d_ambitous_o Binned values for rating by partner (about me) at night of event on being ambitious, 3 groups: [0-5], [6-8] and [9-10]
  • d_shared_interests_o Binned values for rating by partner (about me) at night of event on being ambitious, 3 groups: [0-5], [6-8] and [9-10]
  • attractive_important What do you look for in a partner - attractiveness
  • sincere_important What do you look for in a partner - sincerity
  • intellicence_important What do you look for in a partner - intelligence
  • funny_important What do you look for in a partner - being funny
  • ambtition_important What do you look for in a partner - ambition
  • shared_interests_important What do you look for in a partner - shared interests
  • d_attractive_important Binned values for what do you look for in a partner - attractiveness, 3 groups: [0-15], [16-20] and [21-100]
  • d_sincere_important Binned values for what do you look for in a partner - sincerity, 3 groups: [0-15], [16-20] and [21-100]
  • d_intellicence_important Binned values for what do you look for in a partner - intelligence, 3 groups: [0-15], [16-20] and [21-100]
  • d_funny_important Binned values for what do you look for in a partner - being funny, 3 groups: [0-15], [16-20] and [21-100]
  • d_ambtition_important Binned values for what do you look for in a partner - ambition, 3 groups: [0-15], [16-20] and [21-100]
  • d_shared_interests_important Binned values for what do you look for in a partner - shared interests, 3 groups: [0-15], [16-20] and [21-100]
  • attractive Rate yourself - attractiveness
  • sincere Rate yourself - sincerity
  • intelligence Rate yourself - intelligence
  • funny Rate yourself - being funny
  • ambition Rate yourself - ambition
  • d_attractive Binned values for rate yourself - attractiveness, 3 groups: [0-5], [6-8] and [9-10]
  • d_sincere Binned values for rate yourself - sincerity, 3 groups: [0-5], [6-8] and [9-10]
  • d_intelligence Binned values for rate yourself - intelligence, 3 groups: [0-5], [6-8] and [9-10]
  • d_funny Binned values for rate yourself - being funny, 3 groups: [0-5], [6-8] and [9-10]
  • d_ambition Binned values for Rate yourself - ambition, 3 groups: [0-5], [6-8] and [9-10]
  • attractive_partner Rate your partner - attractiveness
  • sincere_partner Rate your partner - sincerity
  • intelligence_partner Rate your partner - intelligence
  • funny_partner Rate your partner - being funny
  • ambition_partner Rate your partner - ambition
  • shared_interests_partner Rate your partner - shared interests
  • d_attractive_partner Binned values for rate your partner - attractiveness, 3 groups: [0-5], [6-8] and [9-10]
  • d_sincere_partner Binned values for rate your partner - sincerity, 3 groups: [0-5], [6-8] and [9-10]
  • d_intelligence_partner Binned values for rate your partner - intelligence, 3 groups: [0-5], [6-8] and [9-10]
  • d_funny_partner Binned values for rate your partner - being funny, 3 groups: [0-5], [6-8] and [9-10]
  • d_ambition_partner Binned values for Rate your partner - ambition, 3 groups: [0-5], [6-8] and [9-10]
  • d_shared_interests_partner Binned values for Rate your partner - shared interest, 3 groups: [0-5], [6-8] and [9-10]
  • sports Your own interests
  • tvsports Your own interests
  • exercise Your own interests
  • dining Your own interests
  • museums Your own interests
  • art Your own interests
  • hiking Your own interests
  • gaming Your own interests
  • clubbing Your own interests
  • reading Your own interests
  • tv Your own interests
  • theater Your own interests
  • movies Your own interests
  • concerts Your own interests
  • music Your own interests
  • shopping Your own interests
  • yoga Your own interests
  • d_sports Binned values for Your own interests - sports, 3 groups: [0-5], [6-8] and [9-10]
  • d_tvsports Binned values for Your own interests - tvsports, 3 groups: [0-5], [6-8] and [9-10]
  • d_exercise Binned values for Your own interests - exercise, 3 groups: [0-5], [6-8] and [9-10]
  • d_dining Binned values for Your own interests - dining, 3 groups: [0-5], [6-8] and [9-10]
  • d_museums Binned values for Your own interests - museums, 3 groups: [0-5], [6-8] and [9-10]
  • d_art Binned values for Your own interests - art, 3 groups: [0-5], [6-8] and [9-10]
  • d_hiking Binned values for Your own interests - hiking, 3 groups: [0-5], [6-8] and [9-10]
  • d_gaming Binned values for Your own interests - gaming, 3 groups: [0-5], [6-8] and [9-10]
  • d_clubbing Binned values for Your own interests - clubbing, 3 groups: [0-5], [6-8] and [9-10]
  • d_reading Binned values for Your own interests - reading, 3 groups: [0-5], [6-8] and [9-10]
  • d_tv Binned values for Your own interests - tv, 3 groups: [0-5], [6-8] and [9-10]
  • d_theater Binned values for Your own interests - theatre, 3 groups: [0-5], [6-8] and [9-10]
  • d_movies Binned values for Your own interests - movies, 3 groups: [0-5], [6-8] and [9-10]
  • d_concerts Binned values for Your own interests - concerts, 3 groups: [0-5], [6-8] and [9-10]
  • d_music Binned values for Your own interests - music, 3 groups: [0-5], [6-8] and [9-10]
  • d_shopping Binned values for Your own interests - shopping, 3 groups: [0-5], [6-8] and [9-10]
  • d_yoga Binned values for Your own interests - yoga, 3 groups: [0-5], [6-8] and [9-10]
  • interests_correlate Correlation between participant’s and partner’s ratings of interests.
  • d_interests_correlate Binned values for Correlation between participant’s and partner’s ratings of interests., 3 groups: [-1-0], [0-0.33] and [0.33-1]
  • expected_happy_with_sd_people How happy do you expect to be with the people you meet during the speed-dating event?
  • expected_num_interested_in_meOut of the 20 people you will meet, how many do you expect will be interested in dating you?
  • expected_num_matches How many matches do you expect to get?
  • d_expected_happy_with_sd_people Binned values for how happy do you expect to be with the people you meet during the speed-dating event? 3 groups: [0-4], [5-6] and [7-10]
  • d_expected_num_interested_in_me Binned values for out of the 20 people you will meet, how many do you expect will be interested in dating you, 3 groups: [0-3], [4-9] and [10-20]
  • d_expected_num_matches Binned values for How many matches do you expect to get, 3 groups: [0-2], [3-5] and [5-18]
  • like Did you like your partner?
  • guess_prob_liked How likely do you think your partner likes you?
  • d_like Binned values for did you like your partner, 3 groups: [0-5], [6-8] and [9-10]
  • d_guess_prob_liked Binned values for how likely do you think your partner likes you, 3 groups: [0-4], [5-6] and [7-10]
  • met Have you met your partner before?

Target Feature

match is our target feature which is binary. '1' is the positive class which denotes match, '0' denotes not match.

Goal and Objectives

The objective of this analysis is to predict if a person would match (go on with second date) with another person based on the data collected during the short interaction they had (4-minute first date) with each other.

To do this, We are going to fit the processed dataset and compare the prediction results from different classifiers to determine which one would be the best model after parameter tuning, 5 classifiers are chosen to use in the analysis and they are listed as below:

  1. KNN
  2. Decision Tree
  3. Random Forest
  4. Naive Byes
  5. Logistic Regression

Data Pre-processing

1.  We import all the necessary packages. 
2.  We set the display options and random seeds.
3.  We import the csv file directly from the github's url into the dataframe `speed_df`.
In [1]:
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
sns.set(style="darkgrid")


import numpy as np
import pandas as pd
import io
import requests
import matplotlib.pyplot as plt
from IPython.display import display, HTML
from sklearn import metrics
from scipy import stats


np.random.seed(999)
random_state=999

# so that we can see all the columns
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)

import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context

speed_url = 'https://raw.githubusercontent.com/vaksakalli/datasets/master/speed_dating.csv'
url_content = requests.get(speed_url).content
speed_df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))
4.  Once the dataframe is created, we check whether the number of columns and rows of the imported data is the same as the csv file
In [2]:
speed_df.shape
Out[2]:
(8378, 121)
5. We then randomly select 10 rows from the dataset and get an overview of the content of the dataset.
In [3]:
display(HTML('<b>Table 1: Sample of Speed Dating Dataset </b>'))
speed_df.sample(n=10, random_state=8)
Table 1: Sample of Speed Dating Dataset
Out[3]:
has_null wave gender age age_o d_age d_d_age race race_o samerace importance_same_race importance_same_religion d_importance_same_race d_importance_same_religion field pref_o_attractive pref_o_sincere pref_o_intelligence pref_o_funny pref_o_ambitious pref_o_shared_interests d_pref_o_attractive d_pref_o_sincere d_pref_o_intelligence d_pref_o_funny d_pref_o_ambitious d_pref_o_shared_interests attractive_o sinsere_o intelligence_o funny_o ambitous_o shared_interests_o d_attractive_o d_sinsere_o d_intelligence_o d_funny_o d_ambitous_o d_shared_interests_o attractive_important sincere_important intellicence_important funny_important ambtition_important shared_interests_important d_attractive_important d_sincere_important d_intellicence_important d_funny_important d_ambtition_important d_shared_interests_important attractive sincere intelligence funny ambition d_attractive d_sincere d_intelligence d_funny d_ambition attractive_partner sincere_partner intelligence_partner funny_partner ambition_partner shared_interests_partner d_attractive_partner d_sincere_partner d_intelligence_partner d_funny_partner d_ambition_partner d_shared_interests_partner sports tvsports exercise dining museums art hiking gaming clubbing reading tv theater movies concerts music shopping yoga d_sports d_tvsports d_exercise d_dining d_museums d_art d_hiking d_gaming d_clubbing d_reading d_tv d_theater d_movies d_concerts d_music d_shopping d_yoga interests_correlate d_interests_correlate expected_happy_with_sd_people expected_num_interested_in_me expected_num_matches d_expected_happy_with_sd_people d_expected_num_interested_in_me d_expected_num_matches like guess_prob_liked d_like d_guess_prob_liked met match
1670 0 5 female 21.0 22.0 1 [0-1] European/Caucasian-American European/Caucasian-American 1 3.0 8.0 [2-5] [6-10] Economics 25.00 40.00 15.00 10.00 5.00 5.00 [21-100] [21-100] [0-15] [0-15] [0-15] [0-15] 6.0 8.0 9.0 9.0 8.0 6.0 [6-8] [6-8] [9-10] [9-10] [6-8] [6-8] 15.00 15.00 25.00 25.00 15.00 5.00 [0-15] [0-15] [21-100] [21-100] [0-15] [0-15] 8.0 10.0 10.0 8.0 9.0 [6-8] [9-10] [9-10] [6-8] [9-10] 6.0 9.0 8.0 7.0 7.0 6.0 [6-8] [9-10] [6-8] [6-8] [6-8] [6-8] 5.0 3.0 7.0 10.0 9.0 8.0 7.0 2.0 6.0 8.0 7.0 10.0 10.0 9.0 10.0 10.0 7.0 [0-5] [0-5] [6-8] [9-10] [9-10] [6-8] [6-8] [0-5] [6-8] [6-8] [6-8] [9-10] [9-10] [9-10] [9-10] [9-10] [6-8] 0.44 [0.33-1] 7.0 20.0 0.0 [7-10] [10-20] [0-2] 7.0 7.0 [6-8] [7-10] 0.0 1
1823 1 5 male NaN 20.0 20 [7-37] European/Caucasian-American 'Latino/Hispanic American' 0 1.0 1.0 [0-1] [0-1] Economics 10.00 10.00 35.00 35.00 8.00 2.00 [0-15] [0-15] [21-100] [21-100] [0-15] [0-15] 5.0 7.0 2.0 2.0 4.0 2.0 [0-5] [6-8] [0-5] [0-5] [0-5] [0-5] 40.00 20.00 20.00 20.00 NaN NaN [21-100] [16-20] [16-20] [16-20] [0-15] [0-15] 8.0 8.0 8.0 8.0 8.0 [6-8] [6-8] [6-8] [6-8] [6-8] 9.0 8.0 8.0 10.0 9.0 8.0 [9-10] [6-8] [6-8] [9-10] [9-10] [6-8] 7.0 4.0 7.0 9.0 4.0 5.0 9.0 2.0 6.0 9.0 4.0 4.0 7.0 6.0 6.0 4.0 6.0 [6-8] [0-5] [6-8] [9-10] [0-5] [0-5] [9-10] [0-5] [6-8] [9-10] [0-5] [0-5] [6-8] [6-8] [6-8] [0-5] [6-8] 0.36 [0.33-1] 10.0 10.0 5.0 [7-10] [10-20] [3-5] 8.0 8.0 [6-8] [7-10] 0.0 0
4526 1 12 female 24.0 23.0 1 [0-1] 'Asian/Pacific Islander/Asian-American' European/Caucasian-American 0 1.0 1.0 [0-1] [0-1] 'Social Work' 25.00 5.00 30.00 15.00 5.00 20.00 [21-100] [0-15] [21-100] [0-15] [0-15] [16-20] 5.0 7.0 8.0 7.0 6.0 3.0 [0-5] [6-8] [6-8] [6-8] [6-8] [0-5] 10.00 40.00 10.00 20.00 10.00 10.00 [0-15] [21-100] [0-15] [16-20] [0-15] [0-15] 6.0 9.0 9.0 8.0 9.0 [6-8] [9-10] [9-10] [6-8] [9-10] 8.0 9.0 9.0 8.0 9.0 10.0 [6-8] [9-10] [9-10] [6-8] [9-10] [9-10] 7.0 1.0 7.0 8.0 9.0 8.0 9.0 4.0 9.0 9.0 6.0 7.0 9.0 7.0 9.0 5.0 4.0 [6-8] [0-5] [6-8] [6-8] [9-10] [6-8] [9-10] [0-5] [9-10] [9-10] [6-8] [6-8] [9-10] [6-8] [9-10] [0-5] [0-5] 0.40 [0.33-1] 5.0 NaN 2.0 [5-6] [0-3] [0-2] 8.0 7.0 [6-8] [7-10] 0.0 0
4420 1 11 male 28.0 29.0 1 [0-1] 'Black/African American' European/Caucasian-American 0 7.0 1.0 [6-10] [0-1] 'International Affairs' 10.00 40.00 20.00 20.00 0.00 10.00 [0-15] [21-100] [16-20] [16-20] [0-15] [0-15] 10.0 9.0 9.0 6.0 6.0 5.0 [9-10] [9-10] [9-10] [6-8] [6-8] [0-5] 20.00 18.00 20.00 17.00 10.00 15.00 [16-20] [16-20] [16-20] [16-20] [0-15] [0-15] 8.0 10.0 8.0 10.0 10.0 [6-8] [9-10] [6-8] [9-10] [9-10] 10.0 10.0 10.0 4.0 5.0 4.0 [9-10] [9-10] [9-10] [0-5] [0-5] [0-5] 10.0 5.0 8.0 8.0 7.0 5.0 10.0 7.0 9.0 7.0 4.0 7.0 9.0 5.0 10.0 1.0 1.0 [9-10] [0-5] [6-8] [6-8] [6-8] [0-5] [9-10] [6-8] [9-10] [6-8] [0-5] [6-8] [9-10] [0-5] [9-10] [0-5] [0-5] 0.48 [0.33-1] 6.0 NaN 6.0 [5-6] [0-3] [5-18] 5.0 5.0 [0-5] [5-6] 0.0 0
4565 1 12 female 21.0 32.0 11 [7-37] European/Caucasian-American European/Caucasian-American 1 8.0 7.0 [6-10] [6-10] 'speech pathology' 20.00 20.00 20.00 20.00 10.00 10.00 [16-20] [16-20] [16-20] [16-20] [0-15] [0-15] 5.0 4.0 4.0 4.0 4.0 0.0 [0-5] [0-5] [0-5] [0-5] [0-5] [0-5] 50.00 5.00 20.00 10.00 5.00 10.00 [21-100] [0-15] [16-20] [0-15] [0-15] [0-15] 8.0 8.0 8.0 8.0 10.0 [6-8] [6-8] [6-8] [6-8] [9-10] 4.0 9.0 8.0 4.0 8.0 NaN [0-5] [9-10] [6-8] [0-5] [6-8] [0-5] 10.0 10.0 9.0 7.0 5.0 6.0 6.0 6.0 8.0 5.0 7.0 7.0 7.0 9.0 9.0 6.0 6.0 [9-10] [9-10] [9-10] [6-8] [0-5] [6-8] [6-8] [6-8] [6-8] [0-5] [6-8] [6-8] [6-8] [9-10] [9-10] [6-8] [6-8] -0.30 [-1-0] 8.0 NaN 4.0 [7-10] [0-3] [3-5] 4.0 4.0 [0-5] [0-4] 0.0 0
4433 1 11 male 28.0 22.0 6 [4-6] 'Latino/Hispanic American' European/Caucasian-American 0 3.0 7.0 [2-5] [6-10] 'Business [MBA]' 25.00 7.00 25.00 25.00 8.00 10.00 [21-100] [0-15] [21-100] [21-100] [0-15] [0-15] 4.0 4.0 5.0 5.0 6.0 5.0 [0-5] [0-5] [0-5] [0-5] [6-8] [0-5] 23.00 18.00 21.00 18.00 10.00 10.00 [21-100] [16-20] [21-100] [16-20] [0-15] [0-15] 7.0 10.0 9.0 9.0 9.0 [6-8] [9-10] [9-10] [9-10] [9-10] 6.0 8.0 8.0 6.0 5.0 6.0 [6-8] [6-8] [6-8] [6-8] [0-5] [6-8] 6.0 9.0 7.0 10.0 8.0 7.0 6.0 6.0 8.0 9.0 6.0 9.0 9.0 8.0 8.0 6.0 4.0 [6-8] [9-10] [6-8] [9-10] [6-8] [6-8] [6-8] [6-8] [6-8] [9-10] [6-8] [9-10] [9-10] [6-8] [6-8] [6-8] [0-5] 0.27 [0-0.33] 6.0 NaN 3.0 [5-6] [0-3] [3-5] 6.0 6.0 [6-8] [5-6] 0.0 0
4249 1 11 male 27.0 25.0 2 [2-3] European/Caucasian-American European/Caucasian-American 1 1.0 5.0 [0-1] [2-5] 'Business School' 15.00 18.00 19.00 19.00 17.00 12.00 [0-15] [16-20] [16-20] [16-20] [16-20] [0-15] 7.0 8.0 9.0 8.0 NaN 6.0 [6-8] [6-8] [9-10] [6-8] [0-5] [6-8] 25.00 20.00 25.00 20.00 10.00 0.00 [21-100] [16-20] [21-100] [16-20] [0-15] [0-15] 7.0 6.0 7.0 8.0 7.0 [6-8] [6-8] [6-8] [6-8] [6-8] 5.0 4.0 6.0 4.0 NaN NaN [0-5] [0-5] [6-8] [0-5] [0-5] [0-5] 9.0 2.0 5.0 7.0 7.0 9.0 8.0 1.0 5.0 9.0 5.0 5.0 9.0 8.0 9.0 8.0 5.0 [9-10] [0-5] [0-5] [6-8] [6-8] [9-10] [6-8] [0-5] [0-5] [9-10] [0-5] [0-5] [9-10] [6-8] [9-10] [6-8] [0-5] 0.63 [0.33-1] 7.0 NaN 2.0 [7-10] [0-3] [0-2] 4.0 1.0 [0-5] [0-4] 0.0 0
875 0 3 female 26.0 22.0 4 [4-6] 'Latino/Hispanic American' European/Caucasian-American 0 1.0 1.0 [0-1] [0-1] law 30.00 10.00 20.00 30.00 0.00 10.00 [21-100] [0-15] [16-20] [21-100] [0-15] [0-15] 7.0 7.0 7.0 6.0 5.0 5.0 [6-8] [6-8] [6-8] [6-8] [0-5] [0-5] 30.00 10.00 20.00 20.00 10.00 10.00 [21-100] [0-15] [16-20] [16-20] [0-15] [0-15] 9.0 9.0 9.0 9.0 9.0 [9-10] [9-10] [9-10] [9-10] [9-10] 7.0 8.0 8.0 7.0 9.0 7.0 [6-8] [6-8] [6-8] [6-8] [9-10] [6-8] 8.0 5.0 7.0 8.0 6.0 6.0 7.0 7.0 7.0 7.0 3.0 2.0 9.0 7.0 9.0 9.0 2.0 [6-8] [0-5] [6-8] [6-8] [6-8] [6-8] [6-8] [6-8] [6-8] [6-8] [0-5] [0-5] [9-10] [6-8] [9-10] [9-10] [0-5] 0.14 [0-0.33] 5.0 1.0 5.0 [5-6] [0-3] [3-5] 7.0 8.0 [6-8] [7-10] 0.0 0
3010 1 9 male 23.0 25.0 2 [2-3] European/Caucasian-American 'Asian/Pacific Islander/Asian-American' 0 6.0 8.0 [6-10] [6-10] 'Computational Biochemsistry' 17.65 17.65 17.65 15.69 15.69 15.69 [16-20] [16-20] [16-20] [16-20] [16-20] [16-20] 5.0 8.0 7.0 9.0 8.0 5.0 [0-5] [6-8] [6-8] [9-10] [6-8] [0-5] 17.02 21.28 17.02 21.28 14.89 8.51 [16-20] [21-100] [16-20] [21-100] [0-15] [0-15] 8.0 10.0 10.0 9.0 8.0 [6-8] [9-10] [9-10] [9-10] [6-8] 7.0 8.0 8.0 7.0 9.0 5.0 [6-8] [6-8] [6-8] [6-8] [9-10] [0-5] 9.0 3.0 10.0 8.0 6.0 7.0 7.0 4.0 7.0 3.0 1.0 9.0 6.0 4.0 6.0 7.0 1.0 [9-10] [0-5] [9-10] [6-8] [6-8] [6-8] [6-8] [0-5] [6-8] [0-5] [0-5] [9-10] [6-8] [0-5] [6-8] [6-8] [0-5] 0.00 [-1-0] 6.0 NaN NaN [5-6] [0-3] [0-2] 6.0 6.0 [6-8] [5-6] 0.0 1
127 1 1 male 22.0 25.0 3 [2-3] 'Asian/Pacific Islander/Asian-American' European/Caucasian-American 0 3.0 5.0 [2-5] [2-5] Law 9.09 18.18 27.27 18.18 18.18 9.09 [0-15] [16-20] [21-100] [16-20] [16-20] [0-15] 5.0 8.0 8.0 6.0 7.0 7.0 [0-5] [6-8] [6-8] [6-8] [6-8] [6-8] 19.00 18.00 19.00 18.00 14.00 12.00 [16-20] [16-20] [16-20] [16-20] [0-15] [0-15] 4.0 7.0 8.0 8.0 3.0 [0-5] [6-8] [6-8] [6-8] [0-5] 10.0 10.0 10.0 10.0 10.0 10.0 [9-10] [9-10] [9-10] [9-10] [9-10] [9-10] 7.0 8.0 2.0 9.0 5.0 6.0 4.0 7.0 7.0 6.0 8.0 10.0 8.0 9.0 9.0 8.0 1.0 [6-8] [6-8] [0-5] [9-10] [0-5] [6-8] [0-5] [6-8] [6-8] [6-8] [6-8] [9-10] [6-8] [9-10] [9-10] [6-8] [0-5] 0.42 [0.33-1] 3.0 4.0 NaN [0-4] [4-9] [0-2] 10.0 10.0 [9-10] [7-10] 0.0 1
6. Explore the missing values in the dataset
In [4]:
speed_df.isnull().sum()
Out[4]:
has_null                              0
wave                                  0
gender                                0
age                                  95
age_o                               104
d_age                                 0
d_d_age                               0
race                                 63
race_o                               73
samerace                              0
importance_same_race                 79
importance_same_religion             79
d_importance_same_race                0
d_importance_same_religion            0
field                                63
pref_o_attractive                    89
pref_o_sincere                       89
pref_o_intelligence                  89
pref_o_funny                         98
pref_o_ambitious                    107
pref_o_shared_interests             129
d_pref_o_attractive                   0
d_pref_o_sincere                      0
d_pref_o_intelligence                 0
d_pref_o_funny                        0
d_pref_o_ambitious                    0
d_pref_o_shared_interests             0
attractive_o                        212
sinsere_o                           287
intelligence_o                      306
funny_o                             360
ambitous_o                          722
shared_interests_o                 1076
d_attractive_o                        0
d_sinsere_o                           0
d_intelligence_o                      0
d_funny_o                             0
d_ambitous_o                          0
d_shared_interests_o                  0
attractive_important                 79
sincere_important                    79
intellicence_important               79
funny_important                      89
ambtition_important                  99
shared_interests_important          121
d_attractive_important                0
d_sincere_important                   0
d_intellicence_important              0
d_funny_important                     0
d_ambtition_important                 0
d_shared_interests_important          0
attractive                          105
sincere                             105
intelligence                        105
funny                               105
ambition                            105
d_attractive                          0
d_sincere                             0
d_intelligence                        0
d_funny                               0
d_ambition                            0
attractive_partner                  202
sincere_partner                     277
intelligence_partner                296
funny_partner                       350
ambition_partner                    712
shared_interests_partner           1067
d_attractive_partner                  0
d_sincere_partner                     0
d_intelligence_partner                0
d_funny_partner                       0
d_ambition_partner                    0
d_shared_interests_partner            0
sports                               79
tvsports                             79
exercise                             79
dining                               79
museums                              79
art                                  79
hiking                               79
gaming                               79
clubbing                             79
reading                              79
tv                                   79
theater                              79
movies                               79
concerts                             79
music                                79
shopping                             79
yoga                                 79
d_sports                              0
d_tvsports                            0
d_exercise                            0
d_dining                              0
d_museums                             0
d_art                                 0
d_hiking                              0
d_gaming                              0
d_clubbing                            0
d_reading                             0
d_tv                                  0
d_theater                             0
d_movies                              0
d_concerts                            0
d_music                               0
d_shopping                            0
d_yoga                                0
interests_correlate                 158
d_interests_correlate                 0
expected_happy_with_sd_people       101
expected_num_interested_in_me      6578
expected_num_matches               1173
d_expected_happy_with_sd_people       0
d_expected_num_interested_in_me       0
d_expected_num_matches                0
like                                240
guess_prob_liked                    309
d_like                                0
d_guess_prob_liked                    0
met                                 375
match                                 0
dtype: int64

We found that the column expected_num_interested_in_me has the highest count (6578 out of 8378 records) of missing value.

7. Randomly select 10 observations which has missing values in `expected_num_interested_in_me` and compare to the value in it's corresponding binned column `d_expected_num_interested_in_me`. 
In [5]:
check_na_df=speed_df[np.isnan(speed_df.expected_num_interested_in_me)].sample(n=10, random_state=8)
check_na_df.loc[:, ["expected_num_interested_in_me", "d_expected_num_interested_in_me"]]
Out[5]:
expected_num_interested_in_me d_expected_num_interested_in_me
4648 NaN [0-3]
3108 NaN [0-3]
3538 NaN [0-3]
2668 NaN [0-3]
4443 NaN [0-3]
2647 NaN [0-3]
7112 NaN [0-3]
2667 NaN [0-3]
7572 NaN [0-3]
7991 NaN [0-3]

We found that all the observations with missing values in expected_num_interested_in_me would be put in the lowest bin in d_expected_num_interested_in_me. Similar findings with the column shared_interests_partner and d_shared_interests_partner as shown below.

In [6]:
check_na_df=speed_df[np.isnan(speed_df.shared_interests_partner)].sample(n=10, random_state=8)
check_na_df.loc[:, ["shared_interests_partner", "d_shared_interests_partner"]]
Out[6]:
shared_interests_partner d_shared_interests_partner
4258 NaN [0-5]
5410 NaN [0-5]
2918 NaN [0-5]
8207 NaN [0-5]
8144 NaN [0-5]
5305 NaN [0-5]
6126 NaN [0-5]
2666 NaN [0-5]
4848 NaN [0-5]
3394 NaN [0-5]

Further lookup from the web, we found this dataset could also be found from https://www.openml.org/d/40536. In the openML repository, none of the binned column exist.

8.  Base on the principal to retain the originality, as well as predicting the target with completed descriptive features.  We would drop all the observations with null values, as well as the following columns:

- all the columns with `d_` as prefix except column `d_age` as `d_age` is not a column with binned values, it contains the difference in age between participant and his/her partner.
- column `has_null` as it only contains indicating value whether there is any NA value in the particular row, it has no help on determining our target feature.
- column `wave` as it only contains experimental batch number of the dating event, it also has no help on determining our target feature.
In [7]:
bool_binned_cols = (speed_df.columns.str.find('d_', 0, 2)!=-1) & (speed_df.columns.str.find('d_age', 0, 5)==-1)
In [8]:
binned_cols = speed_df.columns[bool_binned_cols].tolist()
In [9]:
speed_df= speed_df.dropna()
speed_df=speed_df.drop(columns =['has_null','wave'])
speed_df=speed_df.drop(columns = binned_cols)
In [10]:
print(f"Shape of the dataset is {speed_df.shape} \n")
na_sum=speed_df.isna().sum()!=0
print(f"Final Check for Null Values:  {sum(na_sum)} \n")
print(f"Now that all null values have been removed, lets check the data types of these attributes. \n")
display(HTML('<b>Table 2: Data types of the attributes </b>'))
print(speed_df.dtypes)
Shape of the dataset is (1048, 64) 

Final Check for Null Values:  0 

Now that all null values have been removed, lets check the data types of these attributes. 

Table 2: Data types of the attributes
gender                            object
age                              float64
age_o                            float64
d_age                              int64
race                              object
race_o                            object
samerace                           int64
importance_same_race             float64
importance_same_religion         float64
field                             object
pref_o_attractive                float64
pref_o_sincere                   float64
pref_o_intelligence              float64
pref_o_funny                     float64
pref_o_ambitious                 float64
pref_o_shared_interests          float64
attractive_o                     float64
sinsere_o                        float64
intelligence_o                   float64
funny_o                          float64
ambitous_o                       float64
shared_interests_o               float64
attractive_important             float64
sincere_important                float64
intellicence_important           float64
funny_important                  float64
ambtition_important              float64
shared_interests_important       float64
attractive                       float64
sincere                          float64
intelligence                     float64
funny                            float64
ambition                         float64
attractive_partner               float64
sincere_partner                  float64
intelligence_partner             float64
funny_partner                    float64
ambition_partner                 float64
shared_interests_partner         float64
sports                           float64
tvsports                         float64
exercise                         float64
dining                           float64
museums                          float64
art                              float64
hiking                           float64
gaming                           float64
clubbing                         float64
reading                          float64
tv                               float64
theater                          float64
movies                           float64
concerts                         float64
music                            float64
shopping                         float64
yoga                             float64
interests_correlate              float64
expected_happy_with_sd_people    float64
expected_num_interested_in_me    float64
expected_num_matches             float64
like                             float64
guess_prob_liked                 float64
met                              float64
match                              int64
dtype: object
9.  check the *descriptive statistics* of all the numerical features using the *describe* function
In [11]:
display(HTML('<b>Table 3: Summary of continuous features </b>'))
speed_df.describe(include = np.number).round(2)
Table 3: Summary of continuous features
Out[11]:
age age_o d_age samerace importance_same_race importance_same_religion pref_o_attractive pref_o_sincere pref_o_intelligence pref_o_funny pref_o_ambitious pref_o_shared_interests attractive_o sinsere_o intelligence_o funny_o ambitous_o shared_interests_o attractive_important sincere_important intellicence_important funny_important ambtition_important shared_interests_important attractive sincere intelligence funny ambition attractive_partner sincere_partner intelligence_partner funny_partner ambition_partner shared_interests_partner sports tvsports exercise dining museums art hiking gaming clubbing reading tv theater movies concerts music shopping yoga interests_correlate expected_happy_with_sd_people expected_num_interested_in_me expected_num_matches like guess_prob_liked met match
count 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.0 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00 1048.00
mean 25.01 24.82 3.03 0.42 4.02 4.14 23.73 16.97 22.26 17.33 9.73 10.33 6.21 7.17 7.39 6.30 6.80 5.42 23.77 17.37 21.82 16.73 9.99 10.84 6.89 8.16 7.60 8.32 7.21 6.19 7.22 7.44 6.36 6.8 5.43 6.25 4.56 5.98 7.63 6.81 6.44 5.10 3.92 6.10 7.42 5.59 6.88 8.06 6.96 7.71 5.51 4.13 0.15 5.38 5.76 2.84 6.22 4.98 0.08 0.18
std 3.27 3.18 2.43 0.49 3.03 3.02 12.66 7.45 7.35 6.67 7.07 6.76 1.96 1.74 1.54 2.07 1.83 2.17 13.56 7.42 7.31 6.57 7.25 6.94 1.49 1.38 1.77 1.00 2.04 1.91 1.75 1.51 2.04 1.8 2.12 2.64 2.79 2.46 1.79 1.96 2.19 2.58 2.37 2.18 1.96 2.49 2.28 1.59 2.03 1.90 2.60 2.70 0.34 1.63 4.95 2.37 1.86 2.27 0.27 0.38
min 18.00 18.00 0.00 0.00 1.00 1.00 5.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 0.00 0.00 0.00 0.00 0.00 2.00 2.00 2.00 5.00 2.00 0.00 0.00 0.00 0.00 0.0 0.00 1.00 1.00 1.00 3.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 3.00 1.00 1.00 1.00 1.00 -0.63 1.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 22.00 22.00 1.00 0.00 1.00 1.00 15.00 10.00 20.00 10.83 5.00 5.00 5.00 6.00 6.88 5.00 6.00 4.00 15.00 11.11 20.00 10.00 5.00 5.00 6.00 8.00 7.00 8.00 6.00 5.00 6.00 7.00 5.00 6.0 4.00 4.00 2.00 4.00 6.00 6.00 5.00 3.00 2.00 4.00 6.00 4.00 5.00 7.00 6.00 7.00 4.00 2.00 -0.11 5.00 2.00 1.00 5.00 3.00 0.00 0.00
50% 25.00 25.00 2.00 0.00 3.00 3.00 20.00 18.00 20.00 18.18 10.00 10.00 6.00 7.00 7.00 6.00 7.00 5.00 20.00 20.00 20.00 15.00 10.00 10.00 7.00 8.00 8.00 8.00 8.00 6.00 7.00 7.00 7.00 7.0 5.00 7.00 4.00 6.00 8.00 7.00 7.00 5.00 4.00 6.00 8.00 6.00 7.00 8.00 7.00 8.00 5.00 3.00 0.15 5.00 4.00 2.00 6.00 5.00 0.00 0.00
75% 27.00 27.00 4.00 1.00 7.00 7.00 30.00 20.00 25.00 20.00 15.00 15.00 8.00 8.00 8.00 8.00 8.00 7.00 30.00 20.00 25.00 20.00 15.00 15.00 8.00 9.00 9.00 9.00 9.00 8.00 8.00 8.00 8.00 8.0 7.00 8.00 7.00 8.00 9.00 8.00 8.00 7.00 6.00 8.00 9.00 8.00 9.00 9.00 8.00 9.00 8.00 7.00 0.42 7.00 8.00 4.00 7.00 7.00 0.00 0.00
max 35.00 35.00 14.00 1.00 10.00 10.00 100.00 40.00 50.00 40.00 53.00 30.00 10.00 10.00 10.00 10.00 10.00 10.00 100.00 40.00 50.00 40.00 53.00 30.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.0 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 0.90 9.00 20.00 10.00 10.00 10.00 3.00 1.00
10. Get the *summary statistics* of the categorical variables
In [12]:
display(HTML('<b>Table 4: Summary Statistics of categorical features </b>'))
categorical_cols = speed_df.columns[speed_df.dtypes == np.object].tolist()
for categorical_col in categorical_cols:
    print(categorical_col + ':')
    print(speed_df[categorical_col].value_counts().round(2))
    print('\n')
Table 4: Summary Statistics of categorical features
gender:
female    531
male      517
Name: gender, dtype: int64


race:
European/Caucasian-American                636
'Asian/Pacific Islander/Asian-American'    205
'Latino/Hispanic American'                  84
Other                                       84
'Black/African American'                    39
Name: race, dtype: int64


race_o:
European/Caucasian-American                641
'Asian/Pacific Islander/Asian-American'    196
'Latino/Hispanic American'                  77
Other                                       70
'Black/African American'                    64
Name: race_o, dtype: int64


field:
Law                                        139
'Social Work'                               81
Business                                    48
law                                         45
Psychology                                  39
Economics                                   37
Finance                                     36
MBA                                         33
Chemistry                                   30
chemistry                                   26
'Operations Research'                       26
Film                                        23
'Electrical Engineering'                    18
'political science'                         17
Engineering                                 16
Marketing                                   15
LAW                                         15
microbiology                                15
Journalism                                  15
Finance&Economics                           15
'Business- MBA'                             15
'Elementary/Childhood Education [MA]'       15
'Educational Psychology'                    14
Medicine                                    14
'Business [MBA]'                            14
'Operations Research [SEAS]'                14
Finanace                                    13
Communications                              13
Mathematics                                 13
'Masters of Social Work'                    13
'Mechanical Engineering'                    13
'International Educational Development'     13
Statistics                                  12
'Mathematical Finance'                      12
psychology                                  12
'TC [Health Ed]'                            11
'Organizational Psychology'                 11
'Masters in Public Administration'          10
'Speech Language Pathology'                 10
'Applied Maths/Econs'                        9
'Undergrad - GS'                             9
'German Literature'                          9
'social work'                                9
'Art History/medicine'                       9
philosophy                                   8
Sociology                                    8
English                                      8
'Biomedical Engineering'                     8
'psychology and english'                     8
'Computer Science'                           8
'Economics and Political Science'            8
'Business & International Affairs'           6
'Economics; Sociology'                       5
'financial math'                             3
Classics                                     1
Polish                                       1
Name: field, dtype: int64


11. Group the values of `field` to

 - avoid difference in capitalization of letters
 - make the value less scattered, such that similar nature of disciplines will combine to same values.
In [13]:
speed_df['field']= speed_df['field'].str.strip()  
speed_df['field']=speed_df['field'].replace(['law', 'LAW'], 'Law') 
speed_df['field']=speed_df['field'].replace(["'Social Work'", "'social work'", "Sociology", "'Masters of Social Work'"], 'Social Work') 
speed_df['field']=speed_df['field'].replace(["philosophy","psychology", "'psychology and english'", "'Organizational Psychology'", "Psychology"], 'Psychology/Philosophy') 
speed_df['field']=speed_df['field'].replace(["chemistry"], 'Chemistry') 
speed_df['field']=speed_df['field'].replace(["'Electrical Engineering'", "'Mechanical Engineering'", "'Biomedical Engineering'", "'Computer Science'", "Engineering"], 'Engineering/Computer Science') 
speed_df['field']=speed_df['field'].replace(["Business","Finance","Economics", "Marketing", "Finance&Economics", "Finanace","'Economics; Sociology'", "'Business & International Affairs'"], "Business/Finance")
speed_df['field']=speed_df['field'].replace(["MBA", "Business- MBA", "'Business- MBA'", "'Business [MBA]'"], "Business [MBA]")
speed_df['field']=speed_df['field'].replace(["'political science'", "'Economics and Political Science'"], 'Political Science')
speed_df['field']=speed_df['field'].replace(["Journalism", "Communications", "'Masters in Public Administration'"], "Journalism/Communications")
speed_df['field']=speed_df['field'].replace(["'Elementary/Childhood Education [MA]'","'Educational Psychology'","'International Educational Development'", "'TC [Health Ed]'"], "Education")
speed_df['field']=speed_df['field'].replace(["Medicine","microbiology","'Biomedical Engineering'"], "Medical Science")
speed_df['field']=speed_df['field'].replace(["'Art History/medicine'", "Mathematics", "'Mathematical Finance'", "'financial math'","'Applied Maths/Econs'", "Statistics"], "Mathematical-related")
speed_df['field']=speed_df['field'].replace(["'Operations Research'","'Operations Research [SEAS]'"], 'Operational Research')
speed_df['field']=speed_df['field'].replace(["English", "Polish", "'German Literature'", "'Speech Language Pathology'"], "Language-related")
speed_df['field']=speed_df['field'].replace(["English", "Polish", "'German Literature'", "'Speech Language Pathology'"], "Language-related")
12.  Assign all the descriptive features (except `match`) as Data and the predictive label column `match` as target
In [14]:
speed_df.to_csv('speed_clean.csv', index=False)
Data = speed_df.drop(columns = 'match')
target = speed_df['match']
13. Perform one-hot encoding for all the descriptive features, if the categorical descriptive has only 2-level, encode it with only one binary variable
In [15]:
# get the list of categorical descriptive features
categorical_cols = Data.columns[Data.dtypes==object].tolist()

# if a categorical descriptive feature has only 2 levels,
# define only one binary variable
for col in categorical_cols:
    n = len(Data[col].unique())
    if (n == 2):
        Data[col] = pd.get_dummies(Data[col], drop_first=True)

# for other categorical features (with > 2 levels), 
# use regular one-hot-encoding 
# if a feature is numeric, it will be untouched
Data = pd.get_dummies(Data)
14.  Check all the data types and shape of the descriptive feature after encoding
In [16]:
Data.dtypes
Out[16]:
gender                                              uint8
age                                               float64
age_o                                             float64
d_age                                               int64
samerace                                            int64
importance_same_race                              float64
importance_same_religion                          float64
pref_o_attractive                                 float64
pref_o_sincere                                    float64
pref_o_intelligence                               float64
pref_o_funny                                      float64
pref_o_ambitious                                  float64
pref_o_shared_interests                           float64
attractive_o                                      float64
sinsere_o                                         float64
intelligence_o                                    float64
funny_o                                           float64
ambitous_o                                        float64
shared_interests_o                                float64
attractive_important                              float64
sincere_important                                 float64
intellicence_important                            float64
funny_important                                   float64
ambtition_important                               float64
shared_interests_important                        float64
attractive                                        float64
sincere                                           float64
intelligence                                      float64
funny                                             float64
ambition                                          float64
attractive_partner                                float64
sincere_partner                                   float64
intelligence_partner                              float64
funny_partner                                     float64
ambition_partner                                  float64
shared_interests_partner                          float64
sports                                            float64
tvsports                                          float64
exercise                                          float64
dining                                            float64
museums                                           float64
art                                               float64
hiking                                            float64
gaming                                            float64
clubbing                                          float64
reading                                           float64
tv                                                float64
theater                                           float64
movies                                            float64
concerts                                          float64
music                                             float64
shopping                                          float64
yoga                                              float64
interests_correlate                               float64
expected_happy_with_sd_people                     float64
expected_num_interested_in_me                     float64
expected_num_matches                              float64
like                                              float64
guess_prob_liked                                  float64
met                                               float64
race_'Asian/Pacific Islander/Asian-American'        uint8
race_'Black/African American'                       uint8
race_'Latino/Hispanic American'                     uint8
race_European/Caucasian-American                    uint8
race_Other                                          uint8
race_o_'Asian/Pacific Islander/Asian-American'      uint8
race_o_'Black/African American'                     uint8
race_o_'Latino/Hispanic American'                   uint8
race_o_European/Caucasian-American                  uint8
race_o_Other                                        uint8
field_'Undergrad - GS'                              uint8
field_Business [MBA]                                uint8
field_Business/Finance                              uint8
field_Chemistry                                     uint8
field_Classics                                      uint8
field_Education                                     uint8
field_Engineering/Computer Science                  uint8
field_Film                                          uint8
field_Journalism/Communications                     uint8
field_Language-related                              uint8
field_Law                                           uint8
field_Mathematical-related                          uint8
field_Medical Science                               uint8
field_Operational Research                          uint8
field_Political Science                             uint8
field_Psychology/Philosophy                         uint8
field_Social Work                                   uint8
dtype: object
In [17]:
Data.shape
Out[17]:
(1048, 87)
15.  Retain the column name of the descriptive and target feature by copying them into new dataframe `Data_df` and `target_df`.
In [18]:
Data_df = Data.copy()
target_df =target.copy()

print("Target Type:", type(target))
print("Counts Using NumPy:")
print(np.unique(target, return_counts = True))
print("Counts Using Pandas:")
print(pd.Series(target).value_counts())
Target Type: <class 'pandas.core.series.Series'>
Counts Using NumPy:
(array([0, 1], dtype=int64), array([862, 186], dtype=int64))
Counts Using Pandas:
0    862
1    186
Name: match, dtype: int64
16.  Perform normalization to all the descriptive features using MinMaxScaler()
In [19]:
from sklearn import preprocessing
Data = preprocessing.MinMaxScaler().fit_transform(Data)
target=target.values
17.  Ensure both `Data` and `target` are numpy array, such that they are ready to pass-in to scikit learn for data modelling.
In [20]:
print(type(Data))
print(type(target))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

Data exploration & Visualisation

Univariate Visualisation - One-Variable plots

1.  Explore the `gender` distribution in the dataset.
In [21]:
plt.figure()
speed_df['gender'].value_counts().plot(kind='bar')
plt.xticks(rotation='horizontal')
plt.ylabel('Count')
plt.title('Gender Distribution', fontsize = 16)
plt.show()
display(HTML('<b>Figure 1: Gender Distribution </b>'))
Figure 1: Gender Distribution

Figure 1 indicates that the two genders are equally distributed, therefore there is no bias in gender.

2.  Explore the `racial` distribution in the dataset.
In [22]:
plt.figure()
speed_df['race'].value_counts().plot(kind='bar')
plt.ylabel('Number of People')
plt.title('Racial Distribution',fontsize = 16)
plt.xticks(rotation=80)
plt.show()
display(HTML('<b>Figure 2: Racial Distribution </b>'))
Figure 2: Racial Distribution

Figure 2 shows us that there is a clear bias towards European Americans as they represent 60% of the dataset. Moreover we can see that there are very few Black/African Americans and no Native Americans.

3. Explore the `field` distribution in the dataset.
In [23]:
plt.figure(figsize=(10, 10))
field_df=pd.DataFrame(speed_df['field'].value_counts()).reset_index()
field_df.columns=['field_name', 'count']
sns.barplot(x="field_name", y="count", data=field_df)
plt.title('Field Distribution',fontsize = 16)
plt.xticks(rotation=80)
display(HTML('<b>Figure 3: Field Distribution </b>'))
Figure 3: Field Distribution

Figure 3 shows the count of professional field for all the participants

Bivariate Visualisation - Two-Variable plots

1.  Explore the count of each descriptive features group by the values of target feature
In [24]:
speed_df_no_target = speed_df.drop(columns = 'match')
In [25]:
for i in speed_df_no_target.columns:
    title_str='Count of ' + i.upper() + ' group by target feature "MATCH"'  
    plt.figure(figsize=(10,6))
    plt.xticks(rotation=90)
    g=sns.countplot(speed_df[i],hue=speed_df['match'])
    g.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
    plt.title(title_str, fontsize = 16)
2.  Explore the proportion of each descriptive features group by the values of target feature
In [26]:
for i in speed_df_no_target.columns:
    title_str='Proportion of ' + i.upper() + ' group by target feature "MATCH"'  

    col_list = speed_df[i].dropna().unique().tolist()
    match_list = speed_df['match'].dropna().unique().tolist()
    result=pd.DataFrame(columns=[i, 'match', 'percentage'])

    for col_element in col_list:
        for match_value in match_list:
            element_num = len(speed_df[(speed_df[i]==col_element) & (speed_df['match']==match_value)])
            element_percent = element_num / len(speed_df[speed_df['match'] == match_value])
            result.loc[len(result)] = [col_element, match_value, element_percent]
#            print(col_element, match_value, element_num, element_percent)
#            print(result)

    plt.figure()
    sns.barplot(x=i, y="percentage", hue="match", data=result)
    plt.xticks(rotation=80)
    plt.title(title_str, fontsize = 16)
    plt.ylabel('Proportion')
    plt.show()

After going through all the Two-variable plots, we found that beside interests_correlate, all the numerical values are discrete in the datasets.

Trivariate Visualisation - Three-Variable plots

1.  Explore the relationship between Age, Gender and the target `Match`
In [27]:
sns.boxplot(x="gender", y="age", hue='match', data=speed_df)
plt.title('Age vs Gender group by "MATCH"', fontsize = 16)
Out[27]:
Text(0.5, 1.0, 'Age vs Gender group by "MATCH"')
 2.  Explore the relationship between Age, Race and the target `Match`
In [28]:
plt.xticks(rotation=80)
sns.boxplot(x="race", y="age", hue='match', data=speed_df)
plt.title('Age vs Race group by "MATCH"', fontsize = 16)
Out[28]:
Text(0.5, 1.0, 'Age vs Race group by "MATCH"')
3.  Explore the relationship between Gender, attractiveness of participant rated by partner and the target `Match`
In [29]:
plt.xticks(rotation='horizontal')
sns.boxplot(x="gender", y="attractive_o", hue='match', data=speed_df)
plt.title('Age vs Attractiveness Rated by partner group by "MATCH"', fontsize = 16)
Out[29]:
Text(0.5, 1.0, 'Age vs Attractiveness Rated by partner group by "MATCH"')
4.  Explore the relationship between Age of participant, age of partner and the target `Match`
In [30]:
sns.lmplot(x='age', y='age_o', markers=['o', 'x'], hue='match',
           data=speed_df.loc[speed_df['match'].isin([0,1])],
           fit_reg=False
          )
plt.title('Age of participant vs Age of partner group by "MATCH"', fontsize = 16)
Out[30]:
Text(0.5, 1, 'Age of participant vs Age of partner group by "MATCH"')
5.  Explore the relationship between attractiveness of participant rated by own self vs rated by partner and the target `Match`
In [31]:
sns.lmplot(x='attractive', y='attractive_o', markers=['o', 'x'], hue='match',
           data=speed_df.loc[speed_df['match'].isin([0,1])],
           fit_reg=False
          )
plt.title('Attractiveness rated by participant own self vs rated by partner group by "MATCH"', fontsize = 16)
Out[31]:
Text(0.5, 1, 'Attractiveness rated by participant own self vs rated by partner group by "MATCH"')
6.  Explore the relationship between like your partner vs how likely do you think your partner like you  the target `Match`
In [32]:
sns.lmplot(x='like', y='guess_prob_liked', markers=['o', 'x'], hue='match',
           data=speed_df.loc[speed_df['match'].isin([0,1])],
           fit_reg=False
          )
plt.title('Like your partner vs How likely do you think your partner like you group by "MATCH"', fontsize = 16)
Out[32]:
Text(0.5, 1, 'Like your partner vs How likely do you think your partner like you group by "MATCH"')

Predictive modelling

Overview of Methodology

Grid search and repeated cross-validation are used to find the optimal parameters of the following classifiers:

  1. KNN
  2. Decision Tree
  3. Random Forest
  4. Naive Byes
  5. Logistic Regression

Let's observe the proportion of class label and number of observations in the dataset again.

In [33]:
target_df.value_counts(normalize=True)
Out[33]:
0    0.822519
1    0.177481
Name: match, dtype: float64
In [34]:
Data.shape
Out[34]:
(1048, 87)

Obviously the class label is imbalanced in the final dataset, and the data size is not huge. Thus we chose to use RepeatedStratifiedKFold to balance the class label in the fold, as well as average out the variance of validation results in the dataset. We then define:

- pipeline to host the process of feature selection and classifier 
- Grid search to host feature selection score function and all the parameter of the classifier 

so that we can perform exhaustively searches for the best parameter through all possible combinations while training the dataset.

For performance analysis, we would choose the scoring metric AUC which would provide unbiased judgement for imbalance class label.

Feature Selection, Hyperparameter Tuning and Performance Analysis

We execute the following 3 steps before we perform feature selection and Hyperparameter tuning on each classifier. The products of these steps would be commonly used in each classifier tuning process.

1. Prepare the dataset using the `holdout` approach, split the sampled data 70% as training, and 30% as testing (which is the unseen data for later-on performance analysis for each model).
In [35]:
from sklearn import feature_selection as fs
from sklearn.model_selection import train_test_split

# The "\" character below allows us to split the line across multiple lines
D_train, D_test, t_train, t_test = \
    train_test_split(Data, target, test_size = 0.3, 
                     stratify=target, shuffle=True, random_state=888)
print (D_train.shape)
print (D_test.shape)
print (t_train.shape)
print (t_test.shape)
(733, 87)
(315, 87)
(733,)
(315,)
 2.  Define the cross-validation method `RepeatedStratifiedKFold` (with k=5, n=2) and scoring metric `AUC`.
In [36]:
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score, GridSearchCV
cv_method = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=999)
scoring_metric = 'roc_auc'
3. Define a custom function to format the search results and return as a Pandas dataframe
In [37]:
def get_search_results(gs):
    def model_result(scores, params):
        scores = {'mean_score': np.mean(scores),
             'std_score': np.std(scores),
             'min_score': np.min(scores),
             'max_score': np.max(scores)}
        return pd.Series({**params,**scores})
    models = []
    scores = []
    for i in range(gs.n_splits_):
        key = f"split{i}_test_score"
        r = gs.cv_results_[key]        
        scores.append(r.reshape(-1,1))
    all_scores = np.hstack(scores)
    for p, s in zip(gs.cv_results_['params'], all_scores):
        models.append((model_result(s, p)))
    pipe_results = pd.concat(models, axis=1).T.sort_values(['mean_score'], ascending=False)
    columns_first = ['mean_score', 'std_score', 'max_score', 'min_score']
    columns = columns_first + [c for c in pipe_results.columns if c not in columns_first]
    return pipe_results[columns]

KNN

In [38]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
1. Using grid search for KNN hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a KNN model with `k` values in {5, 10, 20}, `p` values in {1, 2, 5} and n_neighbors values in {10, 25, 50, 75,100, 150, 200, 300} 
In [39]:
from sklearn.neighbors import KNeighborsClassifier

pipe_KNN = Pipeline([('fselector', SelectKBest()), 
                     ('knn', KNeighborsClassifier())])

params_pipe_KNN = {'fselector__score_func': [f_classif, mutual_info_classif],
                   'fselector__k': [5, 10, 20],
                   'knn__n_neighbors': [10, 25, 50, 75,100,150, 200, 300],
                   'knn__p': [1, 2, 5]}
 
gs_pipe_KNN = GridSearchCV(estimator=pipe_KNN, 
                           param_grid=params_pipe_KNN, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='roc_auc',
                           verbose=1) 
In [40]:
gs_pipe_KNN.fit(D_train, t_train);
Fitting 10 folds for each of 144 candidates, totalling 1440 fits
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  36 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-2)]: Done 683 tasks      | elapsed:   31.1s
[Parallel(n_jobs=-2)]: Done 1402 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-2)]: Done 1440 out of 1440 | elapsed:  1.6min finished
2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.
In [41]:
results_KNN = get_search_results(gs_pipe_KNN)
results_KNN.head()
Out[41]:
mean_score std_score max_score min_score fselector__k fselector__score_func knn__n_neighbors knn__p
12 0.832758 0.0336231 0.878735 0.792276 5 <function f_classif at 0x0000026419EAC708> 100 1
8 0.83178 0.0433289 0.89288 0.776542 5 <function f_classif at 0x0000026419EAC708> 50 5
10 0.831373 0.0414457 0.889383 0.779879 5 <function f_classif at 0x0000026419EAC708> 75 2
11 0.831328 0.0416365 0.891291 0.779085 5 <function f_classif at 0x0000026419EAC708> 75 5
13 0.830208 0.0370159 0.882708 0.782263 5 <function f_classif at 0x0000026419EAC708> 100 2
In [42]:
results_KNN = pd.DataFrame(gs_pipe_KNN.cv_results_['params'])
results_KNN['test_score'] = gs_pipe_KNN.cv_results_['mean_test_score']
In [43]:
results_KNN['knn__p'] = results_KNN['knn__p'].replace([1,2,5], ["Manhattan", "Euclidean", "Minkowski"])
results_KNN['fselector__score_func'] = results_KNN['fselector__score_func'].replace([1,2,5], ["Manhattan", "Euclidean", "Minkowski"])
In [44]:
results_KNN['fselector__score_func'] = results_KNN['fselector__score_func'].astype(str)
results_KNN.loc[results_KNN['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_KNN.loc[results_KNN['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
3. visualize the tuning results.
In [45]:
g = sns.FacetGrid(results_KNN,  height=6, col="fselector__k", row="fselector__score_func", hue = "knn__p")
g.map(plt.plot, "knn__n_neighbors", "test_score", alpha=.7,  marker=".")
g.add_legend()
g.fig.suptitle("KNN performance comparison", size=18)
g.fig.subplots_adjust(top=.9)

For f_classif when k= 5, We spot the trend would tip the top at n_neighbors = 50 (Minikowski) or 75 (Manhattan). For other parameter combination of f_classif although the trend is at highest when n_nieghbors > 150, we would presume that would be an overfitting for the dataset when there are only 733 observations in D_train. For mutual_info_classif, the score fluctuates between 0.78 and 0.82 with different parameter combinations, as there is no obvious upward trend in the tail, this suggests the tuning is completed for the knn model.

Decision Tree

1. Using grid search for DT hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a DT model with both `gini` and `entropy` criterion,  `min_samples_split` values in {10, 50, 60, 70, 80, 100, 150}, `max_depth` in  {3, 5, 7, 9}
In [46]:
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier(random_state=random_state)


pipe_DT = Pipeline([('fselector', SelectKBest()), 
                     ('dt', dt_classifier)])

params_pipe_DT = {'fselector__score_func': [f_classif, mutual_info_classif],
                  'fselector__k': [5, 10, 20],
                  'dt__criterion': ['gini', 'entropy'],
                  'dt__min_samples_split': [ 10, 50, 60, 70, 80, 100, 150],
                  'dt__max_depth': [3, 5, 7, 9]}

 
gs_pipe_DT = GridSearchCV(estimator=pipe_DT, 
                           param_grid=params_pipe_DT, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='roc_auc',
                           verbose=1) 


gs_pipe_DT.fit(D_train, t_train);
Fitting 10 folds for each of 336 candidates, totalling 3360 fits
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  55 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-2)]: Done 207 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-2)]: Done 457 tasks      | elapsed:   39.0s
[Parallel(n_jobs=-2)]: Done 807 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-2)]: Done 1257 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-2)]: Done 1807 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-2)]: Done 2457 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-2)]: Done 3207 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-2)]: Done 3360 out of 3360 | elapsed:  4.1min finished
2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.
In [47]:
results_DT = get_search_results(gs_pipe_DT)
results_DT.head()
Out[47]:
mean_score std_score max_score min_score dt__criterion dt__max_depth dt__min_samples_split fselector__k fselector__score_func
96 0.793543 0.03456 0.852829 0.749682 gini 7 60 5 <function f_classif at 0x0000026419EAC708>
60 0.792593 0.0468779 0.866179 0.732517 gini 5 70 5 <function f_classif at 0x0000026419EAC708>
237 0.791185 0.0546864 0.846795 0.654005 entropy 5 80 10 <function mutual_info_classif at 0x00000264179...
54 0.790705 0.0459067 0.866179 0.729339 gini 5 60 5 <function f_classif at 0x0000026419EAC708>
48 0.78979 0.0422831 0.867451 0.720121 gini 5 50 5 <function f_classif at 0x0000026419EAC708>
In [48]:
results_DT = pd.DataFrame(gs_pipe_DT.cv_results_['params'])
results_DT['test_score'] = gs_pipe_DT.cv_results_['mean_test_score']
results_DT_gini = results_DT[results_DT['dt__criterion']=='gini']
results_DT_entropy = results_DT[results_DT['dt__criterion']=='entropy']
results_DT_gini.columns
Out[48]:
Index(['dt__criterion', 'dt__max_depth', 'dt__min_samples_split',
       'fselector__k', 'fselector__score_func', 'test_score'],
      dtype='object')
In [49]:
results_DT_gini['fselector__score_func'] = results_DT_gini['fselector__score_func'].astype(str)
results_DT_gini.loc[results_DT_gini['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_DT_gini.loc[results_DT_gini['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
3. visualize the tuning results.
In [50]:
g = sns.FacetGrid(results_DT_gini,   height=6, col="fselector__k", row="fselector__score_func", hue="dt__max_depth")
g.map(plt.plot, "dt__min_samples_split", "test_score", alpha=.7, marker=".")
g.add_legend()
g.fig.suptitle("Decision Tree performance comparison (Gini index)", size=18)
g.fig.subplots_adjust(top=.9)
In [51]:
results_DT_entropy['fselector__score_func'] = results_DT_entropy['fselector__score_func'].astype(str)
results_DT_entropy.loc[results_DT_entropy['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_DT_entropy.loc[results_DT_entropy['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
In [52]:
g = sns.FacetGrid(results_DT_entropy,  height=6, col="fselector__k", row="fselector__score_func", hue="dt__max_depth")
g.map(plt.plot, "dt__min_samples_split", "test_score", alpha=.7, marker=".")
g.add_legend()
g.fig.suptitle("Decision Tree performance comparison (Entropy)", size=18)
g.fig.subplots_adjust(top=.9)

We spot the trend of all the graphs will go downward once min_samples_split > 100 for both gini and entropy index, this suggests we don't need to add any more values for min_samples_split > 100 for tuning. The scores varies for different combination of parameter when min_samples_split is between 50 and 100, thus we add more values (60, 70 and 80) in between to complete the tuning for the decision tree model.

Random Forest

1. Using grid search for Random Forest hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a Random Forest model with `n_estimators` values in {5, 10, 20, 50, 100, 150, 200},  `max_depth` values in {5, 7, 10, 12}.
In [53]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=999)

pipe_RF = Pipeline([('fselector', SelectKBest()), 
                     ('rf', rf_classifier)])

params_pipe_RF = {'fselector__score_func': [f_classif, mutual_info_classif],
                  'fselector__k': [5, 10, 20],
                  'rf__n_estimators': [5, 10, 20, 50, 100, 150, 200],
                  'rf__max_depth': [5, 7, 10, 12]}

 
gs_pipe_RF = GridSearchCV(estimator=pipe_RF, 
                           param_grid=params_pipe_RF, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='roc_auc',
                           verbose=1) 


gs_pipe_RF.fit(D_train, t_train);
Fitting 10 folds for each of 168 candidates, totalling 1680 fits
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  74 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-2)]: Done 347 tasks      | elapsed:   21.4s
[Parallel(n_jobs=-2)]: Done 597 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-2)]: Done 947 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-2)]: Done 1429 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-2)]: Done 1680 out of 1680 | elapsed:  3.5min finished
2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.
In [54]:
results_RF = get_search_results(gs_pipe_RF)
results_RF.head()
Out[54]:
mean_score std_score max_score min_score fselector__k fselector__score_func rf__max_depth rf__n_estimators
165 0.833761 0.0391557 0.890496 0.777404 20 <function mutual_info_classif at 0x00000264179... 12 100
167 0.833072 0.0420947 0.898284 0.780128 20 <function mutual_info_classif at 0x00000264179... 12 200
5 0.83234 0.0347954 0.880006 0.77225 5 <function f_classif at 0x0000026419EAC708> 5 150
6 0.832308 0.0358723 0.880642 0.771297 5 <function f_classif at 0x0000026419EAC708> 5 200
10 0.83163 0.0434665 0.886364 0.747139 5 <function f_classif at 0x0000026419EAC708> 7 50
In [55]:
results_RF = pd.DataFrame(gs_pipe_RF.cv_results_['params'])
results_RF['test_score'] = gs_pipe_RF.cv_results_['mean_test_score']
In [56]:
results_RF['fselector__score_func'] = results_RF['fselector__score_func'].astype(str)
results_RF.loc[results_RF['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_RF.loc[results_RF['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
3. visualize the tuning results.
In [57]:
g = sns.FacetGrid(results_RF, height=6,col="fselector__k", row="fselector__score_func", hue="rf__max_depth")
g.map(plt.plot, "rf__n_estimators", "test_score", alpha=.7,  marker=".")
g.add_legend()
g.fig.suptitle("Random Forest performance comparison", size=18)
g.fig.subplots_adjust(top=.9)

For f_classif, We spot the trend become flatten when n_estimators >= 25 or > =50, for mutual_info_classif, the score fluctuates a bit more, but the trend is getting flattening also, as there are only 733 observations in D_train, going beyond 200 for n_estimators wouldn't make any sense, thus by observing from the above graph, we can conduct that tuning is completed for the random forest model.

Naive Bayes

1. Using grid search for Naive Bayes hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a Naive model with `var_smoothing` values in the `logspace` start with $10^1$ and end with $10^{-9}$
In [58]:
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer

nb_classifier = GaussianNB()

pipe_NB = Pipeline([('fselector', SelectKBest()), 
                     ('nb', nb_classifier)])

params_pipe_NB = {'fselector__score_func': [f_classif, mutual_info_classif],
                  'fselector__k': [5, 10, 20],
                  'nb__var_smoothing': np.logspace(1,-9, num=100)}

 
gs_pipe_NB = GridSearchCV(estimator=pipe_NB, 
                           param_grid=params_pipe_NB, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='roc_auc',
                           verbose=1) 

Data_transformed = PowerTransformer().fit_transform(D_train)

gs_pipe_NB.fit(Data_transformed, t_train);
Fitting 10 folds for each of 600 candidates, totalling 6000 fits
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  74 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-2)]: Done 1390 tasks      | elapsed:   54.8s
[Parallel(n_jobs=-2)]: Done 1640 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-2)]: Done 1990 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-2)]: Done 3539 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-2)]: Done 4701 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-2)]: Done 6000 out of 6000 | elapsed:  7.0min finished
2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.
In [59]:
results_NB = get_search_results(gs_pipe_NB)
results_NB.head()
Out[59]:
mean_score std_score max_score min_score fselector__k fselector__score_func nb__var_smoothing
373 0.826385 0.0461895 0.887821 0.76637 10 <function mutual_info_classif at 0x00000264179... 4.22924e-07
393 0.826037 0.0489426 0.89288 0.726282 10 <function mutual_info_classif at 0x00000264179... 4.03702e-09
392 0.824496 0.0395232 0.894872 0.76192 10 <function mutual_info_classif at 0x00000264179... 5.09414e-09
374 0.824082 0.0375718 0.86904 0.768277 10 <function mutual_info_classif at 0x00000264179... 3.3516e-07
233 0.823413 0.0459822 0.883026 0.745709 10 <function f_classif at 0x0000026419EAC708> 0.00464159
In [60]:
results_NB = pd.DataFrame(gs_pipe_NB.cv_results_['params'])
results_NB['test_score'] = gs_pipe_NB.cv_results_['mean_test_score']
In [61]:
results_NB['fselector__score_func'] = results_NB['fselector__score_func'].astype(str)
results_NB.loc[results_NB['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_NB.loc[results_NB['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
3. visualize the tuning results.
In [62]:
g = sns.FacetGrid(results_NB, height=6,col ="fselector__k", hue ="fselector__score_func")
g.map(plt.plot, "nb__var_smoothing", "test_score",  marker=".", alpha=.7)
g.add_legend()
g.add_legend()
g.fig.suptitle("Naive Bayes performance comparison", size=18)
g.fig.subplots_adjust(top=.9)

We spot the trend for f_classif is very steady during parameter tuning, for mutual_info_classif the score does fluctuate. Naive Bayes assumes that the descriptive features follow normal distribution, the above tuning shows that by search the logspace from $10^1$ to $10^{-9}$ for power transforming the data in descriptive features, there is variation of the score in mutual_info_classif, but no significant improvement, this suggests maybe there is no needs to go any further on tuning the naive bayes model.

Logistic Regression

1. Using grid search for Logistic Regression hyperparameter tuning via cross-validation using the **train** data. For scoring, use AUC, that is, "area under the ROC curve". 

- Perform feature selection using `SelectKBest`. We shall use the `f_classif` and `mutual_info_classif` score functions with 5, 10 and 20 features.
- Train a Logistic Regression model with `penalty` values in {l1, l2}, `C` values in {0.1, 1, 10, 100, 500, 1000, 1500} and class_weight value = `balanced`
In [63]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()


pipe_LR = Pipeline([('fselector', SelectKBest()), 
                     ('lr', lr)])

params_pipe_LR = {'fselector__score_func': [f_classif, mutual_info_classif],
                  'fselector__k': [5, 10, 20],
                  'lr__penalty': ['l1','l2'],
                  'lr__C': [0.1,1,10,100, 500,1000, 1500],
                  'lr__class_weight': ['balanced']}


 
gs_pipe_LR = GridSearchCV(estimator=pipe_LR, 
                           param_grid=params_pipe_LR, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='roc_auc',
                           verbose=1) 


gs_pipe_LR.fit(D_train, t_train);
Fitting 10 folds for each of 84 candidates, totalling 840 fits
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  74 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-2)]: Done 550 tasks      | elapsed:   42.8s
[Parallel(n_jobs=-2)]: Done 840 out of 840 | elapsed:  1.0min finished
2. Using the custom `get_search_results()` function, display the top 5 combinations of the pipeline.
In [64]:
results_LR = get_search_results(gs_pipe_LR)
results_LR.head()
Out[64]:
mean_score std_score max_score min_score fselector__k fselector__score_func lr__C lr__class_weight lr__penalty
73 0.830106 0.0342993 0.880801 0.769872 20 <function mutual_info_classif at 0x00000264179... 1 balanced l2
59 0.829682 0.0414529 0.884933 0.766026 20 <function f_classif at 0x0000026419EAC708> 1 balanced l2
61 0.829111 0.0357382 0.879212 0.778526 20 <function f_classif at 0x0000026419EAC708> 10 balanced l2
63 0.827005 0.0340319 0.880483 0.779167 20 <function f_classif at 0x0000026419EAC708> 100 balanced l2
65 0.826974 0.034029 0.880801 0.778846 20 <function f_classif at 0x0000026419EAC708> 500 balanced l2
In [65]:
results_LR = pd.DataFrame(gs_pipe_LR.cv_results_['params'])
results_LR['test_score'] = gs_pipe_LR.cv_results_['mean_test_score']
In [66]:
results_LR['fselector__score_func'] = results_LR['fselector__score_func'].astype(str)
results_LR.loc[results_LR['fselector__score_func'].str.find("f_classif")!= -1, 'fselector__score_func']="f_classif" 
results_LR.loc[results_LR['fselector__score_func'].str.find("mutual_info")!= -1, 'fselector__score_func']="mutual_info"
3. visualize the tuning results.
In [67]:
g = sns.FacetGrid(results_LR, height=6,col="fselector__k", row="fselector__score_func", hue="lr__penalty")
g.map(plt.plot, "lr__C", "test_score", alpha=.7,  marker=".")
g.add_legend()
g.fig.suptitle("Logistic Regression performance comparison", size=18)
g.fig.subplots_adjust(top=.9)

We spot the trend for f_classif becomes flatten when C = 1, for mutual_info_classif the score does fluctuate, but would not go beyond 0.83 with C in the values from 10 to 1500 , this suggests that tuning is completed for logistic regression.

Performance Comparison and Paired T-Test

1. Compare the metrics by generating the classification reports for all 5 models with their best estimator using the **unseen** test portion split from the `hold-out` sample that we prepared earlier.
In [68]:
# KNN
clf_best = gs_pipe_KNN.best_estimator_
predictions = clf_best.predict(D_test)
print("KNN :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")

#DT
clf_best = gs_pipe_DT.best_estimator_
predictions = clf_best.predict(D_test)
print("Decision Tree :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")

#RF
clf_best = gs_pipe_RF.best_estimator_
predictions = clf_best.predict(D_test)
print("Random Forest :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")

#NB
clf_best = gs_pipe_NB.best_estimator_
predictions = clf_best.predict(D_test)
print("Naive Bayes :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")

#LR
clf_best = gs_pipe_LR.best_estimator_
predictions = clf_best.predict(D_test)
print("Logistic Regression :")
print("Accuracy : ", metrics.accuracy_score(t_test, predictions))
print("Confusion Matrix : \n",metrics.confusion_matrix(t_test, predictions))
print("Classification Report: \n",metrics.classification_report(t_test, predictions))
print("=================================================================================")
KNN :
Accuracy :  0.8634920634920635
Confusion Matrix : 
 [[259   0]
 [ 43  13]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.86      1.00      0.92       259
           1       1.00      0.23      0.38        56

    accuracy                           0.86       315
   macro avg       0.93      0.62      0.65       315
weighted avg       0.88      0.86      0.83       315

=================================================================================
Decision Tree :
Accuracy :  0.8507936507936508
Confusion Matrix : 
 [[254   5]
 [ 42  14]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.86      0.98      0.92       259
           1       0.74      0.25      0.37        56

    accuracy                           0.85       315
   macro avg       0.80      0.62      0.64       315
weighted avg       0.84      0.85      0.82       315

=================================================================================
Random Forest :
Accuracy :  0.8603174603174604
Confusion Matrix : 
 [[253   6]
 [ 38  18]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.87      0.98      0.92       259
           1       0.75      0.32      0.45        56

    accuracy                           0.86       315
   macro avg       0.81      0.65      0.69       315
weighted avg       0.85      0.86      0.84       315

=================================================================================
Naive Bayes :
Accuracy :  0.26666666666666666
Confusion Matrix : 
 [[ 28 231]
 [  0  56]]
Classification Report: 
               precision    recall  f1-score   support

           0       1.00      0.11      0.20       259
           1       0.20      1.00      0.33        56

    accuracy                           0.27       315
   macro avg       0.60      0.55      0.26       315
weighted avg       0.86      0.27      0.22       315

=================================================================================
Logistic Regression :
Accuracy :  0.7333333333333333
Confusion Matrix : 
 [[184  75]
 [  9  47]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.95      0.71      0.81       259
           1       0.39      0.84      0.53        56

    accuracy                           0.73       315
   macro avg       0.67      0.77      0.67       315
weighted avg       0.85      0.73      0.76       315

=================================================================================

Since Match = 1 is classified as positive label, the above confusion matrix can be interpreted as:

$$CM = \begin{bmatrix} TN & FP \\ FN & TP \end{bmatrix}$$

where TN denotes True Negative, TP denotes True Positive, FP denotes False Positive, FN denotes False Negative

In the classification report:

  • $Precision_{0}=\frac {TN} {(TN+FN)} $, $Precision_{1}=\frac {TP} {(TP+FP)} $, $Macro Avg_{Precision}=\frac {Precision_{0} + Precision_{1}} {2} $

    $WeightedAvg_{Precision}=\frac {Precision_{0} * Number_{0} + Precision_{1} * Number_{1}} {Number_{0} + Number_{1}} $

  • $Recall_{0}=\frac {TN} {(TN+FP)} $, $Recall_{1}=\frac {TP} {(TP+FN)} $, $Macro Avg_{Recall}=\frac {Recall_{0} + Recall_{1}} {2} $, $Weighted Avg_{Recall}=\frac {Recall_{0} * Number_{0} + Recall_{1} * Number_{1}} {Number_{0} + Number_{1}} $
  • $Accuracy=\frac {TP + TN } {(TP+TN+FP+FN)} $

  • $F1_{0}=\frac {2 * (Precision_{0} * Recall_{0})} {(Precision_{0} * Recall_{0})} $, $F1_{1}=\frac {2 * (Precision_{1} * Recall_{1})} {(Precision_{1} * Recall_{1})} $, $Macro Avg_{F1}=\frac {F1_{0} + F1_{1}} {2} $, $Weighted Avg_{F1}=\frac {F1_{0} * Number_{0} + F1_{1} * Number_{1}} {Number_{0} + Number_{1}} $

For example, $$CM_{KNN} = \begin{bmatrix} 252 & 7 \\ 38 & 18 \end{bmatrix}$$

TN = 252, FP = 7, FN = 38, TP = 18, $Number_{0}$ = 259, $Number_{1}$ = 56

From the classification reports of all the models, we could summarize as below:

  • KNN - All metrics are good, except $Recall_{1}$ is low (0.32) as TP is 18, but FN is 38, which means 38 out of 56 positive labels has been classified as negative and only 18 are classified correctly.
  • Decision Tree - All metrics are good, except $Recall_{1}$ is (0.45, well a bit higher than KNN) low as TP is 25, but FN is 31, which means 31 out of 56 positive labels has been classified as negative and only 25 are classified correctly.
  • Random Forest - All metrics are good, except $Recall_{1}$ is (0.34) low as TP is 19, but FN is 37, which means 37 out of 56 positive labels has been classified as negative and only 19 are classified correctly.
  • Naive Bayes - Even though almost all positive labels (55 out of 56 observations) are classified correctly, but for negative labels, only 71 out of 259 are classified correctly. All other 188 negative match observations are classified incorrectly. Thus the accuracy is so low in this model.
  • Logistic Regression - for positive labels, 47 (out of 56) are classified correctly. For negative labels, 181 (out of 259) are classified correctly, it shows that this model is better in classifying positive labels than negative labels.

Thus we found that the best estimator of Naive Bayes and Logistic Regression can distinguish positive labels better than other models, while KNN, Decision Tree and Random Forest would make less mistakes on predicting negative labels from the unseen test dataset.

2. Prepare a dataframe with the `fpr` and `tpr` for all 5 models with their best estimator by calculating the prediction scores using the `predict_proba` method in `Scikit-learn` and plot the AUC curve of all these models in the same graph.
In [69]:
clf_best = gs_pipe_KNN.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
KNN_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
KNN_df['model']='KNN'
neworder = ['model','fpr','tpr']
KNN_df=KNN_df.reindex(columns=neworder)


clf_best = gs_pipe_DT.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
DT_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
DT_df['model']='Decision Tree'
neworder = ['model','fpr','tpr']
DT_df=DT_df.reindex(columns=neworder)

clf_best = gs_pipe_RF.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
RF_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
RF_df['model']='Random Forest'
neworder = ['model','fpr','tpr']
RF_df=RF_df.reindex(columns=neworder)

clf_best = gs_pipe_NB.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
NB_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
NB_df['model']='Naive Bayes'
neworder = ['model','fpr','tpr']
NB_df=NB_df.reindex(columns=neworder)

clf_best = gs_pipe_LR.best_estimator_
t_prob = clf_best.predict_proba(D_test)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:, 1])
LR_df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
LR_df['model']='Logistic Regression'
neworder = ['model','fpr','tpr']
LR_df=LR_df.reindex(columns=neworder)

result = pd.concat([KNN_df, DT_df, RF_df, NB_df, LR_df], ignore_index=True, sort=False)


x=np.linspace(0.0,1.0,101)
plt.figure(figsize=(10, 10))
sns.lineplot(x="fpr", y="tpr", hue="model", data=result)
plt.plot(x, x, color='black')
plt.title("Comparison of Roc Curve of 5 classifiers", fontsize = 16)
plt.show()

From the AUC curve, we found that beside decision tree, all other 4 models perform similarly (ideal model would have 0/1 fpr-tpr relationship) on the unseen dataset.

3.  Let's use t-test to compare the mean score of the models against each other to find any significant difference.  We set a different random seed this time. 
In [70]:
cv_method_ttest = RepeatedStratifiedKFold(n_splits=5, 
                                          n_repeats=3, 
                                          random_state=222)

cv_results_KNN = cross_val_score(estimator=gs_pipe_KNN.best_estimator_,
                                 X=Data,
                                 y=target, 
                                 cv=cv_method_ttest, 
                                 n_jobs=-2,
                                 scoring='roc_auc')
cv_results_DT = cross_val_score(estimator=gs_pipe_DT.best_estimator_,
                                X=Data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='roc_auc')
cv_results_RF = cross_val_score(estimator=gs_pipe_RF.best_estimator_,
                                X=Data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='roc_auc')
cv_results_NB = cross_val_score(estimator=gs_pipe_NB.best_estimator_,
                                X=Data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='roc_auc')
cv_results_LR = cross_val_score(estimator=gs_pipe_LR.best_estimator_,
                                X=Data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='roc_auc')

print("KNN cross validation mean score:", cv_results_KNN.mean().round(3))
print("Decision Tree cross validation mean score:", cv_results_DT.mean().round(3))
print("Random Forest cross validation mean score:", cv_results_RF.mean().round(3))
print("Naive Bayes cross validation mean score:", cv_results_NB.mean().round(3))
print("Logistic Regression cross validation mean score:", cv_results_LR.mean().round(3))
print("==================================================================")


print("Difference between KNN and Decisioin Tree: ", stats.ttest_rel(cv_results_KNN, cv_results_DT).pvalue)
print("Difference between KNN and Random Forest: ",stats.ttest_rel(cv_results_KNN, cv_results_RF).pvalue)
print("Difference between KNN and Naive Bayes: ",stats.ttest_rel(cv_results_KNN, cv_results_NB).pvalue)
print("Difference between KNN and Logistic Regression: ", stats.ttest_rel(cv_results_KNN, cv_results_LR).pvalue)

print("==================================================================")


print("Difference between Decision Tree and Random Forest: ", stats.ttest_rel(cv_results_DT, cv_results_RF).pvalue)
print("Difference between Decision Tree and Naive Bayes: ", stats.ttest_rel(cv_results_DT, cv_results_NB).pvalue)
print("Difference between Decision Tree and Logistic Regression: ", stats.ttest_rel(cv_results_DT, cv_results_LR).pvalue)
print("==================================================================")

print("Difference between Random Forest and Naive Bayes: ", stats.ttest_rel(cv_results_RF, cv_results_NB).pvalue)
print("Difference between Random Forest and Logistic Regression: ", stats.ttest_rel(cv_results_RF, cv_results_LR).pvalue)

print("Difference between Logistic Regression and Naive Bayes: ", stats.ttest_rel(cv_results_NB, cv_results_LR).pvalue)
KNN cross validation mean score: 0.824
Decision Tree cross validation mean score: 0.784
Random Forest cross validation mean score: 0.839
Naive Bayes cross validation mean score: 0.832
Logistic Regression cross validation mean score: 0.833
==================================================================
Difference between KNN and Decisioin Tree:  2.235138967706454e-05
Difference between KNN and Random Forest:  0.03945350976204024
Difference between KNN and Naive Bayes:  0.21035155300613778
Difference between KNN and Logistic Regression:  0.15251342670858226
==================================================================
Difference between Decision Tree and Random Forest:  6.135442923421278e-06
Difference between Decision Tree and Naive Bayes:  2.038343652173581e-05
Difference between Decision Tree and Logistic Regression:  6.232179612259313e-05
==================================================================
Difference between Random Forest and Naive Bayes:  0.3083120733642208
Difference between Random Forest and Logistic Regression:  0.3824185833221857
Difference between Logistic Regression and Naive Bayes:  0.9441211442623972

With 95%of confidence interval, we found that only decision tree has significant difference (p-value < 0.05) on the mean score (under AUC metric) compare to other models.

Critique of our approach

One of the biggest limitations in our project is handling missing values. We have truncated all the observations with missing values. Below are the top 3 features with most missing values:

* expected_num_interested_in_me, 6578 observations with missing values
* expected_num_matches, 1173 observations with missing values
* shared_interests_partner, 1067 observations with missing values

If we fit the whole dataset to select the best 5 features for classif

In [71]:
fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=5)
fs_fit_fscore.fit_transform(Data, target)
fs_indices_fscore = np.argsort(fs_fit_fscore.scores_)[::-1][0:5]
fs_indices_fscore
Out[71]:
array([57, 30, 58, 13, 18], dtype=int64)
In [72]:
best_features_fscore = Data_df.columns[fs_indices_fscore].values
best_features_fscore
Out[72]:
array(['like', 'attractive_partner', 'guess_prob_liked', 'attractive_o',
       'shared_interests_o'], dtype=object)

None of the top 3 missing values features would be selected.

If we fit the whole dataset to select the best 10 features for classif and mutual_info

In [73]:
fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=10)
fs_fit_fscore.fit_transform(Data, target)
fs_indices_fscore = np.argsort(fs_fit_fscore.scores_)[::-1][0:10]
fs_indices_fscore
Out[73]:
array([57, 30, 58, 13, 18, 35, 16, 33, 56, 49], dtype=int64)
In [74]:
best_features_fscore = Data_df.columns[fs_indices_fscore].values
best_features_fscore
Out[74]:
array(['like', 'attractive_partner', 'guess_prob_liked', 'attractive_o',
       'shared_interests_o', 'shared_interests_partner', 'funny_o',
       'funny_partner', 'expected_num_matches', 'concerts'], dtype=object)
In [75]:
fs_fit_fscore = fs.SelectKBest(fs.mutual_info_classif, k=10)
fs_fit_fscore.fit_transform(Data, target)
fs_indices_fscore = np.argsort(fs_fit_fscore.scores_)[::-1][0:10]
fs_indices_fscore
Out[75]:
array([58, 53, 57, 16, 18, 56, 13, 30, 35, 31], dtype=int64)
In [76]:
best_features_fscore = Data_df.columns[fs_indices_fscore].values
best_features_fscore
Out[76]:
array(['guess_prob_liked', 'interests_correlate', 'like', 'funny_o',
       'shared_interests_o', 'expected_num_matches', 'attractive_o',
       'attractive_partner', 'shared_interests_partner',
       'sincere_partner'], dtype=object)

expected_num_matches and shared_interests_partner would then be selected, but expected_num_interested_in_me would still not be the top 10 selections with both mutual_info and classif metric, it would be interesting to see the comparison of the whole analysis by removing the expected_num_interested_in_me feature before truncating any observations with missing values. We would have a much larger dataset if so, and we believe the results would change dramatically.

Also, we have deleted all the binned features, however from reviewing the 2-variable plots, we do find the values in interests_correlate are not discrete, we might also compare the analysis if we bin this feature instead. If time and resources is allowed, it would also be interesting to use all the descriptive features to train any of our classifiers and see if there would be any difference in parameter tuning.

On the other hand, Cross-validation on tuning all the parameters with the 5 dedicated models is the strength of our investigation, as we could see when we compare the mean score by cross-validating with different random seed, the mean scores stay relatively similar to the best estimator with original random seed on the training dataset.

Summary and Conclusion

The main goal of this analysis is to predict if a person would match (go on with second date) with another person based on the data collected during speed dating.

We found that visualization on data helps us a lot on the analysis:

  1. we have in-depth understanding on the dataset after generating all the variable plots
  2. we could adjust the values of our parameters much easier by observing the fine-tuning plots

From the trained models with optimal parameter, we find that Naive Bayes would distinguish a match with highest accuracy. However, if we apply this model to help the speed dating agent to identify match on any unseen data, we might end up contacting too many couples that would not be matched (for example in the test data set, we would contact 188 + 55 potential match cases, and then we would find out the 188 false positive cases are waste of effort).

In contrast, if we use the KNN, decision tree or random forest model, by observing the test data set, we would miss a high percentage of match case. (e.g. for KNN, we would miss 67.8% (38/56) of potential match)

Even though the AUC metric score is not low from the above analysis due to the imbalanced class label, if we need to deploy this model for real, we would really suggest to train and fine-tune the model on a larger dataset (perhaps dropping the expected_num_interested_in_me feature before truncating any missing value observations would be one of the option to retain more data in the dataset), as by simply feeding into the sampled test data, we already found discrepancies in each model, which cannot help us to achieve our goal in an overall satisfactory way. Review the survey questions and improve the relevance of descriptive features could enhance our tuning process to find a better model too, in short, for industry deployment, it is essential to repetitively train and fine tune a classifier that can result higher accuracy for detecting both positive and negative class label on unseen data to achieve the business goal.

In [ ]: