Question 1 - Naive Bayes¶

Based on the US Census Income Dataset that we have been using in this course. The annual_income target variable is binary, which is either high_income or low_income. As usual, high income will be the positive class for this problem.

For this question, we would use different variations of the Naive Bayes (NB) classifier for predicting the annual_income target feature and present our results as Pandas data frames.

Part A - Data Preparation ¶

Task 1 ¶

Transform the 2 numerical features (age and education_years) into 2 (nominal) categorical features. Specifically, use equal-width binning with the following 3 bins for each numerical feature: low, mid, and high. Once this is done, all the 5 descriptive features in our dataset will be categorical. Dataset's name after Task 1 will be named as df_all_cat.

Step 1: Import all necessary packages: numpy, pandas and tabulate, and set the appropriate display options

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PowerTransformer

Step 2: Read in the A3_Q1_train.csv file to the dataframe df, drop the column row_id as it is an index

df = pd.read_csv("A3_Q1_train.csv")
df=df.drop(columns='row_id')

df.head()

Step 3: Copy all the information in df to a new dataframe df_all_cat, drop the age and education_years column.

df_all_cat = df.copy() 
df_all_cat=df_all_cat.drop(columns=['age', 'education_years'])

Step 4: Use the cut function to bin df's age and education_years into equal-width. Assign these binned values to df_all_cat as age and education_years respectively. Re-arrange the column order in df_all_cat, so that it follows the same order as the read-in df.

df_all_cat['age']=pd.cut(df['age'],3, labels=["low", "mid", "high"])
df_all_cat['education_years']=pd.cut(df['education_years'],3, labels=["low", "mid", "high"])
neworder = ['age','education_years','workclass', 'marital_status', 'occupation', 'annual_income']
df_all_cat=df_all_cat.reindex(columns=neworder)

Step 5: Display the value_counts of each column in df_all_cat

for col in df_all_cat.columns.to_list():  
    print(col + ':')
    print(df_all_cat[col].value_counts())
    print('********')

age:
mid     230
low     140
high    130
Name: age, dtype: int64
********
education_years:
high    304
mid     193
low       3
Name: education_years, dtype: int64
********
workclass:
Local-gov      225
State-gov      148
Federal-gov    127
Name: workclass, dtype: int64
********
marital_status:
Married-civ-spouse    230
Never-married         155
Divorced              115
Name: marital_status, dtype: int64
********
occupation:
Prof-specialty     224
Adm-clerical       159
Exec-managerial    117
Name: occupation, dtype: int64
********
annual_income:
low_income     320
high_income    180
Name: annual_income, dtype: int64
********

Task 2 ¶

Perform one-hot-encoding (OHE) on the dataset (after the equal-width binning above). Dataset will be named as df_all_cat_ohe after task 2.

Step 1: Copy all the information in df_all_cat to a new dataframe df_all_cat_ohe, drop the annual_income column.

df_all_cat_ohe = df_all_cat.copy()
df_all_cat_ohe=df_all_cat_ohe.drop(columns=['annual_income'])

Step 2: Perform one-hot-encoding to all the columns in df_all_cat_ohe, such that one-hot-encoding is done on all the descriptive feature.

for col in df_all_cat_ohe.columns.to_list():  
    n = len(df_all_cat_ohe[col].unique())
    if (n == 2):
        df_all_cat_ohe[col] = pd.get_dummies(df_all_cat_ohe[col], drop_first=True)
df_all_cat_ohe = pd.get_dummies(df_all_cat_ohe)

Step 3: The labelEncoder or get_dummies function would encode the labels in alphabetical order, which will turn high_income = 0 and low_income=1. As high_income is positive class in annual_income, we would like to encode high_income as 1 and low_income as 0. To do so, we can call the replace function on df_all_cat['annual_income'] and assign the result to a new column annual_income for df_all_cat_ohe.

df_all_cat_ohe['annual_income'] = pd.Series(df_all_cat['annual_income']).replace({'low_income': 0, 'high_income': 1}).values

Step 4: Display the shape and the first five rows of df_all_cat_ohe

print(df_all_cat_ohe.shape)
df_all_cat_ohe.head()

(500, 16)

Part B - Bernoulli NB ¶

Step 1: Drop the target feature annual income in df_all_cat_ohe such that it will only contain descriptive features, turn df_all_cat_ohe into a numpy array and name it Data. Also turn the target feature df_all_cat_ohe["annual_income"] into a numpy array and name it target.

Data = df_all_cat_ohe.drop(columns="annual_income").values
target = df_all_cat_ohe["annual_income"].values

print(type(Data))
print(type(target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

Step 2: Check the unique count in target. Count for 1 should equal to value count of high_income in df_all_cat['annual income'], count for 0 should equal to value count of low_income in df_all_cat['annual income']

np.unique(target, return_counts = True)

(array([0, 1], dtype=int64), array([320, 180], dtype=int64))

Step 3: Train ther Bernoulli NB model with default parameter, using the entire set of Data and target. Compute the accuracy on the whole set of train data.

from sklearn.naive_bayes import BernoulliNB

bnb_classifier = BernoulliNB()
bnb_classifier.fit(Data, target)
bnb_classifier.score(Data, target)

0.83

Step 3: Define a dataframe df_summary to display the accuracy scores of different variation of NB classifiers :
- 1st column will contain the method of the classifier
- 2nd column will contain the accuracy score of that classifier.

We would use df_summary extensively in part F

df_summary = pd.DataFrame(columns=['method', 'accuracy'])

Step 4: Add the Bernoulli NB result into the first row of df_summary

df_summary.loc[len(df_summary)] = ['Bernoulli NB', bnb_classifier.score(Data, target)]

Part C - Gaussian NB ¶

Step 1: Train the Gaussian NB model with default parameter, using the entire set of Data (all features will be binary) and target. Compute the accuracy on the whole set of train data.

from sklearn.naive_bayes import GaussianNB

gnb_classifier = GaussianNB()
gnb_classifier.fit(Data, target)
gnb_classifier.score(Data, target)

0.728

Step 2: Add the Gaussian NB result into the second row of df_summary

df_summary.loc[len(df_summary)] = ['Gaussian NB', gnb_classifier.score(Data, target)]

Part D - Performance Tuning ¶

Task 1 - Tuning ¶

Step 1: Define a dataframe bnb_result:
- 1st column will contain the parameter value alpha of the BernoulliNB classifier
- 2nd column will contain the accuracy score obtained from fitting Data into BernoulliNB classifier with the respective alpha parameter in 1st column

 bnb_result= pd.DataFrame(columns=['alpha', 'accuracy_score'])

Step 2: Use a for loop to train the BernoulliNB classifier ranging from 1 to 500, store the alpha value and accuracy score into the dataframe bnb_result

for alpha_value in np.arange(1,500):
#    print(alpha_value)
    bnb_classifier = BernoulliNB(alpha=alpha_value)
    Data_transformed = PowerTransformer().fit_transform(Data)
    bnb_classifier.fit(Data_transformed, target)
    accuracy=bnb_classifier.score(Data_transformed, target).round(3)
    bnb_result.loc[len(bnb_result)] = [alpha_value, accuracy]

Step 3: Define a dataframe gnb_result:
- 1st column will contain the parameter value var_smoothing of the GaussianNB classifier
- 2nd column will contain the accuracy score obtained from fitting Data into GaussianNB classifier with the respective var_smoothing parameter in 1st column

gnb_result = pd.DataFrame(columns=['var_smoothing', 'accuracy_score'])

Step 4: Use a for loop to train the GaussianNB classifier ranging from $10^0$ to $10^{-9}$ ($10^{-9}$ is the default value), store the alpha value and accuracy score into the dataframe gnb_result

for var_smoothing_value in np.logspace(1,-9, num=100):
    gnb_classifier = GaussianNB(var_smoothing=var_smoothing_value)
    Data_transformed = PowerTransformer().fit_transform(Data)
    gnb_classifier.fit(Data_transformed, target)
    accuracy=gnb_classifier.score(Data_transformed, target).round(3)
    gnb_result.loc[len(gnb_result)] = [var_smoothing_value, accuracy]

Task 2 - Plotting ¶

Step 1: Plot the Bernoulli NB tuning results

sns.lineplot(x="alpha", y="accuracy_score", data=bnb_result)
plt.title('Tuning of BernoulliNB classifier', fontsize = 16)

Text(0.5, 1.0, 'Tuning of BernoulliNB classifier')

Step 2: List the top 5 alpha with the highest accuracy score

bnb_result_desc_order = bnb_result.sort_values(by=["accuracy_score"], ascending = False).reset_index(drop=True)

bnb_result_desc_order.head(5)

Step 3: Add the best tuned Bernoulli NB result into the third row of df_summary

df_summary.loc[len(df_summary)] = ['Tuned Bernoulli NB', bnb_result_desc_order["accuracy_score"][0]]

Step 4: Plot the Gaussian NB tuning results

sns.lineplot(x="var_smoothing", y="accuracy_score", data=gnb_result)
plt.title('Tuning of GaussianNB classifier', fontsize = 16)

Text(0.5, 1.0, 'Tuning of GaussianNB classifier')

Step 5: List the top 5 alpha with the highest accuracy score

gnb_result_desc_order = gnb_result.sort_values(by=["accuracy_score"], ascending = False).reset_index(drop=True)
gnb_result_desc_order.head(5)

gnb_result_desc_order["accuracy_score"]

0     0.838
1     0.834
2     0.830
3     0.828
4     0.828
      ...  
95    0.682
96    0.680
97    0.678
98    0.676
99    0.674
Name: accuracy_score, Length: 100, dtype: float64

Step 6: Add the best tuned Gaussian NB result into the fourth row of df_summary

df_summary.loc[len(df_summary)] = ['Tuned Gaussian NB', gnb_result_desc_order["accuracy_score"][0]]

Part E - Hybrid NB ¶

In real world, we would usually work with datasets with a mix of categorical and numerical features. We have covered two NB variants so far:

Bernoulli NB that assumes all descriptive features are binary, and
Gaussian NB that assumes all descriptive features are numerical and they follow a Gaussian probability distribution.

The purpose of this part is to implement a Hybrid NB Classifier on the "A3_Q1_train.csv" dataset that uses Bernoulli NB (with default parameters) for categorical descriptive features and Gaussian NB (with default parameters) for the numerical descriptive features. You will specifically train your Hybrid NB model using the train data and compute its accuracy on again train data.

Task 1 - Split Dataset¶

Step 1: Construct the dataframe df_numeric, first 2 columns, age and education_years would be continuous (copy from dataframe df), while the rest of the categorical features would be resided in the dataframe df_cat.

df.dtypes

age                 int64
education_years     int64
workclass          object
marital_status     object
occupation         object
annual_income      object
dtype: object

df_numeric = df[['age', 'education_years']].copy()

df_cat = df.select_dtypes(include='object').drop(columns='annual_income')

df_numeric.dtypes

age                int64
education_years    int64
dtype: object

df_cat.dtypes

workclass         object
marital_status    object
occupation        object
dtype: object

Task 2 - Transformation Check (Numeric Portion Only)¶

Before we train df_numeric with Gaussian NB. Our numeric data may not have a Gaussian (Normal) distribution and Gaussian NB classifier would have better performance if the data follow Gaussian distribution. Test and see if we need to apply power or Boxcox transformation in this dataset before training the data.

Step 1: Plot the distribution for Age, compare with and without transformation

from scipy import stats

df_age = pd.DataFrame()
df_age['original']=df_numeric['age']

df_numeric_transformed =  PowerTransformer().fit_transform(df_numeric)
df_age['power']=df_numeric_transformed[:,0]
tdata = stats.boxcox(df_numeric['age'])
df_age['boxcox']=tdata[0]
df_age.head()

df_age_long=df_age.melt(var_name=["data"])

df_age_long=df_age.melt(var_name=["data"])
g = sns.FacetGrid(df_age_long, col="data")
g.map(sns.distplot, "value")

<seaborn.axisgrid.FacetGrid at 0x2814eda5b08>

Side-by-side comparison is not clear, let's plot the distribution separately.

Original

sns.distplot(df_age['original'],
             kde=True,
             bins=40);

Power Transformed

sns.distplot(df_age['power'],
             kde=True,
             bins=40);

Box-Cox Transformed

sns.distplot(df_age['boxcox'],
             kde=True,
             bins=40);

Step 2: Plot the distribution for Education years, compare with and without transformation

df_ed = pd.DataFrame()
df_ed['original']=df_numeric['education_years']
df_ed['power']=df_numeric_transformed[:,1]
tdata = stats.boxcox(df_numeric['education_years'])
df_ed['boxcox']=tdata[0]
df_ed.head()

df_ed_long=df_ed.melt(var_name=["data"])
g = sns.FacetGrid(df_ed_long, col="data")
g.map(sns.distplot, "value")

<seaborn.axisgrid.FacetGrid at 0x2814f1a8508>