Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.

s3806940_A3_Q1-Final

Question 1 - Naive Bayes

Based on the US Census Income Dataset that we have been using in this course. The annual_income target variable is binary, which is either high_income or low_income. As usual, high income will be the positive class for this problem.

For this question, we would use different variations of the Naive Bayes (NB) classifier for predicting the annual_income target feature and present our results as Pandas data frames.

Part A - Data Preparation

Task 1

Transform the 2 numerical features (age and education_years) into 2 (nominal) categorical features. Specifically, use equal-width binning with the following 3 bins for each numerical feature: low, mid, and high. Once this is done, all the 5 descriptive features in our dataset will be categorical. Dataset's name after Task 1 will be named as df_all_cat.

  • Step 1: Import all necessary packages: numpy, pandas and tabulate, and set the appropriate display options
In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PowerTransformer
  • Step 2: Read in the A3_Q1_train.csv file to the dataframe df, drop the column row_id as it is an index
In [2]:
df = pd.read_csv("A3_Q1_train.csv")
df=df.drop(columns='row_id')
In [3]:
df.head()
Out[3]:
age education_years workclass marital_status occupation annual_income
0 48 14 Local-gov Divorced Prof-specialty high_income
1 23 13 Local-gov Never-married Prof-specialty low_income
2 45 13 Local-gov Never-married Prof-specialty low_income
3 51 13 Federal-gov Married-civ-spouse Exec-managerial low_income
4 51 14 Local-gov Married-civ-spouse Prof-specialty high_income
  • Step 3: Copy all the information in df to a new dataframe df_all_cat, drop the age and education_years column.
In [4]:
df_all_cat = df.copy() 
df_all_cat=df_all_cat.drop(columns=['age', 'education_years'])
  • Step 4: Use the cut function to bin df's age and education_years into equal-width. Assign these binned values to df_all_cat as age and education_years respectively. Re-arrange the column order in df_all_cat, so that it follows the same order as the read-in df.
In [5]:
df_all_cat['age']=pd.cut(df['age'],3, labels=["low", "mid", "high"])
df_all_cat['education_years']=pd.cut(df['education_years'],3, labels=["low", "mid", "high"])
neworder = ['age','education_years','workclass', 'marital_status', 'occupation', 'annual_income']
df_all_cat=df_all_cat.reindex(columns=neworder)
  • Step 5: Display the value_counts of each column in df_all_cat
In [6]:
for col in df_all_cat.columns.to_list():  
    print(col + ':')
    print(df_all_cat[col].value_counts())
    print('********')
age:
mid     230
low     140
high    130
Name: age, dtype: int64
********
education_years:
high    304
mid     193
low       3
Name: education_years, dtype: int64
********
workclass:
Local-gov      225
State-gov      148
Federal-gov    127
Name: workclass, dtype: int64
********
marital_status:
Married-civ-spouse    230
Never-married         155
Divorced              115
Name: marital_status, dtype: int64
********
occupation:
Prof-specialty     224
Adm-clerical       159
Exec-managerial    117
Name: occupation, dtype: int64
********
annual_income:
low_income     320
high_income    180
Name: annual_income, dtype: int64
********

Task 2

Perform one-hot-encoding (OHE) on the dataset (after the equal-width binning above). Dataset will be named as df_all_cat_ohe after task 2.

  • Step 1: Copy all the information in df_all_cat to a new dataframe df_all_cat_ohe, drop the annual_income column.
In [7]:
df_all_cat_ohe = df_all_cat.copy()
df_all_cat_ohe=df_all_cat_ohe.drop(columns=['annual_income'])
  • Step 2: Perform one-hot-encoding to all the columns in df_all_cat_ohe, such that one-hot-encoding is done on all the descriptive feature.
In [8]:
for col in df_all_cat_ohe.columns.to_list():  
    n = len(df_all_cat_ohe[col].unique())
    if (n == 2):
        df_all_cat_ohe[col] = pd.get_dummies(df_all_cat_ohe[col], drop_first=True)
df_all_cat_ohe = pd.get_dummies(df_all_cat_ohe)
  • Step 3: The labelEncoder or get_dummies function would encode the labels in alphabetical order, which will turn high_income = 0 and low_income=1. As high_income is positive class in annual_income, we would like to encode high_income as 1 and low_income as 0. To do so, we can call the replace function on df_all_cat['annual_income'] and assign the result to a new column annual_income for df_all_cat_ohe.
In [9]:
df_all_cat_ohe['annual_income'] = pd.Series(df_all_cat['annual_income']).replace({'low_income': 0, 'high_income': 1}).values
  • Step 4: Display the shape and the first five rows of df_all_cat_ohe
In [10]:
print(df_all_cat_ohe.shape)
df_all_cat_ohe.head()
(500, 16)
Out[10]:
age_low age_mid age_high education_years_low education_years_mid education_years_high workclass_Federal-gov workclass_Local-gov workclass_State-gov marital_status_Divorced marital_status_Married-civ-spouse marital_status_Never-married occupation_Adm-clerical occupation_Exec-managerial occupation_Prof-specialty annual_income
0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 1 1
1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0
2 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0
3 0 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0
4 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1

Part B - Bernoulli NB

  • Step 1: Drop the target feature annual income in df_all_cat_ohe such that it will only contain descriptive features, turn df_all_cat_ohe into a numpy array and name it Data. Also turn the target feature df_all_cat_ohe["annual_income"] into a numpy array and name it target.
In [11]:
Data = df_all_cat_ohe.drop(columns="annual_income").values
target = df_all_cat_ohe["annual_income"].values
In [12]:
print(type(Data))
print(type(target))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
  • Step 2: Check the unique count in target. Count for 1 should equal to value count of high_income in df_all_cat['annual income'], count for 0 should equal to value count of low_income in df_all_cat['annual income']
In [13]:
np.unique(target, return_counts = True)
Out[13]:
(array([0, 1], dtype=int64), array([320, 180], dtype=int64))
  • Step 3: Train ther Bernoulli NB model with default parameter, using the entire set of Data and target. Compute the accuracy on the whole set of train data.
In [14]:
from sklearn.naive_bayes import BernoulliNB
In [15]:
bnb_classifier = BernoulliNB()
bnb_classifier.fit(Data, target)
bnb_classifier.score(Data, target)
Out[15]:
0.83
  • Step 3: Define a dataframe df_summary to display the accuracy scores of different variation of NB classifiers :
    • 1st column will contain the method of the classifier
    • 2nd column will contain the accuracy score of that classifier.

We would use df_summary extensively in part F

In [16]:
df_summary = pd.DataFrame(columns=['method', 'accuracy'])
  • Step 4: Add the Bernoulli NB result into the first row of df_summary
In [17]:
df_summary.loc[len(df_summary)] = ['Bernoulli NB', bnb_classifier.score(Data, target)]

Part C - Gaussian NB

  • Step 1: Train the Gaussian NB model with default parameter, using the entire set of Data (all features will be binary) and target. Compute the accuracy on the whole set of train data.
In [18]:
from sklearn.naive_bayes import GaussianNB
In [19]:
gnb_classifier = GaussianNB()
gnb_classifier.fit(Data, target)
gnb_classifier.score(Data, target)
Out[19]:
0.728
  • Step 2: Add the Gaussian NB result into the second row of df_summary
In [20]:
df_summary.loc[len(df_summary)] = ['Gaussian NB', gnb_classifier.score(Data, target)]

Part D - Performance Tuning

Task 1 - Tuning

  • Step 1: Define a dataframe bnb_result:
    • 1st column will contain the parameter value alpha of the BernoulliNB classifier
    • 2nd column will contain the accuracy score obtained from fitting Data into BernoulliNB classifier with the respective alpha parameter in 1st column
In [21]:
 bnb_result= pd.DataFrame(columns=['alpha', 'accuracy_score'])
  • Step 2: Use a for loop to train the BernoulliNB classifier ranging from 1 to 500, store the alpha value and accuracy score into the dataframe bnb_result
In [22]:
for alpha_value in np.arange(1,500):
#    print(alpha_value)
    bnb_classifier = BernoulliNB(alpha=alpha_value)
    Data_transformed = PowerTransformer().fit_transform(Data)
    bnb_classifier.fit(Data_transformed, target)
    accuracy=bnb_classifier.score(Data_transformed, target).round(3)
    bnb_result.loc[len(bnb_result)] = [alpha_value, accuracy]
  • Step 3: Define a dataframe gnb_result:
    • 1st column will contain the parameter value var_smoothing of the GaussianNB classifier
    • 2nd column will contain the accuracy score obtained from fitting Data into GaussianNB classifier with the respective var_smoothing parameter in 1st column
In [23]:
gnb_result = pd.DataFrame(columns=['var_smoothing', 'accuracy_score'])
  • Step 4: Use a for loop to train the GaussianNB classifier ranging from $10^0$ to $10^{-9}$ ($10^{-9}$ is the default value), store the alpha value and accuracy score into the dataframe gnb_result
In [24]:
for var_smoothing_value in np.logspace(1,-9, num=100):
    gnb_classifier = GaussianNB(var_smoothing=var_smoothing_value)
    Data_transformed = PowerTransformer().fit_transform(Data)
    gnb_classifier.fit(Data_transformed, target)
    accuracy=gnb_classifier.score(Data_transformed, target).round(3)
    gnb_result.loc[len(gnb_result)] = [var_smoothing_value, accuracy]

Task 2 - Plotting

  • Step 1: Plot the Bernoulli NB tuning results
In [25]:
sns.lineplot(x="alpha", y="accuracy_score", data=bnb_result)
plt.title('Tuning of BernoulliNB classifier', fontsize = 16)
Out[25]:
Text(0.5, 1.0, 'Tuning of BernoulliNB classifier')
  • Step 2: List the top 5 alpha with the highest accuracy score
In [26]:
bnb_result_desc_order = bnb_result.sort_values(by=["accuracy_score"], ascending = False).reset_index(drop=True)
In [27]:
bnb_result_desc_order.head(5)
Out[27]:
alpha accuracy_score
0 135.0 0.842
1 144.0 0.842
2 136.0 0.842
3 137.0 0.842
4 138.0 0.842
  • Step 3: Add the best tuned Bernoulli NB result into the third row of df_summary
In [28]:
df_summary.loc[len(df_summary)] = ['Tuned Bernoulli NB', bnb_result_desc_order["accuracy_score"][0]]
  • Step 4: Plot the Gaussian NB tuning results
In [29]:
sns.lineplot(x="var_smoothing", y="accuracy_score", data=gnb_result)
plt.title('Tuning of GaussianNB classifier', fontsize = 16)
Out[29]:
Text(0.5, 1.0, 'Tuning of GaussianNB classifier')
  • Step 5: List the top 5 alpha with the highest accuracy score
In [30]:
gnb_result_desc_order = gnb_result.sort_values(by=["accuracy_score"], ascending = False).reset_index(drop=True)
gnb_result_desc_order.head(5)
Out[30]:
var_smoothing accuracy_score
0 6.280291 0.838
1 4.977024 0.834
2 3.125716 0.830
3 3.944206 0.828
4 0.385353 0.828
In [31]:
gnb_result_desc_order["accuracy_score"]
Out[31]:
0     0.838
1     0.834
2     0.830
3     0.828
4     0.828
      ...  
95    0.682
96    0.680
97    0.678
98    0.676
99    0.674
Name: accuracy_score, Length: 100, dtype: float64
  • Step 6: Add the best tuned Gaussian NB result into the fourth row of df_summary
In [32]:
df_summary.loc[len(df_summary)] = ['Tuned Gaussian NB', gnb_result_desc_order["accuracy_score"][0]]

Part E - Hybrid NB

In real world, we would usually work with datasets with a mix of categorical and numerical features. We have covered two NB variants so far:

  • Bernoulli NB that assumes all descriptive features are binary, and
  • Gaussian NB that assumes all descriptive features are numerical and they follow a Gaussian probability distribution.

The purpose of this part is to implement a Hybrid NB Classifier on the "A3_Q1_train.csv" dataset that uses Bernoulli NB (with default parameters) for categorical descriptive features and Gaussian NB (with default parameters) for the numerical descriptive features. You will specifically train your Hybrid NB model using the train data and compute its accuracy on again train data.

Task 1 - Split Dataset

Step 1: Construct the dataframe df_numeric, first 2 columns, age and education_years would be continuous (copy from dataframe df), while the rest of the categorical features would be resided in the dataframe df_cat.

In [33]:
df.dtypes
Out[33]:
age                 int64
education_years     int64
workclass          object
marital_status     object
occupation         object
annual_income      object
dtype: object
In [34]:
df_numeric = df[['age', 'education_years']].copy()
In [35]:
df_cat = df.select_dtypes(include='object').drop(columns='annual_income')
In [36]:
df_numeric.dtypes
Out[36]:
age                int64
education_years    int64
dtype: object
In [37]:
df_cat.dtypes
Out[37]:
workclass         object
marital_status    object
occupation        object
dtype: object

Task 2 - Transformation Check (Numeric Portion Only)

Before we train df_numeric with Gaussian NB. Our numeric data may not have a Gaussian (Normal) distribution and Gaussian NB classifier would have better performance if the data follow Gaussian distribution. Test and see if we need to apply power or Boxcox transformation in this dataset before training the data.

  • Step 1: Plot the distribution for Age, compare with and without transformation
In [38]:
from scipy import stats

df_age = pd.DataFrame()
df_age['original']=df_numeric['age']

df_numeric_transformed =  PowerTransformer().fit_transform(df_numeric)
df_age['power']=df_numeric_transformed[:,0]
tdata = stats.boxcox(df_numeric['age'])
df_age['boxcox']=tdata[0]
df_age.head()
Out[38]:
original power boxcox
0 48 0.844505 58.088388
1 23 -1.561055 25.905422
2 45 0.549372 54.146006
3 51 1.141063 62.048351
4 51 1.141063 62.048351
In [39]:
df_age_long=df_age.melt(var_name=["data"])
In [40]:
df_age_long=df_age.melt(var_name=["data"])
g = sns.FacetGrid(df_age_long, col="data")
g.map(sns.distplot, "value")
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x2814eda5b08>

Side-by-side comparison is not clear, let's plot the distribution separately.

Original

In [41]:
sns.distplot(df_age['original'],
             kde=True,
             bins=40);

Power Transformed

In [42]:
sns.distplot(df_age['power'],
             kde=True,
             bins=40);

Box-Cox Transformed

In [43]:
sns.distplot(df_age['boxcox'],
             kde=True,
             bins=40);
  • Step 2: Plot the distribution for Education years, compare with and without transformation
In [44]:
df_ed = pd.DataFrame()
df_ed['original']=df_numeric['education_years']
df_ed['power']=df_numeric_transformed[:,1]
tdata = stats.boxcox(df_numeric['education_years'])
df_ed['boxcox']=tdata[0]
df_ed.head()
Out[44]:
original power boxcox
0 14 0.944526 73.582644
1 13 0.407072 63.999017
2 13 0.407072 63.999017
3 13 0.407072 63.999017
4 14 0.944526 73.582644
In [45]:
df_ed_long=df_ed.melt(var_name=["data"])
g = sns.FacetGrid(df_ed_long, col="data")
g.map(sns.distplot, "value")
Out[45]:
<seaborn.axisgrid.FacetGrid at 0x2814f1a8508>

Side-by-side comparison is not clear, let's plot the distribution separately.

Original

In [46]:
sns.distplot(df_ed['original'],
             kde=True,
             bins=40);

Power Transformed

In [47]:
sns.distplot(df_ed['power'],
             kde=True,
             bins=40);

Box-Cox Transformed

In [48]:
sns.distplot(df_ed['boxcox'],
             kde=True,
             bins=40);
  • Step 3: Draw conclusion

Visual comparison, there is not much change for the distributions (for both Age and Education years) after being transformed. We would just use the original numeric dataset for training.

Task 3 - Train and combine the outcomes

  • Step 1: Train the numeric portion using Gaussian NB
In [49]:
clf_GaussianNB = GaussianNB()
clf_GaussianNB.fit(df_numeric, target)

t_pred_GaussianNB = clf_GaussianNB.predict(df_numeric)
  • Step 2: Train the categorical portion using Bernoulli NB
In [50]:
data_cat_ohe = pd.get_dummies(df_cat)

clf_BernoulliNB = BernoulliNB()
clf_BernoulliNB.fit(data_cat_ohe, target)

t_pred_BernoulliNB = clf_BernoulliNB.predict(data_cat_ohe)
  • Step 3: Obtain the probability estimate of the label classes for each classifier and combine the outcomes
In [51]:
prob_BernoulliNB = clf_BernoulliNB.predict_proba(data_cat_ohe)
prob_GaussianNB = clf_GaussianNB.predict_proba(df_numeric)

results_combined = pd.DataFrame({'BernoulliNB_0':prob_BernoulliNB[:,0],
                                 'BernoulliNB_1':prob_BernoulliNB[:,1],
                                 'GaussianNB_0':prob_GaussianNB[:,0],
                                 'GaussianNB_1':prob_GaussianNB[:,1]})
  • Step 4: Calculate prior of each label classes based on the occurrence in the dataset
In [52]:
count_0 = np.count_nonzero(target == 0)
count_1 = np.count_nonzero(target == 1)
prior_0 = count_0 / (count_0+count_1)
prior_1 = count_1 / (count_0+count_1)
print(prior_0)
print(prior_1)
0.64
0.36

Task 4 - Class and Score the Hybrid model

  • Step 1: Calculate Likelihood: Since Posterior ∝ Prior * Likelihood ==> Likelihood = Posterior / Prior
In [53]:
results_combined['HybridNB_0'] = results_combined['BernoulliNB_0']*results_combined['GaussianNB_0'] / prior_0
results_combined['HybridNB_1'] = results_combined['BernoulliNB_1']*results_combined['GaussianNB_1'] / prior_1
  • Step 2 : Class if higher final probability is assigned from the prediction
In [54]:
results_combined['Target_Hybrid'] = np.where(results_combined['HybridNB_0'] > results_combined['HybridNB_1'], 0, 1)

results_combined.head()
Out[54]:
BernoulliNB_0 BernoulliNB_1 GaussianNB_0 GaussianNB_1 HybridNB_0 HybridNB_1 Target_Hybrid
0 0.972532 0.027468 0.280403 0.719597 0.426096 0.054905 0
1 0.988022 0.011978 0.940321 0.059679 1.451652 0.001986 0
2 0.988022 0.011978 0.387028 0.612972 0.597487 0.020396 0
3 0.041132 0.958868 0.337925 0.662075 0.021718 1.763451 1
4 0.184389 0.815611 0.266350 0.733650 0.076737 1.662147 1
  • Step 3: Compute the accuracy on the whole set of train data.
In [55]:
from sklearn import metrics
max_score_HybridNB_Final = metrics.accuracy_score(target, results_combined['Target_Hybrid'].values)
print(f'{max_score_HybridNB_Final:.3f}')
0.842
  • Step 4: Add the Hybrid NB result into the fifth row of df_summary
In [56]:
df_summary.loc[len(df_summary)] = ['Hybrid NB', max_score_HybridNB_Final]

Part F - Wrapping up

  • display df_summary
In [57]:
df_summary
Out[57]:
method accuracy
0 Bernoulli NB 0.830
1 Gaussian NB 0.728
2 Tuned Bernoulli NB 0.842
3 Tuned Gaussian NB 0.838
4 Hybrid NB 0.842

From df_summary, it shows that hyper-parameter tuning would really improve the performance for both Bernoulli and Gaussian NB. The accuracy of hybrid NB model is the better than both untuned Bernoulli NB and untuned Gaussian NB for this particular dataset.