Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.
Based on the US Census Income Dataset that we have been using in this course. The annual_income target variable is binary, which is either high_income or low_income. As usual, high income will be the positive class for this problem.
For this question, we would use different variations of the Naive Bayes (NB) classifier for predicting the annual_income target feature and present our results as Pandas data frames.
Transform the 2 numerical features (age and education_years) into 2 (nominal) categorical features. Specifically, use equal-width binning with the following 3 bins for each numerical feature: low
, mid
, and high
. Once this is done, all the 5 descriptive features in our dataset will be categorical. Dataset's name after Task 1 will be named as df_all_cat.
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PowerTransformer
A3_Q1_train.csv
file to the dataframe df
, drop the column row_id
as it is an indexdf = pd.read_csv("A3_Q1_train.csv")
df=df.drop(columns='row_id')
df.head()
df
to a new dataframe df_all_cat
, drop the age
and education_years
column.df_all_cat = df.copy()
df_all_cat=df_all_cat.drop(columns=['age', 'education_years'])
df
's age
and education_years
into equal-width. Assign these binned values to df_all_cat
as age
and education_years
respectively. Re-arrange the column order in df_all_cat
, so that it follows the same order as the read-in df
.df_all_cat['age']=pd.cut(df['age'],3, labels=["low", "mid", "high"])
df_all_cat['education_years']=pd.cut(df['education_years'],3, labels=["low", "mid", "high"])
neworder = ['age','education_years','workclass', 'marital_status', 'occupation', 'annual_income']
df_all_cat=df_all_cat.reindex(columns=neworder)
df_all_cat
for col in df_all_cat.columns.to_list():
print(col + ':')
print(df_all_cat[col].value_counts())
print('********')
Perform one-hot-encoding (OHE) on the dataset (after the equal-width binning above). Dataset will be named as df_all_cat_ohe
after task 2.
df_all_cat
to a new dataframe df_all_cat_ohe
, drop the annual_income
column.df_all_cat_ohe = df_all_cat.copy()
df_all_cat_ohe=df_all_cat_ohe.drop(columns=['annual_income'])
df_all_cat_ohe
, such that one-hot-encoding is done on all the descriptive feature.for col in df_all_cat_ohe.columns.to_list():
n = len(df_all_cat_ohe[col].unique())
if (n == 2):
df_all_cat_ohe[col] = pd.get_dummies(df_all_cat_ohe[col], drop_first=True)
df_all_cat_ohe = pd.get_dummies(df_all_cat_ohe)
labelEncoder
or get_dummies
function would encode the labels in alphabetical order, which will turn high_income
= 0 and low_income
=1. As high_income
is positive class in annual_income
, we would like to encode high_income
as 1 and low_income
as 0. To do so, we can call the replace
function on df_all_cat['annual_income']
and assign the result to a new column annual_income
for df_all_cat_ohe
.df_all_cat_ohe['annual_income'] = pd.Series(df_all_cat['annual_income']).replace({'low_income': 0, 'high_income': 1}).values
df_all_cat_ohe
print(df_all_cat_ohe.shape)
df_all_cat_ohe.head()
annual income
in df_all_cat_ohe
such that it will only contain descriptive features, turn df_all_cat_ohe
into a numpy array and name it Data
. Also turn the target feature df_all_cat_ohe["annual_income"]
into a numpy array and name it target
. Data = df_all_cat_ohe.drop(columns="annual_income").values
target = df_all_cat_ohe["annual_income"].values
print(type(Data))
print(type(target))
target
. Count for 1 should equal to value count of high_income in df_all_cat['annual income']
, count for 0 should equal to value count of low_income in df_all_cat['annual income']
np.unique(target, return_counts = True)
Data
and target
. Compute the accuracy on the whole set of train data.from sklearn.naive_bayes import BernoulliNB
bnb_classifier = BernoulliNB()
bnb_classifier.fit(Data, target)
bnb_classifier.score(Data, target)
df_summary
to display the accuracy scores of different variation of NB classifiers : We would use df_summary
extensively in part F
df_summary = pd.DataFrame(columns=['method', 'accuracy'])
df_summary
df_summary.loc[len(df_summary)] = ['Bernoulli NB', bnb_classifier.score(Data, target)]
Data
(all features will be binary) and target
. Compute the accuracy on the whole set of train data.from sklearn.naive_bayes import GaussianNB
gnb_classifier = GaussianNB()
gnb_classifier.fit(Data, target)
gnb_classifier.score(Data, target)
df_summary
df_summary.loc[len(df_summary)] = ['Gaussian NB', gnb_classifier.score(Data, target)]
bnb_result
: alpha
of the BernoulliNB
classifierData
into BernoulliNB
classifier with the respective alpha
parameter in 1st column bnb_result= pd.DataFrame(columns=['alpha', 'accuracy_score'])
BernoulliNB
classifier ranging from 1 to 500, store the alpha
value and accuracy
score into the dataframe bnb_result
for alpha_value in np.arange(1,500):
# print(alpha_value)
bnb_classifier = BernoulliNB(alpha=alpha_value)
Data_transformed = PowerTransformer().fit_transform(Data)
bnb_classifier.fit(Data_transformed, target)
accuracy=bnb_classifier.score(Data_transformed, target).round(3)
bnb_result.loc[len(bnb_result)] = [alpha_value, accuracy]
gnb_result
: var_smoothing
of the GaussianNB
classifierData
into GaussianNB
classifier with the respective var_smoothing
parameter in 1st column gnb_result = pd.DataFrame(columns=['var_smoothing', 'accuracy_score'])
GaussianNB
classifier ranging from $10^0$ to $10^{-9}$ ($10^{-9}$ is the default value), store the alpha value and accuracy score into the dataframe gnb_result
for var_smoothing_value in np.logspace(1,-9, num=100):
gnb_classifier = GaussianNB(var_smoothing=var_smoothing_value)
Data_transformed = PowerTransformer().fit_transform(Data)
gnb_classifier.fit(Data_transformed, target)
accuracy=gnb_classifier.score(Data_transformed, target).round(3)
gnb_result.loc[len(gnb_result)] = [var_smoothing_value, accuracy]
sns.lineplot(x="alpha", y="accuracy_score", data=bnb_result)
plt.title('Tuning of BernoulliNB classifier', fontsize = 16)
bnb_result_desc_order = bnb_result.sort_values(by=["accuracy_score"], ascending = False).reset_index(drop=True)
bnb_result_desc_order.head(5)
df_summary
df_summary.loc[len(df_summary)] = ['Tuned Bernoulli NB', bnb_result_desc_order["accuracy_score"][0]]
sns.lineplot(x="var_smoothing", y="accuracy_score", data=gnb_result)
plt.title('Tuning of GaussianNB classifier', fontsize = 16)
gnb_result_desc_order = gnb_result.sort_values(by=["accuracy_score"], ascending = False).reset_index(drop=True)
gnb_result_desc_order.head(5)
gnb_result_desc_order["accuracy_score"]
df_summary
df_summary.loc[len(df_summary)] = ['Tuned Gaussian NB', gnb_result_desc_order["accuracy_score"][0]]
In real world, we would usually work with datasets with a mix of categorical and numerical features. We have covered two NB variants so far:
The purpose of this part is to implement a Hybrid NB Classifier on the "A3_Q1_train.csv" dataset that uses Bernoulli NB (with default parameters) for categorical descriptive features and Gaussian NB (with default parameters) for the numerical descriptive features. You will specifically train your Hybrid NB model using the train data and compute its accuracy on again train data.
Step 1: Construct the dataframe df_numeric
, first 2 columns, age
and education_years
would be continuous (copy from dataframe df
), while the rest of the categorical features would be resided in the dataframe df_cat
.
df.dtypes
df_numeric = df[['age', 'education_years']].copy()
df_cat = df.select_dtypes(include='object').drop(columns='annual_income')
df_numeric.dtypes
df_cat.dtypes
Before we train df_numeric
with Gaussian NB. Our numeric data may not have a Gaussian (Normal) distribution and Gaussian NB classifier would have better performance if the data follow Gaussian distribution. Test and see if we need to apply power or Boxcox transformation in this dataset before training the data.
from scipy import stats
df_age = pd.DataFrame()
df_age['original']=df_numeric['age']
df_numeric_transformed = PowerTransformer().fit_transform(df_numeric)
df_age['power']=df_numeric_transformed[:,0]
tdata = stats.boxcox(df_numeric['age'])
df_age['boxcox']=tdata[0]
df_age.head()
df_age_long=df_age.melt(var_name=["data"])
df_age_long=df_age.melt(var_name=["data"])
g = sns.FacetGrid(df_age_long, col="data")
g.map(sns.distplot, "value")
Side-by-side comparison is not clear, let's plot the distribution separately.
Original
sns.distplot(df_age['original'],
kde=True,
bins=40);
Power Transformed
sns.distplot(df_age['power'],
kde=True,
bins=40);
Box-Cox Transformed
sns.distplot(df_age['boxcox'],
kde=True,
bins=40);
df_ed = pd.DataFrame()
df_ed['original']=df_numeric['education_years']
df_ed['power']=df_numeric_transformed[:,1]
tdata = stats.boxcox(df_numeric['education_years'])
df_ed['boxcox']=tdata[0]
df_ed.head()
df_ed_long=df_ed.melt(var_name=["data"])
g = sns.FacetGrid(df_ed_long, col="data")
g.map(sns.distplot, "value")
Side-by-side comparison is not clear, let's plot the distribution separately.
Original
sns.distplot(df_ed['original'],
kde=True,
bins=40);
Power Transformed
sns.distplot(df_ed['power'],
kde=True,
bins=40);
Box-Cox Transformed
sns.distplot(df_ed['boxcox'],
kde=True,
bins=40);
Visual comparison, there is not much change for the distributions (for both Age and Education years) after being transformed. We would just use the original numeric dataset for training.
clf_GaussianNB = GaussianNB()
clf_GaussianNB.fit(df_numeric, target)
t_pred_GaussianNB = clf_GaussianNB.predict(df_numeric)
data_cat_ohe = pd.get_dummies(df_cat)
clf_BernoulliNB = BernoulliNB()
clf_BernoulliNB.fit(data_cat_ohe, target)
t_pred_BernoulliNB = clf_BernoulliNB.predict(data_cat_ohe)
prob_BernoulliNB = clf_BernoulliNB.predict_proba(data_cat_ohe)
prob_GaussianNB = clf_GaussianNB.predict_proba(df_numeric)
results_combined = pd.DataFrame({'BernoulliNB_0':prob_BernoulliNB[:,0],
'BernoulliNB_1':prob_BernoulliNB[:,1],
'GaussianNB_0':prob_GaussianNB[:,0],
'GaussianNB_1':prob_GaussianNB[:,1]})
count_0 = np.count_nonzero(target == 0)
count_1 = np.count_nonzero(target == 1)
prior_0 = count_0 / (count_0+count_1)
prior_1 = count_1 / (count_0+count_1)
print(prior_0)
print(prior_1)
results_combined['HybridNB_0'] = results_combined['BernoulliNB_0']*results_combined['GaussianNB_0'] / prior_0
results_combined['HybridNB_1'] = results_combined['BernoulliNB_1']*results_combined['GaussianNB_1'] / prior_1
results_combined['Target_Hybrid'] = np.where(results_combined['HybridNB_0'] > results_combined['HybridNB_1'], 0, 1)
results_combined.head()
from sklearn import metrics
max_score_HybridNB_Final = metrics.accuracy_score(target, results_combined['Target_Hybrid'].values)
print(f'{max_score_HybridNB_Final:.3f}')
df_summary
df_summary.loc[len(df_summary)] = ['Hybrid NB', max_score_HybridNB_Final]
df_summary
df_summary
From df_summary
, it shows that hyper-parameter tuning would really improve the performance for both Bernoulli and Gaussian NB.
The accuracy of hybrid NB model is the better than both untuned Bernoulli NB and untuned Gaussian NB for this particular dataset.