Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.
Based on the US Census Income Dataset that we have been using in this course. The annual_income target variable is binary, which is either high_income or low_income. As usual, high income will be the positive class for this problem.
For this question, we would use different variations of the Naive Bayes (NB) classifier for predicting the annual_income target feature and present our results as Pandas data frames.
Transform the 2 numerical features (age and education_years) into 2 (nominal) categorical features. Specifically, use equal-width binning with the following 3 bins for each numerical feature: low, mid, and high. Once this is done, all the 5 descriptive features in our dataset will be categorical. Dataset's name after Task 1 will be named as df_all_cat.
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PowerTransformer
A3_Q1_train.csv file to the dataframe df, drop the column row_id as it is an indexdf = pd.read_csv("A3_Q1_train.csv")
df=df.drop(columns='row_id')
df.head()
df to a new dataframe df_all_cat, drop the age and education_years column.df_all_cat = df.copy()
df_all_cat=df_all_cat.drop(columns=['age', 'education_years'])
df's age and education_years into equal-width. Assign these binned values to df_all_cat as age and education_years respectively. Re-arrange the column order in df_all_cat, so that it follows the same order as the read-in df.df_all_cat['age']=pd.cut(df['age'],3, labels=["low", "mid", "high"])
df_all_cat['education_years']=pd.cut(df['education_years'],3, labels=["low", "mid", "high"])
neworder = ['age','education_years','workclass', 'marital_status', 'occupation', 'annual_income']
df_all_cat=df_all_cat.reindex(columns=neworder)
df_all_catfor col in df_all_cat.columns.to_list():
print(col + ':')
print(df_all_cat[col].value_counts())
print('********')
Perform one-hot-encoding (OHE) on the dataset (after the equal-width binning above). Dataset will be named as df_all_cat_ohe after task 2.
df_all_cat to a new dataframe df_all_cat_ohe, drop the annual_income column.df_all_cat_ohe = df_all_cat.copy()
df_all_cat_ohe=df_all_cat_ohe.drop(columns=['annual_income'])
df_all_cat_ohe, such that one-hot-encoding is done on all the descriptive feature.for col in df_all_cat_ohe.columns.to_list():
n = len(df_all_cat_ohe[col].unique())
if (n == 2):
df_all_cat_ohe[col] = pd.get_dummies(df_all_cat_ohe[col], drop_first=True)
df_all_cat_ohe = pd.get_dummies(df_all_cat_ohe)
labelEncoder or get_dummies function would encode the labels in alphabetical order, which will turn high_income = 0 and low_income=1. As high_income is positive class in annual_income, we would like to encode high_income as 1 and low_income as 0. To do so, we can call the replace function on df_all_cat['annual_income'] and assign the result to a new column annual_income for df_all_cat_ohe.df_all_cat_ohe['annual_income'] = pd.Series(df_all_cat['annual_income']).replace({'low_income': 0, 'high_income': 1}).values
df_all_cat_oheprint(df_all_cat_ohe.shape)
df_all_cat_ohe.head()
annual income in df_all_cat_ohe such that it will only contain descriptive features, turn df_all_cat_ohe into a numpy array and name it Data. Also turn the target feature df_all_cat_ohe["annual_income"] into a numpy array and name it target. Data = df_all_cat_ohe.drop(columns="annual_income").values
target = df_all_cat_ohe["annual_income"].values
print(type(Data))
print(type(target))
target. Count for 1 should equal to value count of high_income in df_all_cat['annual income'], count for 0 should equal to value count of low_income in df_all_cat['annual income']np.unique(target, return_counts = True)
Data and target. Compute the accuracy on the whole set of train data.from sklearn.naive_bayes import BernoulliNB
bnb_classifier = BernoulliNB()
bnb_classifier.fit(Data, target)
bnb_classifier.score(Data, target)
df_summary to display the accuracy scores of different variation of NB classifiers : We would use df_summary extensively in part F
df_summary = pd.DataFrame(columns=['method', 'accuracy'])
df_summary df_summary.loc[len(df_summary)] = ['Bernoulli NB', bnb_classifier.score(Data, target)]
Data (all features will be binary) and target. Compute the accuracy on the whole set of train data.from sklearn.naive_bayes import GaussianNB
gnb_classifier = GaussianNB()
gnb_classifier.fit(Data, target)
gnb_classifier.score(Data, target)
df_summary df_summary.loc[len(df_summary)] = ['Gaussian NB', gnb_classifier.score(Data, target)]
bnb_result: alpha of the BernoulliNB classifierData into BernoulliNB classifier with the respective alpha parameter in 1st column bnb_result= pd.DataFrame(columns=['alpha', 'accuracy_score'])
BernoulliNB classifier ranging from 1 to 500, store the alpha value and accuracy score into the dataframe bnb_result for alpha_value in np.arange(1,500):
# print(alpha_value)
bnb_classifier = BernoulliNB(alpha=alpha_value)
Data_transformed = PowerTransformer().fit_transform(Data)
bnb_classifier.fit(Data_transformed, target)
accuracy=bnb_classifier.score(Data_transformed, target).round(3)
bnb_result.loc[len(bnb_result)] = [alpha_value, accuracy]
gnb_result: var_smoothing of the GaussianNB classifierData into GaussianNB classifier with the respective var_smoothing parameter in 1st column gnb_result = pd.DataFrame(columns=['var_smoothing', 'accuracy_score'])
GaussianNB classifier ranging from $10^0$ to $10^{-9}$ ($10^{-9}$ is the default value), store the alpha value and accuracy score into the dataframe gnb_result for var_smoothing_value in np.logspace(1,-9, num=100):
gnb_classifier = GaussianNB(var_smoothing=var_smoothing_value)
Data_transformed = PowerTransformer().fit_transform(Data)
gnb_classifier.fit(Data_transformed, target)
accuracy=gnb_classifier.score(Data_transformed, target).round(3)
gnb_result.loc[len(gnb_result)] = [var_smoothing_value, accuracy]
sns.lineplot(x="alpha", y="accuracy_score", data=bnb_result)
plt.title('Tuning of BernoulliNB classifier', fontsize = 16)
bnb_result_desc_order = bnb_result.sort_values(by=["accuracy_score"], ascending = False).reset_index(drop=True)
bnb_result_desc_order.head(5)
df_summary df_summary.loc[len(df_summary)] = ['Tuned Bernoulli NB', bnb_result_desc_order["accuracy_score"][0]]
sns.lineplot(x="var_smoothing", y="accuracy_score", data=gnb_result)
plt.title('Tuning of GaussianNB classifier', fontsize = 16)
gnb_result_desc_order = gnb_result.sort_values(by=["accuracy_score"], ascending = False).reset_index(drop=True)
gnb_result_desc_order.head(5)
gnb_result_desc_order["accuracy_score"]
df_summary df_summary.loc[len(df_summary)] = ['Tuned Gaussian NB', gnb_result_desc_order["accuracy_score"][0]]
In real world, we would usually work with datasets with a mix of categorical and numerical features. We have covered two NB variants so far:
The purpose of this part is to implement a Hybrid NB Classifier on the "A3_Q1_train.csv" dataset that uses Bernoulli NB (with default parameters) for categorical descriptive features and Gaussian NB (with default parameters) for the numerical descriptive features. You will specifically train your Hybrid NB model using the train data and compute its accuracy on again train data.
Step 1: Construct the dataframe df_numeric, first 2 columns, age and education_years would be continuous (copy from dataframe df), while the rest of the categorical features would be resided in the dataframe df_cat.
df.dtypes
df_numeric = df[['age', 'education_years']].copy()
df_cat = df.select_dtypes(include='object').drop(columns='annual_income')
df_numeric.dtypes
df_cat.dtypes
Before we train df_numeric with Gaussian NB. Our numeric data may not have a Gaussian (Normal) distribution and Gaussian NB classifier would have better performance if the data follow Gaussian distribution. Test and see if we need to apply power or Boxcox transformation in this dataset before training the data.
from scipy import stats
df_age = pd.DataFrame()
df_age['original']=df_numeric['age']
df_numeric_transformed = PowerTransformer().fit_transform(df_numeric)
df_age['power']=df_numeric_transformed[:,0]
tdata = stats.boxcox(df_numeric['age'])
df_age['boxcox']=tdata[0]
df_age.head()
df_age_long=df_age.melt(var_name=["data"])
df_age_long=df_age.melt(var_name=["data"])
g = sns.FacetGrid(df_age_long, col="data")
g.map(sns.distplot, "value")
Side-by-side comparison is not clear, let's plot the distribution separately.
Original
sns.distplot(df_age['original'],
kde=True,
bins=40);
Power Transformed
sns.distplot(df_age['power'],
kde=True,
bins=40);
Box-Cox Transformed
sns.distplot(df_age['boxcox'],
kde=True,
bins=40);
df_ed = pd.DataFrame()
df_ed['original']=df_numeric['education_years']
df_ed['power']=df_numeric_transformed[:,1]
tdata = stats.boxcox(df_numeric['education_years'])
df_ed['boxcox']=tdata[0]
df_ed.head()
df_ed_long=df_ed.melt(var_name=["data"])
g = sns.FacetGrid(df_ed_long, col="data")
g.map(sns.distplot, "value")
Side-by-side comparison is not clear, let's plot the distribution separately.
Original
sns.distplot(df_ed['original'],
kde=True,
bins=40);
Power Transformed
sns.distplot(df_ed['power'],
kde=True,
bins=40);
Box-Cox Transformed
sns.distplot(df_ed['boxcox'],
kde=True,
bins=40);
Visual comparison, there is not much change for the distributions (for both Age and Education years) after being transformed. We would just use the original numeric dataset for training.
clf_GaussianNB = GaussianNB()
clf_GaussianNB.fit(df_numeric, target)
t_pred_GaussianNB = clf_GaussianNB.predict(df_numeric)
data_cat_ohe = pd.get_dummies(df_cat)
clf_BernoulliNB = BernoulliNB()
clf_BernoulliNB.fit(data_cat_ohe, target)
t_pred_BernoulliNB = clf_BernoulliNB.predict(data_cat_ohe)
prob_BernoulliNB = clf_BernoulliNB.predict_proba(data_cat_ohe)
prob_GaussianNB = clf_GaussianNB.predict_proba(df_numeric)
results_combined = pd.DataFrame({'BernoulliNB_0':prob_BernoulliNB[:,0],
'BernoulliNB_1':prob_BernoulliNB[:,1],
'GaussianNB_0':prob_GaussianNB[:,0],
'GaussianNB_1':prob_GaussianNB[:,1]})
count_0 = np.count_nonzero(target == 0)
count_1 = np.count_nonzero(target == 1)
prior_0 = count_0 / (count_0+count_1)
prior_1 = count_1 / (count_0+count_1)
print(prior_0)
print(prior_1)
results_combined['HybridNB_0'] = results_combined['BernoulliNB_0']*results_combined['GaussianNB_0'] / prior_0
results_combined['HybridNB_1'] = results_combined['BernoulliNB_1']*results_combined['GaussianNB_1'] / prior_1
results_combined['Target_Hybrid'] = np.where(results_combined['HybridNB_0'] > results_combined['HybridNB_1'], 0, 1)
results_combined.head()
from sklearn import metrics
max_score_HybridNB_Final = metrics.accuracy_score(target, results_combined['Target_Hybrid'].values)
print(f'{max_score_HybridNB_Final:.3f}')
df_summary df_summary.loc[len(df_summary)] = ['Hybrid NB', max_score_HybridNB_Final]
df_summarydf_summary
From df_summary, it shows that hyper-parameter tuning would really improve the performance for both Bernoulli and Gaussian NB.
The accuracy of hybrid NB model is the better than both untuned Bernoulli NB and untuned Gaussian NB for this particular dataset.