Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.
Build a simple decision tree with depth 1 using the given dataset (as shown below) to predict annual income (target feature). Use Gini Index as the split criterion and present the result as Pandas data frame.
ID | Age | Education | Marital_Status | Occupation | Annual_Income |
---|---|---|---|---|---|
1 | 39 | bachelors | never married | professional | high |
2 | 50 | doctorate | married | professional | mid |
3 | 18 | high school | never married | agriculture | low |
4 | 30 | bachelors | married | professional | mid |
5 | 37 | high school | married | agriculture | mid |
6 | 23 | high school | never married | agriculture | low |
7 | 52 | high school | divorced | transport | mid |
8 | 40 | doctorate | married | professional | high |
9 | 46 | bachelors | divorced | transport | mid |
10 | 33 | high school | married | transport | mid |
11 | 36 | high school | never married | transport | mid |
12 | 45 | doctorate | married | professional | mid |
13 | 23 | bachelors | never married | agriculture | low |
14 | 25 | high school | married | professional | high |
15 | 35 | bachelors | married | agriculture | mid |
16 | 29 | bachelors | never married | agriculture | mid |
17 | 44 | doctorate | divorced | transport | mid |
18 | 37 | bachelors | married | professional | mid |
19 | 39 | high school | divorced | professional | high |
20 | 25 | bachelors | married | transport | high |
import pandas as pd
import numpy as np
from tabulate import tabulate
pd.set_option('display.max_columns', None)
A2_Q1.csv
file, drop the column 'ID' as it is an indexQ1 = pd.read_csv("A2_Q1.csv")
Q1=Q1.drop(columns='ID')
Q1.head()
The gini impurity index is defined as follows: $$ \mbox{Gini}(x) := 1 - \sum_{i=1}^{\ell}P(t=i)^{2}$$ The idea with Gini index is the same as in entropy in the sense that the more heterogenous and impure a feature is, the higher the Gini index.
A nice property of the Gini index is that it is always between 0 and 1, and this may make it easier to compare Gini indices across different features.
The impurity of our fruit basket using Gini index is calculated as below.
compute_impurity()
that calculates impurity of a feature using either entropy or gini index. Returns the impurity value.def compute_impurity(feature, impurity_criterion):
"""
This function calculates impurity of a feature.
Supported impurity criteria: 'entropy', 'gini'
input: feature (this needs to be a Pandas series)
output: feature impurity
"""
probs = feature.value_counts(normalize=True)
if impurity_criterion == 'entropy':
impurity = -1 * np.sum(np.log2(probs) * probs)
elif impurity_criterion == 'gini':
impurity = 1 - np.sum(np.square(probs))
else:
raise ValueError('Unknown impurity criterion')
return(impurity)
compute_impurity()
to calculate the impurity of the target feature Annual Income
target_gini_index=compute_impurity(Q1['Annual_Income'], 'gini')
print("Impurity (Gini Index) of Annual Income is:", round(target_gini_index,3))
Age
by partitioning with optimal threshold. First, sort the dataset by Age
in ascending orderQ1_sorted_Age=Q1.sort_values(by='Age', ascending=True).reset_index(drop=True)
Q1_sorted_Age.head()
calculate_optimal_threshold()
, which will calculate the optimal threshold by finding the adjacent instances in the ordered list which have different target feature levels, returns all the optimal threshold values in a listdef calculate_optimal_threshold(df, target, descriptive_feature):
if(np.issubdtype(df[descriptive_feature].dtype, np.number)==False):
raise ValueError('Descriptive feature is not numeric')
optimal_threshold_list = list()
for i in range(len(df)-1):
if(df[target][i] != df[target][i+1]):
optimal_threshold=(df[descriptive_feature][i] + df[descriptive_feature][i+1])/2
optimal_threshold_list.append(optimal_threshold)
return(optimal_threshold_list)
calculate_optimal_threshold()
function by passing in the sorted dataframe to obtain the age_optimal_threshold_list
age_optimal_threshold_list=calculate_optimal_threshold(Q1_sorted_Age, 'Annual_Income', 'Age')
age_optimal_threshold_list
comp_feature_information_gain_continuous()
, which will calculate the information gain for continuous descriptive feature by splitting at the threshold values in the passed-in optimal_threshold_list
. Returns the remaining_impurity and information gain as 2 individual lists. def comp_feature_information_gain_continuous(df, target, descriptive_feature, optimal_threshold_list, split_criterion):
if(np.issubdtype(df[descriptive_feature].dtype, np.number)==False):
raise ValueError('Descriptive feature is not numeric')
remainder_list = list()
info_gain_list = list()
target_entropy = compute_impurity(df[target], split_criterion)
print("Impurity of target", round(target_entropy,3), "in", split_criterion )
# loop over each optimal threshold values
# to partition the dataset with respect to that threshold value
# and compute the entropy and the weight of the upper and lower partition base on that threshold value
for optimal_threshold in optimal_threshold_list:
print('==============================================================')
print('Optimal_threshold:', optimal_threshold)
# we define two arrays below:
# entropy_array to store the entropy of each partition
# weight_array to store the relative number of observations in each partition
entropy_array = np.zeros(2)
weight_array = np.zeros(2)
for i in range(2):
if (i==0):
df_feature_level = df[df[descriptive_feature] < optimal_threshold]
print('Lower partition:')
else:
df_feature_level = df[df[descriptive_feature] >= optimal_threshold]
print('Upper partition:')
print(tabulate(df_feature_level,headers='keys', tablefmt='psql'))
entropy_array[i] = compute_impurity(df_feature_level[target], split_criterion)
weight_array[i]= len(df_feature_level) / len(df)
print('impurity of partitions:', entropy_array.round(decimals=3))
print('weights of partitions:', weight_array.round(decimals=3))
feature_remaining_impurity = np.sum(entropy_array * weight_array)
remainder_list.append(feature_remaining_impurity)
print('remaining impurity:', round(feature_remaining_impurity,3))
information_gain = target_entropy - feature_remaining_impurity
info_gain_list.append(information_gain)
print('information gain:', round(information_gain,3))
print('==============================================================')
return remainder_list, info_gain_list
comp_feature_information_gain_continuous()
function to obtain the age_remainder_list
and age_info_gain_list
age_remainder_list, age_info_gain_list = comp_feature_information_gain_continuous(Q1, 'Annual_Income', 'Age', age_optimal_threshold_list, 'gini' )
comp_feature_information_gain()
, which will calculate the information gain on the target feature by partitioning the descriptive features at each unique level. Returns the remaining_impurity and information gain as 2 individual lists. def comp_feature_information_gain(df, target, descriptive_feature, split_criterion):
print('=====================================================================================================')
print('target feature:', target)
print('descriptive_feature:', descriptive_feature)
print('split criterion:', split_criterion)
target_entropy = compute_impurity(df[target], split_criterion)
# we define two lists below:
# entropy_list to store the entropy of each partition
# weight_list to store the relative number of observations in each partition
entropy_list = list()
weight_list = list()
# loop over each level of the descriptive feature
# to partition the dataset with respect to that level
# and compute the entropy and the weight of the level's partition
for level in df[descriptive_feature].unique():
df_feature_level = df[df[descriptive_feature] == level]
entropy_level = compute_impurity(df_feature_level[target], split_criterion)
entropy_list.append(round(entropy_level, 3))
weight_level = len(df_feature_level) / len(df)
weight_list.append(round(weight_level, 3))
print('impurity of partitions:', entropy_list)
print('weights of partitions:', weight_list)
feature_remaining_impurity = np.sum(np.array(entropy_list) * np.array(weight_list))
print('remaining impurity:', round(feature_remaining_impurity,3))
information_gain = target_entropy - feature_remaining_impurity
print('information gain:', round(information_gain,3))
return(feature_remaining_impurity, information_gain)
comp_feature_information_gain()
function for each categorical descriptive features (Education, Marital status, occupation) in the dataframe to obtain the feature_remainder_list
and feature_info_gain_list
split_criterion = 'gini'
feature_remainder_list = list()
feature_info_gain_list = list()
feature_list=list()
# loop over each categorical feature call the comp_feature_information_gain() function to
# calculate the remainder impurity and information gain for each feature
for feature in Q1.drop(columns=['Annual_Income', 'Age']).columns:
feature_remainder, feature_info_gain = comp_feature_information_gain(Q1, 'Annual_Income', feature, split_criterion)
feature_list.append(feature)
feature_remainder_list.append(feature_remainder)
feature_info_gain_list.append(feature_info_gain)
# loop over each level of the descriptive feature caculate the impurity of
# target feature and relative number of observations in level's partition
for level in Q1[feature].unique():
print('+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+')
df_feature_level = Q1[Q1[feature] == level]
print('corresponding data partition:')
print(tabulate(df_feature_level,headers='keys', tablefmt='psql'))
print('partition target feature impurity:', compute_impurity(df_feature_level['Annual_Income'], split_criterion))
print('partition weight:', str(len(df_feature_level)) + '/' + str(len(Q1)))
age_optimal_threshold_list
with the prefix Age_≥
age_optimal_threshold_list_str = ['Age_>=' + str(int(x)) for x in age_optimal_threshold_list]
age_optimal_threshold_list_str
df_splits
categorical_df= pd.DataFrame({'Split':feature_list, 'Remainder': feature_remainder_list, 'Information_Gain': feature_info_gain_list})
age_df= pd.DataFrame({'Split':age_optimal_threshold_list_str, 'Remainder': age_remainder_list, 'Information_Gain': age_info_gain_list})
df_splits = pd.concat([categorical_df, age_df], axis=0, sort=False)
df_splits
by Informatio_gain
, add an additional column Is_Optimal
which would specify True
for the row with lowest information gain and False
for all other rows. Display df_splits
.df_splits=df_splits.sort_values(by='Information_Gain', ascending=False).reset_index(drop=True)
df_splits['Is_Optimal']=False
df_splits.loc[0, 'Is_Optimal']=True
df_splits=df_splits.round(3)
df_splits
Assume the descriptive feature, Education is chosen as the root node, make predictions for the annual income target variable.
Education
, calculate the probability of getting each unique value of the target feature ('low', 'mid', 'high') in that partition. Find the value in the target feature that has highest probability as the leaf prediction
. Save all these information for each level into the dictionary d_value_counts
and insert that as a new row to the dataframe df_prediction
. df_prediction = pd.DataFrame(columns=['Leaf_Condition', 'Low_Income_Prob', 'Mid_Income_Prob', 'High_Income_Prob', 'Leaf_Prediction'])
# loop over each level in 'Education, calculate the probabilty of occurence of each unique value of
# target feature Annual Income
for level in Q1['Education'].unique():
print('+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+')
d_prediction = {'Leaf_Condition' : "Education == '" + level +"'"}
df_feature_level = Q1[Q1['Education'] == level]
print(tabulate(df_feature_level,headers='keys', tablefmt='psql'))
d_value_counts=(df_feature_level['Annual_Income'].value_counts(normalize=True).to_dict())
# Find the value in the target feature that has highest probability as the leaf prediction
Keymax = max(d_value_counts, key=d_value_counts.get)
print("Target value with highest probability: ", Keymax)
for k, v in d_value_counts.items():
if k == 'low':
d_value_counts['Low_Income_Prob'] = d_value_counts.pop('low')
if k == 'mid':
d_value_counts['Mid_Income_Prob'] = d_value_counts.pop('mid')
if k == 'high':
d_value_counts['High_Income_Prob'] = d_value_counts.pop('high')
d_prediction.update(d_value_counts)
d_prediction.update({'Leaf_Prediction': Keymax})
print(d_prediction)
# Insert all the information as a new row in the dataframe df_prediction
df_prediction=df_prediction.append(d_prediction, ignore_index=True)
df_prediction
df_prediction