Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.

s3806940_A2_Q1

Question 1 - Decision Tree

Build a simple decision tree with depth 1 using the given dataset (as shown below) to predict annual income (target feature). Use Gini Index as the split criterion and present the result as Pandas data frame.

ID Age Education Marital_Status Occupation Annual_Income
1 39 bachelors never married professional high
2 50 doctorate married professional mid
3 18 high school never married agriculture low
4 30 bachelors married professional mid
5 37 high school married agriculture mid
6 23 high school never married agriculture low
7 52 high school divorced transport mid
8 40 doctorate married professional high
9 46 bachelors divorced transport mid
10 33 high school married transport mid
11 36 high school never married transport mid
12 45 doctorate married professional mid
13 23 bachelors never married agriculture low
14 25 high school married professional high
15 35 bachelors married agriculture mid
16 29 bachelors never married agriculture mid
17 44 doctorate divorced transport mid
18 37 bachelors married professional mid
19 39 high school divorced professional high
20 25 bachelors married transport high

Data Processing

  • Step 1: Import all necessary packages: numpy, pandas and tabulate, and set the appropriate display options
In [1]:
import pandas as pd
import numpy as np
from tabulate import tabulate 
pd.set_option('display.max_columns', None) 
  • Step 2: Read in the A2_Q1.csv file, drop the column 'ID' as it is an index
In [2]:
Q1 = pd.read_csv("A2_Q1.csv")
Q1=Q1.drop(columns='ID')
  • Step 3: Display the first 5 rows in the dataset
In [3]:
Q1.head()
Out[3]:
Age Education Marital_Status Occupation Annual_Income
0 39 bachelors never married professional high
1 50 doctorate married professional mid
2 18 high school never married agriculture low
3 30 bachelors married professional mid
4 37 high school married agriculture mid

Part A - Imputiry of Target Feature

Compute the impurity of the target feature

The gini impurity index is defined as follows: $$ \mbox{Gini}(x) := 1 - \sum_{i=1}^{\ell}P(t=i)^{2}$$ The idea with Gini index is the same as in entropy in the sense that the more heterogenous and impure a feature is, the higher the Gini index.

A nice property of the Gini index is that it is always between 0 and 1, and this may make it easier to compare Gini indices across different features.

The impurity of our fruit basket using Gini index is calculated as below.

  • Step 1: Define a function called compute_impurity() that calculates impurity of a feature using either entropy or gini index. Returns the impurity value.
In [4]:
def compute_impurity(feature, impurity_criterion):
    """
    This function calculates impurity of a feature.
    Supported impurity criteria: 'entropy', 'gini'
    input: feature (this needs to be a Pandas series)
    output: feature impurity
    """
    probs = feature.value_counts(normalize=True)
    
    if impurity_criterion == 'entropy':
        impurity = -1 * np.sum(np.log2(probs) * probs)
    elif impurity_criterion == 'gini':
        impurity = 1 - np.sum(np.square(probs))
    else:
        raise ValueError('Unknown impurity criterion')
        
    return(impurity)
  • Step 2: Call the function compute_impurity() to calculate the impurity of the target feature Annual Income
In [5]:
target_gini_index=compute_impurity(Q1['Annual_Income'], 'gini')
print("Impurity (Gini Index) of Annual Income is:", round(target_gini_index,3))
Impurity (Gini Index) of Annual Income is: 0.555

Part B - Find the root node

Determine the root node for your decision tree

  • Step 1: Handle the continuous features Age by partitioning with optimal threshold. First, sort the dataset by Age in ascending order
In [6]:
Q1_sorted_Age=Q1.sort_values(by='Age', ascending=True).reset_index(drop=True)
Q1_sorted_Age.head()
Out[6]:
Age Education Marital_Status Occupation Annual_Income
0 18 high school never married agriculture low
1 23 high school never married agriculture low
2 23 bachelors never married agriculture low
3 25 bachelors married transport high
4 25 high school married professional high
  • Step 2: Define a function called calculate_optimal_threshold(), which will calculate the optimal threshold by finding the adjacent instances in the ordered list which have different target feature levels, returns all the optimal threshold values in a list
In [7]:
def calculate_optimal_threshold(df, target, descriptive_feature):
    if(np.issubdtype(df[descriptive_feature].dtype, np.number)==False):
        raise ValueError('Descriptive feature is not numeric')
    optimal_threshold_list = list()
    for i in range(len(df)-1):
        if(df[target][i] != df[target][i+1]):
            optimal_threshold=(df[descriptive_feature][i] + df[descriptive_feature][i+1])/2
            optimal_threshold_list.append(optimal_threshold)
    return(optimal_threshold_list)
  • Step 3: Call the calculate_optimal_threshold() function by passing in the sorted dataframe to obtain the age_optimal_threshold_list
In [8]:
age_optimal_threshold_list=calculate_optimal_threshold(Q1_sorted_Age, 'Annual_Income', 'Age')
age_optimal_threshold_list
Out[8]:
[24.0, 27.0, 38.0, 42.0]
  • Step 4: Define a function called comp_feature_information_gain_continuous(), which will calculate the information gain for continuous descriptive feature by splitting at the threshold values in the passed-in optimal_threshold_list. Returns the remaining_impurity and information gain as 2 individual lists.
In [9]:
def comp_feature_information_gain_continuous(df, target, descriptive_feature, optimal_threshold_list, split_criterion):

    if(np.issubdtype(df[descriptive_feature].dtype, np.number)==False):
        raise ValueError('Descriptive feature is not numeric')

    remainder_list = list()
    info_gain_list = list()
    target_entropy = compute_impurity(df[target], split_criterion)
    print("Impurity of target", round(target_entropy,3), "in", split_criterion )

    # loop over each optimal threshold values 
    # to partition the dataset with respect to that threshold value
    # and compute the entropy and the weight of the upper and lower partition base on that threshold value
    
    for optimal_threshold in optimal_threshold_list:
        print('==============================================================')
        print('Optimal_threshold:', optimal_threshold)
        
        
    # we define two arrays below:
    # entropy_array to store the entropy of each partition
    # weight_array to store the relative number of observations in each partition

        entropy_array = np.zeros(2) 
        weight_array = np.zeros(2) 
        for i in range(2):
            if (i==0):
                df_feature_level = df[df[descriptive_feature] < optimal_threshold]
                print('Lower partition:')
            else:
                df_feature_level = df[df[descriptive_feature] >= optimal_threshold]
                print('Upper partition:')

            print(tabulate(df_feature_level,headers='keys', tablefmt='psql'))
            entropy_array[i] = compute_impurity(df_feature_level[target], split_criterion)
            weight_array[i]= len(df_feature_level) / len(df)

        print('impurity of partitions:', entropy_array.round(decimals=3))
        print('weights of partitions:', weight_array.round(decimals=3))

        feature_remaining_impurity = np.sum(entropy_array * weight_array)
        remainder_list.append(feature_remaining_impurity)
        print('remaining impurity:', round(feature_remaining_impurity,3))

        information_gain = target_entropy - feature_remaining_impurity
        info_gain_list.append(information_gain)
        print('information gain:', round(information_gain,3))
    print('==============================================================')
    return remainder_list, info_gain_list
  • Step 5: Call the comp_feature_information_gain_continuous() function to obtain the age_remainder_list and age_info_gain_list
In [10]:
age_remainder_list, age_info_gain_list = comp_feature_information_gain_continuous(Q1, 'Annual_Income', 'Age', age_optimal_threshold_list, 'gini' )
Impurity of target 0.555 in gini
==============================================================
Optimal_threshold: 24.0
Lower partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  5 |    23 | high school | never married    | agriculture  | low             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
+----+-------+-------------+------------------+--------------+-----------------+
Upper partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  1 |    50 | doctorate   | married          | professional | mid             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 13 |    25 | high school | married          | professional | high            |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
impurity of partitions: [0.    0.415]
weights of partitions: [0.15 0.85]
remaining impurity: 0.353
information gain: 0.202
==============================================================
Optimal_threshold: 27.0
Lower partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  5 |    23 | high school | never married    | agriculture  | low             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 13 |    25 | high school | married          | professional | high            |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
Upper partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  1 |    50 | doctorate   | married          | professional | mid             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
impurity of partitions: [0.48 0.32]
weights of partitions: [0.25 0.75]
remaining impurity: 0.36
information gain: 0.195
==============================================================
Optimal_threshold: 38.0
Lower partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  5 |    23 | high school | never married    | agriculture  | low             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 13 |    25 | high school | married          | professional | high            |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
Upper partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  1 |    50 | doctorate   | married          | professional | mid             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
impurity of partitions: [0.569 0.469]
weights of partitions: [0.6 0.4]
remaining impurity: 0.529
information gain: 0.026
==============================================================
Optimal_threshold: 42.0
Lower partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  2 |    18 | high school | never married    | agriculture  | low             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  5 |    23 | high school | never married    | agriculture  | low             |
|  7 |    40 | doctorate   | married          | professional | high            |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 13 |    25 | high school | married          | professional | high            |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
Upper partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  1 |    50 | doctorate   | married          | professional | mid             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
+----+-------+-------------+------------------+--------------+-----------------+
impurity of partitions: [0.631 0.   ]
weights of partitions: [0.75 0.25]
remaining impurity: 0.473
information gain: 0.082
==============================================================
  • Step 6: Define a function called comp_feature_information_gain(), which will calculate the information gain on the target feature by partitioning the descriptive features at each unique level. Returns the remaining_impurity and information gain as 2 individual lists.
In [11]:
def comp_feature_information_gain(df, target, descriptive_feature,  split_criterion):
    
    print('=====================================================================================================')   
    print('target feature:', target)
    print('descriptive_feature:', descriptive_feature)
    print('split criterion:', split_criterion)
            
    target_entropy = compute_impurity(df[target], split_criterion)

    # we define two lists below:
    # entropy_list to store the entropy of each partition
    # weight_list to store the relative number of observations in each partition
    
    entropy_list = list()
    weight_list = list()
    
    # loop over each level of the descriptive feature
    # to partition the dataset with respect to that level
    # and compute the entropy and the weight of the level's partition

    for level in df[descriptive_feature].unique():      
  
        df_feature_level = df[df[descriptive_feature] == level]
        entropy_level = compute_impurity(df_feature_level[target], split_criterion)
        entropy_list.append(round(entropy_level, 3))
        weight_level = len(df_feature_level) / len(df)
        weight_list.append(round(weight_level, 3))

    print('impurity of partitions:', entropy_list)
    print('weights of partitions:', weight_list)

    feature_remaining_impurity = np.sum(np.array(entropy_list) * np.array(weight_list))
    print('remaining impurity:', round(feature_remaining_impurity,3))
    
    information_gain = target_entropy - feature_remaining_impurity
    print('information gain:', round(information_gain,3))
    

    return(feature_remaining_impurity, information_gain)
  • Step 7: Call the comp_feature_information_gain() function for each categorical descriptive features (Education, Marital status, occupation) in the dataframe to obtain the feature_remainder_list and feature_info_gain_list
In [12]:
split_criterion = 'gini'
feature_remainder_list = list()
feature_info_gain_list = list()
feature_list=list()

    # loop over each categorical feature call the comp_feature_information_gain() function to 
    # calculate the remainder impurity and information gain for each feature  

for feature in Q1.drop(columns=['Annual_Income', 'Age']).columns:
    feature_remainder, feature_info_gain = comp_feature_information_gain(Q1, 'Annual_Income', feature, split_criterion)
    feature_list.append(feature)
    feature_remainder_list.append(feature_remainder)
    feature_info_gain_list.append(feature_info_gain)

    # loop over each level of the descriptive feature caculate the impurity of 
    # target feature and relative number of observations in level's partition

    
    for level in Q1[feature].unique():
        print('+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+')
        df_feature_level = Q1[Q1[feature] == level]
        print('corresponding data partition:')
        print(tabulate(df_feature_level,headers='keys', tablefmt='psql'))
        print('partition target feature impurity:', compute_impurity(df_feature_level['Annual_Income'], split_criterion))
        print('partition weight:', str(len(df_feature_level)) + '/' + str(len(Q1)))
=====================================================================================================
target feature: Annual_Income
descriptive_feature: Education
split criterion: gini
impurity of partitions: [0.531, 0.375, 0.625]
weights of partitions: [0.4, 0.2, 0.4]
remaining impurity: 0.537
information gain: 0.018
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.53125
partition weight: 8/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  1 |    50 | doctorate   | married          | professional | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.375
partition weight: 4/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  5 |    23 | high school | never married    | agriculture  | low             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 13 |    25 | high school | married          | professional | high            |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.625
partition weight: 8/20
=====================================================================================================
target feature: Annual_Income
descriptive_feature: Marital_Status
split criterion: gini
impurity of partitions: [0.611, 0.42, 0.375]
weights of partitions: [0.3, 0.5, 0.2]
remaining impurity: 0.468
information gain: 0.087
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  2 |    18 | high school | never married    | agriculture  | low             |
|  5 |    23 | high school | never married    | agriculture  | low             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.6111111111111112
partition weight: 6/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  1 |    50 | doctorate   | married          | professional | mid             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
|  9 |    33 | high school | married          | transport    | mid             |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 13 |    25 | high school | married          | professional | high            |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.42000000000000004
partition weight: 10/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  6 |    52 | high school | divorced         | transport    | mid             |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.375
partition weight: 4/20
=====================================================================================================
target feature: Annual_Income
descriptive_feature: Occupation
split criterion: gini
impurity of partitions: [0.5, 0.5, 0.278]
weights of partitions: [0.4, 0.3, 0.3]
remaining impurity: 0.433
information gain: 0.122
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  1 |    50 | doctorate   | married          | professional | mid             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 13 |    25 | high school | married          | professional | high            |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.5
partition weight: 8/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  5 |    23 | high school | never married    | agriculture  | low             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.5
partition weight: 6/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  6 |    52 | high school | divorced         | transport    | mid             |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.2777777777777777
partition weight: 6/20
  • Step 8: Concatenate the values in the age_optimal_threshold_list with the prefix Age_&#8805
In [13]:
age_optimal_threshold_list_str = ['Age_>=' + str(int(x)) for x in age_optimal_threshold_list]
age_optimal_threshold_list_str
Out[13]:
['Age_>=24', 'Age_>=27', 'Age_>=38', 'Age_>=42']
  • Step 9: Construct the dataframes for categorical and continuous target features with their corresponding remainder and information gain lists. Combine them to df_splits
In [14]:
categorical_df= pd.DataFrame({'Split':feature_list, 'Remainder': feature_remainder_list, 'Information_Gain': feature_info_gain_list})
age_df= pd.DataFrame({'Split':age_optimal_threshold_list_str, 'Remainder': age_remainder_list, 'Information_Gain': age_info_gain_list})
df_splits = pd.concat([categorical_df, age_df], axis=0, sort=False)
  • Step 10: Sort df_splits by Informatio_gain, add an additional column Is_Optimal which would specify True for the row with lowest information gain and False for all other rows. Display df_splits.
In [15]:
df_splits=df_splits.sort_values(by='Information_Gain', ascending=False).reset_index(drop=True)
df_splits['Is_Optimal']=False
df_splits.loc[0, 'Is_Optimal']=True
df_splits=df_splits.round(3)
df_splits
Out[15]:
Split Remainder Information_Gain Is_Optimal
0 Age_>=24 0.353 0.202 True
1 Age_>=27 0.360 0.195 False
2 Occupation 0.433 0.122 False
3 Marital_Status 0.468 0.087 False
4 Age_>=42 0.473 0.082 False
5 Age_>=38 0.529 0.026 False
6 Education 0.537 0.018 False

Part C - Make Prediction

Assume the descriptive feature, Education is chosen as the root node, make predictions for the annual income target variable.

  • Step 1: Partition each level in Education, calculate the probability of getting each unique value of the target feature ('low', 'mid', 'high') in that partition. Find the value in the target feature that has highest probability as the leaf prediction. Save all these information for each level into the dictionary d_value_counts and insert that as a new row to the dataframe df_prediction.
In [16]:
df_prediction  = pd.DataFrame(columns=['Leaf_Condition', 'Low_Income_Prob', 'Mid_Income_Prob', 'High_Income_Prob', 'Leaf_Prediction'])

    # loop over each level in 'Education, calculate the probabilty of occurence of each unique value of 
    # target feature Annual Income
    
for level in Q1['Education'].unique():
    print('+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+')
    d_prediction = {'Leaf_Condition' : "Education == '" + level +"'"}
    df_feature_level = Q1[Q1['Education'] == level]
    print(tabulate(df_feature_level,headers='keys', tablefmt='psql'))
    d_value_counts=(df_feature_level['Annual_Income'].value_counts(normalize=True).to_dict())
    
    # Find the value in the target feature that has highest probability as the leaf prediction
    
    Keymax = max(d_value_counts, key=d_value_counts.get)     
    print("Target value with highest probability: ", Keymax)
    for k, v in d_value_counts.items():
        if k == 'low':
            d_value_counts['Low_Income_Prob'] = d_value_counts.pop('low')
        if k == 'mid':
            d_value_counts['Mid_Income_Prob'] = d_value_counts.pop('mid')
        if k == 'high':
            d_value_counts['High_Income_Prob'] = d_value_counts.pop('high')

    d_prediction.update(d_value_counts)
    d_prediction.update({'Leaf_Prediction': Keymax})        
    print(d_prediction)
    
    # Insert all the information as a new row in the dataframe df_prediction
    df_prediction=df_prediction.append(d_prediction, ignore_index=True)
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
Target value with highest probability:  mid
{'Leaf_Condition': "Education == 'bachelors'", 'Mid_Income_Prob': 0.625, 'High_Income_Prob': 0.25, 'Low_Income_Prob': 0.125, 'Leaf_Prediction': 'mid'}
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  1 |    50 | doctorate   | married          | professional | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
+----+-------+-------------+------------------+--------------+-----------------+
Target value with highest probability:  mid
{'Leaf_Condition': "Education == 'doctorate'", 'Mid_Income_Prob': 0.75, 'High_Income_Prob': 0.25, 'Leaf_Prediction': 'mid'}
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  5 |    23 | high school | never married    | agriculture  | low             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 13 |    25 | high school | married          | professional | high            |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
Target value with highest probability:  mid
{'Leaf_Condition': "Education == 'high school'", 'Mid_Income_Prob': 0.5, 'Low_Income_Prob': 0.25, 'High_Income_Prob': 0.25, 'Leaf_Prediction': 'mid'}
  • Step 2: Display df_prediction
In [17]:
df_prediction
Out[17]:
Leaf_Condition Low_Income_Prob Mid_Income_Prob High_Income_Prob Leaf_Prediction
0 Education == 'bachelors' 0.125 0.625 0.25 mid
1 Education == 'doctorate' NaN 0.750 0.25 mid
2 Education == 'high school' 0.250 0.500 0.25 mid
In [ ]: