Question 1 - Decision Tree¶

Build a simple decision tree with depth 1 using the given dataset (as shown below) to predict annual income (target feature). Use Gini Index as the split criterion and present the result as Pandas data frame.

ID	Age	Education	Marital_Status	Occupation	Annual_Income
1	39	bachelors	never married	professional	high
2	50	doctorate	married	professional	mid
3	18	high school	never married	agriculture	low
4	30	bachelors	married	professional	mid
5	37	high school	married	agriculture	mid
6	23	high school	never married	agriculture	low
7	52	high school	divorced	transport	mid
8	40	doctorate	married	professional	high
9	46	bachelors	divorced	transport	mid
10	33	high school	married	transport	mid
11	36	high school	never married	transport	mid
12	45	doctorate	married	professional	mid
13	23	bachelors	never married	agriculture	low
14	25	high school	married	professional	high
15	35	bachelors	married	agriculture	mid
16	29	bachelors	never married	agriculture	mid
17	44	doctorate	divorced	transport	mid
18	37	bachelors	married	professional	mid
19	39	high school	divorced	professional	high
20	25	bachelors	married	transport	high

Data Processing ¶

Step 1: Import all necessary packages: numpy, pandas and tabulate, and set the appropriate display options

import pandas as pd
import numpy as np
from tabulate import tabulate 
pd.set_option('display.max_columns', None)

Step 2: Read in the A2_Q1.csv file, drop the column 'ID' as it is an index

Q1 = pd.read_csv("A2_Q1.csv")
Q1=Q1.drop(columns='ID')

Step 3: Display the first 5 rows in the dataset

Q1.head()

Part A - Imputiry of Target Feature¶

Compute the impurity of the target feature

The gini impurity index is defined as follows: $$ \mbox{Gini}(x) := 1 - \sum_{i=1}^{\ell}P(t=i)^{2}$$ The idea with Gini index is the same as in entropy in the sense that the more heterogenous and impure a feature is, the higher the Gini index.

A nice property of the Gini index is that it is always between 0 and 1, and this may make it easier to compare Gini indices across different features.

The impurity of our fruit basket using Gini index is calculated as below.

Step 1: Define a function called compute_impurity() that calculates impurity of a feature using either entropy or gini index. Returns the impurity value.

def compute_impurity(feature, impurity_criterion):
    """
    This function calculates impurity of a feature.
    Supported impurity criteria: 'entropy', 'gini'
    input: feature (this needs to be a Pandas series)
    output: feature impurity
    """
    probs = feature.value_counts(normalize=True)
    
    if impurity_criterion == 'entropy':
        impurity = -1 * np.sum(np.log2(probs) * probs)
    elif impurity_criterion == 'gini':
        impurity = 1 - np.sum(np.square(probs))
    else:
        raise ValueError('Unknown impurity criterion')
        
    return(impurity)

Step 2: Call the function compute_impurity() to calculate the impurity of the target feature Annual Income

target_gini_index=compute_impurity(Q1['Annual_Income'], 'gini')
print("Impurity (Gini Index) of Annual Income is:", round(target_gini_index,3))

Impurity (Gini Index) of Annual Income is: 0.555

Part B - Find the root node¶

Determine the root node for your decision tree

Step 1: Handle the continuous features Age by partitioning with optimal threshold. First, sort the dataset by Age in ascending order

Q1_sorted_Age=Q1.sort_values(by='Age', ascending=True).reset_index(drop=True)
Q1_sorted_Age.head()

Step 2: Define a function called calculate_optimal_threshold(), which will calculate the optimal threshold by finding the adjacent instances in the ordered list which have different target feature levels, returns all the optimal threshold values in a list

def calculate_optimal_threshold(df, target, descriptive_feature):
    if(np.issubdtype(df[descriptive_feature].dtype, np.number)==False):
        raise ValueError('Descriptive feature is not numeric')
    optimal_threshold_list = list()
    for i in range(len(df)-1):
        if(df[target][i] != df[target][i+1]):
            optimal_threshold=(df[descriptive_feature][i] + df[descriptive_feature][i+1])/2
            optimal_threshold_list.append(optimal_threshold)
    return(optimal_threshold_list)

Step 3: Call the calculate_optimal_threshold() function by passing in the sorted dataframe to obtain the age_optimal_threshold_list

age_optimal_threshold_list=calculate_optimal_threshold(Q1_sorted_Age, 'Annual_Income', 'Age')
age_optimal_threshold_list

[24.0, 27.0, 38.0, 42.0]

Step 4: Define a function called comp_feature_information_gain_continuous(), which will calculate the information gain for continuous descriptive feature by splitting at the threshold values in the passed-in optimal_threshold_list. Returns the remaining_impurity and information gain as 2 individual lists.

def comp_feature_information_gain_continuous(df, target, descriptive_feature, optimal_threshold_list, split_criterion):

    if(np.issubdtype(df[descriptive_feature].dtype, np.number)==False):
        raise ValueError('Descriptive feature is not numeric')

    remainder_list = list()
    info_gain_list = list()
    target_entropy = compute_impurity(df[target], split_criterion)
    print("Impurity of target", round(target_entropy,3), "in", split_criterion )

    # loop over each optimal threshold values 
    # to partition the dataset with respect to that threshold value
    # and compute the entropy and the weight of the upper and lower partition base on that threshold value
    
    for optimal_threshold in optimal_threshold_list:
        print('==============================================================')
        print('Optimal_threshold:', optimal_threshold)
        
        
    # we define two arrays below:
    # entropy_array to store the entropy of each partition
    # weight_array to store the relative number of observations in each partition

        entropy_array = np.zeros(2) 
        weight_array = np.zeros(2) 
        for i in range(2):
            if (i==0):
                df_feature_level = df[df[descriptive_feature] < optimal_threshold]
                print('Lower partition:')
            else:
                df_feature_level = df[df[descriptive_feature] >= optimal_threshold]
                print('Upper partition:')

            print(tabulate(df_feature_level,headers='keys', tablefmt='psql'))
            entropy_array[i] = compute_impurity(df_feature_level[target], split_criterion)
            weight_array[i]= len(df_feature_level) / len(df)

        print('impurity of partitions:', entropy_array.round(decimals=3))
        print('weights of partitions:', weight_array.round(decimals=3))

        feature_remaining_impurity = np.sum(entropy_array * weight_array)
        remainder_list.append(feature_remaining_impurity)
        print('remaining impurity:', round(feature_remaining_impurity,3))

        information_gain = target_entropy - feature_remaining_impurity
        info_gain_list.append(information_gain)
        print('information gain:', round(information_gain,3))
    print('==============================================================')
    return remainder_list, info_gain_list

Step 5: Call the comp_feature_information_gain_continuous() function to obtain the age_remainder_list and age_info_gain_list

age_remainder_list, age_info_gain_list = comp_feature_information_gain_continuous(Q1, 'Annual_Income', 'Age', age_optimal_threshold_list, 'gini' )

Impurity of target 0.555 in gini
==============================================================
Optimal_threshold: 24.0
Lower partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  5 |    23 | high school | never married    | agriculture  | low             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
+----+-------+-------------+------------------+--------------+-----------------+
Upper partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  1 |    50 | doctorate   | married          | professional | mid             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 13 |    25 | high school | married          | professional | high            |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
impurity of partitions: [0.    0.415]
weights of partitions: [0.15 0.85]
remaining impurity: 0.353
information gain: 0.202
==============================================================
Optimal_threshold: 27.0
Lower partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  5 |    23 | high school | never married    | agriculture  | low             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 13 |    25 | high school | married          | professional | high            |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
Upper partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  1 |    50 | doctorate   | married          | professional | mid             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
impurity of partitions: [0.48 0.32]
weights of partitions: [0.25 0.75]
remaining impurity: 0.36
information gain: 0.195
==============================================================
Optimal_threshold: 38.0
Lower partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  5 |    23 | high school | never married    | agriculture  | low             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 13 |    25 | high school | married          | professional | high            |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
Upper partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  1 |    50 | doctorate   | married          | professional | mid             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
impurity of partitions: [0.569 0.469]
weights of partitions: [0.6 0.4]
remaining impurity: 0.529
information gain: 0.026
==============================================================
Optimal_threshold: 42.0
Lower partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  2 |    18 | high school | never married    | agriculture  | low             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  5 |    23 | high school | never married    | agriculture  | low             |
|  7 |    40 | doctorate   | married          | professional | high            |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 13 |    25 | high school | married          | professional | high            |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
Upper partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  1 |    50 | doctorate   | married          | professional | mid             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
+----+-------+-------------+------------------+--------------+-----------------+
impurity of partitions: [0.631 0.   ]
weights of partitions: [0.75 0.25]
remaining impurity: 0.473
information gain: 0.082
==============================================================

Step 6: Define a function called comp_feature_information_gain(), which will calculate the information gain on the target feature by partitioning the descriptive features at each unique level. Returns the remaining_impurity and information gain as 2 individual lists.

def comp_feature_information_gain(df, target, descriptive_feature,  split_criterion):
    
    print('=====================================================================================================')   
    print('target feature:', target)
    print('descriptive_feature:', descriptive_feature)
    print('split criterion:', split_criterion)
            
    target_entropy = compute_impurity(df[target], split_criterion)

    # we define two lists below:
    # entropy_list to store the entropy of each partition
    # weight_list to store the relative number of observations in each partition
    
    entropy_list = list()
    weight_list = list()
    
    # loop over each level of the descriptive feature
    # to partition the dataset with respect to that level
    # and compute the entropy and the weight of the level's partition

    for level in df[descriptive_feature].unique():      
  
        df_feature_level = df[df[descriptive_feature] == level]
        entropy_level = compute_impurity(df_feature_level[target], split_criterion)
        entropy_list.append(round(entropy_level, 3))
        weight_level = len(df_feature_level) / len(df)
        weight_list.append(round(weight_level, 3))

    print('impurity of partitions:', entropy_list)
    print('weights of partitions:', weight_list)

    feature_remaining_impurity = np.sum(np.array(entropy_list) * np.array(weight_list))
    print('remaining impurity:', round(feature_remaining_impurity,3))
    
    information_gain = target_entropy - feature_remaining_impurity
    print('information gain:', round(information_gain,3))
    

    return(feature_remaining_impurity, information_gain)

Step 7: Call the comp_feature_information_gain() function for each categorical descriptive features (Education, Marital status, occupation) in the dataframe to obtain the feature_remainder_list and feature_info_gain_list

split_criterion = 'gini'
feature_remainder_list = list()
feature_info_gain_list = list()
feature_list=list()

    # loop over each categorical feature call the comp_feature_information_gain() function to 
    # calculate the remainder impurity and information gain for each feature  

for feature in Q1.drop(columns=['Annual_Income', 'Age']).columns:
    feature_remainder, feature_info_gain = comp_feature_information_gain(Q1, 'Annual_Income', feature, split_criterion)
    feature_list.append(feature)
    feature_remainder_list.append(feature_remainder)
    feature_info_gain_list.append(feature_info_gain)

    # loop over each level of the descriptive feature caculate the impurity of 
    # target feature and relative number of observations in level's partition

    
    for level in Q1[feature].unique():
        print('+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+')
        df_feature_level = Q1[Q1[feature] == level]
        print('corresponding data partition:')
        print(tabulate(df_feature_level,headers='keys', tablefmt='psql'))
        print('partition target feature impurity:', compute_impurity(df_feature_level['Annual_Income'], split_criterion))
        print('partition weight:', str(len(df_feature_level)) + '/' + str(len(Q1)))

=====================================================================================================
target feature: Annual_Income
descriptive_feature: Education
split criterion: gini
impurity of partitions: [0.531, 0.375, 0.625]
weights of partitions: [0.4, 0.2, 0.4]
remaining impurity: 0.537
information gain: 0.018
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.53125
partition weight: 8/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  1 |    50 | doctorate   | married          | professional | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.375
partition weight: 4/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  5 |    23 | high school | never married    | agriculture  | low             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 13 |    25 | high school | married          | professional | high            |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.625
partition weight: 8/20
=====================================================================================================
target feature: Annual_Income
descriptive_feature: Marital_Status
split criterion: gini
impurity of partitions: [0.611, 0.42, 0.375]
weights of partitions: [0.3, 0.5, 0.2]
remaining impurity: 0.468
information gain: 0.087
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  2 |    18 | high school | never married    | agriculture  | low             |
|  5 |    23 | high school | never married    | agriculture  | low             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.6111111111111112
partition weight: 6/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  1 |    50 | doctorate   | married          | professional | mid             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
|  9 |    33 | high school | married          | transport    | mid             |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 13 |    25 | high school | married          | professional | high            |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.42000000000000004
partition weight: 10/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  6 |    52 | high school | divorced         | transport    | mid             |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.375
partition weight: 4/20
=====================================================================================================
target feature: Annual_Income
descriptive_feature: Occupation
split criterion: gini
impurity of partitions: [0.5, 0.5, 0.278]
weights of partitions: [0.4, 0.3, 0.3]
remaining impurity: 0.433
information gain: 0.122
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  1 |    50 | doctorate   | married          | professional | mid             |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 13 |    25 | high school | married          | professional | high            |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.5
partition weight: 8/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  5 |    23 | high school | never married    | agriculture  | low             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.5
partition weight: 6/20
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
corresponding data partition:
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  6 |    52 | high school | divorced         | transport    | mid             |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
partition target feature impurity: 0.2777777777777777
partition weight: 6/20

Step 8: Concatenate the values in the age_optimal_threshold_list with the prefix Age_&#8805

age_optimal_threshold_list_str = ['Age_>=' + str(int(x)) for x in age_optimal_threshold_list]
age_optimal_threshold_list_str

['Age_>=24', 'Age_>=27', 'Age_>=38', 'Age_>=42']

Step 9: Construct the dataframes for categorical and continuous target features with their corresponding remainder and information gain lists. Combine them to df_splits

categorical_df= pd.DataFrame({'Split':feature_list, 'Remainder': feature_remainder_list, 'Information_Gain': feature_info_gain_list})
age_df= pd.DataFrame({'Split':age_optimal_threshold_list_str, 'Remainder': age_remainder_list, 'Information_Gain': age_info_gain_list})
df_splits = pd.concat([categorical_df, age_df], axis=0, sort=False)

Step 10: Sort df_splits by Informatio_gain, add an additional column Is_Optimal which would specify True for the row with lowest information gain and False for all other rows. Display df_splits.

df_splits=df_splits.sort_values(by='Information_Gain', ascending=False).reset_index(drop=True)
df_splits['Is_Optimal']=False
df_splits.loc[0, 'Is_Optimal']=True
df_splits=df_splits.round(3)
df_splits

Part C - Make Prediction¶

Assume the descriptive feature, Education is chosen as the root node, make predictions for the annual income target variable.

Step 1: Partition each level in Education, calculate the probability of getting each unique value of the target feature ('low', 'mid', 'high') in that partition. Find the value in the target feature that has highest probability as the leaf prediction. Save all these information for each level into the dictionary d_value_counts and insert that as a new row to the dataframe df_prediction.

df_prediction  = pd.DataFrame(columns=['Leaf_Condition', 'Low_Income_Prob', 'Mid_Income_Prob', 'High_Income_Prob', 'Leaf_Prediction'])

    # loop over each level in 'Education, calculate the probabilty of occurence of each unique value of 
    # target feature Annual Income
    
for level in Q1['Education'].unique():
    print('+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+')
    d_prediction = {'Leaf_Condition' : "Education == '" + level +"'"}
    df_feature_level = Q1[Q1['Education'] == level]
    print(tabulate(df_feature_level,headers='keys', tablefmt='psql'))
    d_value_counts=(df_feature_level['Annual_Income'].value_counts(normalize=True).to_dict())
    
    # Find the value in the target feature that has highest probability as the leaf prediction
    
    Keymax = max(d_value_counts, key=d_value_counts.get)     
    print("Target value with highest probability: ", Keymax)
    for k, v in d_value_counts.items():
        if k == 'low':
            d_value_counts['Low_Income_Prob'] = d_value_counts.pop('low')
        if k == 'mid':
            d_value_counts['Mid_Income_Prob'] = d_value_counts.pop('mid')
        if k == 'high':
            d_value_counts['High_Income_Prob'] = d_value_counts.pop('high')

    d_prediction.update(d_value_counts)
    d_prediction.update({'Leaf_Prediction': Keymax})        
    print(d_prediction)
    
    # Insert all the information as a new row in the dataframe df_prediction
    df_prediction=df_prediction.append(d_prediction, ignore_index=True)

+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  0 |    39 | bachelors   | never married    | professional | high            |
|  3 |    30 | bachelors   | married          | professional | mid             |
|  8 |    46 | bachelors   | divorced         | transport    | mid             |
| 12 |    23 | bachelors   | never married    | agriculture  | low             |
| 14 |    35 | bachelors   | married          | agriculture  | mid             |
| 15 |    29 | bachelors   | never married    | agriculture  | mid             |
| 17 |    37 | bachelors   | married          | professional | mid             |
| 19 |    25 | bachelors   | married          | transport    | high            |
+----+-------+-------------+------------------+--------------+-----------------+
Target value with highest probability:  mid
{'Leaf_Condition': "Education == 'bachelors'", 'Mid_Income_Prob': 0.625, 'High_Income_Prob': 0.25, 'Low_Income_Prob': 0.125, 'Leaf_Prediction': 'mid'}
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  1 |    50 | doctorate   | married          | professional | mid             |
|  7 |    40 | doctorate   | married          | professional | high            |
| 11 |    45 | doctorate   | married          | professional | mid             |
| 16 |    44 | doctorate   | divorced         | transport    | mid             |
+----+-------+-------------+------------------+--------------+-----------------+
Target value with highest probability:  mid
{'Leaf_Condition': "Education == 'doctorate'", 'Mid_Income_Prob': 0.75, 'High_Income_Prob': 0.25, 'Leaf_Prediction': 'mid'}
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
+----+-------+-------------+------------------+--------------+-----------------+
|    |   Age | Education   | Marital_Status   | Occupation   | Annual_Income   |
|----+-------+-------------+------------------+--------------+-----------------|
|  2 |    18 | high school | never married    | agriculture  | low             |
|  4 |    37 | high school | married          | agriculture  | mid             |
|  5 |    23 | high school | never married    | agriculture  | low             |
|  6 |    52 | high school | divorced         | transport    | mid             |
|  9 |    33 | high school | married          | transport    | mid             |
| 10 |    36 | high school | never married    | transport    | mid             |
| 13 |    25 | high school | married          | professional | high            |
| 18 |    39 | high school | divorced         | professional | high            |
+----+-------+-------------+------------------+--------------+-----------------+
Target value with highest probability:  mid
{'Leaf_Condition': "Education == 'high school'", 'Mid_Income_Prob': 0.5, 'Low_Income_Prob': 0.25, 'High_Income_Prob': 0.25, 'Leaf_Prediction': 'mid'}

Step 2: Display df_prediction

df_prediction

	Split	Remainder	Information_Gain	Is_Optimal
0	Age_>=24	0.353	0.202	True
1	Age_>=27	0.360	0.195	False
2	Occupation	0.433	0.122	False
3	Marital_Status	0.468	0.087	False
4	Age_>=42	0.473	0.082	False
5	Age_>=38	0.529	0.026	False
6	Education	0.537	0.018	False

	Leaf_Condition	Low_Income_Prob	Mid_Income_Prob	High_Income_Prob	Leaf_Prediction
0	Education == 'bachelors'	0.125	0.625	0.25	mid
1	Education == 'doctorate'	NaN	0.750	0.25	mid
2	Education == 'high school'	0.250	0.500	0.25	mid

Machine Learning: Decision Tree

Question 1 - Decision Tree¶

Data Processing ¶

Part A - Imputiry of Target Feature¶

Part B - Find the root node¶

Part C - Make Prediction¶