Question 1 - Model Evaluation¶

Data Processing ¶

Step 1: Install the vega & vega_datasets to upgrade Altair each time when we start our Azure project's server. Then we need to restart the kernel. These 2 lines are commented out after vega and vega_datasets are installed.

#!pip install --upgrade altair

#!pip install vega vega_datasets

Step 2: Import all necessary packages: numpy, pandas, tabulate and altair, and set the appropriate display options

import pandas as pd
import numpy as np
from tabulate import tabulate
import altair as alt
alt.renderers.enable('notebook')
pd.set_option('display.max_columns', None)

Step 3: Read in the A2_Q2.csv file, drop the column 'ID' as it is an index

Q2 = pd.read_csv("A2_Q2.csv")
Q2=Q2.drop(columns='ID')

Step 4: Display the first 5 rows in the dataset

Q2.head()

Part A - Confusion Matrix ¶

Assume true is the positive target level. Using a score threshold of 0.5, work out the confusion matrix (using pd.crosstab() or as a Jupyter notebook table) with appropriate row and column labels.

Step 1: First, sort the dataset by Score in ascending order

Q2=Q2.sort_values(by='Score', ascending=True).reset_index(drop=True)
Q2.head()

Step 2: Define a function called makePrediction(), which will add an additional column Prediction (column name concatenate with the suffice of score threshold, so that we can utilize this function again in part C) in the pass-in dataframe df. Mark the value in the Prediction column as True if it's corresponding score is < the threshold, otherwise mark it as False. Returns the column name of the prediction column.

def makePrediction(df, score_threshold):
    Prediction_Column = 'Prediction_' + str(score_threshold) 
    df[Prediction_Column]=np.where((df['Score'] < score_threshold), False, True)
    return Prediction_Column

Step 3: Call the makePrediction() function by passing in the sorted dataframe Q2 and 0.5 as the score threshold, set the returned column name to prediction_column_point5

prediction_column_point5=makePrediction(Q2, 0.5)

Step 4: Define a function called makeOutcome(), which will add an additional column Outcome (column name concatenate with the suffice of score threshold, so that we can utilize this function again in part C) in the pass-in dataframe df. Mark the value in the Outcome column as follows:
- i. TP if it's corresponding target and prediction value are both True
- ii. TN if it's corresponding target and prediction value are both False
- iii. FN if it's corresponding target value is True, but prediction value is False
- iv. FP if it's corresponding target value is False, but prediction value is True
  
  Returns the column name of the outcome column.

def makeOutcome(df, Prediction_column):
    Outcome_Column = Prediction_column
    Outcome_Column = Outcome_Column.replace('Prediction', 'Outcome')
    df[Outcome_Column]='NA'
    df.loc[(df['Target'] == True) & (df[Prediction_column] == True), Outcome_Column] = 'TP'
    df.loc[(df['Target'] == False) & (df[Prediction_column] == False), Outcome_Column] = 'TN'
    df.loc[(df['Target'] == True) & (df[Prediction_column] == False), Outcome_Column] = 'FN'
    df.loc[(df['Target'] == False) & (df[Prediction_column] == True), Outcome_Column] = 'FP'
    return Outcome_Column

Step 5: Call the makeOutcome() function by passing in the sorted dataframe Q2 and prediction_column_point5 , set the returned column name to outcome_column_point5

outcome_Column_point5 = makeOutcome(Q2, prediction_column_point5)

Step 6: Show the first 10 rows of the sorted dataframe Q2.

Q2.head(10)

Step 7: Construct the confusion matrix using the pandas pd.crosstab() function, each column of the matrix represents the predicted instances, each row represents the instances in an actual target class.

confustion_matrix_point5=pd.crosstab(index=Q2['Target'], columns=Q2[prediction_column_point5], rownames=['Target'], colnames=['Prediction'])
confustion_matrix_point5

Part B - Metrics¶

Compute the following 5 metrics

1. Error Rate
1. Precision
1. TPR (True Positive Rate) (also known as Recall)
1. F1-Score
1. FPR (False Positive Rate) displays the answers as a Pandas data frame called df_metrics

Step 1: Define a function called calculateMetrics(), which will calculate the above 5 metrics for the pass in Outcome_column of the dataframe df (values of the outcome_column are either TP, TN, FP or FN, which is generated in Part A step 4) . If metrc_type is specified as ROC, only TPR and FPR would be calculated (so that we can utilize this function again in Part C). Returns the calculated metrics as a dictionary (This is for the benefit of Part C).

def calculateMetrics(df, Outcome_column, metric_type='whole'):

    dict_Outcome= Q2[Outcome_column].value_counts().to_dict()
    TN=TP=FP=FN=0
    
    for k, v in dict_Outcome.items():
        if k == 'TN':
            TN = v 
        if k == 'TP':
            TP = v 
        if k == 'FP':
            FP = v 
        if k == 'FN':
            FN = v 
    TPR = TP / (TP+FN) 
    FPR = 1 - (TN /(TN+FP))

    if (metric_type != 'ROC'):
        Error_Rate = (FP+FN)/(TN+TP+FP+FN)
        Precision = TP / (TP+FP)
        F1_Score = 2 * ((Precision*TPR)/(Precision+TPR))
        dict_Metrics={'Error_Rate':Error_Rate, 'Precision':Precision, 'TPR':TPR, 'F1_Score':F1_Score, 'FPR':FPR}
    else:
        dict_Metrics={'TPR':TPR, 'FPR':FPR}

    return dict_Metrics

Step 2: Call the calculateMetrics() function by passing in the sorted dataframe Q2 and outcome_column_point5 (column name obtained from Part A step 5), set the returned dictionary to dict_Metrics_Point5

dict_Metrics_Point5 = calculateMetrics(Q2, outcome_Column_point5)

Step 3: Construct the df_metrics dataframe from the dictionary dict_Metrics_Point5. Transpose df_metrics to show each metric value by row as requested.

df_metrics = pd.DataFrame(columns=['Error_Rate', 'Precision', 'TPR', 'F1_Score', 'FPR'])
df_metrics = df_metrics.append(dict_Metrics_Point5, ignore_index=True)

df_metrics=df_metrics.transpose().reset_index()
df_metrics.columns=['Metric', 'Value']

df_metrics.round(3)

Part C - TPR and FPR¶

By varying the score threshold from 0.1 to 0.9 (both inclusive) in 0.1 increments, compute TPR and FPR values. Display the answers as a Pandas data frame called df_roc

Step 1: Construct the score_threshold_list which consists of the values from 0.1 to 0.9 with 0.1 increments

score_threshold_list = [ x*0.1 for x in range(1, 10)]
score_threshold_list = np.round(score_threshold_list, 1)   
score_threshold_list

array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

Step 2: Only keep the Target and Score column in Q2, as we start to compute everything all over again.

Q2=Q2.loc[:, ['Target', 'Score']]

Step 3: Construct the df_roc dataframe with desired column name. Loop over each score threshold and call the below functions :
- i. makePrediction(); the function is well explained in part A step 2 and 3.
- ii. makeOutcome(); the function is well explained in part A step 4 and 5.
to add additional prediction and outcome columns in the dataframe Q2.
- iii. calculateMetrics() with metric_type=ROC; the function is well explained in part B step 1 and 2.
to obtain the calculated metric d_metric, and append it as a new row into the dataframe df_roc

df_roc = pd.DataFrame(columns=['Threshold', 'TPR', 'FPR'])
for score_threshold in score_threshold_list:
    d_roc ={'Threshold': score_threshold}
    prediction_column = makePrediction(Q2, score_threshold)
    outcome_column = makeOutcome(Q2, prediction_column)
    d_metric =calculateMetrics(Q2, outcome_column, 'ROC')
    d_roc.update(d_metric)
    df_roc = df_roc.append(d_roc, ignore_index=True)

Step 4: Show the first 10 rows of the sorted dataframe Q2.

Q2.head(10)

Step 5: Display the dataframe df_roc.

df_roc.round(3)

Part D - ROC curve¶

Use the dataframe df_roc and display an ROC curve with appropriate axes labels and a title.

Step 1: Refer to Cells #20 and #21 in our SK4 Tutorial here. The ROC curve is plotted as below:

import altair as alt
alt.renderers.enable('html')
base = alt.Chart(df_roc, 
                 title='ROC Curve of A2_Q2.csv'
                ).properties(width=300)

roc_curve = base.mark_line(point=True).encode(
    alt.X('FPR', title='False Positive Rate (FPR)',  sort=None),
    alt.Y('TPR', title='True Positive Rate (TPR) (a.k.a Recall)'),
)

roc_rule = base.mark_line(color='green').encode(
    x='FPR',
    y='FPR',
    size=alt.value(2)
)

(roc_curve + roc_rule).interactive()

	Threshold	TPR	FPR
0	0.1	1.000	0.941
1	0.2	1.000	0.706
2	0.3	0.923	0.588
3	0.4	0.846	0.471
4	0.5	0.692	0.353
5	0.6	0.615	0.235
6	0.7	0.462	0.118
7	0.8	0.308	0.059
8	0.9	0.077	0.000

	Target	Score	Prediction_0.5	Outcome_0.5
0	False	0.03	False	TN
1	False	0.11	False	TN
2	False	0.14	False	TN
3	False	0.14	False	TN
4	False	0.17	False	TN
5	False	0.24	False	TN
6	False	0.26	False	TN
7	True	0.27	False	FN
8	True	0.32	False	FN
9	False	0.35	False	TN