Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.

s3806940_A2_Q2

Question 1 - Model Evaluation

Data Processing

  • Step 1: Install the vega & vega_datasets to upgrade Altair each time when we start our Azure project's server. Then we need to restart the kernel. These 2 lines are commented out after vega and vega_datasets are installed.
In [1]:
#!pip install --upgrade altair

#!pip install vega vega_datasets
  • Step 2: Import all necessary packages: numpy, pandas, tabulate and altair, and set the appropriate display options
In [2]:
import pandas as pd
import numpy as np
from tabulate import tabulate
import altair as alt
alt.renderers.enable('notebook')
pd.set_option('display.max_columns', None) 
  • Step 3: Read in the A2_Q2.csv file, drop the column 'ID' as it is an index
In [3]:
Q2 = pd.read_csv("A2_Q2.csv")
Q2=Q2.drop(columns='ID')
  • Step 4: Display the first 5 rows in the dataset
In [4]:
Q2.head()
Out[4]:
Target Score
0 False 0.46
1 False 0.14
2 False 0.48
3 True 0.91
4 False 0.24

Part A - Confusion Matrix

Assume true is the positive target level. Using a score threshold of 0.5, work out the confusion matrix (using pd.crosstab() or as a Jupyter notebook table) with appropriate row and column labels.

  • Step 1: First, sort the dataset by Score in ascending order
In [5]:
Q2=Q2.sort_values(by='Score', ascending=True).reset_index(drop=True)
Q2.head()
Out[5]:
Target Score
0 False 0.03
1 False 0.11
2 False 0.14
3 False 0.14
4 False 0.17
  • Step 2: Define a function called makePrediction(), which will add an additional column Prediction (column name concatenate with the suffice of score threshold, so that we can utilize this function again in part C) in the pass-in dataframe df. Mark the value in the Prediction column as True if it's corresponding score is < the threshold, otherwise mark it as False. Returns the column name of the prediction column.
In [6]:
def makePrediction(df, score_threshold):
    Prediction_Column = 'Prediction_' + str(score_threshold) 
    df[Prediction_Column]=np.where((df['Score'] < score_threshold), False, True)
    return Prediction_Column
  • Step 3: Call the makePrediction() function by passing in the sorted dataframe Q2 and 0.5 as the score threshold, set the returned column name to prediction_column_point5
In [7]:
prediction_column_point5=makePrediction(Q2, 0.5)
  • Step 4: Define a function called makeOutcome(), which will add an additional column Outcome (column name concatenate with the suffice of score threshold, so that we can utilize this function again in part C) in the pass-in dataframe df. Mark the value in the Outcome column as follows:

    • i. TP if it's corresponding target and prediction value are both True
    • ii. TN if it's corresponding target and prediction value are both False
    • iii. FN if it's corresponding target value is True, but prediction value is False
    • iv. FP if it's corresponding target value is False, but prediction value is True

      Returns the column name of the outcome column.

In [8]:
def makeOutcome(df, Prediction_column):
    Outcome_Column = Prediction_column
    Outcome_Column = Outcome_Column.replace('Prediction', 'Outcome')
    df[Outcome_Column]='NA'
    df.loc[(df['Target'] == True) & (df[Prediction_column] == True), Outcome_Column] = 'TP'
    df.loc[(df['Target'] == False) & (df[Prediction_column] == False), Outcome_Column] = 'TN'
    df.loc[(df['Target'] == True) & (df[Prediction_column] == False), Outcome_Column] = 'FN'
    df.loc[(df['Target'] == False) & (df[Prediction_column] == True), Outcome_Column] = 'FP'
    return Outcome_Column
  • Step 5: Call the makeOutcome() function by passing in the sorted dataframe Q2 and prediction_column_point5 , set the returned column name to outcome_column_point5
In [9]:
outcome_Column_point5 = makeOutcome(Q2, prediction_column_point5)
  • Step 6: Show the first 10 rows of the sorted dataframe Q2.
In [10]:
Q2.head(10)
Out[10]:
Target Score Prediction_0.5 Outcome_0.5
0 False 0.03 False TN
1 False 0.11 False TN
2 False 0.14 False TN
3 False 0.14 False TN
4 False 0.17 False TN
5 False 0.24 False TN
6 False 0.26 False TN
7 True 0.27 False FN
8 True 0.32 False FN
9 False 0.35 False TN
  • Step 7: Construct the confusion matrix using the pandas pd.crosstab() function, each column of the matrix represents the predicted instances, each row represents the instances in an actual target class.
In [11]:
confustion_matrix_point5=pd.crosstab(index=Q2['Target'], columns=Q2[prediction_column_point5], rownames=['Target'], colnames=['Prediction'])
confustion_matrix_point5
Out[11]:
Prediction False True
Target
False 11 6
True 4 9

Part B - Metrics

Compute the following 5 metrics

    1. Error Rate
    1. Precision
    1. TPR (True Positive Rate) (also known as Recall)
    1. F1-Score
    1. FPR (False Positive Rate) displays the answers as a Pandas data frame called df_metrics
  • Step 1: Define a function called calculateMetrics(), which will calculate the above 5 metrics for the pass in Outcome_column of the dataframe df (values of the outcome_column are either TP, TN, FP or FN, which is generated in Part A step 4) . If metrc_type is specified as ROC, only TPR and FPR would be calculated (so that we can utilize this function again in Part C). Returns the calculated metrics as a dictionary (This is for the benefit of Part C).
In [12]:
def calculateMetrics(df, Outcome_column, metric_type='whole'):

    dict_Outcome= Q2[Outcome_column].value_counts().to_dict()
    TN=TP=FP=FN=0
    
    for k, v in dict_Outcome.items():
        if k == 'TN':
            TN = v 
        if k == 'TP':
            TP = v 
        if k == 'FP':
            FP = v 
        if k == 'FN':
            FN = v 
    TPR = TP / (TP+FN) 
    FPR = 1 - (TN /(TN+FP))

    if (metric_type != 'ROC'):
        Error_Rate = (FP+FN)/(TN+TP+FP+FN)
        Precision = TP / (TP+FP)
        F1_Score = 2 * ((Precision*TPR)/(Precision+TPR))
        dict_Metrics={'Error_Rate':Error_Rate, 'Precision':Precision, 'TPR':TPR, 'F1_Score':F1_Score, 'FPR':FPR}
    else:
        dict_Metrics={'TPR':TPR, 'FPR':FPR}

    return dict_Metrics
  • Step 2: Call the calculateMetrics() function by passing in the sorted dataframe Q2 and outcome_column_point5 (column name obtained from Part A step 5), set the returned dictionary to dict_Metrics_Point5
In [13]:
dict_Metrics_Point5 = calculateMetrics(Q2, outcome_Column_point5)
  • Step 3: Construct the df_metrics dataframe from the dictionary dict_Metrics_Point5. Transpose df_metrics to show each metric value by row as requested.
In [14]:
df_metrics = pd.DataFrame(columns=['Error_Rate', 'Precision', 'TPR', 'F1_Score', 'FPR'])
df_metrics = df_metrics.append(dict_Metrics_Point5, ignore_index=True)

df_metrics=df_metrics.transpose().reset_index()
df_metrics.columns=['Metric', 'Value']

df_metrics.round(3)
Out[14]:
Metric Value
0 Error_Rate 0.333
1 Precision 0.600
2 TPR 0.692
3 F1_Score 0.643
4 FPR 0.353

Part C - TPR and FPR

By varying the score threshold from 0.1 to 0.9 (both inclusive) in 0.1 increments, compute TPR and FPR values. Display the answers as a Pandas data frame called df_roc

  • Step 1: Construct the score_threshold_list which consists of the values from 0.1 to 0.9 with 0.1 increments
In [15]:
score_threshold_list = [ x*0.1 for x in range(1, 10)]
score_threshold_list = np.round(score_threshold_list, 1)   
score_threshold_list
Out[15]:
array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
  • Step 2: Only keep the Target and Score column in Q2, as we start to compute everything all over again.
In [16]:
Q2=Q2.loc[:, ['Target', 'Score']]
  • Step 3: Construct the df_roc dataframe with desired column name. Loop over each score threshold and call the below functions :

    • i. makePrediction(); the function is well explained in part A step 2 and 3.
    • ii. makeOutcome(); the function is well explained in part A step 4 and 5.

    to add additional prediction and outcome columns in the dataframe Q2.

    • iii. calculateMetrics() with metric_type=ROC; the function is well explained in part B step 1 and 2.

    to obtain the calculated metric d_metric, and append it as a new row into the dataframe df_roc

In [17]:
df_roc = pd.DataFrame(columns=['Threshold', 'TPR', 'FPR'])
for score_threshold in score_threshold_list:
    d_roc ={'Threshold': score_threshold}
    prediction_column = makePrediction(Q2, score_threshold)
    outcome_column = makeOutcome(Q2, prediction_column)
    d_metric =calculateMetrics(Q2, outcome_column, 'ROC')
    d_roc.update(d_metric)
    df_roc = df_roc.append(d_roc, ignore_index=True)
  • Step 4: Show the first 10 rows of the sorted dataframe Q2.
In [18]:
Q2.head(10)
Out[18]:
Target Score Prediction_0.1 Outcome_0.1 Prediction_0.2 Outcome_0.2 Prediction_0.3 Outcome_0.3 Prediction_0.4 Outcome_0.4 Prediction_0.5 Outcome_0.5 Prediction_0.6 Outcome_0.6 Prediction_0.7 Outcome_0.7 Prediction_0.8 Outcome_0.8 Prediction_0.9 Outcome_0.9
0 False 0.03 False TN False TN False TN False TN False TN False TN False TN False TN False TN
1 False 0.11 True FP False TN False TN False TN False TN False TN False TN False TN False TN
2 False 0.14 True FP False TN False TN False TN False TN False TN False TN False TN False TN
3 False 0.14 True FP False TN False TN False TN False TN False TN False TN False TN False TN
4 False 0.17 True FP False TN False TN False TN False TN False TN False TN False TN False TN
5 False 0.24 True FP True FP False TN False TN False TN False TN False TN False TN False TN
6 False 0.26 True FP True FP False TN False TN False TN False TN False TN False TN False TN
7 True 0.27 True TP True TP False FN False FN False FN False FN False FN False FN False FN
8 True 0.32 True TP True TP True TP False FN False FN False FN False FN False FN False FN
9 False 0.35 True FP True FP True FP False TN False TN False TN False TN False TN False TN
  • Step 5: Display the dataframe df_roc.
In [19]:
df_roc.round(3)
Out[19]:
Threshold TPR FPR
0 0.1 1.000 0.941
1 0.2 1.000 0.706
2 0.3 0.923 0.588
3 0.4 0.846 0.471
4 0.5 0.692 0.353
5 0.6 0.615 0.235
6 0.7 0.462 0.118
7 0.8 0.308 0.059
8 0.9 0.077 0.000

Part D - ROC curve

Use the dataframe df_roc and display an ROC curve with appropriate axes labels and a title.

  • Step 1: Refer to Cells #20 and #21 in our SK4 Tutorial here. The ROC curve is plotted as below:
In [20]:
import altair as alt
alt.renderers.enable('html')
base = alt.Chart(df_roc, 
                 title='ROC Curve of A2_Q2.csv'
                ).properties(width=300)

roc_curve = base.mark_line(point=True).encode(
    alt.X('FPR', title='False Positive Rate (FPR)',  sort=None),
    alt.Y('TPR', title='True Positive Rate (TPR) (a.k.a Recall)'),
)

roc_rule = base.mark_line(color='green').encode(
    x='FPR',
    y='FPR',
    size=alt.value(2)
)

(roc_curve + roc_rule).interactive()
Out[20]: