Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.
#!pip install --upgrade altair
#!pip install vega vega_datasets
import pandas as pd
import numpy as np
from tabulate import tabulate
import altair as alt
alt.renderers.enable('notebook')
pd.set_option('display.max_columns', None)
A2_Q2.csv file, drop the column 'ID' as it is an indexQ2 = pd.read_csv("A2_Q2.csv")
Q2=Q2.drop(columns='ID')
Q2.head()
Assume true is the positive target level. Using a score threshold of 0.5, work out the confusion matrix (using pd.crosstab() or as a Jupyter notebook table) with appropriate row and column labels.
Score in ascending orderQ2=Q2.sort_values(by='Score', ascending=True).reset_index(drop=True)
Q2.head()
makePrediction(), which will add an additional column Prediction (column name concatenate with the suffice of score threshold, so that we can utilize this function again in part C) in the pass-in dataframe df. Mark the value in the Prediction column as True if it's corresponding score is < the threshold, otherwise mark it as False. Returns the column name of the prediction column. def makePrediction(df, score_threshold):
Prediction_Column = 'Prediction_' + str(score_threshold)
df[Prediction_Column]=np.where((df['Score'] < score_threshold), False, True)
return Prediction_Column
makePrediction() function by passing in the sorted dataframe Q2 and 0.5 as the score threshold, set the returned column name to prediction_column_point5prediction_column_point5=makePrediction(Q2, 0.5)
Step 4: Define a function called makeOutcome(), which will add an additional column Outcome (column name concatenate with the suffice of score threshold, so that we can utilize this function again in part C) in the pass-in dataframe df. Mark the value in the Outcome column as follows:
iv. FP if it's corresponding target value is False, but prediction value is True
Returns the column name of the outcome column.
def makeOutcome(df, Prediction_column):
Outcome_Column = Prediction_column
Outcome_Column = Outcome_Column.replace('Prediction', 'Outcome')
df[Outcome_Column]='NA'
df.loc[(df['Target'] == True) & (df[Prediction_column] == True), Outcome_Column] = 'TP'
df.loc[(df['Target'] == False) & (df[Prediction_column] == False), Outcome_Column] = 'TN'
df.loc[(df['Target'] == True) & (df[Prediction_column] == False), Outcome_Column] = 'FN'
df.loc[(df['Target'] == False) & (df[Prediction_column] == True), Outcome_Column] = 'FP'
return Outcome_Column
makeOutcome() function by passing in the sorted dataframe Q2 and prediction_column_point5 , set the returned column name to outcome_column_point5outcome_Column_point5 = makeOutcome(Q2, prediction_column_point5)
Q2. Q2.head(10)
pd.crosstab() function, each column of the matrix represents the predicted instances, each row represents the instances in an actual target class.confustion_matrix_point5=pd.crosstab(index=Q2['Target'], columns=Q2[prediction_column_point5], rownames=['Target'], colnames=['Prediction'])
confustion_matrix_point5
Compute the following 5 metrics
df_metricscalculateMetrics(), which will calculate the above 5 metrics for the pass in Outcome_column of the dataframe df (values of the outcome_column are either TP, TN, FP or FN, which is generated in Part A step 4) . If metrc_type is specified as ROC, only TPR and FPR would be calculated (so that we can utilize this function again in Part C). Returns the calculated metrics as a dictionary (This is for the benefit of Part C). def calculateMetrics(df, Outcome_column, metric_type='whole'):
dict_Outcome= Q2[Outcome_column].value_counts().to_dict()
TN=TP=FP=FN=0
for k, v in dict_Outcome.items():
if k == 'TN':
TN = v
if k == 'TP':
TP = v
if k == 'FP':
FP = v
if k == 'FN':
FN = v
TPR = TP / (TP+FN)
FPR = 1 - (TN /(TN+FP))
if (metric_type != 'ROC'):
Error_Rate = (FP+FN)/(TN+TP+FP+FN)
Precision = TP / (TP+FP)
F1_Score = 2 * ((Precision*TPR)/(Precision+TPR))
dict_Metrics={'Error_Rate':Error_Rate, 'Precision':Precision, 'TPR':TPR, 'F1_Score':F1_Score, 'FPR':FPR}
else:
dict_Metrics={'TPR':TPR, 'FPR':FPR}
return dict_Metrics
calculateMetrics() function by passing in the sorted dataframe Q2 and outcome_column_point5 (column name obtained from Part A step 5), set the returned dictionary to dict_Metrics_Point5dict_Metrics_Point5 = calculateMetrics(Q2, outcome_Column_point5)
df_metrics dataframe from the dictionary dict_Metrics_Point5. Transpose df_metrics to show each metric value by row as requested. df_metrics = pd.DataFrame(columns=['Error_Rate', 'Precision', 'TPR', 'F1_Score', 'FPR'])
df_metrics = df_metrics.append(dict_Metrics_Point5, ignore_index=True)
df_metrics=df_metrics.transpose().reset_index()
df_metrics.columns=['Metric', 'Value']
df_metrics.round(3)
By varying the score threshold from 0.1 to 0.9 (both inclusive) in 0.1 increments, compute TPR and FPR values. Display the answers as a Pandas data frame called df_roc
score_threshold_list which consists of the values from 0.1 to 0.9 with 0.1 incrementsscore_threshold_list = [ x*0.1 for x in range(1, 10)]
score_threshold_list = np.round(score_threshold_list, 1)
score_threshold_list
Target and Score column in Q2, as we start to compute everything all over again.Q2=Q2.loc[:, ['Target', 'Score']]
Step 3: Construct the df_roc dataframe with desired column name. Loop over each score threshold and call the below functions :
to add additional prediction and outcome columns in the dataframe Q2.
ROC; the function is well explained in part B step 1 and 2.to obtain the calculated metric d_metric, and append it as a new row into the dataframe df_roc
df_roc = pd.DataFrame(columns=['Threshold', 'TPR', 'FPR'])
for score_threshold in score_threshold_list:
d_roc ={'Threshold': score_threshold}
prediction_column = makePrediction(Q2, score_threshold)
outcome_column = makeOutcome(Q2, prediction_column)
d_metric =calculateMetrics(Q2, outcome_column, 'ROC')
d_roc.update(d_metric)
df_roc = df_roc.append(d_roc, ignore_index=True)
Q2. Q2.head(10)
df_roc. df_roc.round(3)
import altair as alt
alt.renderers.enable('html')
base = alt.Chart(df_roc,
title='ROC Curve of A2_Q2.csv'
).properties(width=300)
roc_curve = base.mark_line(point=True).encode(
alt.X('FPR', title='False Positive Rate (FPR)', sort=None),
alt.Y('TPR', title='True Positive Rate (TPR) (a.k.a Recall)'),
)
roc_rule = base.mark_line(color='green').encode(
x='FPR',
y='FPR',
size=alt.value(2)
)
(roc_curve + roc_rule).interactive()