Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.
#!pip install --upgrade altair
#!pip install vega vega_datasets
import pandas as pd
import numpy as np
from tabulate import tabulate
import altair as alt
alt.renderers.enable('notebook')
pd.set_option('display.max_columns', None)
A2_Q2.csv
file, drop the column 'ID' as it is an indexQ2 = pd.read_csv("A2_Q2.csv")
Q2=Q2.drop(columns='ID')
Q2.head()
Assume true is the positive target level. Using a score threshold of 0.5, work out the confusion matrix (using pd.crosstab() or as a Jupyter notebook table) with appropriate row and column labels.
Score
in ascending orderQ2=Q2.sort_values(by='Score', ascending=True).reset_index(drop=True)
Q2.head()
makePrediction()
, which will add an additional column Prediction
(column name concatenate with the suffice of score threshold
, so that we can utilize this function again in part C) in the pass-in dataframe df
. Mark the value in the Prediction
column as True
if it's corresponding score is < the threshold, otherwise mark it as False
. Returns the column name of the prediction column. def makePrediction(df, score_threshold):
Prediction_Column = 'Prediction_' + str(score_threshold)
df[Prediction_Column]=np.where((df['Score'] < score_threshold), False, True)
return Prediction_Column
makePrediction()
function by passing in the sorted dataframe Q2
and 0.5 as the score threshold, set the returned column name to prediction_column_point5
prediction_column_point5=makePrediction(Q2, 0.5)
Step 4: Define a function called makeOutcome()
, which will add an additional column Outcome
(column name concatenate with the suffice of score threshold
, so that we can utilize this function again in part C) in the pass-in dataframe df
. Mark the value in the Outcome
column as follows:
iv. FP if it's corresponding target value is False, but prediction value is True
Returns the column name of the outcome column.
def makeOutcome(df, Prediction_column):
Outcome_Column = Prediction_column
Outcome_Column = Outcome_Column.replace('Prediction', 'Outcome')
df[Outcome_Column]='NA'
df.loc[(df['Target'] == True) & (df[Prediction_column] == True), Outcome_Column] = 'TP'
df.loc[(df['Target'] == False) & (df[Prediction_column] == False), Outcome_Column] = 'TN'
df.loc[(df['Target'] == True) & (df[Prediction_column] == False), Outcome_Column] = 'FN'
df.loc[(df['Target'] == False) & (df[Prediction_column] == True), Outcome_Column] = 'FP'
return Outcome_Column
makeOutcome()
function by passing in the sorted dataframe Q2
and prediction_column_point5
, set the returned column name to outcome_column_point5
outcome_Column_point5 = makeOutcome(Q2, prediction_column_point5)
Q2
. Q2.head(10)
pd.crosstab()
function, each column of the matrix represents the predicted instances, each row represents the instances in an actual target class.confustion_matrix_point5=pd.crosstab(index=Q2['Target'], columns=Q2[prediction_column_point5], rownames=['Target'], colnames=['Prediction'])
confustion_matrix_point5
Compute the following 5 metrics
df_metrics
calculateMetrics()
, which will calculate the above 5 metrics for the pass in Outcome_column
of the dataframe df
(values of the outcome_column are either TP
, TN
, FP
or FN
, which is generated in Part A step 4) . If metrc_type is specified as ROC
, only TPR
and FPR
would be calculated (so that we can utilize this function again in Part C). Returns the calculated metrics as a dictionary (This is for the benefit of Part C). def calculateMetrics(df, Outcome_column, metric_type='whole'):
dict_Outcome= Q2[Outcome_column].value_counts().to_dict()
TN=TP=FP=FN=0
for k, v in dict_Outcome.items():
if k == 'TN':
TN = v
if k == 'TP':
TP = v
if k == 'FP':
FP = v
if k == 'FN':
FN = v
TPR = TP / (TP+FN)
FPR = 1 - (TN /(TN+FP))
if (metric_type != 'ROC'):
Error_Rate = (FP+FN)/(TN+TP+FP+FN)
Precision = TP / (TP+FP)
F1_Score = 2 * ((Precision*TPR)/(Precision+TPR))
dict_Metrics={'Error_Rate':Error_Rate, 'Precision':Precision, 'TPR':TPR, 'F1_Score':F1_Score, 'FPR':FPR}
else:
dict_Metrics={'TPR':TPR, 'FPR':FPR}
return dict_Metrics
calculateMetrics()
function by passing in the sorted dataframe Q2
and outcome_column_point5
(column name obtained from Part A step 5), set the returned dictionary to dict_Metrics_Point5
dict_Metrics_Point5 = calculateMetrics(Q2, outcome_Column_point5)
df_metrics
dataframe from the dictionary dict_Metrics_Point5
. Transpose df_metrics
to show each metric value by row as requested. df_metrics = pd.DataFrame(columns=['Error_Rate', 'Precision', 'TPR', 'F1_Score', 'FPR'])
df_metrics = df_metrics.append(dict_Metrics_Point5, ignore_index=True)
df_metrics=df_metrics.transpose().reset_index()
df_metrics.columns=['Metric', 'Value']
df_metrics.round(3)
By varying the score threshold from 0.1 to 0.9 (both inclusive) in 0.1 increments, compute TPR and FPR values. Display the answers as a Pandas data frame called df_roc
score_threshold_list
which consists of the values from 0.1 to 0.9 with 0.1 incrementsscore_threshold_list = [ x*0.1 for x in range(1, 10)]
score_threshold_list = np.round(score_threshold_list, 1)
score_threshold_list
Target
and Score
column in Q2
, as we start to compute everything all over again.Q2=Q2.loc[:, ['Target', 'Score']]
Step 3: Construct the df_roc
dataframe with desired column name. Loop over each score threshold and call the below functions :
to add additional prediction and outcome columns in the dataframe Q2
.
ROC
; the function is well explained in part B step 1 and 2.to obtain the calculated metric d_metric
, and append it as a new row into the dataframe df_roc
df_roc = pd.DataFrame(columns=['Threshold', 'TPR', 'FPR'])
for score_threshold in score_threshold_list:
d_roc ={'Threshold': score_threshold}
prediction_column = makePrediction(Q2, score_threshold)
outcome_column = makeOutcome(Q2, prediction_column)
d_metric =calculateMetrics(Q2, outcome_column, 'ROC')
d_roc.update(d_metric)
df_roc = df_roc.append(d_roc, ignore_index=True)
Q2
. Q2.head(10)
df_roc
. df_roc.round(3)
import altair as alt
alt.renderers.enable('html')
base = alt.Chart(df_roc,
title='ROC Curve of A2_Q2.csv'
).properties(width=300)
roc_curve = base.mark_line(point=True).encode(
alt.X('FPR', title='False Positive Rate (FPR)', sort=None),
alt.Y('TPR', title='True Positive Rate (TPR) (a.k.a Recall)'),
)
roc_rule = base.mark_line(color='green').encode(
x='FPR',
y='FPR',
size=alt.value(2)
)
(roc_curve + roc_rule).interactive()