Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.
Data preprocessing is a critical component in machine learning and its importance cannot be overstated. ill still be incorrect.
For this question, we need to perform all data preprocessing steps on a dataset on the UCI ML Datasets Repository so that the clean dataset can be directly fed into any classification algorithm within the Scikit-Learn Python module without any further changes.
This dataset is the Credit Approval data at the following address:
https://archive.ics.uci.edu/ml/datasets/Credit+Approval
The UCI Repository provides four datasets, but only two of them will be relevant:
numpy, pandas, matplotlib and statistics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statistics as stat
%matplotlib inline
Read in Assignment1_Q1_crx.data
and assign the column names as described in the file Assignment1_Q1_crx.names
crx_colunm = list(map(str, range(1, 17)))
crx_colunm = ['A' + sub for sub in crx_colunm]
Data = pd.read_csv("Assignment1_Q1_crx.data", names=crx_colunm, index_col=False, header = None)
Display the first 5 rows in the dataset
Data.head()
print(Data.shape)
print(Data.dtypes)
According to the file ,Assignment1_Q1_crx.names
feature A2
and A14
should be numeric and continuous. Covert these 2 features to numeric and transform the non-numeric values into NaN.
Data['A2']=Data['A2'].apply(pd.to_numeric, errors='coerce')
Data['A14']=Data['A14'].apply(pd.to_numeric, errors='coerce')
Check whether there is any null values in the dataset. NaN values will also return True with the isnull()
function
Data.isnull().sum()
There are only Null or NAN values in feature A2
and A14
. Locate the rows in these 2 features with NaN or Null values. For feature A2
, at index=83, we found the first row with NaN value, for feature A14
, at index=71, we found the first row with NaN value.
Data.loc[Data['A2'].isnull()]
Data.loc[Data['A14'].isnull()]
Since feature A2
and A14
are both numeric, we impute the NaN values in these columns with their respective median as requested.
print("Median for column A2:", round(np.nanmedian(Data['A2']),2))
print("Median for column A14:", round(np.nanmedian(Data['A14']),2))
Data.loc[Data['A2'].isnull(), 'A2']= round(np.nanmedian(Data['A2']),2)
Data.loc[Data['A14'].isnull(), 'A14']= round(np.nanmedian(Data['A14']),2)
Sanity check whether median values have been imputed for the rows with the particular index for feature A2
and A14
as described in step 7.
print("First Imputed values in column A2:", Data.loc[83,'A2'])
print("First Imputed values in column A14:", Data.loc[71,'A14'])
Generate summary statistics for all the numerical columns and boxplot for each numerical features to have a better understanding of the distribution of the data. `
Data.describe(include=np.number).round(2)
numerical_cols = Data.select_dtypes(include=['float64', 'int64']).columns.tolist()
for col in numerical_cols:
print(f"Column Name: {col}:")
plt.boxplot(Data[col])
plt.show()
From the above boxplots, we found that the data are not noramlly distributed for all the features. Range is particularly huge for feature A14 and A15. As this is an anonymised dataset, we have limited domain knowledge, and we have no reason to believe that IQR based outlier detection is applicable. Thus, all the values in the datasets might be perfectly valid and we do not flag any numerical values as outliers regardless of their value.
Generate the summary statistics and list all the value counts for all each categorical features
Data.describe(include=np.object)
categorical_cols = Data.columns[Data.dtypes==object].tolist()
for col in categorical_cols:
print(f'Before Transformation {col} -', Data[col].unique())
print(Data[col].value_counts())
'?'
is found in some of the categorical features, replace it with the mode value of that feature, check the unique values and their corresponding counts of each feature to verify mode values have been imputed.
categorical_cols = Data.columns[Data.dtypes==object].tolist()
for col in categorical_cols:
Data.loc[(Data[col]=='?'), col]= stat.mode(Data[col])
print(f'After Transformation {col} -', Data[col].unique())
print(Data[col].value_counts())
From the output of previous step, we could find that column A4
and A5
have the same number of unique values, check whether each unique values have a mapping count in the respective column.
Data.groupby(['A4', 'A5']).size()
The above output shows that column A4 and A5 might carry the same information as in column A4 whenever there is a value of l
, A5 would have a value of gg
, etc. However in the file Assignment1_Q1_crx.names
, it tells us that A4 could accept 4 inputs and A5 could only take in 3 as below:
A4: u, y, l, t.
A5: g, p, gg.
even though we should simplify our dataset by removing column(s) with redundant information, but in this case, as we don't have any domain knowledge on this anonymised dataset, we better play safe and keep both columns, just bear in mind about this finding, and we could come back to revise the method of cleaning the dataset once we improve our knowledge on it.
For the A2
numerical descriptive feature, discretize it via equal-frequency binning with 3 bins named "low", "medium", and "high", and then handle with integer encoding.
Data['A2'] = pd.qcut(Data['A2'], q=3, labels=['low', 'medium', 'high'])
level_mapping = {'low': 0, 'medium': 1, 'high': 2}
Data['A2'] = Data['A2'].replace(level_mapping)
Data['A2'].value_counts()
Remove feature A16
as it's the class attribute in the dataset. Call it target. The rest of the features will be the descriptive features in our dataset.
target = Data['A16']
Data = Data.drop(columns='A16')
target.value_counts()
Label-encode the target feature so that the positive class is encoded as "1". The negative class should be encoded as "0". Confirm correctness of our label-encoding by getting a value counts.
target = target.replace({'-': 0, '+': 1})
target.value_counts()
As all the categorical features appear to be nominal. Perform one-hot encoding for all the descriptive categorical features and call this encoded data frame as Data_encoded
. If a categorical descriptive feature has only 2 levels, encode it with only one binary variable. For other categorical features (with more than 2 levels), use regular one-hot-encoding (where number of binary variables are equal to the number of distinct levels).
Data_encoded = Data.copy() # retain original Data without encoding for further analysis
print(Data_encoded.columns)
# if a categorical descriptive feature has only 2 levels,
# define only one binary variable
categorical_cols = Data_encoded.columns[Data.dtypes==object].tolist()
for col in categorical_cols:
q = len(Data_encoded[col].unique())
if (q == 2):
Data_encoded[col] = pd.get_dummies(Data_encoded[col], drop_first=True)
# for other categorical features (with > 2 levels),
# perform regular one-hot-encoding using pd.get_dummies()
# if a feature is numeric, it will be untouched
Data_encoded = pd.get_dummies(Data_encoded)
print(Data_encoded.columns)
Check the data types of all the descriptive features. Once they are numerical, it's almost ready to be fed into scikit learn module.
Data_encoded.dtypes
Perform a range normalization of the descriptive features using standard scaling within the preprocessing submodule of Scikit-Learn, and call it Data_encoded_norm_numpy
. We shall leave Data_encoded
around to keep track of column names.
from sklearn import preprocessing
Data_scaler = preprocessing.StandardScaler()
Data_encoded_norm_numpy = Data_scaler.fit_transform(Data_encoded)
As the Data_encoded_norm_numpy
is a NumPy array, all the column names are lost. Define a new Pandas data frame called Data_encoded_norm_df
from Data_encoded_norm_numpy
with the column names of Data_encoded
. Finally, get the shape and a description of Data_encoded_norm_df
with include='all' option.
Data_encoded_norm_df = pd.DataFrame(Data_encoded_norm_numpy,
columns=Data_encoded.columns)
print(f'Shape of Data_encoded_norm_df is {Data_encoded_norm_df.shape}\n')
Data_encoded_norm_df.describe(include='all').round(3)
Sanity check some categorical features, there should be only 3 unique values for feature A2
and 2 unique values for feature A7_h
.
Data_encoded_norm_df['A2'].value_counts()
Data_encoded_norm_df['A7_h'].value_counts()
Define a new data frame called df_clean
which is the combination of the normalized and scaled descriptive features and the target feature with the target feature as the last column. Ensure target feature is a numpy array by calling .values.
df_clean = Data_encoded_norm_df.assign(target = target.values)
Generate the summary output of the final data frame df_clean
.
pd.set_option('display.max_columns', 50)
df_clean.shape
df_clean.describe(include='all').round(3)
df_clean.head(5)
Write the final dataset df_clean
to a CSV file called df_clean.csv
.
# set index to False so that row IDs are not written
df_clean.to_csv('df_clean.csv', index=False)
Use KNN algorithm with Manhattan distance, predict the level of corruption in Russia
based on a range of macro-economic and social features of a list of given countries. CPI index measures the perceived levels of corruption in the public sector of countries and ranges from 0 (highly corrupt) to 10 (very clean).
Read Asignment1_Q2.csv
file into a dataframe oCorr
, whcih stands for "original Corruption".
oCorr = pd.read_csv("Asignment1_Q2.csv", index_col=False, header=0)
oCorr
oCorr.dtypes
Since CPI should be a numeric value, convert it to numeric.
oCorr['CPI']=oCorr['CPI'].apply(pd.to_numeric, errors='coerce')
Last row in the dataset is our query row which we want to predict the CPI index. Extract the last row into a new dataframe, q
and remove it from the dataset, oCorr
. oCorr
is now our training data set. Please note, information for Afghanistan can be located with index 0, etc.
q = oCorr.iloc[-1,:].to_frame()
q=q.T.reset_index(drop=True)
q
oCorr = oCorr.iloc[:-1,:]
oCorr
As CPI is our target feature, extract that into a separate dataframe, oTarget
. Drop this feature in the oCorr
and q
dataframe.
oTarget = oCorr['CPI']
oCorr = oCorr.drop(columns='CPI')
q = q.drop(columns='CPI')
OCorrMatrix = oCorr.iloc[:, 1:].values
qMatrix = q.iloc[:, 1:].values
Transform the training and query dataframe into 2-Dimensional numpy arrays for lateral calculation.
type(OCorrMatrix)
OCorrMatrix
qMatrix
What value would a 3-nearest neighbor prediction model using Manhattan distance return for the CPI of Russia?
Define a function, calManhattan
, which would calculate the Manhattan distance of each row in the descriptive Matrix to the target Matrix. There is only one row in the target matrix with index 0. The calculated distance would be returned as a 1-Dimensional numpy array.
def calManhattan(descMatrix, targetMatrix):
M_dist=np.zeros(len(descMatrix))
for i in range(len(descMatrix)):
M_dist[i]= sum(np.abs(descMatrix[i,:] - targetMatrix[0,:]))
return M_dist
Pass in the training and query matrix to the calManhattan
function to obtain the manhattan distance of each row, store the result in the numpy array, M
. Please note, first Cell with index=0, is the manhattan distance between Afghanistan and RuAssia, etc.
M = calManhattan(OCorrMatrix, qMatrix)
M
Construct a dataframe, result_a
, which shows the manhattan distance for each country and it's corresponding target feature.
data_a = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': M.tolist(), 'CPI' : oTarget.tolist()}
result_a = pd.DataFrame(data_a)
Sort result_a
base on the manhattan distance in ascending order
result_a = result_a.sort_values('Manhattan distance to Russia')
result_a
Draw the conclusion from result_a
as below:
The nearest 3 neighbors to Russia are Argentina, USA and China. The CPI value will be returned by the model is the average CPI score for these 3 neighbors, which is
nearest3CPI = result_a.iloc[0:3,2].tolist()
print('CPI for the first 3 neighbors:', nearest3CPI)
avgnearest3CPI = np.round((sum(nearest3CPI)/3), 4)
print('Average CPI for the first 3 neighbors:', avgnearest3CPI)
What value would a weighted k-NN prediction model return for the CPI of Russia? Use k =16 (i.e., the full dataset) and a weighting scheme of the reciprocal of the squared Manhattan distance between the neighbor and the query.
Calculating the Weight matrix, W
, which is the reciprocal of the squared Manhattan distance between Each neighbor country and the query Russia
W = 1 / (M)**2
W
Calculating the weights multiplied by the instance target value (i.e. CPI) and store that as our product
matrix
product = W*oTarget.values
Construct a dataframe, result_b
, which shows the manhattan distance for each country and it's corresponding target feature (i.e. CPI), weight and wight x CPI.
data_b = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': M.tolist(), 'CPI' : oTarget.tolist(), 'Weight' : W.tolist(), 'Weight x CPI': product.tolist()}
result_b = pd.DataFrame(data_b)
Sort result_b
base on the manhattan distance in ascending order
result_b = result_b.sort_values('Manhattan distance to Russia')
result_b
Draw the conclusion from result_b
as below:
Since we are using k=16, the value returned by the model is the sum of the instance weights multiplied by the instance target value divided by the sum of the instance weights for all levels:
weight16 = np.round(sum(result_b['Weight x CPI']) / sum(result_b['Weight']),4)
print('Weight CPI using full set of data k = 16:', weight16)
The descriptive features in this dataset are of different types. For example, some are percentages, others are measured in years, and others are measured in counts per 1,000. We should always consider normalizing our data, but it is particularly important to do this when the descriptive are measured in different units. What value would a 3-nearest neighbor prediction model using Manhattan distance return for the CPI of Russia when the descriptive features have been normalized using range normalization?
Define a function, normal2DByRow
, which takes a 2-Dimensional descMatrix
as parameter, and return this matrix after min-max normalization.
def normal2DByRow(descMatrix):
norm=np.zeros((len(descMatrix), descMatrix.shape[1]))
# print(len(descMatrix),descMatrix.shape[1])
# print(norm)
for i in range(descMatrix.shape[1]):
minVal= min(descMatrix[:,i])
maxVal= max(descMatrix[:,i])
for j in range(len(descMatrix)):
# print(minVal, maxVal, descMatrix[j,i])
norm[j,i] = (descMatrix[j,i] - minVal )/ (maxVal - minVal)
# print(norm)
return norm
Pass in the training matrix OCorrMatrix
to the normal2DByRow
function to obtain nCorrMatrix
.
nCorrMatrix=normal2DByRow(OCorrMatrix)
nCorrMatrix
Define a function, normalTFromDByRow
, which takes two 2-dimensional descMatrix
and targetMatrix
as parameter, descMatrix
can have any columns and rows, while targetMatrix
can only have 1 row and same number of columns as descMatrix. We then normalise targetMatrix
base on the minimum and maximum values of descMatrix
by Row. This function will return the normalized targetMatrix
.
def normalTFromDByRow(descMatrix, targetMatrix):
norm=np.zeros((1, targetMatrix.shape[1]))
# print(len(descMatrix),descMatrix.shape[1])
# print(norm)
for i in range(descMatrix.shape[1]):
minVal= min(descMatrix[:,i])
maxVal= max(descMatrix[:,i])
norm[0,i] = (targetMatrix[0,i]- minVal )/ (maxVal - minVal)
return norm
Pass in the training OCorrMatrix
and query matrix qMatrix
to the normalTFromDByRow
function to obtain nQMatrix
.
nQMatrix = normalTFromDByRow(OCorrMatrix,qMatrix)
nQMatrix
Construct a dataframe nCorrdf
to show the normalized descriptive features with headings.
nCorrDf = pd.DataFrame(data=nCorrMatrix, columns=oCorr.columns[1:])
nCorrDf = pd.concat([oCorr['COUNTRY_ID'].to_frame(), nCorrDf], axis=1)
nCorrDf
Construct a dataframe nQdf
to show the normalized query with headings.
nQdf = pd.DataFrame(data=nQMatrix, columns=q.columns[1:])
qCountry = q['COUNTRY_ID'].reset_index(drop=True).to_frame()
nQdf = pd.concat([qCountry, nQdf], axis=1)
nQdf
Pass in the normalised training (nCorrMatrix
) and query matrix (nQMatrix
) to the calManhattan
function to obtain the manhattan distance of each row, store the result in the numpy array, nM
. Please note, first Cell with index=0, is the Manhattan distance between Afghanistan and Russia, etc.
nM = calManhattan(nCorrMatrix, nQMatrix)
nM
Construct a dataframe, result_c
, which shows the manhattan distance after normalization for each country and it's corresponding target feature. Please note, target feature is not normalized.
data_c = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': nM.tolist(), 'CPI' : oTarget.tolist()}
result_c = pd.DataFrame(data_c)
result_c = result_c.sort_values('Manhattan distance to Russia')
result_c
Draw the conclusion from result_c
as below:
The nearest 3 neighbors to Russia are USA, UK and Argentina. The CPI value that will be returned by the model is the average CPI score for these 3 neighbors, which is
nearest3CPINorm = result_c.iloc[0:3,2].tolist()
print('CPI for the first 3 neighbors:', nearest3CPINorm)
avgnearest3CPINorm = np.round((sum(nearest3CPINorm)/3), 4)
print('Average CPI for the first 3 neighbors:', avgnearest3CPINorm)
What value would a weighted k-NN prediction model—with k=16 (i.e., the full dataset) and using a weighting scheme of the reciprocal of the squared Manhattan distance between the neighbor and the query return for the CPI of Russia when it is applied to the range-normalized data?
Calculating the Weight matrix, nW
, which is the reciprocal of the squared Manhattan distance between each neighbor country and the query Russia
after normalization.
nW = 1 / (nM)**2
nW
Calculating the weights multiplied by the instance target value (i.e. CPI) and store that as our nProduct
matrix
nProduct =nW*oTarget.values
nProduct
Construct a dataframe, result_d
, which shows the manhattan distance for each country and it's corresponding target feature (i.e. CPI), weight and wight x CPI after normalization. Please note, target feature is not normalized.
data_d = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': nM.tolist(), 'CPI' : oTarget.tolist(), 'Weight' : nW.tolist(), 'Weight x CPI': nProduct.tolist()}
result_d = pd.DataFrame(data_d)
result_d = pd.DataFrame(data_d)
Sort result_d
base on the manhattan distance in ascending order
result_d = result_d.sort_values('Manhattan distance to Russia')
result_d
Draw the conclusion from result_d
as below:
Since we are using k=16, the value returned by the model is the sum of the instance weights multiplied by the instance target value divided by the sum of the instance weights for all levels:
weightNorm16 = np.round(sum(result_d['Weight x CPI']) / sum(result_d['Weight']),4)
print('Weight CPI using full set of data k = 16:', weightNorm16)
The actual 2011 CPI for Russia was 2.4488. Which of the predictions made was the most accurate? Why do you think this was?
The closest prediction was using 3-nearest neighbor prediction based on the original data. As the data ranges in the dataset are so different, normalization would actually find the non-biased distance (ranking of the instances would not be dominated by the feature with larger values) between the training instances and the query, and then would drive us to calculate the most sensible CPI. If we further examine the data according to the ranking of Manhattan distance and the CPI, we can find that the distance ranking of Russia to the original data set is a bit better than the normalized dataset with CPI close to 2.4488 when k = 3, but overall, we can observe there is no relationship between distance ranking to the CPI index in both datasets for this particular instance. ( Results in part b and d clearly explain so when we wanna find the CPI index with the full dataset.) Normalizing data and multiply the weight with CPI, will only amplify the non-linear relationship for this particular query.
It is more logical to have few more query (or testing) data to justify which data model (we can also find out the relevance on each macro economic features with the CPI, not just rely on one particular instance) works best for this scenario. We could then fine tune on using which distance metric and number of k to maximize the correctness of predicting CPI index on unseen countries.
We have construct the below dataframes to show all the features in both datasets with the ranking of manhattan distance for further reference
targetDf = pd.DataFrame(oTarget.tolist(), columns=['CPI'])
DistRank = pd.DataFrame(np.arange(1,17), columns=['Rank <lower indicates closer>'])
nMDf = pd.DataFrame(nM.tolist(), columns=['Manhattan distance to Russia'])
nRankDf = pd.concat([nCorrDf,nMDf, targetDf ], axis=1)
nRankDf = nRankDf.sort_values('Manhattan distance to Russia').reset_index(drop=True)
nRankDf = pd.concat([nRankDf, DistRank ], axis=1)
nRankDf = nRankDf.sort_values('CPI').reset_index(drop=True)
MDf = pd.DataFrame(M.tolist(), columns=['Manhattan distance to Russia'])
oRankDf = pd.concat([oCorr, MDf, targetDf ], axis=1)
oRankDf = oRankDf.sort_values('Manhattan distance to Russia').reset_index(drop=True)
oRankDf = pd.concat([oRankDf, DistRank ], axis=1)
oRankDf = oRankDf.sort_values('CPI').reset_index(drop=True)
Original Data with Distance ranking and CPI in ascending order
oRankDf
Normalized Data with Distance ranking and CPI in ascending order
nRankDf