Question 1 - Data Preprocessing ¶

Data preprocessing is a critical component in machine learning and its importance cannot be overstated. ill still be incorrect.

For this question, we need to perform all data preprocessing steps on a dataset on the UCI ML Datasets Repository so that the clean dataset can be directly fed into any classification algorithm within the Scikit-Learn Python module without any further changes.

This dataset is the Credit Approval data at the following address:

https://archive.ics.uci.edu/ml/datasets/Credit+Approval

The UCI Repository provides four datasets, but only two of them will be relevant:

Assignment1_Q1_crx.names: Some basic info on the dataset together with the feature names & values
Assignment1_Q1_crx.data: The actual data in comma-separated format

1. Import all necessary packages:¶

numpy, pandas, matplotlib and statistics

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statistics as stat

%matplotlib inline

2. Read in the given data file¶

Read in Assignment1_Q1_crx.data and assign the column names as described in the file Assignment1_Q1_crx.names

crx_colunm = list(map(str, range(1, 17)))
crx_colunm = ['A' + sub for sub in crx_colunm]
Data = pd.read_csv("Assignment1_Q1_crx.data", names=crx_colunm, index_col=False, header = None)

3. Show the read-in file¶

Display the first 5 rows in the dataset

Data.head()

4. Check dimension and data types¶

print(Data.shape)
print(Data.dtypes)

(690, 16)
A1      object
A2      object
A3     float64
A4      object
A5      object
A6      object
A7      object
A8     float64
A9      object
A10     object
A11      int64
A12     object
A13     object
A14     object
A15      int64
A16     object
dtype: object

5. Handling Numerical Descriptive features ¶

According to the file ,Assignment1_Q1_crx.names feature A2 and A14 should be numeric and continuous. Covert these 2 features to numeric and transform the non-numeric values into NaN.

Data['A2']=Data['A2'].apply(pd.to_numeric, errors='coerce')
Data['A14']=Data['A14'].apply(pd.to_numeric, errors='coerce')

Check whether there is any null values in the dataset. NaN values will also return True with the isnull() function

Data.isnull().sum()

A1      0
A2     12
A3      0
A4      0
A5      0
A6      0
A7      0
A8      0
A9      0
A10     0
A11     0
A12     0
A13     0
A14    13
A15     0
A16     0
dtype: int64

There are only Null or NAN values in feature A2 and A14. Locate the rows in these 2 features with NaN or Null values. For feature A2, at index=83, we found the first row with NaN value, for feature A14, at index=71, we found the first row with NaN value.

Data.loc[Data['A2'].isnull()]

Data.loc[Data['A14'].isnull()]

Since feature A2 and A14 are both numeric, we impute the NaN values in these columns with their respective median as requested.

print("Median for column A2:", round(np.nanmedian(Data['A2']),2))
print("Median for column A14:", round(np.nanmedian(Data['A14']),2))
Data.loc[Data['A2'].isnull(), 'A2']= round(np.nanmedian(Data['A2']),2)
Data.loc[Data['A14'].isnull(), 'A14']= round(np.nanmedian(Data['A14']),2)

Median for column A2: 28.46
Median for column A14: 160.0

Sanity check whether median values have been imputed for the rows with the particular index for feature A2 and A14 as described in step 7.

print("First Imputed values in column A2:", Data.loc[83,'A2'])
print("First Imputed values in column A14:", Data.loc[71,'A14'])

First Imputed values in column A2: 28.46
First Imputed values in column A14: 160.0

Generate summary statistics for all the numerical columns and boxplot for each numerical features to have a better understanding of the distribution of the data. `

Data.describe(include=np.number).round(2)

numerical_cols = Data.select_dtypes(include=['float64', 'int64']).columns.tolist()
for col in numerical_cols:
    print(f"Column Name: {col}:")
    plt.boxplot(Data[col])
    plt.show()

Column Name: A2:

Column Name: A3:

Column Name: A8:

Column Name: A11:

Column Name: A14:

Column Name: A15:

From the above boxplots, we found that the data are not noramlly distributed for all the features. Range is particularly huge for feature A14 and A15. As this is an anonymised dataset, we have limited domain knowledge, and we have no reason to believe that IQR based outlier detection is applicable. Thus, all the values in the datasets might be perfectly valid and we do not flag any numerical values as outliers regardless of their value.

6. Handling Categorical Descriptive features ¶

Generate the summary statistics and list all the value counts for all each categorical features

Data.describe(include=np.object)

categorical_cols = Data.columns[Data.dtypes==object].tolist()
for col in categorical_cols:    
    print(f'Before Transformation {col} -', Data[col].unique())
    print(Data[col].value_counts())

Before Transformation A1 - ['b' 'a' '?']
b    468
a    210
?     12
Name: A1, dtype: int64
Before Transformation A4 - ['u' 'y' '?' 'l']
u    519
y    163
?      6
l      2
Name: A4, dtype: int64
Before Transformation A5 - ['g' 'p' '?' 'gg']
g     519
p     163
?       6
gg      2
Name: A5, dtype: int64
Before Transformation A6 - ['w' 'q' 'm' 'r' 'cc' 'k' 'c' 'd' 'x' 'i' 'e' 'aa' 'ff' 'j' '?']
c     137
q      78
w      64
i      59
aa     54
ff     53
k      51
cc     41
m      38
x      38
d      30
e      25
j      10
?       9
r       3
Name: A6, dtype: int64
Before Transformation A7 - ['v' 'h' 'bb' 'ff' 'j' 'z' '?' 'o' 'dd' 'n']
v     399
h     138
bb     59
ff     57
?       9
j       8
z       8
dd      6
n       4
o       2
Name: A7, dtype: int64
Before Transformation A9 - ['t' 'f']
t    361
f    329
Name: A9, dtype: int64
Before Transformation A10 - ['t' 'f']
f    395
t    295
Name: A10, dtype: int64
Before Transformation A12 - ['f' 't']
f    374
t    316
Name: A12, dtype: int64
Before Transformation A13 - ['g' 's' 'p']
g    625
s     57
p      8
Name: A13, dtype: int64
Before Transformation A16 - ['+' '-']
-    383
+    307
Name: A16, dtype: int64

'?' is found in some of the categorical features, replace it with the mode value of that feature, check the unique values and their corresponding counts of each feature to verify mode values have been imputed.

categorical_cols = Data.columns[Data.dtypes==object].tolist()
for col in categorical_cols:
    Data.loc[(Data[col]=='?'), col]= stat.mode(Data[col])
    print(f'After Transformation {col} -', Data[col].unique())
    print(Data[col].value_counts())

After Transformation A1 - ['b' 'a']
b    480
a    210
Name: A1, dtype: int64
After Transformation A4 - ['u' 'y' 'l']
u    525
y    163
l      2
Name: A4, dtype: int64
After Transformation A5 - ['g' 'p' 'gg']
g     525
p     163
gg      2
Name: A5, dtype: int64
After Transformation A6 - ['w' 'q' 'm' 'r' 'cc' 'k' 'c' 'd' 'x' 'i' 'e' 'aa' 'ff' 'j']
c     146
q      78
w      64
i      59
aa     54
ff     53
k      51
cc     41
m      38
x      38
d      30
e      25
j      10
r       3
Name: A6, dtype: int64
After Transformation A7 - ['v' 'h' 'bb' 'ff' 'j' 'z' 'o' 'dd' 'n']
v     408
h     138
bb     59
ff     57
j       8
z       8
dd      6
n       4
o       2
Name: A7, dtype: int64
After Transformation A9 - ['t' 'f']
t    361
f    329
Name: A9, dtype: int64
After Transformation A10 - ['t' 'f']
f    395
t    295
Name: A10, dtype: int64
After Transformation A12 - ['f' 't']
f    374
t    316
Name: A12, dtype: int64
After Transformation A13 - ['g' 's' 'p']
g    625
s     57
p      8
Name: A13, dtype: int64
After Transformation A16 - ['+' '-']
-    383
+    307
Name: A16, dtype: int64

From the output of previous step, we could find that column A4 and A5 have the same number of unique values, check whether each unique values have a mapping count in the respective column.

Data.groupby(['A4', 'A5']).size()

A4  A5
l   gg      2
u   g     525
y   p     163
dtype: int64

The above output shows that column A4 and A5 might carry the same information as in column A4 whenever there is a value of l, A5 would have a value of gg, etc. However in the file Assignment1_Q1_crx.names, it tells us that A4 could accept 4 inputs and A5 could only take in 3 as below:

A4: u, y, l, t.
A5: g, p, gg.

even though we should simplify our dataset by removing column(s) with redundant information, but in this case, as we don't have any domain knowledge on this anonymised dataset, we better play safe and keep both columns, just bear in mind about this finding, and we could come back to revise the method of cleaning the dataset once we improve our knowledge on it.

7. Discretize the A2 numerical descriptive feature ¶

For the A2 numerical descriptive feature, discretize it via equal-frequency binning with 3 bins named "low", "medium", and "high", and then handle with integer encoding.

Data['A2'] = pd.qcut(Data['A2'], q=3, labels=['low', 'medium', 'high'])

level_mapping = {'low': 0, 'medium': 1, 'high': 2}
Data['A2'] = Data['A2'].replace(level_mapping)
Data['A2'].value_counts()

1    231
0    230
2    229
Name: A2, dtype: int64

8. Handling target feature ¶

Remove feature A16 as it's the class attribute in the dataset. Call it target. The rest of the features will be the descriptive features in our dataset.

target = Data['A16']
Data = Data.drop(columns='A16')

target.value_counts()

-    383
+    307
Name: A16, dtype: int64

Label-encode the target feature so that the positive class is encoded as "1". The negative class should be encoded as "0". Confirm correctness of our label-encoding by getting a value counts.

target = target.replace({'-': 0, '+': 1})

target.value_counts()

0    383
1    307
Name: A16, dtype: int64

9. One-hot encoding for categorical features¶

As all the categorical features appear to be nominal. Perform one-hot encoding for all the descriptive categorical features and call this encoded data frame as Data_encoded. If a categorical descriptive feature has only 2 levels, encode it with only one binary variable. For other categorical features (with more than 2 levels), use regular one-hot-encoding (where number of binary variables are equal to the number of distinct levels).

Data_encoded = Data.copy()  # retain original Data without encoding for further analysis
print(Data_encoded.columns)

# if a categorical descriptive feature has only 2 levels,
# define only one binary variable
categorical_cols = Data_encoded.columns[Data.dtypes==object].tolist()
for col in categorical_cols:
    q = len(Data_encoded[col].unique())
    if (q == 2):
        Data_encoded[col] = pd.get_dummies(Data_encoded[col], drop_first=True)

# for other categorical features (with > 2 levels), 
# perform regular one-hot-encoding using pd.get_dummies()
# if a feature is numeric, it will be untouched
Data_encoded = pd.get_dummies(Data_encoded)

print(Data_encoded.columns)

Index(['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11',
       'A12', 'A13', 'A14', 'A15'],
      dtype='object')
Index(['A1', 'A2', 'A3', 'A8', 'A9', 'A10', 'A11', 'A12', 'A14', 'A15', 'A4_l',
       'A4_u', 'A4_y', 'A5_g', 'A5_gg', 'A5_p', 'A6_aa', 'A6_c', 'A6_cc',
       'A6_d', 'A6_e', 'A6_ff', 'A6_i', 'A6_j', 'A6_k', 'A6_m', 'A6_q', 'A6_r',
       'A6_w', 'A6_x', 'A7_bb', 'A7_dd', 'A7_ff', 'A7_h', 'A7_j', 'A7_n',
       'A7_o', 'A7_v', 'A7_z', 'A13_g', 'A13_p', 'A13_s'],
      dtype='object')

Check the data types of all the descriptive features. Once they are numerical, it's almost ready to be fed into scikit learn module.

Data_encoded.dtypes

A1         uint8
A2         int64
A3       float64
A8       float64
A9         uint8
A10        uint8
A11        int64
A12        uint8
A14      float64
A15        int64
A4_l       uint8
A4_u       uint8
A4_y       uint8
A5_g       uint8
A5_gg      uint8
A5_p       uint8
A6_aa      uint8
A6_c       uint8
A6_cc      uint8
A6_d       uint8
A6_e       uint8
A6_ff      uint8
A6_i       uint8
A6_j       uint8
A6_k       uint8
A6_m       uint8
A6_q       uint8
A6_r       uint8
A6_w       uint8
A6_x       uint8
A7_bb      uint8
A7_dd      uint8
A7_ff      uint8
A7_h       uint8
A7_j       uint8
A7_n       uint8
A7_o       uint8
A7_v       uint8
A7_z       uint8
A13_g      uint8
A13_p      uint8
A13_s      uint8
dtype: object

10. Normalization ¶

Perform a range normalization of the descriptive features using standard scaling within the preprocessing submodule of Scikit-Learn, and call it Data_encoded_norm_numpy. We shall leave Data_encoded around to keep track of column names.

from sklearn import preprocessing

Data_scaler = preprocessing.StandardScaler()

Data_encoded_norm_numpy = Data_scaler.fit_transform(Data_encoded)

As the Data_encoded_norm_numpy is a NumPy array, all the column names are lost. Define a new Pandas data frame called Data_encoded_norm_df from Data_encoded_norm_numpy with the column names of Data_encoded. Finally, get the shape and a description of Data_encoded_norm_df with include='all' option.

Data_encoded_norm_df = pd.DataFrame(Data_encoded_norm_numpy, 
                                    columns=Data_encoded.columns)

print(f'Shape of Data_encoded_norm_df is {Data_encoded_norm_df.shape}\n') 

Data_encoded_norm_df.describe(include='all').round(3)

Shape of Data_encoded_norm_df is (690, 42)

Sanity check some categorical features, there should be only 3 unique values for feature A2 and 2 unique values for feature A7_h.

Data_encoded_norm_df['A2'].value_counts()

 0.001777    231
-1.224303    230
 1.227857    229
Name: A2, dtype: int64

Data_encoded_norm_df['A7_h'].value_counts()

-0.5    552
 2.0    138
Name: A7_h, dtype: int64

Define a new data frame called df_clean which is the combination of the normalized and scaled descriptive features and the target feature with the target feature as the last column. Ensure target feature is a numpy array by calling .values.

df_clean = Data_encoded_norm_df.assign(target = target.values)

11. Summary of the final clean dataset: df_clean ¶

Generate the summary output of the final data frame df_clean.

pd.set_option('display.max_columns', 50)
df_clean.shape

(690, 43)

df_clean.describe(include='all').round(3)

df_clean.head(5)

Write the final dataset df_clean to a CSV file called df_clean.csv.

# set index to False so that row IDs are not written
df_clean.to_csv('df_clean.csv', index=False)

Question 2 - KNN ¶

Use KNN algorithm with Manhattan distance, predict the level of corruption in Russia based on a range of macro-economic and social features of a list of given countries. CPI index measures the perceived levels of corruption in the public sector of countries and ranges from 0 (highly corrupt) to 10 (very clean).

1. Read in the given data file¶

Read Asignment1_Q2.csv file into a dataframe oCorr, whcih stands for "original Corruption".

oCorr = pd.read_csv("Asignment1_Q2.csv", index_col=False, header=0)
oCorr

2. Check the datatypes¶

oCorr.dtypes

COUNTRY_ID       object
LIFE_EXP        float64
TOP10_INCOME    float64
INFANT_MORT     float64
MIL_SPEND       float64
SCHOOL_YEARS    float64
CPI              object
dtype: object

3. Data validation and preparation¶

Since CPI should be a numeric value, convert it to numeric.

oCorr['CPI']=oCorr['CPI'].apply(pd.to_numeric, errors='coerce')

Last row in the dataset is our query row which we want to predict the CPI index. Extract the last row into a new dataframe, q and remove it from the dataset, oCorr. oCorr is now our training data set. Please note, information for Afghanistan can be located with index 0, etc.

q = oCorr.iloc[-1,:].to_frame()
q=q.T.reset_index(drop=True)

q

oCorr = oCorr.iloc[:-1,:]

oCorr

As CPI is our target feature, extract that into a separate dataframe, oTarget. Drop this feature in the oCorr and q dataframe.

oTarget = oCorr['CPI']
oCorr = oCorr.drop(columns='CPI')
q = q.drop(columns='CPI')

OCorrMatrix = oCorr.iloc[:, 1:].values
qMatrix = q.iloc[:, 1:].values

Transform the training and query dataframe into 2-Dimensional numpy arrays for lateral calculation.

type(OCorrMatrix)

numpy.ndarray

OCorrMatrix

array([[59.61, 23.21, 74.3 ,  4.44,  0.4 ],
       [45.  , 47.67, 73.1 ,  0.09,  3.4 ],
       [51.3 , 38.23, 82.6 ,  1.07,  4.1 ],
       [70.48, 26.58, 19.6 ,  1.86,  5.3 ],
       [75.77, 32.3 , 13.3 ,  0.76, 10.1 ],
       [74.87, 29.98, 13.7 ,  1.95,  6.4 ],
       [73.12, 42.93, 14.5 ,  1.43,  7.2 ],
       [81.3 , 28.8 ,  3.6 ,  6.77, 12.5 ],
       [78.51, 29.85,  6.3 ,  4.72, 13.7 ],
       [80.15, 27.23,  3.5 ,  0.6 , 11.5 ],
       [80.09, 28.49,  4.4 ,  2.59, 13.  ],
       [80.24, 22.07,  3.5 ,  1.31, 12.  ],
       [80.99, 24.79,  4.9 ,  1.42, 14.2 ],
       [82.09, 25.4 ,  4.2 ,  1.86, 11.5 ],
       [81.43, 22.18,  2.4 ,  1.27, 12.8 ],
       [80.67, 27.81,  4.9 ,  1.13, 12.3 ]])

qMatrix

array([[67.62, 31.68, 10.0, 3.87, 12.9]], dtype=object)

4. Part A ¶

Question¶

What value would a 3-nearest neighbor prediction model using Manhattan distance return for the CPI of Russia?

Define a function, calManhattan, which would calculate the Manhattan distance of each row in the descriptive Matrix to the target Matrix. There is only one row in the target matrix with index 0. The calculated distance would be returned as a 1-Dimensional numpy array.

def calManhattan(descMatrix, targetMatrix):
    M_dist=np.zeros(len(descMatrix))
    for i in range(len(descMatrix)):       
        M_dist[i]= sum(np.abs(descMatrix[i,:] - targetMatrix[0,:]))        
    return M_dist

Pass in the training and query matrix to the calManhattan function to obtain the manhattan distance of each row, store the result in the numpy array, M. Please note, first Cell with index=0, is the manhattan distance between Afghanistan and RuAssia, etc.

M = calManhattan(OCorrMatrix, qMatrix)
M

array([ 93.85, 114.99, 107.07,  27.17,  17.98,  21.07,  29.39,  26.26,
        18.07,  28.15,  22.64,  32.19,  29.11,  29.96,  33.61,  25.36])

Construct a dataframe, result_a, which shows the manhattan distance for each country and it's corresponding target feature.

data_a = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': M.tolist(), 'CPI' : oTarget.tolist()} 
result_a = pd.DataFrame(data_a)

Sort result_a base on the manhattan distance in ascending order

result_a = result_a.sort_values('Manhattan distance to Russia')
result_a

Draw the conclusion from result_a as below:

Answer¶

The nearest 3 neighbors to Russia are Argentina, USA and China. The CPI value will be returned by the model is the average CPI score for these 3 neighbors, which is

nearest3CPI = result_a.iloc[0:3,2].tolist()
print('CPI for the first 3 neighbors:', nearest3CPI)

CPI for the first 3 neighbors: [2.9961, 7.1357, 3.6356]

avgnearest3CPI = np.round((sum(nearest3CPI)/3), 4)
print('Average CPI for the first 3 neighbors:', avgnearest3CPI)

Average CPI for the first 3 neighbors: 4.5891

5. Part B ¶

Question¶

What value would a weighted k-NN prediction model return for the CPI of Russia? Use k =16 (i.e., the full dataset) and a weighting scheme of the reciprocal of the squared Manhattan distance between the neighbor and the query.

Calculating the Weight matrix, W, which is the reciprocal of the squared Manhattan distance between Each neighbor country and the query Russia

W = 1 / (M)**2

W

array([1.13535440e-04, 7.56275188e-05, 8.72297032e-05, 1.35463011e-03,
       3.09328991e-03, 2.25253179e-03, 1.15771278e-03, 1.45014208e-03,
       3.06255358e-03, 1.26195306e-03, 1.95095456e-03, 9.65068292e-04,
       1.18009125e-03, 1.11408001e-03, 8.85243966e-04, 1.55489656e-03])

Calculating the weights multiplied by the instance target value (i.e. CPI) and store that as our product matrix

product = W*oTarget.values

Construct a dataframe, result_b, which shows the manhattan distance for each country and it's corresponding target feature (i.e. CPI), weight and wight x CPI.

data_b = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': M.tolist(), 'CPI' : oTarget.tolist(), 'Weight' : W.tolist(), 'Weight x CPI': product.tolist()} 
result_b = pd.DataFrame(data_b)

Sort result_b base on the manhattan distance in ascending order

result_b = result_b.sort_values('Manhattan distance to Russia')
result_b

Answer¶

Draw the conclusion from result_b as below:

Since we are using k=16, the value returned by the model is the sum of the instance weights multiplied by the instance target value divided by the sum of the instance weights for all levels:

weight16 = np.round(sum(result_b['Weight x CPI']) / sum(result_b['Weight']),4)
print('Weight CPI using full set of data k = 16:', weight16)

Weight CPI using full set of data k = 16: 6.1215

6. Part C ¶

Question¶

The descriptive features in this dataset are of different types. For example, some are percentages, others are measured in years, and others are measured in counts per 1,000. We should always consider normalizing our data, but it is particularly important to do this when the descriptive are measured in different units. What value would a 3-nearest neighbor prediction model using Manhattan distance return for the CPI of Russia when the descriptive features have been normalized using range normalization?

Define a function, normal2DByRow, which takes a 2-Dimensional descMatrix as parameter, and return this matrix after min-max normalization.

def normal2DByRow(descMatrix):
    norm=np.zeros((len(descMatrix), descMatrix.shape[1]))
#    print(len(descMatrix),descMatrix.shape[1])
#    print(norm)
    for i in range(descMatrix.shape[1]):
        minVal= min(descMatrix[:,i])
        maxVal= max(descMatrix[:,i])
        for j in range(len(descMatrix)):
#            print(minVal, maxVal, descMatrix[j,i])
            norm[j,i] = (descMatrix[j,i] - minVal )/ (maxVal - minVal)
#        print(norm)
    return norm

Pass in the training matrix OCorrMatrix to the normal2DByRow function to obtain nCorrMatrix.

nCorrMatrix=normal2DByRow(OCorrMatrix)

nCorrMatrix

array([[0.39390671, 0.04453125, 0.89650873, 0.6511976 , 0.        ],
       [0.        , 1.        , 0.88154613, 0.        , 0.2173913 ],
       [0.1698571 , 0.63125   , 1.        , 0.14670659, 0.26811594],
       [0.68697762, 0.17617187, 0.21446384, 0.26497006, 0.35507246],
       [0.82960367, 0.39960937, 0.13591022, 0.1002994 , 0.70289855],
       [0.80533837, 0.30898437, 0.14089776, 0.27844311, 0.43478261],
       [0.75815584, 0.81484375, 0.15087282, 0.2005988 , 0.49275362],
       [0.97870046, 0.26289062, 0.01496259, 1.        , 0.87681159],
       [0.90347803, 0.30390625, 0.04862843, 0.69311377, 0.96376812],
       [0.9476948 , 0.2015625 , 0.01371571, 0.07634731, 0.80434783],
       [0.94607711, 0.25078125, 0.02493766, 0.3742515 , 0.91304348],
       [0.95012133, 0.        , 0.01371571, 0.18263473, 0.84057971],
       [0.97034241, 0.10625   , 0.03117207, 0.1991018 , 1.        ],
       [1.        , 0.13007812, 0.02244389, 0.26497006, 0.80434783],
       [0.98220545, 0.00429687, 0.        , 0.17664671, 0.89855072],
       [0.96171475, 0.22421875, 0.03117207, 0.15568862, 0.86231884]])

Define a function, normalTFromDByRow, which takes two 2-dimensional descMatrix and targetMatrix as parameter, descMatrix can have any columns and rows, while targetMatrix can only have 1 row and same number of columns as descMatrix. We then normalise targetMatrix base on the minimum and maximum values of descMatrix by Row. This function will return the normalized targetMatrix.

def normalTFromDByRow(descMatrix, targetMatrix):
    norm=np.zeros((1, targetMatrix.shape[1]))
#    print(len(descMatrix),descMatrix.shape[1])
#    print(norm)
    for i in range(descMatrix.shape[1]):
        minVal= min(descMatrix[:,i])
        maxVal= max(descMatrix[:,i])
        norm[0,i] = (targetMatrix[0,i]- minVal )/ (maxVal - minVal)
    return norm

Pass in the training OCorrMatrix and query matrix qMatrix to the normalTFromDByRow function to obtain nQMatrix.

nQMatrix = normalTFromDByRow(OCorrMatrix,qMatrix)

nQMatrix

array([[0.60986789, 0.37539062, 0.09476309, 0.56586826, 0.9057971 ]])

Construct a dataframe nCorrdf to show the normalized descriptive features with headings.

nCorrDf = pd.DataFrame(data=nCorrMatrix, columns=oCorr.columns[1:])
nCorrDf = pd.concat([oCorr['COUNTRY_ID'].to_frame(), nCorrDf], axis=1)
nCorrDf

Construct a dataframe nQdf to show the normalized query with headings.

nQdf = pd.DataFrame(data=nQMatrix, columns=q.columns[1:])
qCountry = q['COUNTRY_ID'].reset_index(drop=True).to_frame()
nQdf = pd.concat([qCountry, nQdf], axis=1)
nQdf

Pass in the normalised training (nCorrMatrix) and query matrix (nQMatrix) to the calManhattan function to obtain the manhattan distance of each row, store the result in the numpy array, nM. Please note, first Cell with index=0, is the Manhattan distance between Afghanistan and Russia, etc.

nM = calManhattan(nCorrMatrix, nQMatrix)

nM

array([2.33969263, 3.27553437, 2.6579499 , 1.24765207, 0.95356907,
       1.06645103, 1.42216374, 1.02425031, 0.5964457 , 1.18367265,
       0.72950718, 1.24514237, 1.15417553, 1.11011129, 1.23466233,
       1.02026766])

Construct a dataframe, result_c, which shows the manhattan distance after normalization for each country and it's corresponding target feature. Please note, target feature is not normalized.

data_c = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': nM.tolist(), 'CPI' : oTarget.tolist()} 
result_c = pd.DataFrame(data_c)

result_c = result_c.sort_values('Manhattan distance to Russia')
result_c

Answer¶

Draw the conclusion from result_c as below:

The nearest 3 neighbors to Russia are USA, UK and Argentina. The CPI value that will be returned by the model is the average CPI score for these 3 neighbors, which is

nearest3CPINorm = result_c.iloc[0:3,2].tolist()
print('CPI for the first 3 neighbors:', nearest3CPINorm)

avgnearest3CPINorm = np.round((sum(nearest3CPINorm)/3), 4)
print('Average CPI for the first 3 neighbors:', avgnearest3CPINorm)

CPI for the first 3 neighbors: [7.1357, 7.7751, 2.9961]
Average CPI for the first 3 neighbors: 5.969

7. Part D ¶

Question¶

What value would a weighted k-NN prediction model—with k=16 (i.e., the full dataset) and using a weighting scheme of the reciprocal of the squared Manhattan distance between the neighbor and the query return for the CPI of Russia when it is applied to the range-normalized data?

Calculating the Weight matrix, nW, which is the reciprocal of the squared Manhattan distance between each neighbor country and the query Russia after normalization.

nW = 1 / (nM)**2

nW

array([0.18267638, 0.09320424, 0.14154887, 0.64241107, 1.09975434,
       0.87926171, 0.49442543, 0.95320824, 2.81098273, 0.71373464,
       1.87906094, 0.64500336, 0.75068247, 0.81145971, 0.65599964,
       0.96066454])

Calculating the weights multiplied by the instance target value (i.e. CPI) and store that as our nProduct matrix

nProduct =nW*oTarget.values
nProduct

array([ 0.27713833,  0.16775832,  0.34669566,  1.83870896,  3.29497399,
        3.19664388,  1.86601101,  5.53518496, 20.05832945,  5.37870426,
       14.60988668,  5.18976154,  6.51029368,  7.17671193,  6.09981267,
        9.09048034])

Construct a dataframe, result_d, which shows the manhattan distance for each country and it's corresponding target feature (i.e. CPI), weight and wight x CPI after normalization. Please note, target feature is not normalized.

data_d = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': nM.tolist(), 'CPI' : oTarget.tolist(), 'Weight' : nW.tolist(), 'Weight x CPI': nProduct.tolist()} 
result_d = pd.DataFrame(data_d)

result_d = pd.DataFrame(data_d)

Sort result_d base on the manhattan distance in ascending order

result_d = result_d.sort_values('Manhattan distance to Russia')
result_d

Answer¶

Draw the conclusion from result_d as below:

Since we are using k=16, the value returned by the model is the sum of the instance weights multiplied by the instance target value divided by the sum of the instance weights for all levels:

weightNorm16 = np.round(sum(result_d['Weight x CPI']) / sum(result_d['Weight']),4)
print('Weight CPI using full set of data k = 16:', weightNorm16)

Weight CPI using full set of data k = 16: 6.6091

8. Part E ¶

Question¶

The actual 2011 CPI for Russia was 2.4488. Which of the predictions made was the most accurate? Why do you think this was?

Answer¶

The closest prediction was using 3-nearest neighbor prediction based on the original data. As the data ranges in the dataset are so different, normalization would actually find the non-biased distance (ranking of the instances would not be dominated by the feature with larger values) between the training instances and the query, and then would drive us to calculate the most sensible CPI. If we further examine the data according to the ranking of Manhattan distance and the CPI, we can find that the distance ranking of Russia to the original data set is a bit better than the normalized dataset with CPI close to 2.4488 when k = 3, but overall, we can observe there is no relationship between distance ranking to the CPI index in both datasets for this particular instance. ( Results in part b and d clearly explain so when we wanna find the CPI index with the full dataset.) Normalizing data and multiply the weight with CPI, will only amplify the non-linear relationship for this particular query.

It is more logical to have few more query (or testing) data to justify which data model (we can also find out the relevance on each macro economic features with the CPI, not just rely on one particular instance) works best for this scenario. We could then fine tune on using which distance metric and number of k to maximize the correctness of predicting CPI index on unseen countries.

We have construct the below dataframes to show all the features in both datasets with the ranking of manhattan distance for further reference

targetDf = pd.DataFrame(oTarget.tolist(), columns=['CPI'])
DistRank = pd.DataFrame(np.arange(1,17), columns=['Rank <lower indicates closer>'])

nMDf = pd.DataFrame(nM.tolist(), columns=['Manhattan distance to Russia'])
nRankDf = pd.concat([nCorrDf,nMDf, targetDf ], axis=1)
nRankDf = nRankDf.sort_values('Manhattan distance to Russia').reset_index(drop=True)
nRankDf = pd.concat([nRankDf, DistRank ], axis=1)
nRankDf = nRankDf.sort_values('CPI').reset_index(drop=True)

MDf = pd.DataFrame(M.tolist(), columns=['Manhattan distance to Russia'])
oRankDf = pd.concat([oCorr, MDf, targetDf ], axis=1)
oRankDf = oRankDf.sort_values('Manhattan distance to Russia').reset_index(drop=True)
oRankDf = pd.concat([oRankDf, DistRank ], axis=1)
oRankDf = oRankDf.sort_values('CPI').reset_index(drop=True)

Original Data with Distance ranking and CPI in ascending order

oRankDf

Normalized Data with Distance ranking and CPI in ascending order

nRankDf

	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	A15	A16
0	b	30.83	0.000	u	g	w	v	1.25	t	t	1	f	g	00202	0	+
1	a	58.67	4.460	u	g	q	h	3.04	t	t	6	f	g	00043	560	+
2	a	24.50	0.500	u	g	q	h	1.50	t	f	0	f	g	00280	824	+
3	b	27.83	1.540	u	g	w	v	3.75	t	t	5	t	g	00100	3	+
4	b	20.17	5.625	u	g	w	v	1.71	t	f	0	f	s	00120	0	+

	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	A15	A16
71	b	34.83	4.000	u	g	d	bb	12.500	t	f	0	t	g	NaN	0	-
202	b	24.83	2.750	u	g	c	v	2.250	t	t	6	f	g	NaN	600	+
206	a	71.58	0.000	?	?	?	?	0.000	f	f	0	f	p	NaN	0	+
243	a	18.75	7.500	u	g	q	v	2.710	t	t	5	f	g	NaN	26726	+
270	b	37.58	0.000	?	?	?	?	0.000	f	f	0	f	p	NaN	0	+
278	b	24.58	13.500	y	p	ff	ff	0.000	f	f	0	f	g	NaN	0	-
330	b	20.42	0.000	?	?	?	?	0.000	f	f	0	f	p	NaN	0	-
406	a	40.33	8.125	y	p	k	v	0.165	f	t	2	f	g	NaN	18	-
445	a	NaN	11.250	u	g	ff	ff	0.000	f	f	0	f	g	NaN	5200	-
456	b	34.58	0.000	?	?	?	?	0.000	f	f	0	f	p	NaN	0	-
592	b	23.17	0.000	?	?	?	?	0.000	f	f	0	f	p	NaN	0	+
622	a	25.58	0.000	?	?	?	?	0.000	f	f	0	f	p	NaN	0	+
626	b	22.00	7.835	y	p	i	bb	0.165	f	f	0	t	g	NaN	0	-

	A2	A3	A8	A11	A14	A15
count	690.00	690.00	690.00	690.00	690.00	690.00
mean	31.51	4.76	2.22	2.40	183.56	1017.39
std	11.86	4.98	3.35	4.86	172.19	5210.10
min	13.75	0.00	0.00	0.00	0.00	0.00
25%	22.67	1.00	0.16	0.00	80.00	0.00
50%	28.46	2.75	1.00	0.00	160.00	5.00
75%	37.71	7.21	2.62	3.00	272.00	395.50
max	80.25	28.00	28.50	67.00	2000.00	100000.00

	A1	A2	A3	A8	A9	A10	A11	A12	A14	A15	...	A7_ff	A7_h	A7_j	A7_n	A7_o	A7_v	A7_z	A13_g	A13_p	A13_s
count	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	...	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000
mean	0.000	-0.000	0.000	0.000	0.000	-0.000	0.000	-0.000	0.000	-0.000	...	0.000	0.000	0.000	-0.000	-0.000	-0.000	-0.000	0.000	-0.000	-0.000
std	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	...	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001
min	-1.512	-1.224	-0.957	-0.665	-1.048	-0.864	-0.494	-0.919	-1.067	-0.195	...	-0.300	-0.500	-0.108	-0.076	-0.054	-1.203	-0.108	-3.101	-0.108	-0.300
25%	-1.512	-1.224	-0.756	-0.616	-1.048	-0.864	-0.494	-0.919	-0.602	-0.195	...	-0.300	-0.500	-0.108	-0.076	-0.054	-1.203	-0.108	0.322	-0.108	-0.300
50%	0.661	0.002	-0.404	-0.366	0.955	-0.864	-0.494	-0.919	-0.137	-0.194	...	-0.300	-0.500	-0.108	-0.076	-0.054	0.831	-0.108	0.322	-0.108	-0.300
75%	0.661	1.228	0.492	0.120	0.955	1.157	0.123	1.088	0.514	-0.119	...	-0.300	-0.500	-0.108	-0.076	-0.054	0.831	-0.108	0.322	-0.108	-0.300
max	0.661	1.228	4.672	7.858	0.955	1.157	13.294	1.088	10.557	19.012	...	3.332	2.000	9.233	13.096	18.547	0.831	9.233	0.322	9.233	3.332

	A1	A2	A3	A8	A9	A10	A11	A12	A14	A15	A4_l	A4_u	A4_y	A5_g	A5_gg	A5_p	A6_aa	A6_c	A6_cc	A6_d	A6_e	A6_ff	A6_i	A6_j	A6_k	A6_m	A6_q	A6_r	A6_w	A6_x	A7_bb	A7_dd	A7_ff	A7_h	A7_j	A7_n	A7_o	A7_v	A7_z	A13_g	A13_p	A13_s	target
count	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000	690.000
mean	0.000	-0.000	0.000	0.000	0.000	-0.000	0.000	-0.000	0.000	-0.000	-0.000	0.000	0.000	0.000	-0.000	0.000	0.000	0.000	-0.000	0.000	0.000	0.000	-0.000	0.000	0.000	0.000	0.000	-0.000	-0.000	-0.000	-0.000	-0.000	0.000	0.000	0.000	-0.000	-0.000	-0.000	-0.000	0.000	-0.000	-0.000	0.445
std	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	1.001	0.497
min	-1.512	-1.224	-0.957	-0.665	-1.048	-0.864	-0.494	-0.919	-1.067	-0.195	-0.054	-1.784	-0.556	-1.784	-0.054	-0.556	-0.291	-0.518	-0.251	-0.213	-0.194	-0.288	-0.306	-0.121	-0.283	-0.241	-0.357	-0.066	-0.320	-0.241	-0.306	-0.094	-0.300	-0.500	-0.108	-0.076	-0.054	-1.203	-0.108	-3.101	-0.108	-0.300	0.000
25%	-1.512	-1.224	-0.756	-0.616	-1.048	-0.864	-0.494	-0.919	-0.602	-0.195	-0.054	0.561	-0.556	0.561	-0.054	-0.556	-0.291	-0.518	-0.251	-0.213	-0.194	-0.288	-0.306	-0.121	-0.283	-0.241	-0.357	-0.066	-0.320	-0.241	-0.306	-0.094	-0.300	-0.500	-0.108	-0.076	-0.054	-1.203	-0.108	0.322	-0.108	-0.300	0.000
50%	0.661	0.002	-0.404	-0.366	0.955	-0.864	-0.494	-0.919	-0.137	-0.194	-0.054	0.561	-0.556	0.561	-0.054	-0.556	-0.291	-0.518	-0.251	-0.213	-0.194	-0.288	-0.306	-0.121	-0.283	-0.241	-0.357	-0.066	-0.320	-0.241	-0.306	-0.094	-0.300	-0.500	-0.108	-0.076	-0.054	0.831	-0.108	0.322	-0.108	-0.300	0.000
75%	0.661	1.228	0.492	0.120	0.955	1.157	0.123	1.088	0.514	-0.119	-0.054	0.561	-0.556	0.561	-0.054	-0.556	-0.291	-0.518	-0.251	-0.213	-0.194	-0.288	-0.306	-0.121	-0.283	-0.241	-0.357	-0.066	-0.320	-0.241	-0.306	-0.094	-0.300	-0.500	-0.108	-0.076	-0.054	0.831	-0.108	0.322	-0.108	-0.300	1.000
max	0.661	1.228	4.672	7.858	0.955	1.157	13.294	1.088	10.557	19.012	18.547	0.561	1.798	0.561	18.547	1.798	3.432	1.930	3.979	4.690	5.158	3.467	3.270	8.246	3.540	4.142	2.801	15.133	3.127	4.142	3.270	10.677	3.332	2.000	9.233	13.096	18.547	0.831	9.233	0.322	9.233	3.332	1.000

	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	A15	A16
83	a	NaN	3.500	u	g	d	v	3.000	t	f	0	t	g	300.0	0	-
86	b	NaN	0.375	u	g	d	v	0.875	t	f	0	t	s	928.0	0	-
92	b	NaN	5.000	y	p	aa	v	8.500	t	f	0	f	g	0.0	0	-
97	b	NaN	0.500	u	g	c	bb	0.835	t	f	0	t	s	320.0	0	-
254	b	NaN	0.625	u	g	k	v	0.250	f	f	0	f	g	380.0	2010	-
286	a	NaN	1.500	u	g	ff	ff	0.000	f	t	2	t	g	200.0	105	-
329	b	NaN	4.000	y	p	i	v	0.085	f	f	0	t	g	411.0	0	-
445	a	NaN	11.250	u	g	ff	ff	0.000	f	f	0	f	g	NaN	5200	-
450	b	NaN	3.000	y	p	i	bb	7.000	f	f	0	f	g	0.0	1	-
500	b	NaN	4.000	u	g	x	v	5.000	t	t	3	t	g	290.0	2279	+
515	b	NaN	10.500	u	g	x	v	6.500	t	f	0	f	g	0.0	0	+
608	b	NaN	0.040	y	p	d	v	4.250	f	f	0	t	g	460.0	0	-

	A1	A4	A5	A6	A7	A9	A10	A12	A13	A16
count	690	690	690	690	690	690	690	690	690	690
unique	3	4	4	15	10	2	2	2	3	2
top	b	u	g	c	v	t	f	f	g	-
freq	468	519	519	137	399	361	395	374	625	383

	A1	A2	A3	A8	A9	A10	A11	A12	A14	A15	A4_l	A4_u	A4_y	A5_g	A5_gg	A5_p	A6_aa	A6_c	A6_cc	A6_d	A6_e	A6_ff	A6_i	A6_j	A6_k	A6_m	A6_q	A6_r	A6_w	A6_x	A7_bb	A7_dd	A7_ff	A7_h	A7_j	A7_n	A7_o	A7_v	A7_z	A13_g	A13_p	A13_s	target
0	0.661438	0.001777	-0.956613	-0.291083	0.95465	1.157144	-0.288101	-0.919195	0.107155	-0.195413	-0.053916	0.560612	-0.556146	0.560612	-0.053916	-0.556146	-0.291386	-0.518056	-0.251345	-0.213201	-0.193892	-0.288448	-0.305782	-0.121268	-0.282511	-0.241417	-0.357003	-0.066082	3.127499	-0.241417	-0.305782	-0.093659	-0.300079	-0.5	-0.108306	-0.07636	-0.053916	0.831370	-0.108306	0.322490	-0.108306	-0.300079	1
1	-1.511858	1.227857	-0.060051	0.244190	0.95465	1.157144	0.740830	-0.919195	-0.816912	-0.087852	-0.053916	0.560612	-0.556146	0.560612	-0.053916	-0.556146	-0.291386	-0.518056	-0.251345	-0.213201	-0.193892	-0.288448	-0.305782	-0.121268	-0.282511	-0.241417	2.801099	-0.066082	-0.319744	-0.241417	-0.305782	-0.093659	-0.300079	2.0	-0.108306	-0.07636	-0.053916	-1.202834	-0.108306	0.322490	-0.108306	-0.300079	1
2	-1.511858	0.001777	-0.856102	-0.216324	0.95465	-0.864196	-0.493887	-0.919195	0.560471	-0.037144	-0.053916	0.560612	-0.556146	0.560612	-0.053916	-0.556146	-0.291386	-0.518056	-0.251345	-0.213201	-0.193892	-0.288448	-0.305782	-0.121268	-0.282511	-0.241417	2.801099	-0.066082	-0.319744	-0.241417	-0.305782	-0.093659	-0.300079	2.0	-0.108306	-0.07636	-0.053916	-1.202834	-0.108306	0.322490	-0.108306	-0.300079	1
3	0.661438	0.001777	-0.647038	0.456505	0.95465	1.157144	0.535044	1.087908	-0.485643	-0.194837	-0.053916	0.560612	-0.556146	0.560612	-0.053916	-0.556146	-0.291386	-0.518056	-0.251345	-0.213201	-0.193892	-0.288448	-0.305782	-0.121268	-0.282511	-0.241417	-0.357003	-0.066082	3.127499	-0.241417	-0.305782	-0.093659	-0.300079	-0.5	-0.108306	-0.07636	-0.053916	0.831370	-0.108306	0.322490	-0.108306	-0.300079	1
4	0.661438	-1.224303	0.174141	-0.153526	0.95465	-0.864196	-0.493887	-0.919195	-0.369408	-0.195413	-0.053916	0.560612	-0.556146	0.560612	-0.053916	-0.556146	-0.291386	-0.518056	-0.251345	-0.213201	-0.193892	-0.288448	-0.305782	-0.121268	-0.282511	-0.241417	-0.357003	-0.066082	3.127499	-0.241417	-0.305782	-0.093659	-0.300079	-0.5	-0.108306	-0.07636	-0.053916	0.831370	-0.108306	-3.100868	-0.108306	3.332456	1

	COUNTRY_ID	LIFE_EXP	TOP10_INCOME	INFANT_MORT	MIL_SPEND	SCHOOL_YEARS	CPI
0	Afghanistan	59.61	23.21	74.3	4.44	0.4	1.5171
1	Haiti	45.00	47.67	73.1	0.09	3.4	1.7999
2	Nigeria	51.30	38.23	82.6	1.07	4.1	2.4493
3	Egypt	70.48	26.58	19.6	1.86	5.3	2.8622
4	Argentina	75.77	32.30	13.3	0.76	10.1	2.9961
5	China	74.87	29.98	13.7	1.95	6.4	3.6356
6	Brazil	73.12	42.93	14.5	1.43	7.2	3.7741
7	Israel	81.30	28.80	3.6	6.77	12.5	5.8069
8	U.S.A	78.51	29.85	6.3	4.72	13.7	7.1357
9	Ireland	80.15	27.23	3.5	0.60	11.5	7.536
10	U.K.	80.09	28.49	4.4	2.59	13.0	7.7751
11	Germany	80.24	22.07	3.5	1.31	12.0	8.0461
12	Canada	80.99	24.79	4.9	1.42	14.2	8.6725
13	Australia	82.09	25.40	4.2	1.86	11.5	8.8442
14	Sweden	81.43	22.18	2.4	1.27	12.8	9.2985
15	New Zealand	80.67	27.81	4.9	1.13	12.3	9.4627
16	Russia	67.62	31.68	10.0	3.87	12.9	?

	COUNTRY_ID	Manhattan distance to Russia	CPI
4	Argentina	17.98	2.9961
8	U.S.A	18.07	7.1357
5	China	21.07	3.6356
10	U.K.	22.64	7.7751
15	New Zealand	25.36	9.4627
7	Israel	26.26	5.8069
3	Egypt	27.17	2.8622
9	Ireland	28.15	7.5360
12	Canada	29.11	8.6725
6	Brazil	29.39	3.7741
13	Australia	29.96	8.8442
11	Germany	32.19	8.0461
14	Sweden	33.61	9.2985
0	Afghanistan	93.85	1.5171
2	Nigeria	107.07	2.4493
1	Haiti	114.99	1.7999

	COUNTRY_ID	LIFE_EXP	TOP10_INCOME	INFANT_MORT	MIL_SPEND	SCHOOL_YEARS
0	Afghanistan	0.393907	0.044531	0.896509	0.651198	0.000000
1	Haiti	0.000000	1.000000	0.881546	0.000000	0.217391
2	Nigeria	0.169857	0.631250	1.000000	0.146707	0.268116
3	Egypt	0.686978	0.176172	0.214464	0.264970	0.355072
4	Argentina	0.829604	0.399609	0.135910	0.100299	0.702899
5	China	0.805338	0.308984	0.140898	0.278443	0.434783
6	Brazil	0.758156	0.814844	0.150873	0.200599	0.492754
7	Israel	0.978700	0.262891	0.014963	1.000000	0.876812
8	U.S.A	0.903478	0.303906	0.048628	0.693114	0.963768
9	Ireland	0.947695	0.201563	0.013716	0.076347	0.804348
10	U.K.	0.946077	0.250781	0.024938	0.374251	0.913043
11	Germany	0.950121	0.000000	0.013716	0.182635	0.840580
12	Canada	0.970342	0.106250	0.031172	0.199102	1.000000
13	Australia	1.000000	0.130078	0.022444	0.264970	0.804348
14	Sweden	0.982205	0.004297	0.000000	0.176647	0.898551
15	New Zealand	0.961715	0.224219	0.031172	0.155689	0.862319

	COUNTRY_ID	Manhattan distance to Russia	CPI
8	U.S.A	0.596446	7.1357
10	U.K.	0.729507	7.7751
4	Argentina	0.953569	2.9961
15	New Zealand	1.020268	9.4627
7	Israel	1.024250	5.8069
5	China	1.066451	3.6356
13	Australia	1.110111	8.8442
12	Canada	1.154176	8.6725
9	Ireland	1.183673	7.5360
14	Sweden	1.234662	9.2985
11	Germany	1.245142	8.0461
3	Egypt	1.247652	2.8622
6	Brazil	1.422164	3.7741
0	Afghanistan	2.339693	1.5171
2	Nigeria	2.657950	2.4493
1	Haiti	3.275534	1.7999

Machine Learning: Data Preparation and KNN

Question 1 - Data Preprocessing ¶

1. Import all necessary packages:¶

2. Read in the given data file¶

3. Show the read-in file¶

4. Check dimension and data types¶

5. Handling Numerical Descriptive features ¶

6. Handling Categorical Descriptive features ¶

7. Discretize the A2 numerical descriptive feature ¶

8. Handling target feature ¶

9. One-hot encoding for categorical features¶

10. Normalization ¶

11. Summary of the final clean dataset: df_clean ¶

Question 2 - KNN ¶

1. Read in the given data file¶

2. Check the datatypes¶

3. Data validation and preparation¶

4. Part A ¶

Question¶

Answer¶

5. Part B ¶

Question¶

Answer¶

6. Part C ¶

Question¶

Answer¶

7. Part D ¶

Question¶

Answer¶

8. Part E ¶

Question¶

Answer¶