Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.

A1

Question 1 - Data Preprocessing

Data preprocessing is a critical component in machine learning and its importance cannot be overstated. ill still be incorrect.

For this question, we need to perform all data preprocessing steps on a dataset on the UCI ML Datasets Repository so that the clean dataset can be directly fed into any classification algorithm within the Scikit-Learn Python module without any further changes.

This dataset is the Credit Approval data at the following address:

https://archive.ics.uci.edu/ml/datasets/Credit+Approval

The UCI Repository provides four datasets, but only two of them will be relevant:

  • Assignment1_Q1_crx.names: Some basic info on the dataset together with the feature names & values
  • Assignment1_Q1_crx.data: The actual data in comma-separated format

1. Import all necessary packages:

numpy, pandas, matplotlib and statistics

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statistics as stat

%matplotlib inline 

2. Read in the given data file

Read in Assignment1_Q1_crx.data and assign the column names as described in the file Assignment1_Q1_crx.names

In [2]:
crx_colunm = list(map(str, range(1, 17)))
crx_colunm = ['A' + sub for sub in crx_colunm]
Data = pd.read_csv("Assignment1_Q1_crx.data", names=crx_colunm, index_col=False, header = None)

3. Show the read-in file

Display the first 5 rows in the dataset

In [3]:
Data.head()
Out[3]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0.000 u g w v 1.25 t t 1 f g 00202 0 +
1 a 58.67 4.460 u g q h 3.04 t t 6 f g 00043 560 +
2 a 24.50 0.500 u g q h 1.50 t f 0 f g 00280 824 +
3 b 27.83 1.540 u g w v 3.75 t t 5 t g 00100 3 +
4 b 20.17 5.625 u g w v 1.71 t f 0 f s 00120 0 +

4. Check dimension and data types

In [4]:
print(Data.shape)
print(Data.dtypes)
(690, 16)
A1      object
A2      object
A3     float64
A4      object
A5      object
A6      object
A7      object
A8     float64
A9      object
A10     object
A11      int64
A12     object
A13     object
A14     object
A15      int64
A16     object
dtype: object

5. Handling Numerical Descriptive features

According to the file ,Assignment1_Q1_crx.names feature A2 and A14 should be numeric and continuous. Covert these 2 features to numeric and transform the non-numeric values into NaN.

In [5]:
Data['A2']=Data['A2'].apply(pd.to_numeric, errors='coerce')
Data['A14']=Data['A14'].apply(pd.to_numeric, errors='coerce')

Check whether there is any null values in the dataset. NaN values will also return True with the isnull() function

In [6]:
Data.isnull().sum()
Out[6]:
A1      0
A2     12
A3      0
A4      0
A5      0
A6      0
A7      0
A8      0
A9      0
A10     0
A11     0
A12     0
A13     0
A14    13
A15     0
A16     0
dtype: int64

There are only Null or NAN values in feature A2 and A14. Locate the rows in these 2 features with NaN or Null values. For feature A2, at index=83, we found the first row with NaN value, for feature A14, at index=71, we found the first row with NaN value.

In [7]:
Data.loc[Data['A2'].isnull()]
Out[7]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
83 a NaN 3.500 u g d v 3.000 t f 0 t g 300.0 0 -
86 b NaN 0.375 u g d v 0.875 t f 0 t s 928.0 0 -
92 b NaN 5.000 y p aa v 8.500 t f 0 f g 0.0 0 -
97 b NaN 0.500 u g c bb 0.835 t f 0 t s 320.0 0 -
254 b NaN 0.625 u g k v 0.250 f f 0 f g 380.0 2010 -
286 a NaN 1.500 u g ff ff 0.000 f t 2 t g 200.0 105 -
329 b NaN 4.000 y p i v 0.085 f f 0 t g 411.0 0 -
445 a NaN 11.250 u g ff ff 0.000 f f 0 f g NaN 5200 -
450 b NaN 3.000 y p i bb 7.000 f f 0 f g 0.0 1 -
500 b NaN 4.000 u g x v 5.000 t t 3 t g 290.0 2279 +
515 b NaN 10.500 u g x v 6.500 t f 0 f g 0.0 0 +
608 b NaN 0.040 y p d v 4.250 f f 0 t g 460.0 0 -
In [8]:
Data.loc[Data['A14'].isnull()]
Out[8]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
71 b 34.83 4.000 u g d bb 12.500 t f 0 t g NaN 0 -
202 b 24.83 2.750 u g c v 2.250 t t 6 f g NaN 600 +
206 a 71.58 0.000 ? ? ? ? 0.000 f f 0 f p NaN 0 +
243 a 18.75 7.500 u g q v 2.710 t t 5 f g NaN 26726 +
270 b 37.58 0.000 ? ? ? ? 0.000 f f 0 f p NaN 0 +
278 b 24.58 13.500 y p ff ff 0.000 f f 0 f g NaN 0 -
330 b 20.42 0.000 ? ? ? ? 0.000 f f 0 f p NaN 0 -
406 a 40.33 8.125 y p k v 0.165 f t 2 f g NaN 18 -
445 a NaN 11.250 u g ff ff 0.000 f f 0 f g NaN 5200 -
456 b 34.58 0.000 ? ? ? ? 0.000 f f 0 f p NaN 0 -
592 b 23.17 0.000 ? ? ? ? 0.000 f f 0 f p NaN 0 +
622 a 25.58 0.000 ? ? ? ? 0.000 f f 0 f p NaN 0 +
626 b 22.00 7.835 y p i bb 0.165 f f 0 t g NaN 0 -

Since feature A2 and A14 are both numeric, we impute the NaN values in these columns with their respective median as requested.

In [9]:
print("Median for column A2:", round(np.nanmedian(Data['A2']),2))
print("Median for column A14:", round(np.nanmedian(Data['A14']),2))
Data.loc[Data['A2'].isnull(), 'A2']= round(np.nanmedian(Data['A2']),2)
Data.loc[Data['A14'].isnull(), 'A14']= round(np.nanmedian(Data['A14']),2)
Median for column A2: 28.46
Median for column A14: 160.0

Sanity check whether median values have been imputed for the rows with the particular index for feature A2 and A14 as described in step 7.

In [10]:
print("First Imputed values in column A2:", Data.loc[83,'A2'])
print("First Imputed values in column A14:", Data.loc[71,'A14'])
First Imputed values in column A2: 28.46
First Imputed values in column A14: 160.0

Generate summary statistics for all the numerical columns and boxplot for each numerical features to have a better understanding of the distribution of the data. `

In [11]:
Data.describe(include=np.number).round(2)
Out[11]:
A2 A3 A8 A11 A14 A15
count 690.00 690.00 690.00 690.00 690.00 690.00
mean 31.51 4.76 2.22 2.40 183.56 1017.39
std 11.86 4.98 3.35 4.86 172.19 5210.10
min 13.75 0.00 0.00 0.00 0.00 0.00
25% 22.67 1.00 0.16 0.00 80.00 0.00
50% 28.46 2.75 1.00 0.00 160.00 5.00
75% 37.71 7.21 2.62 3.00 272.00 395.50
max 80.25 28.00 28.50 67.00 2000.00 100000.00
In [12]:
numerical_cols = Data.select_dtypes(include=['float64', 'int64']).columns.tolist()
for col in numerical_cols:
    print(f"Column Name: {col}:")
    plt.boxplot(Data[col])
    plt.show()
Column Name: A2:
Column Name: A3:
Column Name: A8:
Column Name: A11:
Column Name: A14:
Column Name: A15:

From the above boxplots, we found that the data are not noramlly distributed for all the features. Range is particularly huge for feature A14 and A15. As this is an anonymised dataset, we have limited domain knowledge, and we have no reason to believe that IQR based outlier detection is applicable. Thus, all the values in the datasets might be perfectly valid and we do not flag any numerical values as outliers regardless of their value.

6. Handling Categorical Descriptive features

Generate the summary statistics and list all the value counts for all each categorical features

In [13]:
Data.describe(include=np.object)
Out[13]:
A1 A4 A5 A6 A7 A9 A10 A12 A13 A16
count 690 690 690 690 690 690 690 690 690 690
unique 3 4 4 15 10 2 2 2 3 2
top b u g c v t f f g -
freq 468 519 519 137 399 361 395 374 625 383
In [14]:
categorical_cols = Data.columns[Data.dtypes==object].tolist()
for col in categorical_cols:    
    print(f'Before Transformation {col} -', Data[col].unique())
    print(Data[col].value_counts())
Before Transformation A1 - ['b' 'a' '?']
b    468
a    210
?     12
Name: A1, dtype: int64
Before Transformation A4 - ['u' 'y' '?' 'l']
u    519
y    163
?      6
l      2
Name: A4, dtype: int64
Before Transformation A5 - ['g' 'p' '?' 'gg']
g     519
p     163
?       6
gg      2
Name: A5, dtype: int64
Before Transformation A6 - ['w' 'q' 'm' 'r' 'cc' 'k' 'c' 'd' 'x' 'i' 'e' 'aa' 'ff' 'j' '?']
c     137
q      78
w      64
i      59
aa     54
ff     53
k      51
cc     41
m      38
x      38
d      30
e      25
j      10
?       9
r       3
Name: A6, dtype: int64
Before Transformation A7 - ['v' 'h' 'bb' 'ff' 'j' 'z' '?' 'o' 'dd' 'n']
v     399
h     138
bb     59
ff     57
?       9
j       8
z       8
dd      6
n       4
o       2
Name: A7, dtype: int64
Before Transformation A9 - ['t' 'f']
t    361
f    329
Name: A9, dtype: int64
Before Transformation A10 - ['t' 'f']
f    395
t    295
Name: A10, dtype: int64
Before Transformation A12 - ['f' 't']
f    374
t    316
Name: A12, dtype: int64
Before Transformation A13 - ['g' 's' 'p']
g    625
s     57
p      8
Name: A13, dtype: int64
Before Transformation A16 - ['+' '-']
-    383
+    307
Name: A16, dtype: int64

'?' is found in some of the categorical features, replace it with the mode value of that feature, check the unique values and their corresponding counts of each feature to verify mode values have been imputed.

In [15]:
categorical_cols = Data.columns[Data.dtypes==object].tolist()
for col in categorical_cols:
    Data.loc[(Data[col]=='?'), col]= stat.mode(Data[col])
    print(f'After Transformation {col} -', Data[col].unique())
    print(Data[col].value_counts())
After Transformation A1 - ['b' 'a']
b    480
a    210
Name: A1, dtype: int64
After Transformation A4 - ['u' 'y' 'l']
u    525
y    163
l      2
Name: A4, dtype: int64
After Transformation A5 - ['g' 'p' 'gg']
g     525
p     163
gg      2
Name: A5, dtype: int64
After Transformation A6 - ['w' 'q' 'm' 'r' 'cc' 'k' 'c' 'd' 'x' 'i' 'e' 'aa' 'ff' 'j']
c     146
q      78
w      64
i      59
aa     54
ff     53
k      51
cc     41
m      38
x      38
d      30
e      25
j      10
r       3
Name: A6, dtype: int64
After Transformation A7 - ['v' 'h' 'bb' 'ff' 'j' 'z' 'o' 'dd' 'n']
v     408
h     138
bb     59
ff     57
j       8
z       8
dd      6
n       4
o       2
Name: A7, dtype: int64
After Transformation A9 - ['t' 'f']
t    361
f    329
Name: A9, dtype: int64
After Transformation A10 - ['t' 'f']
f    395
t    295
Name: A10, dtype: int64
After Transformation A12 - ['f' 't']
f    374
t    316
Name: A12, dtype: int64
After Transformation A13 - ['g' 's' 'p']
g    625
s     57
p      8
Name: A13, dtype: int64
After Transformation A16 - ['+' '-']
-    383
+    307
Name: A16, dtype: int64

From the output of previous step, we could find that column A4 and A5 have the same number of unique values, check whether each unique values have a mapping count in the respective column.

In [16]:
Data.groupby(['A4', 'A5']).size()
Out[16]:
A4  A5
l   gg      2
u   g     525
y   p     163
dtype: int64

The above output shows that column A4 and A5 might carry the same information as in column A4 whenever there is a value of l, A5 would have a value of gg, etc. However in the file Assignment1_Q1_crx.names, it tells us that A4 could accept 4 inputs and A5 could only take in 3 as below:

A4: u, y, l, t.
A5: g, p, gg.

even though we should simplify our dataset by removing column(s) with redundant information, but in this case, as we don't have any domain knowledge on this anonymised dataset, we better play safe and keep both columns, just bear in mind about this finding, and we could come back to revise the method of cleaning the dataset once we improve our knowledge on it.

7. Discretize the A2 numerical descriptive feature

For the A2 numerical descriptive feature, discretize it via equal-frequency binning with 3 bins named "low", "medium", and "high", and then handle with integer encoding.

In [17]:
Data['A2'] = pd.qcut(Data['A2'], q=3, labels=['low', 'medium', 'high'])
In [18]:
level_mapping = {'low': 0, 'medium': 1, 'high': 2}
Data['A2'] = Data['A2'].replace(level_mapping)
Data['A2'].value_counts()
Out[18]:
1    231
0    230
2    229
Name: A2, dtype: int64

8. Handling target feature

Remove feature A16 as it's the class attribute in the dataset. Call it target. The rest of the features will be the descriptive features in our dataset.

In [19]:
target = Data['A16']
Data = Data.drop(columns='A16')

target.value_counts()
Out[19]:
-    383
+    307
Name: A16, dtype: int64

Label-encode the target feature so that the positive class is encoded as "1". The negative class should be encoded as "0". Confirm correctness of our label-encoding by getting a value counts.

In [20]:
target = target.replace({'-': 0, '+': 1})

target.value_counts()
Out[20]:
0    383
1    307
Name: A16, dtype: int64

9. One-hot encoding for categorical features

As all the categorical features appear to be nominal. Perform one-hot encoding for all the descriptive categorical features and call this encoded data frame as Data_encoded. If a categorical descriptive feature has only 2 levels, encode it with only one binary variable. For other categorical features (with more than 2 levels), use regular one-hot-encoding (where number of binary variables are equal to the number of distinct levels).

In [21]:
Data_encoded = Data.copy()  # retain original Data without encoding for further analysis
print(Data_encoded.columns)

# if a categorical descriptive feature has only 2 levels,
# define only one binary variable
categorical_cols = Data_encoded.columns[Data.dtypes==object].tolist()
for col in categorical_cols:
    q = len(Data_encoded[col].unique())
    if (q == 2):
        Data_encoded[col] = pd.get_dummies(Data_encoded[col], drop_first=True)

# for other categorical features (with > 2 levels), 
# perform regular one-hot-encoding using pd.get_dummies()
# if a feature is numeric, it will be untouched
Data_encoded = pd.get_dummies(Data_encoded)

print(Data_encoded.columns)
Index(['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11',
       'A12', 'A13', 'A14', 'A15'],
      dtype='object')
Index(['A1', 'A2', 'A3', 'A8', 'A9', 'A10', 'A11', 'A12', 'A14', 'A15', 'A4_l',
       'A4_u', 'A4_y', 'A5_g', 'A5_gg', 'A5_p', 'A6_aa', 'A6_c', 'A6_cc',
       'A6_d', 'A6_e', 'A6_ff', 'A6_i', 'A6_j', 'A6_k', 'A6_m', 'A6_q', 'A6_r',
       'A6_w', 'A6_x', 'A7_bb', 'A7_dd', 'A7_ff', 'A7_h', 'A7_j', 'A7_n',
       'A7_o', 'A7_v', 'A7_z', 'A13_g', 'A13_p', 'A13_s'],
      dtype='object')

Check the data types of all the descriptive features. Once they are numerical, it's almost ready to be fed into scikit learn module.

In [22]:
Data_encoded.dtypes
Out[22]:
A1         uint8
A2         int64
A3       float64
A8       float64
A9         uint8
A10        uint8
A11        int64
A12        uint8
A14      float64
A15        int64
A4_l       uint8
A4_u       uint8
A4_y       uint8
A5_g       uint8
A5_gg      uint8
A5_p       uint8
A6_aa      uint8
A6_c       uint8
A6_cc      uint8
A6_d       uint8
A6_e       uint8
A6_ff      uint8
A6_i       uint8
A6_j       uint8
A6_k       uint8
A6_m       uint8
A6_q       uint8
A6_r       uint8
A6_w       uint8
A6_x       uint8
A7_bb      uint8
A7_dd      uint8
A7_ff      uint8
A7_h       uint8
A7_j       uint8
A7_n       uint8
A7_o       uint8
A7_v       uint8
A7_z       uint8
A13_g      uint8
A13_p      uint8
A13_s      uint8
dtype: object

10. Normalization

Perform a range normalization of the descriptive features using standard scaling within the preprocessing submodule of Scikit-Learn, and call it Data_encoded_norm_numpy. We shall leave Data_encoded around to keep track of column names.

In [23]:
from sklearn import preprocessing

Data_scaler = preprocessing.StandardScaler()

Data_encoded_norm_numpy = Data_scaler.fit_transform(Data_encoded)

As the Data_encoded_norm_numpy is a NumPy array, all the column names are lost. Define a new Pandas data frame called Data_encoded_norm_df from Data_encoded_norm_numpy with the column names of Data_encoded. Finally, get the shape and a description of Data_encoded_norm_df with include='all' option.

In [24]:
Data_encoded_norm_df = pd.DataFrame(Data_encoded_norm_numpy, 
                                    columns=Data_encoded.columns)
In [25]:
print(f'Shape of Data_encoded_norm_df is {Data_encoded_norm_df.shape}\n') 

Data_encoded_norm_df.describe(include='all').round(3)
Shape of Data_encoded_norm_df is (690, 42)

Out[25]:
A1 A2 A3 A8 A9 A10 A11 A12 A14 A15 ... A7_ff A7_h A7_j A7_n A7_o A7_v A7_z A13_g A13_p A13_s
count 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 ... 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000
mean 0.000 -0.000 0.000 0.000 0.000 -0.000 0.000 -0.000 0.000 -0.000 ... 0.000 0.000 0.000 -0.000 -0.000 -0.000 -0.000 0.000 -0.000 -0.000
std 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 ... 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001
min -1.512 -1.224 -0.957 -0.665 -1.048 -0.864 -0.494 -0.919 -1.067 -0.195 ... -0.300 -0.500 -0.108 -0.076 -0.054 -1.203 -0.108 -3.101 -0.108 -0.300
25% -1.512 -1.224 -0.756 -0.616 -1.048 -0.864 -0.494 -0.919 -0.602 -0.195 ... -0.300 -0.500 -0.108 -0.076 -0.054 -1.203 -0.108 0.322 -0.108 -0.300
50% 0.661 0.002 -0.404 -0.366 0.955 -0.864 -0.494 -0.919 -0.137 -0.194 ... -0.300 -0.500 -0.108 -0.076 -0.054 0.831 -0.108 0.322 -0.108 -0.300
75% 0.661 1.228 0.492 0.120 0.955 1.157 0.123 1.088 0.514 -0.119 ... -0.300 -0.500 -0.108 -0.076 -0.054 0.831 -0.108 0.322 -0.108 -0.300
max 0.661 1.228 4.672 7.858 0.955 1.157 13.294 1.088 10.557 19.012 ... 3.332 2.000 9.233 13.096 18.547 0.831 9.233 0.322 9.233 3.332

8 rows × 42 columns

Sanity check some categorical features, there should be only 3 unique values for feature A2 and 2 unique values for feature A7_h.

In [26]:
Data_encoded_norm_df['A2'].value_counts()
Out[26]:
 0.001777    231
-1.224303    230
 1.227857    229
Name: A2, dtype: int64
In [27]:
Data_encoded_norm_df['A7_h'].value_counts()
Out[27]:
-0.5    552
 2.0    138
Name: A7_h, dtype: int64

Define a new data frame called df_clean which is the combination of the normalized and scaled descriptive features and the target feature with the target feature as the last column. Ensure target feature is a numpy array by calling .values.

In [28]:
df_clean = Data_encoded_norm_df.assign(target = target.values)

11. Summary of the final clean dataset: df_clean

Generate the summary output of the final data frame df_clean.

In [29]:
pd.set_option('display.max_columns', 50)
df_clean.shape
Out[29]:
(690, 43)
In [30]:
df_clean.describe(include='all').round(3) 
Out[30]:
A1 A2 A3 A8 A9 A10 A11 A12 A14 A15 A4_l A4_u A4_y A5_g A5_gg A5_p A6_aa A6_c A6_cc A6_d A6_e A6_ff A6_i A6_j A6_k A6_m A6_q A6_r A6_w A6_x A7_bb A7_dd A7_ff A7_h A7_j A7_n A7_o A7_v A7_z A13_g A13_p A13_s target
count 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000 690.000
mean 0.000 -0.000 0.000 0.000 0.000 -0.000 0.000 -0.000 0.000 -0.000 -0.000 0.000 0.000 0.000 -0.000 0.000 0.000 0.000 -0.000 0.000 0.000 0.000 -0.000 0.000 0.000 0.000 0.000 -0.000 -0.000 -0.000 -0.000 -0.000 0.000 0.000 0.000 -0.000 -0.000 -0.000 -0.000 0.000 -0.000 -0.000 0.445
std 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 0.497
min -1.512 -1.224 -0.957 -0.665 -1.048 -0.864 -0.494 -0.919 -1.067 -0.195 -0.054 -1.784 -0.556 -1.784 -0.054 -0.556 -0.291 -0.518 -0.251 -0.213 -0.194 -0.288 -0.306 -0.121 -0.283 -0.241 -0.357 -0.066 -0.320 -0.241 -0.306 -0.094 -0.300 -0.500 -0.108 -0.076 -0.054 -1.203 -0.108 -3.101 -0.108 -0.300 0.000
25% -1.512 -1.224 -0.756 -0.616 -1.048 -0.864 -0.494 -0.919 -0.602 -0.195 -0.054 0.561 -0.556 0.561 -0.054 -0.556 -0.291 -0.518 -0.251 -0.213 -0.194 -0.288 -0.306 -0.121 -0.283 -0.241 -0.357 -0.066 -0.320 -0.241 -0.306 -0.094 -0.300 -0.500 -0.108 -0.076 -0.054 -1.203 -0.108 0.322 -0.108 -0.300 0.000
50% 0.661 0.002 -0.404 -0.366 0.955 -0.864 -0.494 -0.919 -0.137 -0.194 -0.054 0.561 -0.556 0.561 -0.054 -0.556 -0.291 -0.518 -0.251 -0.213 -0.194 -0.288 -0.306 -0.121 -0.283 -0.241 -0.357 -0.066 -0.320 -0.241 -0.306 -0.094 -0.300 -0.500 -0.108 -0.076 -0.054 0.831 -0.108 0.322 -0.108 -0.300 0.000
75% 0.661 1.228 0.492 0.120 0.955 1.157 0.123 1.088 0.514 -0.119 -0.054 0.561 -0.556 0.561 -0.054 -0.556 -0.291 -0.518 -0.251 -0.213 -0.194 -0.288 -0.306 -0.121 -0.283 -0.241 -0.357 -0.066 -0.320 -0.241 -0.306 -0.094 -0.300 -0.500 -0.108 -0.076 -0.054 0.831 -0.108 0.322 -0.108 -0.300 1.000
max 0.661 1.228 4.672 7.858 0.955 1.157 13.294 1.088 10.557 19.012 18.547 0.561 1.798 0.561 18.547 1.798 3.432 1.930 3.979 4.690 5.158 3.467 3.270 8.246 3.540 4.142 2.801 15.133 3.127 4.142 3.270 10.677 3.332 2.000 9.233 13.096 18.547 0.831 9.233 0.322 9.233 3.332 1.000
In [31]:
df_clean.head(5)
Out[31]:
A1 A2 A3 A8 A9 A10 A11 A12 A14 A15 A4_l A4_u A4_y A5_g A5_gg A5_p A6_aa A6_c A6_cc A6_d A6_e A6_ff A6_i A6_j A6_k A6_m A6_q A6_r A6_w A6_x A7_bb A7_dd A7_ff A7_h A7_j A7_n A7_o A7_v A7_z A13_g A13_p A13_s target
0 0.661438 0.001777 -0.956613 -0.291083 0.95465 1.157144 -0.288101 -0.919195 0.107155 -0.195413 -0.053916 0.560612 -0.556146 0.560612 -0.053916 -0.556146 -0.291386 -0.518056 -0.251345 -0.213201 -0.193892 -0.288448 -0.305782 -0.121268 -0.282511 -0.241417 -0.357003 -0.066082 3.127499 -0.241417 -0.305782 -0.093659 -0.300079 -0.5 -0.108306 -0.07636 -0.053916 0.831370 -0.108306 0.322490 -0.108306 -0.300079 1
1 -1.511858 1.227857 -0.060051 0.244190 0.95465 1.157144 0.740830 -0.919195 -0.816912 -0.087852 -0.053916 0.560612 -0.556146 0.560612 -0.053916 -0.556146 -0.291386 -0.518056 -0.251345 -0.213201 -0.193892 -0.288448 -0.305782 -0.121268 -0.282511 -0.241417 2.801099 -0.066082 -0.319744 -0.241417 -0.305782 -0.093659 -0.300079 2.0 -0.108306 -0.07636 -0.053916 -1.202834 -0.108306 0.322490 -0.108306 -0.300079 1
2 -1.511858 0.001777 -0.856102 -0.216324 0.95465 -0.864196 -0.493887 -0.919195 0.560471 -0.037144 -0.053916 0.560612 -0.556146 0.560612 -0.053916 -0.556146 -0.291386 -0.518056 -0.251345 -0.213201 -0.193892 -0.288448 -0.305782 -0.121268 -0.282511 -0.241417 2.801099 -0.066082 -0.319744 -0.241417 -0.305782 -0.093659 -0.300079 2.0 -0.108306 -0.07636 -0.053916 -1.202834 -0.108306 0.322490 -0.108306 -0.300079 1
3 0.661438 0.001777 -0.647038 0.456505 0.95465 1.157144 0.535044 1.087908 -0.485643 -0.194837 -0.053916 0.560612 -0.556146 0.560612 -0.053916 -0.556146 -0.291386 -0.518056 -0.251345 -0.213201 -0.193892 -0.288448 -0.305782 -0.121268 -0.282511 -0.241417 -0.357003 -0.066082 3.127499 -0.241417 -0.305782 -0.093659 -0.300079 -0.5 -0.108306 -0.07636 -0.053916 0.831370 -0.108306 0.322490 -0.108306 -0.300079 1
4 0.661438 -1.224303 0.174141 -0.153526 0.95465 -0.864196 -0.493887 -0.919195 -0.369408 -0.195413 -0.053916 0.560612 -0.556146 0.560612 -0.053916 -0.556146 -0.291386 -0.518056 -0.251345 -0.213201 -0.193892 -0.288448 -0.305782 -0.121268 -0.282511 -0.241417 -0.357003 -0.066082 3.127499 -0.241417 -0.305782 -0.093659 -0.300079 -0.5 -0.108306 -0.07636 -0.053916 0.831370 -0.108306 -3.100868 -0.108306 3.332456 1

Write the final dataset df_clean to a CSV file called df_clean.csv.

In [32]:
# set index to False so that row IDs are not written
df_clean.to_csv('df_clean.csv', index=False)

Question 2 - KNN

Use KNN algorithm with Manhattan distance, predict the level of corruption in Russia based on a range of macro-economic and social features of a list of given countries. CPI index measures the perceived levels of corruption in the public sector of countries and ranges from 0 (highly corrupt) to 10 (very clean).

1. Read in the given data file

Read Asignment1_Q2.csv file into a dataframe oCorr, whcih stands for "original Corruption".

In [33]:
oCorr = pd.read_csv("Asignment1_Q2.csv", index_col=False, header=0)
oCorr
Out[33]:
COUNTRY_ID LIFE_EXP TOP10_INCOME INFANT_MORT MIL_SPEND SCHOOL_YEARS CPI
0 Afghanistan 59.61 23.21 74.3 4.44 0.4 1.5171
1 Haiti 45.00 47.67 73.1 0.09 3.4 1.7999
2 Nigeria 51.30 38.23 82.6 1.07 4.1 2.4493
3 Egypt 70.48 26.58 19.6 1.86 5.3 2.8622
4 Argentina 75.77 32.30 13.3 0.76 10.1 2.9961
5 China 74.87 29.98 13.7 1.95 6.4 3.6356
6 Brazil 73.12 42.93 14.5 1.43 7.2 3.7741
7 Israel 81.30 28.80 3.6 6.77 12.5 5.8069
8 U.S.A 78.51 29.85 6.3 4.72 13.7 7.1357
9 Ireland 80.15 27.23 3.5 0.60 11.5 7.536
10 U.K. 80.09 28.49 4.4 2.59 13.0 7.7751
11 Germany 80.24 22.07 3.5 1.31 12.0 8.0461
12 Canada 80.99 24.79 4.9 1.42 14.2 8.6725
13 Australia 82.09 25.40 4.2 1.86 11.5 8.8442
14 Sweden 81.43 22.18 2.4 1.27 12.8 9.2985
15 New Zealand 80.67 27.81 4.9 1.13 12.3 9.4627
16 Russia 67.62 31.68 10.0 3.87 12.9 ?

2. Check the datatypes

In [34]:
oCorr.dtypes
Out[34]:
COUNTRY_ID       object
LIFE_EXP        float64
TOP10_INCOME    float64
INFANT_MORT     float64
MIL_SPEND       float64
SCHOOL_YEARS    float64
CPI              object
dtype: object

3. Data validation and preparation

Since CPI should be a numeric value, convert it to numeric.

In [35]:
oCorr['CPI']=oCorr['CPI'].apply(pd.to_numeric, errors='coerce')

Last row in the dataset is our query row which we want to predict the CPI index. Extract the last row into a new dataframe, q and remove it from the dataset, oCorr. oCorr is now our training data set. Please note, information for Afghanistan can be located with index 0, etc.

In [36]:
q = oCorr.iloc[-1,:].to_frame()
q=q.T.reset_index(drop=True)
In [37]:
q
Out[37]:
COUNTRY_ID LIFE_EXP TOP10_INCOME INFANT_MORT MIL_SPEND SCHOOL_YEARS CPI
0 Russia 67.62 31.68 10 3.87 12.9 NaN
In [38]:
oCorr = oCorr.iloc[:-1,:]
In [39]:
oCorr
Out[39]:
COUNTRY_ID LIFE_EXP TOP10_INCOME INFANT_MORT MIL_SPEND SCHOOL_YEARS CPI
0 Afghanistan 59.61 23.21 74.3 4.44 0.4 1.5171
1 Haiti 45.00 47.67 73.1 0.09 3.4 1.7999
2 Nigeria 51.30 38.23 82.6 1.07 4.1 2.4493
3 Egypt 70.48 26.58 19.6 1.86 5.3 2.8622
4 Argentina 75.77 32.30 13.3 0.76 10.1 2.9961
5 China 74.87 29.98 13.7 1.95 6.4 3.6356
6 Brazil 73.12 42.93 14.5 1.43 7.2 3.7741
7 Israel 81.30 28.80 3.6 6.77 12.5 5.8069
8 U.S.A 78.51 29.85 6.3 4.72 13.7 7.1357
9 Ireland 80.15 27.23 3.5 0.60 11.5 7.5360
10 U.K. 80.09 28.49 4.4 2.59 13.0 7.7751
11 Germany 80.24 22.07 3.5 1.31 12.0 8.0461
12 Canada 80.99 24.79 4.9 1.42 14.2 8.6725
13 Australia 82.09 25.40 4.2 1.86 11.5 8.8442
14 Sweden 81.43 22.18 2.4 1.27 12.8 9.2985
15 New Zealand 80.67 27.81 4.9 1.13 12.3 9.4627

As CPI is our target feature, extract that into a separate dataframe, oTarget. Drop this feature in the oCorr and q dataframe.

In [40]:
oTarget = oCorr['CPI']
oCorr = oCorr.drop(columns='CPI')
q = q.drop(columns='CPI')
In [41]:
OCorrMatrix = oCorr.iloc[:, 1:].values
qMatrix = q.iloc[:, 1:].values

Transform the training and query dataframe into 2-Dimensional numpy arrays for lateral calculation.

In [42]:
type(OCorrMatrix)
Out[42]:
numpy.ndarray
In [43]:
OCorrMatrix
Out[43]:
array([[59.61, 23.21, 74.3 ,  4.44,  0.4 ],
       [45.  , 47.67, 73.1 ,  0.09,  3.4 ],
       [51.3 , 38.23, 82.6 ,  1.07,  4.1 ],
       [70.48, 26.58, 19.6 ,  1.86,  5.3 ],
       [75.77, 32.3 , 13.3 ,  0.76, 10.1 ],
       [74.87, 29.98, 13.7 ,  1.95,  6.4 ],
       [73.12, 42.93, 14.5 ,  1.43,  7.2 ],
       [81.3 , 28.8 ,  3.6 ,  6.77, 12.5 ],
       [78.51, 29.85,  6.3 ,  4.72, 13.7 ],
       [80.15, 27.23,  3.5 ,  0.6 , 11.5 ],
       [80.09, 28.49,  4.4 ,  2.59, 13.  ],
       [80.24, 22.07,  3.5 ,  1.31, 12.  ],
       [80.99, 24.79,  4.9 ,  1.42, 14.2 ],
       [82.09, 25.4 ,  4.2 ,  1.86, 11.5 ],
       [81.43, 22.18,  2.4 ,  1.27, 12.8 ],
       [80.67, 27.81,  4.9 ,  1.13, 12.3 ]])
In [44]:
qMatrix
Out[44]:
array([[67.62, 31.68, 10.0, 3.87, 12.9]], dtype=object)

4. Part A

Question

What value would a 3-nearest neighbor prediction model using Manhattan distance return for the CPI of Russia?

Define a function, calManhattan, which would calculate the Manhattan distance of each row in the descriptive Matrix to the target Matrix. There is only one row in the target matrix with index 0. The calculated distance would be returned as a 1-Dimensional numpy array.

In [45]:
def calManhattan(descMatrix, targetMatrix):
    M_dist=np.zeros(len(descMatrix))
    for i in range(len(descMatrix)):       
        M_dist[i]= sum(np.abs(descMatrix[i,:] - targetMatrix[0,:]))        
    return M_dist

Pass in the training and query matrix to the calManhattan function to obtain the manhattan distance of each row, store the result in the numpy array, M. Please note, first Cell with index=0, is the manhattan distance between Afghanistan and RuAssia, etc.

In [46]:
M = calManhattan(OCorrMatrix, qMatrix)
M
Out[46]:
array([ 93.85, 114.99, 107.07,  27.17,  17.98,  21.07,  29.39,  26.26,
        18.07,  28.15,  22.64,  32.19,  29.11,  29.96,  33.61,  25.36])

Construct a dataframe, result_a, which shows the manhattan distance for each country and it's corresponding target feature.

In [47]:
data_a = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': M.tolist(), 'CPI' : oTarget.tolist()} 
result_a = pd.DataFrame(data_a)

Sort result_a base on the manhattan distance in ascending order

In [48]:
result_a = result_a.sort_values('Manhattan distance to Russia')
result_a
Out[48]:
COUNTRY_ID Manhattan distance to Russia CPI
4 Argentina 17.98 2.9961
8 U.S.A 18.07 7.1357
5 China 21.07 3.6356
10 U.K. 22.64 7.7751
15 New Zealand 25.36 9.4627
7 Israel 26.26 5.8069
3 Egypt 27.17 2.8622
9 Ireland 28.15 7.5360
12 Canada 29.11 8.6725
6 Brazil 29.39 3.7741
13 Australia 29.96 8.8442
11 Germany 32.19 8.0461
14 Sweden 33.61 9.2985
0 Afghanistan 93.85 1.5171
2 Nigeria 107.07 2.4493
1 Haiti 114.99 1.7999

Draw the conclusion from result_a as below:

Answer

The nearest 3 neighbors to Russia are Argentina, USA and China. The CPI value will be returned by the model is the average CPI score for these 3 neighbors, which is

In [49]:
nearest3CPI = result_a.iloc[0:3,2].tolist()
print('CPI for the first 3 neighbors:', nearest3CPI)
CPI for the first 3 neighbors: [2.9961, 7.1357, 3.6356]
In [50]:
avgnearest3CPI = np.round((sum(nearest3CPI)/3), 4)
print('Average CPI for the first 3 neighbors:', avgnearest3CPI)
Average CPI for the first 3 neighbors: 4.5891

5. Part B

Question

What value would a weighted k-NN prediction model return for the CPI of Russia? Use k =16 (i.e., the full dataset) and a weighting scheme of the reciprocal of the squared Manhattan distance between the neighbor and the query.

Calculating the Weight matrix, W, which is the reciprocal of the squared Manhattan distance between Each neighbor country and the query Russia

In [51]:
W = 1 / (M)**2
In [52]:
W
Out[52]:
array([1.13535440e-04, 7.56275188e-05, 8.72297032e-05, 1.35463011e-03,
       3.09328991e-03, 2.25253179e-03, 1.15771278e-03, 1.45014208e-03,
       3.06255358e-03, 1.26195306e-03, 1.95095456e-03, 9.65068292e-04,
       1.18009125e-03, 1.11408001e-03, 8.85243966e-04, 1.55489656e-03])

Calculating the weights multiplied by the instance target value (i.e. CPI) and store that as our product matrix

In [53]:
product = W*oTarget.values

Construct a dataframe, result_b, which shows the manhattan distance for each country and it's corresponding target feature (i.e. CPI), weight and wight x CPI.

In [54]:
data_b = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': M.tolist(), 'CPI' : oTarget.tolist(), 'Weight' : W.tolist(), 'Weight x CPI': product.tolist()} 
result_b = pd.DataFrame(data_b)

Sort result_b base on the manhattan distance in ascending order

In [55]:
result_b = result_b.sort_values('Manhattan distance to Russia')
result_b
Out[55]:
COUNTRY_ID Manhattan distance to Russia CPI Weight Weight x CPI
4 Argentina 17.98 2.9961 0.003093 0.009268
8 U.S.A 18.07 7.1357 0.003063 0.021853
5 China 21.07 3.6356 0.002253 0.008189
10 U.K. 22.64 7.7751 0.001951 0.015169
15 New Zealand 25.36 9.4627 0.001555 0.014714
7 Israel 26.26 5.8069 0.001450 0.008421
3 Egypt 27.17 2.8622 0.001355 0.003877
9 Ireland 28.15 7.5360 0.001262 0.009510
12 Canada 29.11 8.6725 0.001180 0.010234
6 Brazil 29.39 3.7741 0.001158 0.004369
13 Australia 29.96 8.8442 0.001114 0.009853
11 Germany 32.19 8.0461 0.000965 0.007765
14 Sweden 33.61 9.2985 0.000885 0.008231
0 Afghanistan 93.85 1.5171 0.000114 0.000172
2 Nigeria 107.07 2.4493 0.000087 0.000214
1 Haiti 114.99 1.7999 0.000076 0.000136

Answer

Draw the conclusion from result_b as below:

Since we are using k=16, the value returned by the model is the sum of the instance weights multiplied by the instance target value divided by the sum of the instance weights for all levels:

In [56]:
weight16 = np.round(sum(result_b['Weight x CPI']) / sum(result_b['Weight']),4)
print('Weight CPI using full set of data k = 16:', weight16)
Weight CPI using full set of data k = 16: 6.1215

6. Part C

Question

The descriptive features in this dataset are of different types. For example, some are percentages, others are measured in years, and others are measured in counts per 1,000. We should always consider normalizing our data, but it is particularly important to do this when the descriptive are measured in different units. What value would a 3-nearest neighbor prediction model using Manhattan distance return for the CPI of Russia when the descriptive features have been normalized using range normalization?

Define a function, normal2DByRow, which takes a 2-Dimensional descMatrix as parameter, and return this matrix after min-max normalization.

In [57]:
def normal2DByRow(descMatrix):
    norm=np.zeros((len(descMatrix), descMatrix.shape[1]))
#    print(len(descMatrix),descMatrix.shape[1])
#    print(norm)
    for i in range(descMatrix.shape[1]):
        minVal= min(descMatrix[:,i])
        maxVal= max(descMatrix[:,i])
        for j in range(len(descMatrix)):
#            print(minVal, maxVal, descMatrix[j,i])
            norm[j,i] = (descMatrix[j,i] - minVal )/ (maxVal - minVal)
#        print(norm)
    return norm

Pass in the training matrix OCorrMatrix to the normal2DByRow function to obtain nCorrMatrix.

In [58]:
nCorrMatrix=normal2DByRow(OCorrMatrix)
In [59]:
nCorrMatrix
Out[59]:
array([[0.39390671, 0.04453125, 0.89650873, 0.6511976 , 0.        ],
       [0.        , 1.        , 0.88154613, 0.        , 0.2173913 ],
       [0.1698571 , 0.63125   , 1.        , 0.14670659, 0.26811594],
       [0.68697762, 0.17617187, 0.21446384, 0.26497006, 0.35507246],
       [0.82960367, 0.39960937, 0.13591022, 0.1002994 , 0.70289855],
       [0.80533837, 0.30898437, 0.14089776, 0.27844311, 0.43478261],
       [0.75815584, 0.81484375, 0.15087282, 0.2005988 , 0.49275362],
       [0.97870046, 0.26289062, 0.01496259, 1.        , 0.87681159],
       [0.90347803, 0.30390625, 0.04862843, 0.69311377, 0.96376812],
       [0.9476948 , 0.2015625 , 0.01371571, 0.07634731, 0.80434783],
       [0.94607711, 0.25078125, 0.02493766, 0.3742515 , 0.91304348],
       [0.95012133, 0.        , 0.01371571, 0.18263473, 0.84057971],
       [0.97034241, 0.10625   , 0.03117207, 0.1991018 , 1.        ],
       [1.        , 0.13007812, 0.02244389, 0.26497006, 0.80434783],
       [0.98220545, 0.00429687, 0.        , 0.17664671, 0.89855072],
       [0.96171475, 0.22421875, 0.03117207, 0.15568862, 0.86231884]])

Define a function, normalTFromDByRow, which takes two 2-dimensional descMatrix and targetMatrix as parameter, descMatrix can have any columns and rows, while targetMatrix can only have 1 row and same number of columns as descMatrix. We then normalise targetMatrix base on the minimum and maximum values of descMatrix by Row. This function will return the normalized targetMatrix.

In [60]:
def normalTFromDByRow(descMatrix, targetMatrix):
    norm=np.zeros((1, targetMatrix.shape[1]))
#    print(len(descMatrix),descMatrix.shape[1])
#    print(norm)
    for i in range(descMatrix.shape[1]):
        minVal= min(descMatrix[:,i])
        maxVal= max(descMatrix[:,i])
        norm[0,i] = (targetMatrix[0,i]- minVal )/ (maxVal - minVal)
    return norm

Pass in the training OCorrMatrix and query matrix qMatrix to the normalTFromDByRow function to obtain nQMatrix.

In [61]:
nQMatrix = normalTFromDByRow(OCorrMatrix,qMatrix)
In [62]:
nQMatrix
Out[62]:
array([[0.60986789, 0.37539062, 0.09476309, 0.56586826, 0.9057971 ]])

Construct a dataframe nCorrdf to show the normalized descriptive features with headings.

In [63]:
nCorrDf = pd.DataFrame(data=nCorrMatrix, columns=oCorr.columns[1:])
nCorrDf = pd.concat([oCorr['COUNTRY_ID'].to_frame(), nCorrDf], axis=1)
nCorrDf
Out[63]:
COUNTRY_ID LIFE_EXP TOP10_INCOME INFANT_MORT MIL_SPEND SCHOOL_YEARS
0 Afghanistan 0.393907 0.044531 0.896509 0.651198 0.000000
1 Haiti 0.000000 1.000000 0.881546 0.000000 0.217391
2 Nigeria 0.169857 0.631250 1.000000 0.146707 0.268116
3 Egypt 0.686978 0.176172 0.214464 0.264970 0.355072
4 Argentina 0.829604 0.399609 0.135910 0.100299 0.702899
5 China 0.805338 0.308984 0.140898 0.278443 0.434783
6 Brazil 0.758156 0.814844 0.150873 0.200599 0.492754
7 Israel 0.978700 0.262891 0.014963 1.000000 0.876812
8 U.S.A 0.903478 0.303906 0.048628 0.693114 0.963768
9 Ireland 0.947695 0.201563 0.013716 0.076347 0.804348
10 U.K. 0.946077 0.250781 0.024938 0.374251 0.913043
11 Germany 0.950121 0.000000 0.013716 0.182635 0.840580
12 Canada 0.970342 0.106250 0.031172 0.199102 1.000000
13 Australia 1.000000 0.130078 0.022444 0.264970 0.804348
14 Sweden 0.982205 0.004297 0.000000 0.176647 0.898551
15 New Zealand 0.961715 0.224219 0.031172 0.155689 0.862319

Construct a dataframe nQdf to show the normalized query with headings.

In [64]:
nQdf = pd.DataFrame(data=nQMatrix, columns=q.columns[1:])
qCountry = q['COUNTRY_ID'].reset_index(drop=True).to_frame()
nQdf = pd.concat([qCountry, nQdf], axis=1)
nQdf
Out[64]:
COUNTRY_ID LIFE_EXP TOP10_INCOME INFANT_MORT MIL_SPEND SCHOOL_YEARS
0 Russia 0.609868 0.375391 0.094763 0.565868 0.905797

Pass in the normalised training (nCorrMatrix) and query matrix (nQMatrix) to the calManhattan function to obtain the manhattan distance of each row, store the result in the numpy array, nM. Please note, first Cell with index=0, is the Manhattan distance between Afghanistan and Russia, etc.

In [65]:
nM = calManhattan(nCorrMatrix, nQMatrix)
In [66]:
nM
Out[66]:
array([2.33969263, 3.27553437, 2.6579499 , 1.24765207, 0.95356907,
       1.06645103, 1.42216374, 1.02425031, 0.5964457 , 1.18367265,
       0.72950718, 1.24514237, 1.15417553, 1.11011129, 1.23466233,
       1.02026766])

Construct a dataframe, result_c, which shows the manhattan distance after normalization for each country and it's corresponding target feature. Please note, target feature is not normalized.

In [67]:
data_c = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': nM.tolist(), 'CPI' : oTarget.tolist()} 
result_c = pd.DataFrame(data_c)
In [68]:
result_c = result_c.sort_values('Manhattan distance to Russia')
result_c
Out[68]:
COUNTRY_ID Manhattan distance to Russia CPI
8 U.S.A 0.596446 7.1357
10 U.K. 0.729507 7.7751
4 Argentina 0.953569 2.9961
15 New Zealand 1.020268 9.4627
7 Israel 1.024250 5.8069
5 China 1.066451 3.6356
13 Australia 1.110111 8.8442
12 Canada 1.154176 8.6725
9 Ireland 1.183673 7.5360
14 Sweden 1.234662 9.2985
11 Germany 1.245142 8.0461
3 Egypt 1.247652 2.8622
6 Brazil 1.422164 3.7741
0 Afghanistan 2.339693 1.5171
2 Nigeria 2.657950 2.4493
1 Haiti 3.275534 1.7999

Answer

Draw the conclusion from result_c as below:

The nearest 3 neighbors to Russia are USA, UK and Argentina. The CPI value that will be returned by the model is the average CPI score for these 3 neighbors, which is

In [69]:
nearest3CPINorm = result_c.iloc[0:3,2].tolist()
print('CPI for the first 3 neighbors:', nearest3CPINorm)

avgnearest3CPINorm = np.round((sum(nearest3CPINorm)/3), 4)
print('Average CPI for the first 3 neighbors:', avgnearest3CPINorm)
CPI for the first 3 neighbors: [7.1357, 7.7751, 2.9961]
Average CPI for the first 3 neighbors: 5.969

7. Part D

Question

What value would a weighted k-NN prediction model—with k=16 (i.e., the full dataset) and using a weighting scheme of the reciprocal of the squared Manhattan distance between the neighbor and the query return for the CPI of Russia when it is applied to the range-normalized data?

Calculating the Weight matrix, nW, which is the reciprocal of the squared Manhattan distance between each neighbor country and the query Russia after normalization.

In [70]:
nW = 1 / (nM)**2
In [71]:
nW
Out[71]:
array([0.18267638, 0.09320424, 0.14154887, 0.64241107, 1.09975434,
       0.87926171, 0.49442543, 0.95320824, 2.81098273, 0.71373464,
       1.87906094, 0.64500336, 0.75068247, 0.81145971, 0.65599964,
       0.96066454])

Calculating the weights multiplied by the instance target value (i.e. CPI) and store that as our nProduct matrix

In [72]:
nProduct =nW*oTarget.values
nProduct
Out[72]:
array([ 0.27713833,  0.16775832,  0.34669566,  1.83870896,  3.29497399,
        3.19664388,  1.86601101,  5.53518496, 20.05832945,  5.37870426,
       14.60988668,  5.18976154,  6.51029368,  7.17671193,  6.09981267,
        9.09048034])

Construct a dataframe, result_d, which shows the manhattan distance for each country and it's corresponding target feature (i.e. CPI), weight and wight x CPI after normalization. Please note, target feature is not normalized.

In [73]:
data_d = { 'COUNTRY_ID':oCorr['COUNTRY_ID'].tolist() , 'Manhattan distance to Russia': nM.tolist(), 'CPI' : oTarget.tolist(), 'Weight' : nW.tolist(), 'Weight x CPI': nProduct.tolist()} 
result_d = pd.DataFrame(data_d)
In [74]:
result_d = pd.DataFrame(data_d)

Sort result_d base on the manhattan distance in ascending order

In [75]:
result_d = result_d.sort_values('Manhattan distance to Russia')
result_d
Out[75]:
COUNTRY_ID Manhattan distance to Russia CPI Weight Weight x CPI
8 U.S.A 0.596446 7.1357 2.810983 20.058329
10 U.K. 0.729507 7.7751 1.879061 14.609887
4 Argentina 0.953569 2.9961 1.099754 3.294974
15 New Zealand 1.020268 9.4627 0.960665 9.090480
7 Israel 1.024250 5.8069 0.953208 5.535185
5 China 1.066451 3.6356 0.879262 3.196644
13 Australia 1.110111 8.8442 0.811460 7.176712
12 Canada 1.154176 8.6725 0.750682 6.510294
9 Ireland 1.183673 7.5360 0.713735 5.378704
14 Sweden 1.234662 9.2985 0.656000 6.099813
11 Germany 1.245142 8.0461 0.645003 5.189762
3 Egypt 1.247652 2.8622 0.642411 1.838709
6 Brazil 1.422164 3.7741 0.494425 1.866011
0 Afghanistan 2.339693 1.5171 0.182676 0.277138
2 Nigeria 2.657950 2.4493 0.141549 0.346696
1 Haiti 3.275534 1.7999 0.093204 0.167758

Answer

Draw the conclusion from result_d as below:

Since we are using k=16, the value returned by the model is the sum of the instance weights multiplied by the instance target value divided by the sum of the instance weights for all levels:

In [76]:
weightNorm16 = np.round(sum(result_d['Weight x CPI']) / sum(result_d['Weight']),4)
print('Weight CPI using full set of data k = 16:', weightNorm16)
Weight CPI using full set of data k = 16: 6.6091

8. Part E

Question

The actual 2011 CPI for Russia was 2.4488. Which of the predictions made was the most accurate? Why do you think this was?

Answer

The closest prediction was using 3-nearest neighbor prediction based on the original data. As the data ranges in the dataset are so different, normalization would actually find the non-biased distance (ranking of the instances would not be dominated by the feature with larger values) between the training instances and the query, and then would drive us to calculate the most sensible CPI. If we further examine the data according to the ranking of Manhattan distance and the CPI, we can find that the distance ranking of Russia to the original data set is a bit better than the normalized dataset with CPI close to 2.4488 when k = 3, but overall, we can observe there is no relationship between distance ranking to the CPI index in both datasets for this particular instance. ( Results in part b and d clearly explain so when we wanna find the CPI index with the full dataset.) Normalizing data and multiply the weight with CPI, will only amplify the non-linear relationship for this particular query.

It is more logical to have few more query (or testing) data to justify which data model (we can also find out the relevance on each macro economic features with the CPI, not just rely on one particular instance) works best for this scenario. We could then fine tune on using which distance metric and number of k to maximize the correctness of predicting CPI index on unseen countries.

We have construct the below dataframes to show all the features in both datasets with the ranking of manhattan distance for further reference

In [77]:
targetDf = pd.DataFrame(oTarget.tolist(), columns=['CPI'])
DistRank = pd.DataFrame(np.arange(1,17), columns=['Rank <lower indicates closer>'])
In [78]:
nMDf = pd.DataFrame(nM.tolist(), columns=['Manhattan distance to Russia'])
nRankDf = pd.concat([nCorrDf,nMDf, targetDf ], axis=1)
nRankDf = nRankDf.sort_values('Manhattan distance to Russia').reset_index(drop=True)
nRankDf = pd.concat([nRankDf, DistRank ], axis=1)
nRankDf = nRankDf.sort_values('CPI').reset_index(drop=True)
In [79]:
MDf = pd.DataFrame(M.tolist(), columns=['Manhattan distance to Russia'])
oRankDf = pd.concat([oCorr, MDf, targetDf ], axis=1)
oRankDf = oRankDf.sort_values('Manhattan distance to Russia').reset_index(drop=True)
oRankDf = pd.concat([oRankDf, DistRank ], axis=1)
oRankDf = oRankDf.sort_values('CPI').reset_index(drop=True)

Original Data with Distance ranking and CPI in ascending order

In [80]:
oRankDf
Out[80]:
COUNTRY_ID LIFE_EXP TOP10_INCOME INFANT_MORT MIL_SPEND SCHOOL_YEARS Manhattan distance to Russia CPI Rank <lower indicates closer>
0 Afghanistan 59.61 23.21 74.3 4.44 0.4 93.85 1.5171 14
1 Haiti 45.00 47.67 73.1 0.09 3.4 114.99 1.7999 16
2 Nigeria 51.30 38.23 82.6 1.07 4.1 107.07 2.4493 15
3 Egypt 70.48 26.58 19.6 1.86 5.3 27.17 2.8622 7
4 Argentina 75.77 32.30 13.3 0.76 10.1 17.98 2.9961 1
5 China 74.87 29.98 13.7 1.95 6.4 21.07 3.6356 3
6 Brazil 73.12 42.93 14.5 1.43 7.2 29.39 3.7741 10
7 Israel 81.30 28.80 3.6 6.77 12.5 26.26 5.8069 6
8 U.S.A 78.51 29.85 6.3 4.72 13.7 18.07 7.1357 2
9 Ireland 80.15 27.23 3.5 0.60 11.5 28.15 7.5360 8
10 U.K. 80.09 28.49 4.4 2.59 13.0 22.64 7.7751 4
11 Germany 80.24 22.07 3.5 1.31 12.0 32.19 8.0461 12
12 Canada 80.99 24.79 4.9 1.42 14.2 29.11 8.6725 9
13 Australia 82.09 25.40 4.2 1.86 11.5 29.96 8.8442 11
14 Sweden 81.43 22.18 2.4 1.27 12.8 33.61 9.2985 13
15 New Zealand 80.67 27.81 4.9 1.13 12.3 25.36 9.4627 5

Normalized Data with Distance ranking and CPI in ascending order

In [81]:
nRankDf
Out[81]:
COUNTRY_ID LIFE_EXP TOP10_INCOME INFANT_MORT MIL_SPEND SCHOOL_YEARS Manhattan distance to Russia CPI Rank <lower indicates closer>
0 Afghanistan 0.393907 0.044531 0.896509 0.651198 0.000000 2.339693 1.5171 14
1 Haiti 0.000000 1.000000 0.881546 0.000000 0.217391 3.275534 1.7999 16
2 Nigeria 0.169857 0.631250 1.000000 0.146707 0.268116 2.657950 2.4493 15
3 Egypt 0.686978 0.176172 0.214464 0.264970 0.355072 1.247652 2.8622 12
4 Argentina 0.829604 0.399609 0.135910 0.100299 0.702899 0.953569 2.9961 3
5 China 0.805338 0.308984 0.140898 0.278443 0.434783 1.066451 3.6356 6
6 Brazil 0.758156 0.814844 0.150873 0.200599 0.492754 1.422164 3.7741 13
7 Israel 0.978700 0.262891 0.014963 1.000000 0.876812 1.024250 5.8069 5
8 U.S.A 0.903478 0.303906 0.048628 0.693114 0.963768 0.596446 7.1357 1
9 Ireland 0.947695 0.201563 0.013716 0.076347 0.804348 1.183673 7.5360 9
10 U.K. 0.946077 0.250781 0.024938 0.374251 0.913043 0.729507 7.7751 2
11 Germany 0.950121 0.000000 0.013716 0.182635 0.840580 1.245142 8.0461 11
12 Canada 0.970342 0.106250 0.031172 0.199102 1.000000 1.154176 8.6725 8
13 Australia 1.000000 0.130078 0.022444 0.264970 0.804348 1.110111 8.8442 7
14 Sweden 0.982205 0.004297 0.000000 0.176647 0.898551 1.234662 9.2985 10
15 New Zealand 0.961715 0.224219 0.031172 0.155689 0.862319 1.020268 9.4627 4