Please refer to Jupyter NBExtension Readme page to display
the table of contents in floating window and expand the body of the contents.
BLE RSSI (Received Signal Strength Indicator) for Indoor localization Data Set is chosen for this assignment. The goal of our project is to identify the signal measurement patterns from all the observations in the dataset, and investigate whether clustering could help us to group all these patterns into organized structure. We have performed the following steps for data preparation.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import cluster
from IPython.display import display, HTML
from sklearn.cluster import KMeans
import math
import warnings
from sklearn import metrics
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
Read iniBeacon_RSSI_Labeled.csv
which is located in the same directory as this jupyter notebook. Retain all the column names in the csv file.
df_rssi = pd.read_csv("iBeacon_RSSI_Labeled.csv", parse_dates=['date'], index_col=False)
Check the size and data types of the loaded-in dataset
print("Shape of df_rssi:")
print(df_rssi.shape)
print("Datatypes of df_rssi:")
print(df_rssi.dtypes)
Check if there is any missing values in the dataset.
df_rssi.isna().sum()
==> No Missing value is found from the dataset.
Read in the first 5 rows of the datasets. We noticed in each observation, the value in the column location
records the respective location grid (refers to the iBeacon_layout.jpg
) for the RSSI readings that come up from the particular iBeacon(s) in the associated column(s).
df_rssi.head()
n_readings=df_rssi.shape[0]
n_readings
Set all the columns with prefix 'b3' as beacons_col
and number of becaons as n_beacons
beacons_col = df_rssi.columns[df_rssi.columns.str.find('b3', 0, 2)!=-1].tolist()
n_beacons=len(beacons_col)
n_beacons
Run a small test-suite to test all the value in the location
column are in the defined location grid 'A01' to 'W18'
grid_col1 = list(map(str, range(1, 10)))
grid_col1 =['0' + sub for sub in grid_col1]
grid_col2 = list(map(str, range(10, 19)))
grid_col = grid_col1 + grid_col2
grid_all = list()
for c in 'ABCDEFGHIJKLMNOPQRSTUVW':
grid = [c + sub for sub in grid_col]
grid_all.extend(grid)
checkAllTrue=True
for i in np.arange(len(df_rssi['location'])):
check = (df_rssi['location'][i] in grid_all)
if (check == False):
print("We found values not in grid labels", i, "with value", df_rssi['location'][i])
checkAllTrue=False
if(checkAllTrue == True):
print("All locations are within grid labels")
For out-of-range readings, RSSI is indicated by -200. (Refer to https://www.speedcheck.org/wiki/rssi/)
RSSI is measured in decibels from 0 (zero) to -120 (minus 120). Define a function checkRssi()
to test all the value in the beacon columns which are not equal to -200, nor not between 0 to -120. Run checkRssi()
.
def checkRssi():
checkAllTrue=True
for col in beacons_col:
for i in np.arange(len(df_rssi[col])):
check= ((df_rssi[col][i] == -200) | ((df_rssi[col][i] >= -120 ) & (df_rssi[col][i] <0 )))
if (check == False):
print("We found values not in RSSI ranges at row", i, "for", col, " where values is:", df_rssi[col][i])
checkAllTrue=False
if (checkAllTrue == True):
print("We found all values in RSSI are valid")
checkRssi()
All these values are close to -200, impute these values to -200.
for col in beacons_col:
for i in np.arange(len(df_rssi[col])):
check= ((df_rssi[col][i] == -200) | ((df_rssi[col][i] >= -120 ) & (df_rssi[col][i] < 0 )))
if (check == False):
df_rssi[col][i] = -200
Run checkRssi()
again
checkRssi()
Explore each column in the dataset. As we don't need to consider the time-stamp of each observation to identify the locations/coverage. We only need to inspect the beacons and location columns.
Check the summary statistics of the readings from each iBeacon
df_rssi.describe(include = np.number).round(2)
As we know that -200 indicates the iBeacons is out-of-range (i.e. no readings) for the location in each observation. We should take away all the -200 values in each iBeacon column to obtain valid readings. After eliminating -200 in each iBeacon, we can inspect the summary statistics again and plot the valid readings range of each iBeacon in the same graph
valid_reading_all_beacon = pd.DataFrame()
for col in beacons_col:
valid_reading_each_beacon = pd.DataFrame(df_rssi.loc[df_rssi[col]!=-200,col].reset_index(drop=True))
print(valid_reading_each_beacon.describe(include = np.number))
print('=========================================================')
valid_reading_each_beacon['beacon']=col
valid_reading_each_beacon.columns=['Readings', 'Beacon']
valid_reading_each_beacon = valid_reading_each_beacon.reindex(columns=['Beacon', 'Readings'])
valid_reading_all_beacon = valid_reading_all_beacon.append(valid_reading_each_beacon, ignore_index=True, sort=False)
plt.figure(figsize=(16,10))
g=sns.boxplot(x="Beacon", y="Readings", data=valid_reading_all_beacon)
g.set_title('BoxPlot of Readings for all Beacons', fontsize = 25)
plt.xticks(rotation=45)
plt.savefig('1. BeaconReadings.png', dpi=300, bbox_inches='tight')
Let's look at the unique values in the location
column from the dataset.
loc_df=np.unique(df_rssi['location'], return_counts = True)
loc_df=pd.DataFrame(loc_df).T
loc_df.columns=["location", "count"]
loc_df
Plot the count of non -200
RSSI readings at each location
plt.figure(figsize=(25, 10))
plt.xticks(rotation='vertical')
sns.barplot(x='location', y='count', data=loc_df)
plt.title('Count of non "-200" RSSI readings at each location', fontsize = 25)
plt.savefig('2. non -200.png', dpi=300, bbox_inches='tight')
Set number of unique location as n_locations
n_locations = len(loc_df)
n_locations
After having a basic understanding on the dataset, I have set the goal of the project - To identify the readings coverage of each iBeacon, Thus from the above findings, we have completed the checkings on the read-in dataset for the goal of our project. We can now move on to pairwise comparison.
(a) Assign a location code to each unique location and make an additional column in the master dataframe `df_rssi`, for each observation (or reading) specify the location code in `df_rssi`.
loc_df["location_code"]=-1
count=0
for i in range(len(loc_df)):
loc_df["location_code"][i]=count
count+=1
loc_dict=loc_df.set_index('location')['location_code'].to_dict()
df_rssi['location_code']=df_rssi['location'].replace(loc_dict)
(b) Construct 2 matrices, `bLoc` and `bLocCnt` which has the dimension of `n_location` x `n_beacons`. The matrices represent the crosstab relationship between each location and each beacon. We would like to see the pattern of readings in particular location for particular beacon. For example if we read the first row of the dataset, there is only 1 non "-200" value in beacon `b3006`(beacon index=5) and the location is O02 (with location code 58), we would set both bLoc[58,5] and bLocCnt[58,5] to 1. If we read the second row of the dataset, there are non "-200" value in both `b3006` and `b3005` beacons and the location is also specified as O02, we would set both bLoc[58,5] and bLoc[58,4] to 1 and bLocCnt[58,5] is now incremented to 2, bLoc[58,4] is set to 1, and so forth. Thus, we read-in each row in the dataset and:
- set 1 for cell in bLoc of the corresponding location and beacon
- increment the count for the cell in bLocCnt of the corresponding location and beacon
if the reading value of the beacon is not -200
bLoc = np.zeros((n_locations, n_beacons))
bLocCnt = np.zeros((n_locations, n_beacons))
for i in range(0, n_readings):
for j in range(0, n_beacons):
if df_rssi.iloc[i][j+2] != -200:
bLoc[df_rssi['location_code'][i],j]=1
bLocCnt[df_rssi['location_code'][i],j]=bLocCnt[df_rssi['location_code'][i],j]+1
(c) We can sum up bLocCnt to find out the number of non -200 readings in the entire file
sum(sum(bLocCnt))
(d) make a list of beacon from b3001 to b3013 as specified from the given excel file
beacon_colunm1 = list(map(str, range(1, 10)))
beacon_colunm1 = ['b300' + sub for sub in beacon_colunm1]
beacon_colunm2 = list(map(str, range(10, 14)))
beacon_colunm2 = ['b30' + sub for sub in beacon_colunm2]
beacon_colunm=beacon_colunm1+beacon_colunm2
(e) print a heatmap to illustrate the relationship between beacon and location.
y_axis_labels = loc_df['location']
x_axis_labels = beacon_colunm
plt.figure(figsize=(16, 30))
sns.heatmap(bLocCnt, cmap="Blues", annot=True, fmt='.0f', yticklabels=y_axis_labels, xticklabels=x_axis_labels, linewidths=.5, cbar_kws={"shrink": 0.5})
plt.title("Locations with Presence of Readings from iBeacon", fontsize = 25)
plt.xlabel("Beacons", fontsize = 20)
plt.ylabel("Location", fontsize = 20)
plt.savefig('3.readingsLocBeacon.png', dpi=300, bbox_inches='tight')
The above heatmap shows the co-ordinations between beacons and locations. However there is no indication of the groupings of observations for the given file.
For example, a valid reading detected by the beacon combination of b3002 and b3003, can be for location I03, I05 or I06 etc. But for a valid reading detected by beacon combination of b3002, 3004 and 3006, only come up for location I03 and K05, but not I05 and not I06.
We need a way to group our observations and see the relationship between beacon and location, so that we can make it as our target
and see if clustering can help us to identify the target automatically.
Let's group observations and find the relationship of each observations with Beacon and location
(a) Construct a matrix, `bReadings` which has the dimension of `n_readings` x `n_beacons`. The matrix represent the `non -200` value in the beacon in each reading. `1` stands for `non -200` value, `0` stands for `-200` value.
bReadings = np.zeros((n_readings, n_beacons))
for i in range(0, n_readings):
for j in range(0, n_beacons):
if df_rssi.iloc[i][j+2] != -200:
bReadings[i,j]=1
bReadings[0:15,:]
bReadings.shape
(b) Make a new column `readings_group` in dataframe `df_rssi`, initialize the value as `-1` for all observations
df_rssi['readings_group']=-1
df_rssi.head(5)
(c) Use a for loop to read all the observations, check if there is any same becaon combinations of non -200 readings in previous row, if so, classify them into same readings group. Shorten the nympy array `bReadings` with no duplicate beacon combinations, and name it as `bReadingsGroup`.
group_cnt= df_rssi.loc[0, "readings_group"]=0
arr=[bReadings[0]]
for i in range(1, n_readings):
last_row = i-1
for j in range(0, i):
if np.array_equal(bReadings[i],bReadings[j]):
df_rssi.loc[i, "readings_group"] = df_rssi.loc[j, "readings_group"]
break
if j == last_row:
group_cnt+=1
df_rssi.loc[i, "readings_group"] =group_cnt
arr.append(bReadings[i])
bReadingsGroup=np.array(arr)
(d) Construct a matrix, `bReadingsGroupCnt`, which is the same deminsion as `bReadingsGroup`, use it to count the frequency of occurrences for each beacon combinations on all obvervations. Also, make a list `locationReadingList` which will record the location came up in dataframe `df_rssi` for each `bReadingsGroup`.
bReadingsGroupCnt=np.zeros((len(bReadingsGroup), n_beacons))
locationReadingList = [list() for i in range(len(bReadingsGroup))]
for i in range(0, n_readings):
for j in range(0, len(bReadingsGroup)):
if np.array_equal(bReadings[i],bReadingsGroup[j]):
bReadingsGroupCnt[j]=bReadingsGroupCnt[j]+bReadings[i]
if df_rssi['location'][i] not in locationReadingList[j]:
locationReadingList[j].append(df_rssi['location'][i])
(e) Construct a list `beaconReadingList` which will record the beacon combination for each `bReadingsGroup`.
beaconReadingList = [ list() for i in range(len(bReadingsGroup))]
for i in range(0, len(bReadingsGroup)):
bArray=np.argwhere(bReadingsGroup[i]!=0).reshape(-1)
for j in range(0, len(bArray)):
beacon = 'b300' + str(bArray[j] + 1)
beaconReadingList[i].append(beacon)
(f) Show the length of `bReadingsGroup`
len(bReadingsGroup)
(g) Sum up bReadingsGroupCnt, it also shows the number of non -200 readings in the entire file
sum(sum(bReadingsGroupCnt))
(h) Define a function listToString to combine list as a string
# Python program to convert a list
# to string using join() function
# Function to convert
def listToString(s):
# initialize an empty string
str1 = " "
# return string
return (str1.join(s))
(i) Construct a dataframe `readingList_df` which would show the bReadingsGroup, combination of beacons and occurrences of locations side by side
count = 0
readingList_df = pd.DataFrame(columns=["readings_group", "beacon_list", "location_list"])
for i in range(0, len(bReadingsGroup)):
beaconStr=listToString(beaconReadingList[i])
locationStr=listToString(locationReadingList[i])
readingList_df.loc[len(readingList_df)] = [count, beaconStr, locationStr]
count+=1
pd.options.display.max_colwidth = None
display(HTML('<b>Table 1: Readings group by Combination of beacon Readings occurences of locations </b>'))
readingList_df
readingList_df.to_csv('a. readingList_df.csv', index=False)
(j) print a heatmap to illustrate the relationship between bReadingsGroupCnt and combinations of beacon.
import matplotlib.pyplot as plt
x_axis_labels=beacon_colunm
y_axis_labels = readingList_df['beacon_list']
plt.figure(figsize=(16, 16))
sns.heatmap(bReadingsGroupCnt, cmap="Greens", annot=True, fmt='.0f', yticklabels=y_axis_labels, xticklabels =x_axis_labels, linewidths=.5, cbar_kws={"shrink": 0.5})
plt.title("Readings distributions with Becaons", fontsize = 25)
plt.xlabel("Beacons", fontsize = 20)
plt.ylabel("Beacons Group", fontsize = 20)
plt.savefig('4.readingGroupBeacon.png', dpi=300, bbox_inches='tight')
(k) Construct a new dataframe `df_rssi_readingList` which shows the group the count of location and readings_group.
df_rssi_readingList=df_rssi.groupby(['location', 'readings_group']).size().reset_index()
df_rssi_readingList.columns=['location', 'readings_group', 'count']
df_rssi_readingList
(l) Map the beacon combination list to df_rssi_readingList.
beaconList_dict=readingList_df.set_index('readings_group')['beacon_list'].to_dict()
df_rssi_readingList['beacon_list']=df_rssi_readingList['readings_group'].replace(beaconList_dict)
df_rssi_readingList = df_rssi_readingList.reindex(columns=['location', 'readings_group', 'beacon_list', 'count']).reset_index(drop=True)
df_rssi_readingList.head(15)
df_rssi_readingList.to_csv('b. df_rssi_readingList.csv', index=False)
(m) Row 1 of Table 1 can be explained by `df_rssi_readingList` when readings_group == 0 and an associated bar chart.
df_rssi_readingList[df_rssi_readingList['readings_group']==0]
plt.figure(figsize=(10, 7))
sns.barplot(x='location', y='count', hue="beacon_list", data=df_rssi_readingList[df_rssi_readingList['readings_group']==0])
plt.title("Location where b3006 can detect signal", fontsize = 20)
plt.savefig('5.b300Loc.png', dpi=300, bbox_inches='tight')
(n) Location in the bar chart, for example K08, can be explaind by df_rssi_readingList when location == K08, and an associated bar chart.
df_rssi_readingList[df_rssi_readingList['location']=='K08']
plt.title("Becaon combinations which detect \nsignal at location K08", fontsize = 18)
g=sns.barplot(x='location', y='count', hue="beacon_list", data=df_rssi_readingList[df_rssi_readingList['location']=='K08'])
g.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.savefig('6. kn08.png', dpi=300, bbox_inches='tight')
Read in all the RSSI readings into an array array_rssi
.
array_rssi=df_rssi[beacons_col].values
array_rssi.shape
(a) Construct the k Distance Graph to determine Eps parameter for DB Scan.
from sklearn.neighbors import NearestNeighbors
nbrs=NearestNeighbors().fit(array_rssi)
distances, indices=nbrs.kneighbors(array_rssi, 20)
kDis=distances[:,10]
kDis2=distances[:,5]
kDis3=distances[:,1]
kDis.sort()
kDis2.sort()
kDis3.sort()
kDis=kDis[range(len(kDis)-1, 0, -1)]
kDis2=kDis2[range(len(kDis2)-1, 0, -1)]
kDis3=kDis3[range(len(kDis3)-1, 0, -1)]
plt.xlabel('observations')
plt.ylabel('distance')
plt.plot(range(0, len(kDis)), kDis, color='blue', label="10th neighbours")
plt.plot(range(0, len(kDis2)), kDis2, color='orange', label="5th neighbors")
plt.plot(range(0, len(kDis3)), kDis3, color='green', label="1st neighbors")
plt.legend(loc="top right")
plt.title("Determine Eps - k Distance Graph \nfor DBScan Clustering", fontsize = 18)
plt.savefig('7. EPS.png', dpi=300, bbox_inches='tight')
The k-distance graph can help us to determine the optimal “eps” for fitting db scan clustering.
Beside eps, db_scan has another parameter, min_samples (we chose to tune the parameter eps only in this assignment), which is default to be 5 and is shown as the 5th neighbours in the k distance graph. From the yellow line, it suggests the elbow occurs at around 13 for 5th neighbours, thus we fit our original data (a matrix with n_readings x n_beacons contains RSSI measurements) into db scan with eps=13 and min_samples=5 (default value).
(b) Perform DBScan by fitting `aray_rssi` with the optimal eps from the k-distance graph.
dbs_13 = cluster.DBSCAN(eps=13)
dbs_fit_13=dbs_13.fit(array_rssi)
(c) Find the label from the fitted model. Check the number of clusters in the label.
labels_dbs_13=dbs_fit_13.labels_
np.unique(labels_dbs_13, return_counts = True)
(d) Construct a dataframe `X` which will compare the cluster labels with the target (which is the readings_group in `df_rssi` dataframe)
df_rssi['beacon_list']=df_rssi['readings_group'].replace(beaconList_dict)
X=pd.DataFrame(array_rssi)
X['cluster']=labels_dbs_13
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
X.head(10)
(e) Use the crosstab function to construct a confusion matrix.
cm_dbs13=pd.crosstab(index=X["target"], columns=X["cluster"])
cm_dbs13
cm_dbs13.to_csv('c1. cm_dbs13.csv', index=True)
The above confusion matrix shows that target group 00_b3006 has been clustered to group 0 by DB scan, 01_b3003 has been clustered to group 1 by DB scan etc. However, group 07_b3002 b3003 b3005 and 09_b3003 b3004 b3006 have been both clustered to group -1, which is the outliner group in DB Scan.
(f) Run some of the clustering metrics from sklearn to compare the target and the fitted model
label_target=df_rssi['readings_group'].values
print("adjusted_mutal_info :", metrics.adjusted_mutual_info_score(label_target, labels_dbs_13))
print("adjusted_rand_score :", metrics.adjusted_rand_score(label_target, labels_dbs_13))
print("normalized_mutal_info :", metrics.normalized_mutual_info_score(label_target, labels_dbs_13))
print("fowles_mallows_score :", metrics.fowlkes_mallows_score(label_target, labels_dbs_13))
print("silhouette_score :", metrics.silhouette_score(array_rssi, labels_dbs_13))
print("calinski_harabasz_score :", metrics.calinski_harabasz_score(array_rssi, labels_dbs_13))
print("davies_bouldin_score :", metrics.davies_bouldin_score(array_rssi, labels_dbs_13))
print("homogeneity_completeness_v_measure :", metrics.homogeneity_completeness_v_measure(label_target, labels_dbs_13))
There are many performance evaluation metrics for clustering, each one has its advantages and drawbacks. For a more all-rounded evaluation, we would be looking into the following 4 metrics:
- Adjusted mutual information score: measures the agreement of the two assignments, ignoring permutations, it is normalized against chance, the score is symmetric: swapping the argument does not change the score.
- homogeneity: each cluster contains only members of a single class.
- completeness: all members of a given class are assigned to the same cluster.
- v_measure: harmonic means between homogeneity and completeness
(g) We use the hill climbing method to apply a for loop to fit the model repeatedly with EPS value ranges from 2 to 50 and obtain the homogeneity_completeness_v_measure metric of each model and save it into the dataframe `dbs_metric`.
dbs_metric=pd.DataFrame(columns=["eps", "metric","score"])
dbs_cluster_num=pd.DataFrame(columns=["eps", "num_cluster"])
for eps_num in range(2, 50):
dbs = cluster.DBSCAN(eps=eps_num)
dbs_fit=dbs.fit(array_rssi)
labels_dbs=dbs_fit.labels_
a = metrics.adjusted_mutual_info_score(label_target, labels_dbs)
h,c,v= metrics.homogeneity_completeness_v_measure(label_target, labels_dbs)
dbs_metric.loc[len(dbs_metric)] = [eps_num, "homogeneity", h]
dbs_metric.loc[len(dbs_metric)] = [eps_num, "completeness", c]
dbs_metric.loc[len(dbs_metric)] = [eps_num, "v_measure", v]
dbs_metric.loc[len(dbs_metric)] = [eps_num, "ami_score", a]
dbs_cluster_num.loc[len(dbs_cluster_num)] = [eps_num, len(np.unique(labels_dbs, return_counts = True)[0])]
(h) Visualize the `dbs_metric`.
plt.figure(figsize=(10, 7))
sns.lineplot(x='eps', y='score', hue="metric", data=dbs_metric)
plt.title("Performance analysis for DBScan \nClustering with different eps", fontsize = 18)
plt.savefig('8. DBScan perform.png', dpi=300, bbox_inches='tight')
(i) Visualize the relationship between eps and number of cluster in this dataset.
plt.plot(dbs_cluster_num['eps'], dbs_cluster_num['num_cluster'])
plt.title("Relationship between eps and num_cluster in DBScan")
plt.xlabel("EPS", fontsize = 15)
plt.ylabel("Number of Cluster", fontsize = 15)
plt.savefig('8a. DBScan cluster.png', dpi=300, bbox_inches='tight')
(j) Compare the confusion matrix with an eps value with worse performance (eps=3) and better performance (eps=21)
dbs_metric_sorted=dbs_metric.pivot(index='eps', columns='metric', values='score').reset_index().rename_axis(None, axis=1)
dbs_metric_sorted=dbs_metric_sorted.sort_values(by=["completeness", "homogeneity", "v_measure", "ami_score"], ascending=[False, False, False, False])
dbs_metric_sorted.head(1)
dbs_3 = cluster.DBSCAN(eps=3)
dbs_fit_3=dbs_3.fit(array_rssi)
labels_dbs_3=dbs_fit_3.labels_
X=pd.DataFrame(array_rssi)
X['cluster']=labels_dbs_3
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
cm_dbs3=pd.crosstab(index=X["target"], columns=X["cluster"])
cm_dbs3
dbs_21 = cluster.DBSCAN(eps=21)
dbs_fit_21=dbs_21.fit(array_rssi)
labels_dbs_21=dbs_fit_21.labels_
X=pd.DataFrame(array_rssi)
X['cluster']=labels_dbs_21
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
cm_dbs21=pd.crosstab(index=X["target"], columns=X["cluster"])
cm_dbs21
cm_dbs3.to_csv('c2. cm_dbs3.csv', index=True)
cm_dbs21.to_csv('c3. cm_dbs21.csv', index=True)
We found that for DBscan clustering on this dataset, performance boosts at eps = 5 and becomes flatten after eps = 21. (21 is the smallest eps with the best score in 4 metrics) When we compare the confusion matrix for 3 different models with different eps values (3, 13 and 21), we have the following findings:
Eps = 3: Beside the outliner group, DBScan groups all the data into 28 clusters, all these clusters belong to only one single target group. (Thus, Homogeneity should be 1, however, due to the outliner group, it still scores low in homogeneity). 58 out of 63 target groups would have some members grouped as outliners. 17 out of these 58 groups have been split by DBScan into 2 or more groups (for example, group 06_b3002 b3003, 73 members have been clustered to group 5 and 7 members to group 8, 9 members to outliners) , while 41 of them are entirely classified as outliners (for example, group 13_b3002 b3003 b3006 with 19 members have been classified entirely as outliner)
Eps = 13: DBScan groups all the data into 31 clusters, all these clusters belong only to one single target group. No target groups have been split into separate clusters and 32 groups are entirely classified as outliners.
Eps = 21: DBScan groups all the data into 32 clusters, all these clusters belong only to one single target group. No target groups have been split into separate clusters and 31 groups are entirely classified as outliners. Only difference compare to eps=13 is group 32_b30010 b30011, with 7 members have been entirely identified as an individual cluster instead of outliner. All the outliners are now with less than 5 members in the target groups. (4 target groups with 4 members, 2 target groups with 3 members and the rests are just 2 or 1 member(s)).
As it seems the cluster number hasn’t changed much with all these eps parameter, we have plotted the following graph to show the relationship between eps and number of clusters in DBSCan. We can see that the range of clusters are within 29 to 33.
(a) Construct the `averageDistance` to centroids graph to determine optimal `n_cluster` value for K-Mean clustering. The function is customized for this dataset as 13 beacons indicate there would be 13 dimensions for calculating the distance to the centroids.
def avgDistToCentroids(dataArray, k_dist):
k_disValues = np.zeros(len(k_dist))
for cur_k_ind in range(0,len(k_dist)):
#try each k value, starting from the first one with index 0
K_rssi = k_dist[cur_k_ind]
km_rssi = KMeans(n_clusters=K_rssi)
labels_rssi=km_rssi.fit(dataArray).labels_
# print(cur_k_ind, K_rssi, km_rssi)
#calculate the corresponding average distance to the centriods for this k value
sumDis = 0
for ind in range(0,n_readings):
q1 = dataArray[ind, 0]
q2 = dataArray[ind, 1]
q3 = dataArray[ind, 2]
q4 = dataArray[ind, 3]
q5 = dataArray[ind, 4]
q6 = dataArray[ind, 5]
q7 = dataArray[ind, 6]
q8 = dataArray[ind, 7]
q9 = dataArray[ind, 8]
q10 = dataArray[ind, 9]
q11 = dataArray[ind, 10]
q12 = dataArray[ind, 11]
q13 = dataArray[ind, 12]
p1 = km_rssi.cluster_centers_[labels_rssi[ind], 0]
p2 = km_rssi.cluster_centers_[labels_rssi[ind], 1]
p3 = km_rssi.cluster_centers_[labels_rssi[ind], 2]
p4 = km_rssi.cluster_centers_[labels_rssi[ind], 3]
p5 = km_rssi.cluster_centers_[labels_rssi[ind], 4]
p6 = km_rssi.cluster_centers_[labels_rssi[ind], 5]
p7 = km_rssi.cluster_centers_[labels_rssi[ind], 6]
p8 = km_rssi.cluster_centers_[labels_rssi[ind], 7]
p9 = km_rssi.cluster_centers_[labels_rssi[ind], 8]
p10 = km_rssi.cluster_centers_[labels_rssi[ind], 9]
p11 = km_rssi.cluster_centers_[labels_rssi[ind], 10]
p12 = km_rssi.cluster_centers_[labels_rssi[ind], 11]
p13 = km_rssi.cluster_centers_[labels_rssi[ind], 12]
dis = math.sqrt(math.pow(q1 - p1, 2) + math.pow(q2 - p2, 2) + math.pow(q3 - p3, 2) + math.pow(q4 - p4, 2)+ math.pow(q5 - p5, 2)+ math.pow(q6 - p6, 2)+ math.pow(q7 - p7, 2)+ math.pow(q8 - p8, 2)+ math.pow(q9 - p9, 2)+ math.pow(q10 - p10, 2)+ math.pow(q11 - p11, 2)+ math.pow(q12 - p12, 2)+ math.pow(q13 - p13, 2))
sumDis = sumDis + dis
k_disValues[cur_k_ind] = sumDis/n_readings
return k_disValues
k_dist = range(20,105)
print(k_dist)
k_disValues = avgDistToCentroids(array_rssi, k_dist)
plt.plot(k_dist, k_disValues)
plt.xlabel('k values')
plt.ylabel('average distance to the centroids')
plt.title("Determine number of clusters - k values and \ndistance to centroids Graph for Kmean Clustering", fontsize = 18)
plt.savefig('9. avgDistCentroids.png', dpi=300, bbox_inches='tight')
From the graph, the curve is slowly decreasing and we cannot find any distinctive elbow that would intervene and flatten the trend. Thus I have chosen n_cluster=60 as an arbitrary value to fit our original data (a matrix with n_readings x n_beacons contains RSSI measurements) into kmean clustering.
(b) Perform KMeans clustering by fitting `aray_rssi` with n_clusters=60
km_rssi = KMeans(n_clusters=60, random_state=999)
km_fit = km_rssi.fit(array_rssi)
(c) Find the label from the fitted model. Check the number of clusters in the label.
labels_kmean=km_fit.labels_
np.unique(labels_kmean, return_counts = True)
(d) Construct a dataframe `X` which will compare the cluster labels with the target (which is the readings_group in `df_rssi` dataframe)
X=pd.DataFrame(array_rssi)
X['cluster']=labels_kmean
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
X.head(10)
(e) Use the crosstab function to construct a confusion matrix.
cm_kmean=pd.crosstab(index=X["target"], columns=X["cluster"])
cm_kmean
We observed there is no outliner group in k-mean clustering. From the above table, it shows that 00_b3006 has been clustered to group 6, 00_b3003 has been clustered to group 7 by Kmeans etc.
(f) Run some of the clustering metrics from sklearn to compare the target and the fitted model
print("adjusted_mutal_info :", metrics.adjusted_mutual_info_score(label_target, labels_kmean))
print("adjusted_rand_score :", metrics.adjusted_rand_score(label_target, labels_kmean))
print("normalized_mutal_info :", metrics.normalized_mutual_info_score(label_target, labels_kmean))
print("fowles_mallows_score :", metrics.fowlkes_mallows_score(label_target, labels_kmean))
print("silhouette_score :", metrics.silhouette_score(array_rssi, labels_kmean))
print("calinski_harabasz_score :", metrics.calinski_harabasz_score(array_rssi, labels_kmean))
print("davies_bouldin_score :", metrics.davies_bouldin_score(array_rssi, labels_kmean))
print("homogeneity_completeness_v_measure :", metrics.homogeneity_completeness_v_measure(label_target, labels_kmean))
(g) Similar to DB scan, we are also evaluating with the same 4 metrics, we use the hill climbing method to apply a for loop to fit the model repeatedly with n_clusters ranges from 15 to 80 and obtain the homogeneity_completeness_v_measure metric of each model and save it into the dataframe `kmeans_metric`.
kmeans_metric=pd.DataFrame(columns=["n_cluster", "metric","score"])
for cluster_num in range(15, 80):
km_rssi = KMeans(n_clusters=cluster_num, random_state=999)
labels_kmean=km_rssi.fit(array_rssi).labels_
a = metrics.adjusted_mutual_info_score(label_target, labels_kmean)
h,c,v= metrics.homogeneity_completeness_v_measure(label_target, labels_kmean)
kmeans_metric.loc[len(kmeans_metric)] = [cluster_num, "homogeneity", h]
kmeans_metric.loc[len(kmeans_metric)] = [cluster_num, "completeness", c]
kmeans_metric.loc[len(kmeans_metric)] = [cluster_num, "v_measure", v]
kmeans_metric.loc[len(kmeans_metric)] = [cluster_num, "ami_score", a]
(h) Visualize the `kmeans_metric`.
g=sns.lineplot(x='n_cluster', y='score', hue="metric", data=kmeans_metric)
g.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Performance analysis for Kmean \nClustering with different n_cluster", fontsize = 18)
plt.savefig('10. KMean perform.png', dpi=300, bbox_inches='tight')
(i) Compare the confusion matrix with n_cluster value of worse performance (n_cluster=30 or 70) and better performance (n_cluster=60)
kmeans_metric_sorted=kmeans_metric.pivot(index='n_cluster', columns='metric', values='score').reset_index().rename_axis(None, axis=1)
kmeans_metric_sorted=kmeans_metric_sorted.sort_values(by=["completeness", "homogeneity", "v_measure", "ami_score"], ascending=[False, False, False, False])
kmeans_metric_sorted.head(5)
km_rssi_62 = KMeans(n_clusters=62, random_state=999)
km_fit_62 = km_rssi_62.fit(array_rssi)
labels_kmean_62=km_fit_62.labels_
X=pd.DataFrame(array_rssi)
X['cluster']=labels_kmean_62
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
cm_kmean_62= pd.crosstab(index=X["target"], columns=X["cluster"])
cm_kmean_62
km_rssi_30 = KMeans(n_clusters=30, random_state=999)
km_fit_30 = km_rssi_30.fit(array_rssi)
labels_kmean_30=km_fit_30.labels_
X=pd.DataFrame(array_rssi)
X['cluster']=labels_kmean_30
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
cm_kmean_30= pd.crosstab(index=X["target"], columns=X["cluster"])
cm_kmean_30
km_rssi_70 = KMeans(n_clusters=70, random_state=999)
km_fit_70 = km_rssi_70.fit(array_rssi)
labels_kmean_70=km_fit_70.labels_
X=pd.DataFrame(array_rssi)
X['cluster']=labels_kmean_70
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
cm_kmean_70= pd.crosstab(index=X["target"], columns=X["cluster"])
cm_kmean_70
cm_kmean_62.to_csv('d1. cm_kmean_62.csv', index=True)
cm_kmean_70.to_csv('d2. cm_kmean_70.csv', index=True)
cm_kmean_30.to_csv('d3. cm_kmean_30.csv', index=True)
We found that for KMenas clustering on this dataset, performance for v_measure, ami_score and homogeneity gradually increase until n_cluster reaches 62. (62 is the n_cluster with the best score for all 4 metrics) Completeness is almost at its perfect score until n_cluster reaches 62. When n_cluster reaches 62, v_measure, ami_score and completeness dramatically drop, while homogeneity stay flatten. To further investigate this, we compare the confusion matrix for 3 different models with different n_cluster values (30, 62 and 70), we have the following findings:
N_cluster = 30: KMeans groups all the data into 30 clusters, only 7 clusters belong to one single target group. All target groups have been formed into same clusters, except group 09_b3003 b3004 b3006. When we observe closer, we would find that if the cluster groups have multiple target groups, there would be similarities in the beacon combinations of the target group, for example, cluster group 4, comprises 5 target groups, namely, 06_b3002 b3003, 07_b3002 b3003 b3005, 12_b3002 b3003 b3007, 56_b3002 b3003 b3004 and 62_b3002 b3003 b3004 b3007, all contains the combination of b3002 and b3003. Beside the group 06_b3002 b3003 which has 89 members in the group, all these target groups have 1 or 2 members only.
N_cluster = 62: KMeans groups all the data into 62 clusters, only cluster 48, is comprised with target group 21_b3004 b3005 (2 members) and 57_b3004 b3005 b3006 (1 member), all other clusters and target groups are having one-to-one mapping.
N_cluster = 70: KMeans groups all the data into 70 clusters, all cluster belong to one single target group. However, 6 target groups have been split into separate clusters. For example, 03_b3004 has been split into cluster group 0 and 66. Obviously, this is the result of too many clusters for mapping 63 target groups.
Compare the homogeneity and completeness of all our confusion matrices for DB Scan and KMean
(a) Define a function to check each column for homogeneity and each row for completeness
def checkConfusionMatrix(cm):
perfect_homogeneity = True
perfect_completeness = True
print("Homogeneity Check:")
homo_group = 0
complete_group = 0
for col in cm.columns:
homogeneity_cnt = (cm[col]!=0).sum()
if(homogeneity_cnt != 1):
print("================================================================")
print("cluster:", col, "contains", homogeneity_cnt, "target group(s)")
print("target group: ", cm[cm[col]!=0].index.values, "are both grouped into cluster", col)
perfect_homogeneity = False
else:
homo_group = homo_group + 1
if(perfect_homogeneity == True):
print("All cluster only contains one single target group")
else:
print("\nHowever,", homo_group, "out of", len(cm.columns), "clusters contain(s) one single target group")
print("\nCompleteness Check:")
for row in cm.index:
completeness_cnt = (cm.loc[row]!=0).sum()
target_member = 0
if(completeness_cnt != 1):
print("================================================================")
perfect_completeness = False
print("target:", row, "has been split into", completeness_cnt, "cluster group")
cluster=list(np.where(cm.loc[row]!=0))[0]
for i in cluster:
col=cm.columns[i]
target_member = target_member + cm[col][row]
print("while cluster", col , "contains", cm[col][row], "members")
print("there should be ", target_member, "members in target group", row)
else:
complete_group = complete_group + 1
if(perfect_completeness == True):
print("All target groups are grouped into same cluster")
else:
print("\nHowever,", complete_group, "out of", len(cm.index), "target groups have been grouped into same cluster")
checkConfusionMatrix(cm_dbs3)
checkConfusionMatrix(cm_dbs13)
checkConfusionMatrix(cm_dbs21)
checkConfusionMatrix(cm_kmean_30)
checkConfusionMatrix(cm_kmean_62)
checkConfusionMatrix(cm_kmean_70)
To summarize our analysis, we have successfully achieved our goal to identify the signal measurement patterns and use clustering to group all the data into organized structure.
For this data set, from all the eps value we tested (range from 3 to 50), DB scan can differentiate and make all the cluster only contain one single target group. DB scan uses its outliner group “-1” to store all the unidentified cluster. If the eps value is too small, DB scan will split target group into different cluster. At the best eps value, DB scan will not split any target group into different cluster, and just leave all those target group with low density as outliners. With eps value higher than the optimal result, db scan’s performance stay the same and will not be over-tuned.
For K means clustering, from all the n_cluster value we tested (range from 30 to 70), if the n_cluster is too low, Kmean will make the cluster to multiple target groups with similar beacon combinations, but will not split any target group into different cluster. When the n_cluster value is well chosen, K means will differentiate the data as almost 1:1 mapping with the target group. However, beyond the optimal value, K means performance will drop, and will split the target group into different clusters although all the clusters still contain one single target group.
Even though, we have found the almost perfect n_cluster value to identify all the cluster and target in 1:1 mapping. I would still recommend to use db scan to process this dataset, as it is comparatively insensitive to the input parameter, and the undefined clusters are actually sparse reading of beacon combinations that come up infrequently in the dataset, tuning eps is just for improving clustering on less-dense target group. The perfect n_cluster value in kmeans might only work for this dataset for this particular number of groups of beacon combinations, small difference in the n_cluster might turn out to have very bad results in clustering the data.