Executive Summary¶

BLE RSSI (Received Signal Strength Indicator) for Indoor localization Data Set is chosen for this assignment. The goal of our project is to identify the signal measurement patterns from all the observations in the dataset, and investigate whether clustering could help us to group all these patterns into organized structure. We have performed the following steps for data preparation.

Data Preparation¶

1. Import Packages¶

Import all the necessary packages, numpy, pandas, matplotlib, sklearn, seaborn, math etc.

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import cluster
from IPython.display import display, HTML
from sklearn.cluster import KMeans
import math
import warnings
from sklearn import metrics
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)

2. Read in Data File¶

Read iniBeacon_RSSI_Labeled.csv which is located in the same directory as this jupyter notebook. Retain all the column names in the csv file.

df_rssi = pd.read_csv("iBeacon_RSSI_Labeled.csv", parse_dates=['date'], index_col=False)

3. Check Dimensions and Data Types¶

Check the size and data types of the loaded-in dataset

print("Shape of df_rssi:")
print(df_rssi.shape)
print("Datatypes of df_rssi:")
print(df_rssi.dtypes)

Shape of df_rssi:
(1420, 15)
Datatypes of df_rssi:
location            object
date        datetime64[ns]
b3001                int64
b3002                int64
b3003                int64
b3004                int64
b3005                int64
b3006                int64
b3007                int64
b3008                int64
b3009                int64
b3010                int64
b3011                int64
b3012                int64
b3013                int64
dtype: object

4. Check missing values¶

Check if there is any missing values in the dataset.

df_rssi.isna().sum()

location    0
date        0
b3001       0
b3002       0
b3003       0
b3004       0
b3005       0
b3006       0
b3007       0
b3008       0
b3009       0
b3010       0
b3011       0
b3012       0
b3013       0
dtype: int64

==> No Missing value is found from the dataset.

5. Show Content¶

Read in the first 5 rows of the datasets. We noticed in each observation, the value in the column location records the respective location grid (refers to the iBeacon_layout.jpg) for the RSSI readings that come up from the particular iBeacon(s) in the associated column(s).

df_rssi.head()

6. Set variables¶

n_readings¶

Set the number of rows in the dataset as number of RSSI readings n_readings

n_readings=df_rssi.shape[0]
n_readings

1420

beacons_col and n_beacons¶

Set all the columns with prefix 'b3' as beacons_col and number of becaons as n_beacons

beacons_col = df_rssi.columns[df_rssi.columns.str.find('b3', 0, 2)!=-1].tolist()
n_beacons=len(beacons_col)
n_beacons

13

7. Apply test suite¶

Run a small test-suite to test all the value in the location column are in the defined location grid 'A01' to 'W18'

grid_col1 = list(map(str, range(1, 10)))
grid_col1 =['0' + sub for sub in grid_col1]
grid_col2 = list(map(str, range(10, 19)))
grid_col = grid_col1 + grid_col2

grid_all = list()
for c in 'ABCDEFGHIJKLMNOPQRSTUVW':
    grid = [c + sub for sub in grid_col]
    grid_all.extend(grid)

checkAllTrue=True
for i in np.arange(len(df_rssi['location'])):
        check = (df_rssi['location'][i] in grid_all)
        if (check == False):
            print("We found values not in grid labels", i, "with value", df_rssi['location'][i])
            checkAllTrue=False
if(checkAllTrue == True):
    print("All locations are within grid labels")

All locations are within grid labels

8. Handle Out-of-range readings¶

For out-of-range readings, RSSI is indicated by -200. (Refer to https://www.speedcheck.org/wiki/rssi/) RSSI is measured in decibels from 0 (zero) to -120 (minus 120). Define a function checkRssi() to test all the value in the beacon columns which are not equal to -200, nor not between 0 to -120. Run checkRssi().

def checkRssi():
    checkAllTrue=True
    for col in beacons_col:
        for i in np.arange(len(df_rssi[col])):
            check= ((df_rssi[col][i] == -200) | ((df_rssi[col][i] >= -120 ) & (df_rssi[col][i] <0 )))   
            if (check == False):
                print("We found values not in RSSI ranges at row", i, "for", col, " where values is:", df_rssi[col][i])
                checkAllTrue=False
    if (checkAllTrue == True):
           print("We found all values in RSSI are valid")

checkRssi()

We found values not in RSSI ranges at row 203 for b3002  where values is: -198
We found values not in RSSI ranges at row 356 for b3002  where values is: -198
We found values not in RSSI ranges at row 357 for b3002  where values is: -198
We found values not in RSSI ranges at row 411 for b3002  where values is: -198
We found values not in RSSI ranges at row 423 for b3002  where values is: -198
We found values not in RSSI ranges at row 424 for b3002  where values is: -198
We found values not in RSSI ranges at row 513 for b3002  where values is: -198
We found values not in RSSI ranges at row 514 for b3002  where values is: -198
We found values not in RSSI ranges at row 515 for b3002  where values is: -198
We found values not in RSSI ranges at row 518 for b3002  where values is: -198
We found values not in RSSI ranges at row 519 for b3002  where values is: -198
We found values not in RSSI ranges at row 401 for b3012  where values is: -199
We found values not in RSSI ranges at row 402 for b3012  where values is: -199
We found values not in RSSI ranges at row 404 for b3012  where values is: -199
We found values not in RSSI ranges at row 405 for b3012  where values is: -199

All these values are close to -200, impute these values to -200.

for col in beacons_col:
    for i in np.arange(len(df_rssi[col])):
        check= ((df_rssi[col][i] == -200) | ((df_rssi[col][i] >= -120 ) & (df_rssi[col][i] < 0 )))   
        if (check == False):
            df_rssi[col][i] = -200

Run checkRssi() again

checkRssi()

We found all values in RSSI are valid

Data Exploration¶

Explore each column in the dataset. As we don't need to consider the time-stamp of each observation to identify the locations/coverage. We only need to inspect the beacons and location columns.

1. Readings of each iBeacon¶

Check the summary statistics of the readings from each iBeacon

df_rssi.describe(include = np.number).round(2)

As we know that -200 indicates the iBeacons is out-of-range (i.e. no readings) for the location in each observation. We should take away all the -200 values in each iBeacon column to obtain valid readings. After eliminating -200 in each iBeacon, we can inspect the summary statistics again and plot the valid readings range of each iBeacon in the same graph

valid_reading_all_beacon = pd.DataFrame()
for col in beacons_col:
    valid_reading_each_beacon = pd.DataFrame(df_rssi.loc[df_rssi[col]!=-200,col].reset_index(drop=True))
    print(valid_reading_each_beacon.describe(include = np.number))
    print('=========================================================')
    valid_reading_each_beacon['beacon']=col
    valid_reading_each_beacon.columns=['Readings', 'Beacon']
    valid_reading_each_beacon = valid_reading_each_beacon.reindex(columns=['Beacon', 'Readings'])
    valid_reading_all_beacon = valid_reading_all_beacon.append(valid_reading_each_beacon, ignore_index=True, sort=False)

           b3001
count  25.000000
mean  -76.480000
std     4.134408
min   -81.000000
25%   -80.000000
50%   -78.000000
75%   -74.000000
max   -67.000000
=========================================================
            b3002
count  486.000000
mean   -73.308642
std      5.844321
min    -87.000000
25%    -78.000000
50%    -74.000000
75%    -69.000000
max    -59.000000
=========================================================
            b3003
count  280.000000
mean   -75.917857
std      5.794297
min    -88.000000
25%    -80.000000
50%    -78.000000
75%    -74.000000
max    -56.000000
=========================================================
            b3004
count  402.000000
mean   -74.723881
std      5.136625
min    -88.000000
25%    -78.000000
50%    -76.000000
75%    -71.000000
max    -56.000000
=========================================================
            b3005
count  247.000000
mean   -75.696356
std      4.695711
min    -83.000000
25%    -79.000000
50%    -77.000000
75%    -73.000000
max    -60.000000
=========================================================
            b3006
count  287.000000
mean   -76.620209
std      4.019012
min    -87.000000
25%    -79.000000
50%    -77.000000
75%    -75.000000
max    -62.000000
=========================================================
           b3007
count  50.000000
mean  -76.100000
std     6.952462
min   -85.000000
25%   -80.750000
50%   -79.000000
75%   -73.000000
max   -58.000000
=========================================================
           b3008
count  91.000000
mean  -74.703297
std     6.013863
min   -83.000000
25%   -79.000000
50%   -77.000000
75%   -71.500000
max   -56.000000
=========================================================
           b3009
count  31.000000
mean  -69.225806
std     8.849519
min   -82.000000
25%   -77.000000
50%   -72.000000
75%   -59.500000
max   -55.000000
=========================================================
           b3010
count  29.000000
mean  -74.758621
std     6.168209
min   -81.000000
25%   -79.000000
50%   -78.000000
75%   -72.000000
max   -61.000000
=========================================================
           b3011
count  25.000000
mean  -72.120000
std     7.562627
min   -85.000000
25%   -79.000000
50%   -72.000000
75%   -67.000000
max   -59.000000
=========================================================
           b3012
count  31.000000
mean  -73.419355
std     8.106681
min   -82.000000
25%   -81.000000
50%   -77.000000
75%   -66.500000
max   -60.000000
=========================================================
           b3013
count  44.000000
mean  -73.022727
std     7.963547
min   -87.000000
25%   -79.500000
50%   -75.000000
75%   -65.750000
max   -59.000000
=========================================================

plt.figure(figsize=(16,10))
g=sns.boxplot(x="Beacon", y="Readings", data=valid_reading_all_beacon)
g.set_title('BoxPlot of Readings for all Beacons', fontsize = 25)
plt.xticks(rotation=45)
plt.savefig('1. BeaconReadings.png', dpi=300, bbox_inches='tight')

2. Count of unique readings in each location¶

Let's look at the unique values in the location column from the dataset.

loc_df=np.unique(df_rssi['location'], return_counts = True)
loc_df=pd.DataFrame(loc_df).T
loc_df.columns=["location", "count"]
loc_df

Plot the count of non -200 RSSI readings at each location

plt.figure(figsize=(25, 10))
plt.xticks(rotation='vertical')
sns.barplot(x='location', y='count', data=loc_df)
plt.title('Count of non "-200" RSSI readings at each location', fontsize = 25)
plt.savefig('2. non -200.png', dpi=300, bbox_inches='tight')

Set number of unique location as n_locations

n_locations = len(loc_df)
n_locations

105

After having a basic understanding on the dataset, I have set the goal of the project - To identify the readings coverage of each iBeacon, Thus from the above findings, we have completed the checkings on the read-in dataset for the goal of our project. We can now move on to pairwise comparison.

Pairwise Comparison¶

1. Location vs iBeacon¶

Explore the relationship between pairs of attributes. Let's see which iBeacon would detect reading for the location in the dataset.

(a) Assign a location code to each unique location and make an additional column in the master dataframe `df_rssi`, for each observation (or reading) specify the location code in `df_rssi`.

loc_df["location_code"]=-1

count=0
for i in range(len(loc_df)):
    loc_df["location_code"][i]=count
    count+=1
loc_dict=loc_df.set_index('location')['location_code'].to_dict()

df_rssi['location_code']=df_rssi['location'].replace(loc_dict)

  (b) Construct 2 matrices, `bLoc` and `bLocCnt` which has the dimension of `n_location` x `n_beacons`.  The matrices represent the crosstab relationship between each location and each beacon.  We would like to see the pattern of readings in particular location for particular beacon.  For example if we read the first row of the dataset, there is only 1 non "-200" value in beacon `b3006`(beacon index=5) and the location is O02 (with location code 58), we would set both bLoc[58,5] and bLocCnt[58,5] to 1.   If we read the second row of the dataset, there are non "-200" value in both `b3006` and `b3005` beacons and the location is also specified as O02, we would set both bLoc[58,5] and bLoc[58,4] to 1 and bLocCnt[58,5] is now incremented to 2, bLoc[58,4] is set to 1, and so forth. Thus, we read-in each row in the dataset and:
        - set 1 for cell in bLoc of the corresponding location and beacon 
        - increment the count for the cell in bLocCnt of the corresponding location and beacon
  if the reading value of the beacon is not -200

bLoc = np.zeros((n_locations, n_beacons))
bLocCnt = np.zeros((n_locations, n_beacons))
for i in range(0, n_readings):
    for j in range(0, n_beacons):
        if df_rssi.iloc[i][j+2] != -200:
            bLoc[df_rssi['location_code'][i],j]=1
            bLocCnt[df_rssi['location_code'][i],j]=bLocCnt[df_rssi['location_code'][i],j]+1

(c) We can sum up bLocCnt to find out the number of non -200 readings in the entire file

sum(sum(bLocCnt))

2028.0

(d) make a list of beacon from b3001 to b3013 as specified from the given excel file

beacon_colunm1 = list(map(str, range(1, 10)))
beacon_colunm1 = ['b300' + sub for sub in beacon_colunm1]
beacon_colunm2 = list(map(str, range(10, 14)))
beacon_colunm2 = ['b30' + sub for sub in beacon_colunm2]
beacon_colunm=beacon_colunm1+beacon_colunm2

(e) print a heatmap to illustrate the relationship between beacon and location.

y_axis_labels = loc_df['location']
x_axis_labels = beacon_colunm
plt.figure(figsize=(16, 30))
sns.heatmap(bLocCnt, cmap="Blues", annot=True, fmt='.0f', yticklabels=y_axis_labels, xticklabels=x_axis_labels, linewidths=.5, cbar_kws={"shrink": 0.5})
plt.title("Locations with Presence of Readings from iBeacon", fontsize = 25)
plt.xlabel("Beacons", fontsize = 20)
plt.ylabel("Location", fontsize = 20)
plt.savefig('3.readingsLocBeacon.png', dpi=300, bbox_inches='tight')

The above heatmap shows the co-ordinations between beacons and locations. However there is no indication of the groupings of observations for the given file. For example, a valid reading detected by the beacon combination of b3002 and b3003, can be for location I03, I05 or I06 etc. But for a valid reading detected by beacon combination of b3002, 3004 and 3006, only come up for location I03 and K05, but not I05 and not I06. We need a way to group our observations and see the relationship between beacon and location, so that we can make it as our target and see if clustering can help us to identify the target automatically.

2. Location vs combinations of iBeacons¶

Let's group observations and find the relationship of each observations with Beacon and location

  (a) Construct a matrix, `bReadings` which has the dimension of `n_readings` x `n_beacons`.  The matrix represent the `non -200` value in the beacon in each reading. `1` stands for `non -200` value,  `0` stands for `-200` value.

bReadings = np.zeros((n_readings, n_beacons))
for i in range(0, n_readings):
    for j in range(0, n_beacons):
        if df_rssi.iloc[i][j+2] != -200:
            bReadings[i,j]=1

bReadings[0:15,:]

array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

bReadings.shape

(1420, 13)

  (b) Make a new column `readings_group` in dataframe `df_rssi`, initialize the value as `-1` for all observations

df_rssi['readings_group']=-1
df_rssi.head(5)

  (c) Use a for loop to read all the observations, check if there is any same becaon combinations of non -200 readings  in previous row, if so, classify them into same readings group.  Shorten the nympy array `bReadings` with no duplicate beacon combinations, and name it as `bReadingsGroup`.

group_cnt= df_rssi.loc[0, "readings_group"]=0
arr=[bReadings[0]]

for i in range(1, n_readings):
    last_row = i-1
    for j in range(0, i):
        if np.array_equal(bReadings[i],bReadings[j]):
            df_rssi.loc[i, "readings_group"] = df_rssi.loc[j, "readings_group"]
            break
        if j == last_row:
            group_cnt+=1
            df_rssi.loc[i, "readings_group"] =group_cnt
            arr.append(bReadings[i])

bReadingsGroup=np.array(arr)

  (d) Construct a matrix, `bReadingsGroupCnt`, which is the same deminsion as `bReadingsGroup`, use it to count the frequency of occurrences for each beacon combinations on all obvervations.  Also, make a list `locationReadingList` which will record the location came up in dataframe `df_rssi` for each `bReadingsGroup`.

bReadingsGroupCnt=np.zeros((len(bReadingsGroup), n_beacons))

locationReadingList = [list() for i in range(len(bReadingsGroup))]

for i in range(0, n_readings):
    for j in range(0, len(bReadingsGroup)):
        if np.array_equal(bReadings[i],bReadingsGroup[j]):
            bReadingsGroupCnt[j]=bReadingsGroupCnt[j]+bReadings[i]
            if df_rssi['location'][i] not in locationReadingList[j]:
                locationReadingList[j].append(df_rssi['location'][i])

    (e) Construct a list `beaconReadingList` which will record the beacon combination for each `bReadingsGroup`.

beaconReadingList = [ list() for i in range(len(bReadingsGroup))]

for i in range(0, len(bReadingsGroup)):
    bArray=np.argwhere(bReadingsGroup[i]!=0).reshape(-1)
    for j in range(0, len(bArray)):
        beacon = 'b300' + str(bArray[j] + 1)
        beaconReadingList[i].append(beacon)

    (f) Show the length of `bReadingsGroup`

len(bReadingsGroup)

63

    (g) Sum up bReadingsGroupCnt, it also shows the number of non -200 readings in the entire file

sum(sum(bReadingsGroupCnt))

2028.0

    (h) Define a function listToString to combine list as a string

# Python program to convert a list 
# to string using join() function 
    
# Function to convert   
def listToString(s):  
    
    # initialize an empty string 
    str1 = " " 
    
    # return string   
    return (str1.join(s))

    (i) Construct a dataframe `readingList_df` which would show the bReadingsGroup, combination of beacons and occurrences of locations side by side

count = 0
readingList_df = pd.DataFrame(columns=["readings_group", "beacon_list", "location_list"])
for i in range(0, len(bReadingsGroup)):
    beaconStr=listToString(beaconReadingList[i])
    locationStr=listToString(locationReadingList[i])
    readingList_df.loc[len(readingList_df)] = [count, beaconStr, locationStr]
    count+=1

pd.options.display.max_colwidth = None 
display(HTML('<b>Table 1: Readings group by Combination of beacon Readings occurences of locations </b>'))
readingList_df

readingList_df.to_csv('a. readingList_df.csv', index=False)

(j) print a heatmap to illustrate the relationship between bReadingsGroupCnt and combinations of beacon.

import matplotlib.pyplot as plt
x_axis_labels=beacon_colunm
y_axis_labels = readingList_df['beacon_list']
plt.figure(figsize=(16, 16))
sns.heatmap(bReadingsGroupCnt, cmap="Greens", annot=True, fmt='.0f', yticklabels=y_axis_labels, xticklabels =x_axis_labels, linewidths=.5, cbar_kws={"shrink": 0.5})
plt.title("Readings distributions with Becaons", fontsize = 25)
plt.xlabel("Beacons", fontsize = 20)
plt.ylabel("Beacons Group", fontsize = 20)
plt.savefig('4.readingGroupBeacon.png', dpi=300, bbox_inches='tight')

    (k) Construct a new dataframe `df_rssi_readingList` which shows the group the count of location and readings_group.

df_rssi_readingList=df_rssi.groupby(['location', 'readings_group']).size().reset_index()
df_rssi_readingList.columns=['location', 'readings_group', 'count']
df_rssi_readingList

    (l) Map the beacon combination list to df_rssi_readingList.

beaconList_dict=readingList_df.set_index('readings_group')['beacon_list'].to_dict()
df_rssi_readingList['beacon_list']=df_rssi_readingList['readings_group'].replace(beaconList_dict)
df_rssi_readingList = df_rssi_readingList.reindex(columns=['location', 'readings_group', 'beacon_list', 'count']).reset_index(drop=True)
df_rssi_readingList.head(15)

df_rssi_readingList.to_csv('b. df_rssi_readingList.csv', index=False)

    (m) Row 1 of Table 1 can be explained by `df_rssi_readingList` when readings_group == 0 and an associated bar chart.

df_rssi_readingList[df_rssi_readingList['readings_group']==0]

plt.figure(figsize=(10, 7))
sns.barplot(x='location', y='count', hue="beacon_list", data=df_rssi_readingList[df_rssi_readingList['readings_group']==0])
plt.title("Location where b3006 can detect signal", fontsize = 20)
plt.savefig('5.b300Loc.png', dpi=300, bbox_inches='tight')

    (n) Location in the bar chart, for example K08, can be explaind by df_rssi_readingList when location == K08, and an associated bar chart.

df_rssi_readingList[df_rssi_readingList['location']=='K08']

plt.title("Becaon combinations which detect \nsignal at location K08", fontsize = 18)
g=sns.barplot(x='location', y='count', hue="beacon_list", data=df_rssi_readingList[df_rssi_readingList['location']=='K08'])
g.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.savefig('6. kn08.png', dpi=300, bbox_inches='tight')

Data Modeling¶

Read in all the RSSI readings into an array array_rssi.

array_rssi=df_rssi[beacons_col].values
array_rssi.shape

(1420, 13)

1. DB Scan¶

(a)  Construct the k Distance Graph to determine Eps parameter for DB Scan.

from sklearn.neighbors import NearestNeighbors
nbrs=NearestNeighbors().fit(array_rssi)
distances, indices=nbrs.kneighbors(array_rssi, 20)
kDis=distances[:,10]
kDis2=distances[:,5]
kDis3=distances[:,1]

kDis.sort()
kDis2.sort()
kDis3.sort()

kDis=kDis[range(len(kDis)-1, 0, -1)]
kDis2=kDis2[range(len(kDis2)-1, 0, -1)]
kDis3=kDis3[range(len(kDis3)-1, 0, -1)]
plt.xlabel('observations')
plt.ylabel('distance')
plt.plot(range(0, len(kDis)), kDis, color='blue', label="10th neighbours")
plt.plot(range(0, len(kDis2)), kDis2, color='orange', label="5th neighbors")
plt.plot(range(0, len(kDis3)), kDis3, color='green', label="1st neighbors")
plt.legend(loc="top right")
plt.title("Determine Eps - k Distance Graph \nfor DBScan Clustering", fontsize = 18)
plt.savefig('7. EPS.png', dpi=300, bbox_inches='tight')

The k-distance graph can help us to determine the optimal “eps” for fitting db scan clustering.
Beside eps, db_scan has another parameter, min_samples (we chose to tune the parameter eps only in this assignment), which is default to be 5 and is shown as the 5th neighbours in the k distance graph. From the yellow line, it suggests the elbow occurs at around 13 for 5th neighbours, thus we fit our original data (a matrix with n_readings x n_beacons contains RSSI measurements) into db scan with eps=13 and min_samples=5 (default value).

   (b) Perform DBScan by fitting `aray_rssi` with the optimal eps from the k-distance graph.

dbs_13 = cluster.DBSCAN(eps=13)
dbs_fit_13=dbs_13.fit(array_rssi)

  (c) Find the label from the fitted model.  Check the number of clusters in the label.

labels_dbs_13=dbs_fit_13.labels_

np.unique(labels_dbs_13, return_counts = True)

(array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
       dtype=int64),
 array([ 65, 114,  58,  53, 298, 238,  37,  89,   7,   6,  19,  12,   6,
         68,  70,  23,  21,  20,  31,   5,  10,  22,  14,  12,  28,  10,
          9,   8,  15,  37,   5,  10], dtype=int64))

(d) Construct a dataframe `X` which will compare the cluster labels with the target (which is the readings_group in `df_rssi` dataframe)

df_rssi['beacon_list']=df_rssi['readings_group'].replace(beaconList_dict)

X=pd.DataFrame(array_rssi)
X['cluster']=labels_dbs_13
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
X.head(10)

  (e) Use the crosstab function to construct a confusion matrix.

cm_dbs13=pd.crosstab(index=X["target"], columns=X["cluster"])
cm_dbs13

cm_dbs13.to_csv('c1. cm_dbs13.csv', index=True)

The above confusion matrix shows that target group 00_b3006 has been clustered to group 0 by DB scan, 01_b3003 has been clustered to group 1 by DB scan etc. However, group 07_b3002 b3003 b3005 and 09_b3003 b3004 b3006 have been both clustered to group -1, which is the outliner group in DB Scan.

  (f) Run some of the clustering metrics from sklearn to compare the target and the fitted model

label_target=df_rssi['readings_group'].values

print("adjusted_mutal_info :", metrics.adjusted_mutual_info_score(label_target, labels_dbs_13)) 
print("adjusted_rand_score :", metrics.adjusted_rand_score(label_target, labels_dbs_13)) 
print("normalized_mutal_info :", metrics.normalized_mutual_info_score(label_target, labels_dbs_13)) 
print("fowles_mallows_score :", metrics.fowlkes_mallows_score(label_target, labels_dbs_13)) 
print("silhouette_score :", metrics.silhouette_score(array_rssi, labels_dbs_13)) 
print("calinski_harabasz_score :", metrics.calinski_harabasz_score(array_rssi, labels_dbs_13)) 
print("davies_bouldin_score :", metrics.davies_bouldin_score(array_rssi, labels_dbs_13)) 
print("homogeneity_completeness_v_measure :", metrics.homogeneity_completeness_v_measure(label_target, labels_dbs_13))

adjusted_mutal_info : 0.9702379130248487
adjusted_rand_score : 0.9883213121708294
normalized_mutal_info : 0.9738881841539032
fowles_mallows_score : 0.9894827726622666
silhouette_score : 0.8799918151767319
calinski_harabasz_score : 513.4973243462551
davies_bouldin_score : 1.1831817624067935
homogeneity_completeness_v_measure : (0.9491053207986584, 0.9999999999999999, 0.9738881841539034)

There are many performance evaluation metrics for clustering, each one has its advantages and drawbacks. For a more all-rounded evaluation, we would be looking into the following 4 metrics:

- Adjusted mutual information score: measures the agreement of the two assignments, ignoring permutations, it is normalized against chance, the score is symmetric: swapping the argument does not change the score. 
- homogeneity: each cluster contains only members of a single class.
- completeness: all members of a given class are assigned to the same cluster.
- v_measure: harmonic means between homogeneity and completeness

 (g) We use the hill climbing method to apply a for loop to fit the model repeatedly with EPS value ranges from 2 to 50 and obtain the homogeneity_completeness_v_measure metric of each model and save it into the dataframe `dbs_metric`.

dbs_metric=pd.DataFrame(columns=["eps", "metric","score"])
dbs_cluster_num=pd.DataFrame(columns=["eps", "num_cluster"])
for eps_num in range(2, 50):
    dbs = cluster.DBSCAN(eps=eps_num)
    dbs_fit=dbs.fit(array_rssi)
    labels_dbs=dbs_fit.labels_
    a = metrics.adjusted_mutual_info_score(label_target, labels_dbs)
    h,c,v= metrics.homogeneity_completeness_v_measure(label_target, labels_dbs)
    dbs_metric.loc[len(dbs_metric)] = [eps_num, "homogeneity", h]
    dbs_metric.loc[len(dbs_metric)] = [eps_num, "completeness", c]
    dbs_metric.loc[len(dbs_metric)] = [eps_num, "v_measure", v]
    dbs_metric.loc[len(dbs_metric)] = [eps_num, "ami_score", a]    
    dbs_cluster_num.loc[len(dbs_cluster_num)] = [eps_num, len(np.unique(labels_dbs, return_counts = True)[0])]

  (h) Visualize the `dbs_metric`.

plt.figure(figsize=(10, 7))
sns.lineplot(x='eps', y='score', hue="metric", data=dbs_metric)
plt.title("Performance analysis for DBScan \nClustering with different eps", fontsize = 18)
plt.savefig('8. DBScan perform.png', dpi=300, bbox_inches='tight')

  (i) Visualize the relationship between eps and number of cluster in this dataset.

plt.plot(dbs_cluster_num['eps'], dbs_cluster_num['num_cluster'])
plt.title("Relationship between eps and num_cluster in DBScan")
plt.xlabel("EPS", fontsize = 15)
plt.ylabel("Number of Cluster", fontsize = 15)
plt.savefig('8a. DBScan cluster.png', dpi=300, bbox_inches='tight')

  (j) Compare the confusion matrix with an eps value with worse performance (eps=3) and better performance (eps=21)

dbs_metric_sorted=dbs_metric.pivot(index='eps', columns='metric', values='score').reset_index().rename_axis(None, axis=1)
dbs_metric_sorted=dbs_metric_sorted.sort_values(by=["completeness", "homogeneity", "v_measure", "ami_score"], ascending=[False, False, False, False]) 
dbs_metric_sorted.head(1)

dbs_3 = cluster.DBSCAN(eps=3)
dbs_fit_3=dbs_3.fit(array_rssi)
labels_dbs_3=dbs_fit_3.labels_
X=pd.DataFrame(array_rssi)
X['cluster']=labels_dbs_3
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
cm_dbs3=pd.crosstab(index=X["target"], columns=X["cluster"])
cm_dbs3

dbs_21 = cluster.DBSCAN(eps=21)
dbs_fit_21=dbs_21.fit(array_rssi)
labels_dbs_21=dbs_fit_21.labels_
X=pd.DataFrame(array_rssi)
X['cluster']=labels_dbs_21
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
cm_dbs21=pd.crosstab(index=X["target"], columns=X["cluster"])
cm_dbs21

cm_dbs3.to_csv('c2. cm_dbs3.csv', index=True)
cm_dbs21.to_csv('c3. cm_dbs21.csv', index=True)

We found that for DBscan clustering on this dataset, performance boosts at eps = 5 and becomes flatten after eps = 21. (21 is the smallest eps with the best score in 4 metrics) When we compare the confusion matrix for 3 different models with different eps values (3, 13 and 21), we have the following findings:

Eps = 3: Beside the outliner group, DBScan groups all the data into 28 clusters, all these clusters belong to only one single target group. (Thus, Homogeneity should be 1, however, due to the outliner group, it still scores low in homogeneity). 58 out of 63 target groups would have some members grouped as outliners. 17 out of these 58 groups have been split by DBScan into 2 or more groups (for example, group 06_b3002 b3003, 73 members have been clustered to group 5 and 7 members to group 8, 9 members to outliners) , while 41 of them are entirely classified as outliners (for example, group 13_b3002 b3003 b3006 with 19 members have been classified entirely as outliner)

Eps = 13: DBScan groups all the data into 31 clusters, all these clusters belong only to one single target group. No target groups have been split into separate clusters and 32 groups are entirely classified as outliners.

Eps = 21: DBScan groups all the data into 32 clusters, all these clusters belong only to one single target group. No target groups have been split into separate clusters and 31 groups are entirely classified as outliners. Only difference compare to eps=13 is group 32_b30010 b30011, with 7 members have been entirely identified as an individual cluster instead of outliner. All the outliners are now with less than 5 members in the target groups. (4 target groups with 4 members, 2 target groups with 3 members and the rests are just 2 or 1 member(s)).

As it seems the cluster number hasn’t changed much with all these eps parameter, we have plotted the following graph to show the relationship between eps and number of clusters in DBSCan. We can see that the range of clusters are within 29 to 33.

2. K-Mean Clustering¶

  (a) Construct the `averageDistance` to centroids graph to determine optimal `n_cluster` value for K-Mean clustering.  The function is customized for this dataset as 13 beacons indicate there would be 13 dimensions for calculating the distance to the centroids.

def avgDistToCentroids(dataArray, k_dist):
    k_disValues = np.zeros(len(k_dist))
    
    for cur_k_ind in range(0,len(k_dist)):
        #try each k value, starting from the first one with index 0
        K_rssi = k_dist[cur_k_ind]
        km_rssi = KMeans(n_clusters=K_rssi)
        labels_rssi=km_rssi.fit(dataArray).labels_
    #    print(cur_k_ind, K_rssi, km_rssi)

        #calculate the corresponding average distance to the centriods for this k value
        sumDis = 0
        for ind in range(0,n_readings):

            q1 = dataArray[ind, 0]
            q2 = dataArray[ind, 1]
            q3 = dataArray[ind, 2]
            q4 = dataArray[ind, 3]
            q5 = dataArray[ind, 4]
            q6 = dataArray[ind, 5]
            q7 = dataArray[ind, 6]
            q8 = dataArray[ind, 7]
            q9 = dataArray[ind, 8]
            q10 = dataArray[ind, 9]
            q11 = dataArray[ind, 10]
            q12 = dataArray[ind, 11]
            q13 = dataArray[ind, 12]

            p1 = km_rssi.cluster_centers_[labels_rssi[ind], 0]
            p2 = km_rssi.cluster_centers_[labels_rssi[ind], 1]
            p3 = km_rssi.cluster_centers_[labels_rssi[ind], 2]
            p4 = km_rssi.cluster_centers_[labels_rssi[ind], 3]
            p5 = km_rssi.cluster_centers_[labels_rssi[ind], 4]
            p6 = km_rssi.cluster_centers_[labels_rssi[ind], 5]
            p7 = km_rssi.cluster_centers_[labels_rssi[ind], 6]
            p8 = km_rssi.cluster_centers_[labels_rssi[ind], 7]
            p9 = km_rssi.cluster_centers_[labels_rssi[ind], 8]
            p10 = km_rssi.cluster_centers_[labels_rssi[ind], 9]
            p11 = km_rssi.cluster_centers_[labels_rssi[ind], 10]
            p12 = km_rssi.cluster_centers_[labels_rssi[ind], 11]
            p13 = km_rssi.cluster_centers_[labels_rssi[ind], 12]

            dis = math.sqrt(math.pow(q1 - p1, 2) + math.pow(q2 - p2, 2) + math.pow(q3 - p3, 2) + math.pow(q4 - p4, 2)+ math.pow(q5 - p5, 2)+ math.pow(q6 - p6, 2)+ math.pow(q7 - p7, 2)+ math.pow(q8 - p8, 2)+ math.pow(q9 - p9, 2)+ math.pow(q10 - p10, 2)+ math.pow(q11 - p11, 2)+ math.pow(q12 - p12, 2)+ math.pow(q13 - p13, 2)) 

            sumDis = sumDis + dis

        k_disValues[cur_k_ind] = sumDis/n_readings
    return k_disValues

k_dist = range(20,105)
print(k_dist)

k_disValues = avgDistToCentroids(array_rssi, k_dist)

range(20, 105)

plt.plot(k_dist, k_disValues)
plt.xlabel('k values')
plt.ylabel('average distance to the centroids')
plt.title("Determine number of clusters - k values and \ndistance to centroids Graph for Kmean Clustering", fontsize = 18)
plt.savefig('9. avgDistCentroids.png', dpi=300, bbox_inches='tight')

From the graph, the curve is slowly decreasing and we cannot find any distinctive elbow that would intervene and flatten the trend. Thus I have chosen n_cluster=60 as an arbitrary value to fit our original data (a matrix with n_readings x n_beacons contains RSSI measurements) into kmean clustering.

(b) Perform KMeans clustering by fitting `aray_rssi` with n_clusters=60

km_rssi = KMeans(n_clusters=60, random_state=999)
km_fit = km_rssi.fit(array_rssi)

(c) Find the label from the fitted model.  Check the number of clusters in the label.

labels_kmean=km_fit.labels_
np.unique(labels_kmean, return_counts = True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
        51, 52, 53, 54, 55, 56, 57, 58, 59]),
 array([238,  53, 298,  70,  15,  37, 114,  58,  89,  28,   8,  20,  31,
         68,  23,  23,   7,  12,  37,  12,  19,  15,   6,  21,  10,  10,
         10,   5,   9,   2,   7,   6,   4,   4,   5,   2,   4,   2,   3,
          4,   2,   2,   2,   2,   3,   3,   2,   2,   1,   1,   1,   1,
          1,   1,   2,   1,   1,   1,   1,   1], dtype=int64))

(d)  Construct a dataframe `X` which will compare the cluster labels with the target (which is the readings_group in `df_rssi` dataframe)

X=pd.DataFrame(array_rssi)
X['cluster']=labels_kmean
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
X.head(10)

(e)  Use the crosstab function to construct a confusion matrix.

cm_kmean=pd.crosstab(index=X["target"], columns=X["cluster"])
cm_kmean

We observed there is no outliner group in k-mean clustering. From the above table, it shows that 00_b3006 has been clustered to group 6, 00_b3003 has been clustered to group 7 by Kmeans etc.

  (f) Run some of the clustering metrics from sklearn to compare the target and the fitted model

print("adjusted_mutal_info :", metrics.adjusted_mutual_info_score(label_target, labels_kmean)) 
print("adjusted_rand_score :", metrics.adjusted_rand_score(label_target, labels_kmean)) 
print("normalized_mutal_info :", metrics.normalized_mutual_info_score(label_target, labels_kmean)) 
print("fowles_mallows_score :", metrics.fowlkes_mallows_score(label_target, labels_kmean)) 
print("silhouette_score :", metrics.silhouette_score(array_rssi, labels_kmean)) 
print("calinski_harabasz_score :", metrics.calinski_harabasz_score(array_rssi, labels_kmean)) 
print("davies_bouldin_score :", metrics.davies_bouldin_score(array_rssi, labels_kmean)) 
print("homogeneity_completeness_v_measure :", metrics.homogeneity_completeness_v_measure(label_target, labels_kmean))

adjusted_mutal_info : 0.9986451761078916
adjusted_rand_score : 0.9997780173318629
normalized_mutal_info : 0.9988439153787477
fowles_mallows_score : 0.9997988470741014
silhouette_score : 0.9326333754127222
calinski_harabasz_score : 6843.723843035605
davies_bouldin_score : 0.25020485861620845
homogeneity_completeness_v_measure : (0.9976905007340793, 1.0000000000000002, 0.9988439153787477)

(g)  Similar to DB scan, we are also evaluating with the same 4 metrics, we use the hill climbing method to apply a for loop to fit the model repeatedly with n_clusters ranges from 15 to 80 and obtain the homogeneity_completeness_v_measure metric of each model and save it into the dataframe `kmeans_metric`.

kmeans_metric=pd.DataFrame(columns=["n_cluster", "metric","score"])

for cluster_num in range(15, 80):
    km_rssi = KMeans(n_clusters=cluster_num, random_state=999)
    labels_kmean=km_rssi.fit(array_rssi).labels_
    a = metrics.adjusted_mutual_info_score(label_target, labels_kmean)
    h,c,v= metrics.homogeneity_completeness_v_measure(label_target, labels_kmean)
    kmeans_metric.loc[len(kmeans_metric)] = [cluster_num, "homogeneity", h]
    kmeans_metric.loc[len(kmeans_metric)] = [cluster_num, "completeness", c]
    kmeans_metric.loc[len(kmeans_metric)] = [cluster_num, "v_measure", v]
    kmeans_metric.loc[len(kmeans_metric)] = [cluster_num, "ami_score", a]

  (h) Visualize the `kmeans_metric`.

g=sns.lineplot(x='n_cluster', y='score', hue="metric", data=kmeans_metric)
g.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Performance analysis for Kmean \nClustering with different n_cluster", fontsize = 18)
plt.savefig('10. KMean perform.png', dpi=300, bbox_inches='tight')

  (i) Compare the confusion matrix with n_cluster value of worse performance (n_cluster=30 or 70) and better performance (n_cluster=60)

kmeans_metric_sorted=kmeans_metric.pivot(index='n_cluster', columns='metric', values='score').reset_index().rename_axis(None, axis=1)
kmeans_metric_sorted=kmeans_metric_sorted.sort_values(by=["completeness", "homogeneity", "v_measure", "ami_score"], ascending=[False, False, False, False]) 
kmeans_metric_sorted.head(5)

km_rssi_62 = KMeans(n_clusters=62, random_state=999)
km_fit_62 = km_rssi_62.fit(array_rssi)
labels_kmean_62=km_fit_62.labels_
X=pd.DataFrame(array_rssi)
X['cluster']=labels_kmean_62
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
cm_kmean_62= pd.crosstab(index=X["target"], columns=X["cluster"])
cm_kmean_62

km_rssi_30 = KMeans(n_clusters=30, random_state=999)
km_fit_30 = km_rssi_30.fit(array_rssi)
labels_kmean_30=km_fit_30.labels_
X=pd.DataFrame(array_rssi)
X['cluster']=labels_kmean_30
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
cm_kmean_30= pd.crosstab(index=X["target"], columns=X["cluster"])
cm_kmean_30

km_rssi_70 = KMeans(n_clusters=70, random_state=999)
km_fit_70 = km_rssi_70.fit(array_rssi)
labels_kmean_70=km_fit_70.labels_
X=pd.DataFrame(array_rssi)
X['cluster']=labels_kmean_70
X['target']= df_rssi['readings_group'].astype(str).str.zfill(2) + '_' + df_rssi['beacon_list']
cm_kmean_70= pd.crosstab(index=X["target"], columns=X["cluster"])
cm_kmean_70

cm_kmean_62.to_csv('d1. cm_kmean_62.csv', index=True)
cm_kmean_70.to_csv('d2. cm_kmean_70.csv', index=True)
cm_kmean_30.to_csv('d3. cm_kmean_30.csv', index=True)

We found that for KMenas clustering on this dataset, performance for v_measure, ami_score and homogeneity gradually increase until n_cluster reaches 62. (62 is the n_cluster with the best score for all 4 metrics) Completeness is almost at its perfect score until n_cluster reaches 62. When n_cluster reaches 62, v_measure, ami_score and completeness dramatically drop, while homogeneity stay flatten. To further investigate this, we compare the confusion matrix for 3 different models with different n_cluster values (30, 62 and 70), we have the following findings:

N_cluster = 30: KMeans groups all the data into 30 clusters, only 7 clusters belong to one single target group. All target groups have been formed into same clusters, except group 09_b3003 b3004 b3006. When we observe closer, we would find that if the cluster groups have multiple target groups, there would be similarities in the beacon combinations of the target group, for example, cluster group 4, comprises 5 target groups, namely, 06_b3002 b3003, 07_b3002 b3003 b3005, 12_b3002 b3003 b3007, 56_b3002 b3003 b3004 and 62_b3002 b3003 b3004 b3007, all contains the combination of b3002 and b3003. Beside the group 06_b3002 b3003 which has 89 members in the group, all these target groups have 1 or 2 members only.

N_cluster = 62: KMeans groups all the data into 62 clusters, only cluster 48, is comprised with target group 21_b3004 b3005 (2 members) and 57_b3004 b3005 b3006 (1 member), all other clusters and target groups are having one-to-one mapping.

N_cluster = 70: KMeans groups all the data into 70 clusters, all cluster belong to one single target group. However, 6 target groups have been split into separate clusters. For example, 03_b3004 has been split into cluster group 0 and 66. Obviously, this is the result of too many clusters for mapping 63 target groups.

Conclusion¶

Compare the homogeneity and completeness of all our confusion matrices for DB Scan and KMean

 (a) Define a function to check each column for homogeneity and each row for completeness

def checkConfusionMatrix(cm):
    perfect_homogeneity = True
    perfect_completeness = True
    print("Homogeneity Check:")
    homo_group = 0
    complete_group = 0

    for col in cm.columns:
        homogeneity_cnt = (cm[col]!=0).sum()
        if(homogeneity_cnt != 1):
            print("================================================================")
            print("cluster:", col, "contains", homogeneity_cnt, "target group(s)")
            print("target group: ", cm[cm[col]!=0].index.values, "are both grouped into cluster", col)
            perfect_homogeneity = False
        else:
            homo_group = homo_group + 1
    if(perfect_homogeneity == True):
        print("All cluster only contains one single target group")
    else:
        print("\nHowever,", homo_group, "out of", len(cm.columns), "clusters contain(s) one single target group")

    print("\nCompleteness Check:")        
    for row in cm.index:
        completeness_cnt = (cm.loc[row]!=0).sum()
        target_member = 0
        if(completeness_cnt != 1):
            print("================================================================")
            perfect_completeness = False
            print("target:", row, "has been split into", completeness_cnt, "cluster group")
            cluster=list(np.where(cm.loc[row]!=0))[0]
            for i in cluster:
                col=cm.columns[i]
                target_member = target_member + cm[col][row]
                print("while cluster", col , "contains", cm[col][row], "members")
            print("there should be ", target_member, "members in target group", row)
        else:
            complete_group = complete_group + 1
            
    if(perfect_completeness == True):
        print("All target groups are grouped into same cluster")
    else:
        print("\nHowever,", complete_group, "out of", len(cm.index), "target groups have been grouped into same cluster")

checkConfusionMatrix(cm_dbs3)

Homogeneity Check:
================================================================
cluster: -1 contains 58 target group(s)
target group:  ['02_b3003 b3006' '05_b3003 b3004' '06_b3002 b3003' '07_b3002 b3003 b3005'
 '08_b3003 b3005 b3006' '09_b3003 b3004 b3006' '10_b3002 b3004'
 '11_b3003 b3007' '12_b3002 b3003 b3007' '13_b3002 b3003 b3006'
 '14_b3002 b3006' '15_b3001' '16_b3002 b3008' '17_b3002 b3005'
 '18_b3002 b3004 b3006' '20_b3004 b3006' '21_b3004 b3005' '22_b3004 b3007'
 '23_b3007' '24_b3005 b3006' '25_b3002 b3005 b3006' '26_b3002 b3005 b3008'
 '27_b3005 b3008' '28_b3008 b30010' '29_b3008' '30_b3009' '31_b30010'
 '32_b30010 b30011' '33_b30011' '34_b30011 b30012'
 '35_b30010 b30012 b30013' '36_b3008 b30012' '37_b30012'
 '38_b30012 b30013' '39_b30013' '40_b3005 b3007' '41_b3001 b3008'
 '42_b3001 b3005 b3008' '43_b3001 b3002' '44_b3001 b3005 b3006 b3008'
 '45_b3005 b3006 b3008' '46_b3006 b3008 b30010' '47_b3003 b3005'
 '48_b3009 b30010' '49_b30011 b30013' '50_b3006 b3007'
 '51_b3004 b3006 b3007' '52_b3006 b3008' '53_b3002 b3004 b3005 b3006'
 '54_b3003 b3004 b3005 b3006' '55_b3002 b3003 b3005 b3006'
 '56_b3002 b3003 b3004' '57_b3004 b3005 b3006'
 '58_b3002 b3003 b3004 b3006' '59_b3004 b3005 b3008'
 '60_b3002 b3003 b3005 b3008' '61_b3002 b3008 b30010'
 '62_b3002 b3003 b3004 b3007'] are both grouped into cluster -1

However, 28 out of 29 clusters contain(s) one single target group

Completeness Check:
================================================================
target: 02_b3003 b3006 has been split into 3 cluster group
while cluster -1 contains 9 members
while cluster 2 contains 21 members
while cluster 6 contains 23 members
there should be  53 members in target group 02_b3003 b3006
================================================================
target: 05_b3003 b3004 has been split into 2 cluster group
while cluster -1 contains 5 members
while cluster 7 contains 32 members
there should be  37 members in target group 05_b3003 b3004
================================================================
target: 06_b3002 b3003 has been split into 3 cluster group
while cluster -1 contains 9 members
while cluster 5 contains 73 members
while cluster 8 contains 7 members
there should be  89 members in target group 06_b3002 b3003
================================================================
target: 14_b3002 b3006 has been split into 2 cluster group
while cluster -1 contains 4 members
while cluster 9 contains 8 members
there should be  12 members in target group 14_b3002 b3006
================================================================
target: 17_b3002 b3005 has been split into 2 cluster group
while cluster -1 contains 13 members
while cluster 11 contains 55 members
there should be  68 members in target group 17_b3002 b3005
================================================================
target: 20_b3004 b3006 has been split into 2 cluster group
while cluster -1 contains 2 members
while cluster 12 contains 21 members
there should be  23 members in target group 20_b3004 b3006
================================================================
target: 22_b3004 b3007 has been split into 2 cluster group
while cluster -1 contains 12 members
while cluster 14 contains 9 members
there should be  21 members in target group 22_b3004 b3007
================================================================
target: 23_b3007 has been split into 3 cluster group
while cluster -1 contains 2 members
while cluster 13 contains 13 members
while cluster 27 contains 5 members
there should be  20 members in target group 23_b3007
================================================================
target: 24_b3005 b3006 has been split into 2 cluster group
while cluster -1 contains 5 members
while cluster 15 contains 26 members
there should be  31 members in target group 24_b3005 b3006
================================================================
target: 26_b3002 b3005 b3008 has been split into 2 cluster group
while cluster -1 contains 13 members
while cluster 16 contains 9 members
there should be  22 members in target group 26_b3002 b3005 b3008
================================================================
target: 29_b3008 has been split into 2 cluster group
while cluster -1 contains 6 members
while cluster 25 contains 6 members
there should be  12 members in target group 29_b3008
================================================================
target: 30_b3009 has been split into 3 cluster group
while cluster -1 contains 3 members
while cluster 17 contains 10 members
while cluster 18 contains 15 members
there should be  28 members in target group 30_b3009
================================================================
target: 31_b30010 has been split into 2 cluster group
while cluster -1 contains 5 members
while cluster 19 contains 5 members
there should be  10 members in target group 31_b30010
================================================================
target: 33_b30011 has been split into 2 cluster group
while cluster -1 contains 4 members
while cluster 20 contains 5 members
there should be  9 members in target group 33_b30011
================================================================
target: 37_b30012 has been split into 3 cluster group
while cluster -1 contains 2 members
while cluster 21 contains 6 members
while cluster 26 contains 7 members
there should be  15 members in target group 37_b30012
================================================================
target: 39_b30013 has been split into 3 cluster group
while cluster -1 contains 5 members
while cluster 22 contains 17 members
while cluster 23 contains 15 members
there should be  37 members in target group 39_b30013
================================================================
target: 42_b3001 b3005 b3008 has been split into 2 cluster group
while cluster -1 contains 3 members
while cluster 24 contains 7 members
there should be  10 members in target group 42_b3001 b3005 b3008

However, 46 out of 63 target groups have been grouped into same cluster

checkConfusionMatrix(cm_dbs13)

Homogeneity Check:
================================================================
cluster: -1 contains 32 target group(s)
target group:  ['07_b3002 b3003 b3005' '09_b3003 b3004 b3006' '11_b3003 b3007'
 '12_b3002 b3003 b3007' '18_b3002 b3004 b3006' '21_b3004 b3005'
 '28_b3008 b30010' '32_b30010 b30011' '35_b30010 b30012 b30013'
 '36_b3008 b30012' '38_b30012 b30013' '40_b3005 b3007' '43_b3001 b3002'
 '44_b3001 b3005 b3006 b3008' '45_b3005 b3006 b3008'
 '46_b3006 b3008 b30010' '47_b3003 b3005' '48_b3009 b30010'
 '49_b30011 b30013' '50_b3006 b3007' '51_b3004 b3006 b3007'
 '52_b3006 b3008' '53_b3002 b3004 b3005 b3006'
 '54_b3003 b3004 b3005 b3006' '55_b3002 b3003 b3005 b3006'
 '56_b3002 b3003 b3004' '57_b3004 b3005 b3006'
 '58_b3002 b3003 b3004 b3006' '59_b3004 b3005 b3008'
 '60_b3002 b3003 b3005 b3008' '61_b3002 b3008 b30010'
 '62_b3002 b3003 b3004 b3007'] are both grouped into cluster -1

However, 31 out of 32 clusters contain(s) one single target group

Completeness Check:
All target groups are grouped into same cluster

checkConfusionMatrix(cm_dbs21)

Homogeneity Check:
================================================================
cluster: -1 contains 31 target group(s)
target group:  ['07_b3002 b3003 b3005' '09_b3003 b3004 b3006' '11_b3003 b3007'
 '12_b3002 b3003 b3007' '18_b3002 b3004 b3006' '21_b3004 b3005'
 '28_b3008 b30010' '35_b30010 b30012 b30013' '36_b3008 b30012'
 '38_b30012 b30013' '40_b3005 b3007' '43_b3001 b3002'
 '44_b3001 b3005 b3006 b3008' '45_b3005 b3006 b3008'
 '46_b3006 b3008 b30010' '47_b3003 b3005' '48_b3009 b30010'
 '49_b30011 b30013' '50_b3006 b3007' '51_b3004 b3006 b3007'
 '52_b3006 b3008' '53_b3002 b3004 b3005 b3006'
 '54_b3003 b3004 b3005 b3006' '55_b3002 b3003 b3005 b3006'
 '56_b3002 b3003 b3004' '57_b3004 b3005 b3006'
 '58_b3002 b3003 b3004 b3006' '59_b3004 b3005 b3008'
 '60_b3002 b3003 b3005 b3008' '61_b3002 b3008 b30010'
 '62_b3002 b3003 b3004 b3007'] are both grouped into cluster -1

However, 32 out of 33 clusters contain(s) one single target group

Completeness Check:
All target groups are grouped into same cluster

checkConfusionMatrix(cm_kmean_30)

Homogeneity Check:
================================================================
cluster: 1 contains 2 target group(s)
target group:  ['03_b3004' '21_b3004 b3005'] are both grouped into cluster 1
================================================================
cluster: 2 contains 2 target group(s)
target group:  ['04_b3002' '43_b3001 b3002'] are both grouped into cluster 2
================================================================
cluster: 4 contains 5 target group(s)
target group:  ['06_b3002 b3003' '07_b3002 b3003 b3005' '12_b3002 b3003 b3007'
 '56_b3002 b3003 b3004' '62_b3002 b3003 b3004 b3007'] are both grouped into cluster 4
================================================================
cluster: 5 contains 3 target group(s)
target group:  ['19_b3005' '40_b3005 b3007' '47_b3003 b3005'] are both grouped into cluster 5
================================================================
cluster: 6 contains 2 target group(s)
target group:  ['39_b30013' '49_b30011 b30013'] are both grouped into cluster 6
================================================================
cluster: 8 contains 2 target group(s)
target group:  ['02_b3003 b3006' '09_b3003 b3004 b3006'] are both grouped into cluster 8
================================================================
cluster: 9 contains 3 target group(s)
target group:  ['24_b3005 b3006' '25_b3002 b3005 b3006' '57_b3004 b3005 b3006'] are both grouped into cluster 9
================================================================
cluster: 10 contains 2 target group(s)
target group:  ['01_b3003' '11_b3003 b3007'] are both grouped into cluster 10
================================================================
cluster: 11 contains 2 target group(s)
target group:  ['15_b3001' '41_b3001 b3008'] are both grouped into cluster 11
================================================================
cluster: 12 contains 2 target group(s)
target group:  ['23_b3007' '50_b3006 b3007'] are both grouped into cluster 12
================================================================
cluster: 13 contains 3 target group(s)
target group:  ['36_b3008 b30012' '37_b30012' '38_b30012 b30013'] are both grouped into cluster 13
================================================================
cluster: 14 contains 3 target group(s)
target group:  ['31_b30010' '35_b30010 b30012 b30013' '48_b3009 b30010'] are both grouped into cluster 14
================================================================
cluster: 15 contains 2 target group(s)
target group:  ['13_b3002 b3003 b3006' '58_b3002 b3003 b3004 b3006'] are both grouped into cluster 15
================================================================
cluster: 16 contains 2 target group(s)
target group:  ['05_b3003 b3004' '09_b3003 b3004 b3006'] are both grouped into cluster 16
================================================================
cluster: 18 contains 4 target group(s)
target group:  ['28_b3008 b30010' '29_b3008' '46_b3006 b3008 b30010' '52_b3006 b3008'] are both grouped into cluster 18
================================================================
cluster: 19 contains 2 target group(s)
target group:  ['22_b3004 b3007' '51_b3004 b3006 b3007'] are both grouped into cluster 19
================================================================
cluster: 20 contains 2 target group(s)
target group:  ['26_b3002 b3005 b3008' '60_b3002 b3003 b3005 b3008'] are both grouped into cluster 20
================================================================
cluster: 21 contains 2 target group(s)
target group:  ['33_b30011' '34_b30011 b30012'] are both grouped into cluster 21
================================================================
cluster: 23 contains 2 target group(s)
target group:  ['16_b3002 b3008' '61_b3002 b3008 b30010'] are both grouped into cluster 23
================================================================
cluster: 24 contains 2 target group(s)
target group:  ['27_b3005 b3008' '59_b3004 b3005 b3008'] are both grouped into cluster 24
================================================================
cluster: 26 contains 2 target group(s)
target group:  ['42_b3001 b3005 b3008' '44_b3001 b3005 b3006 b3008'] are both grouped into cluster 26
================================================================
cluster: 27 contains 3 target group(s)
target group:  ['08_b3003 b3005 b3006' '54_b3003 b3004 b3005 b3006'
 '55_b3002 b3003 b3005 b3006'] are both grouped into cluster 27
================================================================
cluster: 29 contains 3 target group(s)
target group:  ['10_b3002 b3004' '18_b3002 b3004 b3006' '53_b3002 b3004 b3005 b3006'] are both grouped into cluster 29

However, 7 out of 30 clusters contain(s) one single target group

Completeness Check:
================================================================
target: 09_b3003 b3004 b3006 has been split into 2 cluster group
while cluster 8 contains 2 members
while cluster 16 contains 1 members
there should be  3 members in target group 09_b3003 b3004 b3006

However, 62 out of 63 target groups have been grouped into same cluster

checkConfusionMatrix(cm_kmean_62)

Homogeneity Check:
================================================================
cluster: 48 contains 2 target group(s)
target group:  ['21_b3004 b3005' '57_b3004 b3005 b3006'] are both grouped into cluster 48

However, 61 out of 62 clusters contain(s) one single target group

Completeness Check:
All target groups are grouped into same cluster

checkConfusionMatrix(cm_kmean_70)

Homogeneity Check:
All cluster only contains one single target group

Completeness Check:
================================================================
target: 00_b3006 has been split into 2 cluster group
while cluster 4 contains 31 members
while cluster 64 contains 83 members
there should be  114 members in target group 00_b3006
================================================================
target: 03_b3004 has been split into 2 cluster group
while cluster 0 contains 186 members
while cluster 66 contains 112 members
there should be  298 members in target group 03_b3004
================================================================
target: 04_b3002 has been split into 2 cluster group
while cluster 1 contains 138 members
while cluster 65 contains 100 members
there should be  238 members in target group 04_b3002
================================================================
target: 06_b3002 b3003 has been split into 3 cluster group
while cluster 5 contains 27 members
while cluster 58 contains 53 members
while cluster 67 contains 9 members
there should be  89 members in target group 06_b3002 b3003
================================================================
target: 17_b3002 b3005 has been split into 2 cluster group
while cluster 7 contains 55 members
while cluster 68 contains 13 members
there should be  68 members in target group 17_b3002 b3005
================================================================
target: 39_b30013 has been split into 2 cluster group
while cluster 6 contains 18 members
while cluster 69 contains 19 members
there should be  37 members in target group 39_b30013

However, 57 out of 63 target groups have been grouped into same cluster

To summarize our analysis, we have successfully achieved our goal to identify the signal measurement patterns and use clustering to group all the data into organized structure.

For this data set, from all the eps value we tested (range from 3 to 50), DB scan can differentiate and make all the cluster only contain one single target group. DB scan uses its outliner group “-1” to store all the unidentified cluster. If the eps value is too small, DB scan will split target group into different cluster. At the best eps value, DB scan will not split any target group into different cluster, and just leave all those target group with low density as outliners. With eps value higher than the optimal result, db scan’s performance stay the same and will not be over-tuned.

For K means clustering, from all the n_cluster value we tested (range from 30 to 70), if the n_cluster is too low, Kmean will make the cluster to multiple target groups with similar beacon combinations, but will not split any target group into different cluster. When the n_cluster value is well chosen, K means will differentiate the data as almost 1:1 mapping with the target group. However, beyond the optimal value, K means performance will drop, and will split the target group into different clusters although all the clusters still contain one single target group.

Even though, we have found the almost perfect n_cluster value to identify all the cluster and target in 1:1 mapping. I would still recommend to use db scan to process this dataset, as it is comparatively insensitive to the input parameter, and the undefined clusters are actually sparse reading of beacon combinations that come up infrequently in the dataset, tuning eps is just for improving clustering on less-dense target group. The perfect n_cluster value in kmeans might only work for this dataset for this particular number of groups of beacon combinations, small difference in the n_cluster might turn out to have very bad results in clustering the data.

	b3001	b3002	b3003	b3004	b3005	b3006	b3007	b3008	b3009	b3010	b3011	b3012	b3013
count	1420.00	1420.00	1420.00	1420.00	1420.00	1420.00	1420.00	1420.00	1420.00	1420.00	1420.00	1420.00	1420.00
mean	-197.83	-156.64	-175.53	-164.53	-178.38	-175.06	-195.64	-191.97	-197.15	-197.44	-197.75	-197.24	-196.07
std	16.26	60.23	49.45	56.52	47.18	49.60	22.88	30.73	19.16	17.74	16.85	18.54	22.05
min	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00
25%	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00
50%	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00
75%	-200.00	-78.00	-200.00	-80.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00	-200.00
max	-67.00	-59.00	-56.00	-56.00	-60.00	-62.00	-58.00	-56.00	-55.00	-61.00	-59.00	-60.00	-59.00

	location	date	b3001	b3002	b3003	b3004	b3005	b3006	b3007	b3008	b3009	b3010	b3011	b3012	b3013
0	O02	2016-10-18 11:15:21	-200	-200	-200	-200	-200	-78	-200	-200	-200	-200	-200	-200	-200
1	P01	2016-10-18 11:15:19	-200	-200	-200	-200	-200	-78	-200	-200	-200	-200	-200	-200	-200
2	P01	2016-10-18 11:15:17	-200	-200	-200	-200	-200	-77	-200	-200	-200	-200	-200	-200	-200
3	P01	2016-10-18 11:15:15	-200	-200	-200	-200	-200	-77	-200	-200	-200	-200	-200	-200	-200
4	P01	2016-10-18 11:15:13	-200	-200	-200	-200	-200	-77	-200	-200	-200	-200	-200	-200	-200

	location	count
0	D13	6
1	D14	4
2	D15	14
3	E15	4
4	F08	4
5	G15	4
6	I01	18
7	I02	21
8	I03	19
9	I04	18
10	I05	19
11	I06	27
12	I07	27
13	I08	26
14	I09	8
15	I10	14
16	I15	5
17	J01	16
18	J02	22
19	J03	24
20	J04	32
21	J05	19
22	J06	29
23	J07	27
24	J08	9
25	J10	6
26	J15	8
27	K01	6
28	K02	9
29	K03	23
30	K04	34
31	K05	25
32	K06	22
33	K07	11
34	K08	12
35	L01	6
36	L02	10
37	L03	13
38	L04	20
39	L05	14
40	L06	22
41	L08	3
42	L09	2
43	L15	10
44	M01	10
45	M02	14
46	M03	12
47	M04	19
48	M05	10
49	M06	20
50	N01	12
51	N02	12
52	N03	14
53	N04	10
54	N05	12
55	N06	14
56	N15	12
57	O01	2
58	O02	8
59	O03	13
60	O04	24
61	O05	24
62	O06	15
63	P01	13
64	P02	6
65	P03	12
66	P04	13
67	P05	12
68	P06	8
69	P15	7
70	Q01	6
71	Q02	4
72	Q03	18
73	Q04	18
74	Q05	24
75	Q06	4
76	R01	14
77	R02	18
78	R03	15
79	R04	10
80	R05	16
81	R06	6
82	R15	12
83	S01	23
84	S02	21
85	S03	17
86	S04	18
87	S05	20
88	S06	20
89	S07	10
90	S08	4
91	S15	3
92	T01	4
93	T03	6
94	T04	12
95	T05	10
96	T15	7
97	U01	8
98	U02	10
99	U03	14
100	U04	10
101	U05	8
102	U15	5
103	V15	8
104	W15	17

	readings_group	beacon_list	location_list
0	0	b3006	O02 P01 L05 M05 N05 P05 Q05 Q06 P06 O06 N06 M06 L06 K06 J01 K01 L02 M02 N01 N02 O01 I08 K08 K07
1	1	b3003	P01 P02 K03 P03 Q04 P04 O04 M04 L04 L01 M01 M02 N01 Q02
2	2	b3003 b3006	P01 K03 M03 N03 O03 L01 M01 M02 N01 N02 L04 P04
3	3	b3004	R01 R02 S01 S02 T01 U01 U02 O03 P03 Q03 R03 S03 T03 U03 U04 T04 S04 R04 O05 P05 Q05 R05 S05 T05 U05 S06 S07 S08 L08 Q02 Q01
4	4	b3002	U02 J03 K03 L03 N04 M04 L04 K04 J04 I04 I05 J05 K05 L05 K06 J06 J02 I06 I01 I02 I03 J01 K02 J08 L02 P02
5	5	b3003 b3004	U01 O03 P03 Q03 S04 R04 Q04 S03 Q01
6	6	b3002 b3003	J03 K03 L03 Q04 P04 O04 N04 M04 L04 K04 J04 J02 L01 K05 I02 I04 I03
7	7	b3002 b3003 b3005	L03
8	8	b3003 b3005 b3006	M03 N03 O02
9	9	b3003 b3004 b3006	N03 L04
10	10	b3002 b3004	Q04 L05 Q05
11	11	b3003 b3007	Q04
12	12	b3002 b3003 b3007	Q04
13	13	b3002 b3003 b3006	O04 N04 K04 N01 L05 P04
14	14	b3002 b3006	L04 K05 I04 I02 J01 L02
15	15	b3001	I04 F08
16	16	b3002 b3008	J05 J06 I09 I07
17	17	b3002 b3005	J05 M05 K06 J06 I06 J04 I07 I03 I04 I05 I08 J07 L06
18	18	b3002 b3004 b3006	K05 I03
19	19	b3005	M05 M06 L06 K06 J07 I07 J08 I02 O02 I03 I08 J06 K05 K07 K08 L08 L05 N06
20	20	b3004 b3006	M05 O05 P05 P06 O06
21	21	b3004 b3005	N05
22	22	b3004 b3007	R05 S05 S06 R06 S07 S08 S03 R03
23	23	b3007	R05 T05 S06 R06 S07 S08
24	24	b3005 b3006	M06 L06 K06 J07 K07 K08
25	25	b3002 b3005 b3006	M06 K06
26	26	b3002 b3005 b3008	I06 I07 I08 J08 J06 J07
27	27	b3005 b3008	J07 I07 I10 J10 J08 I09 L09 K08
28	28	b3008 b30010	I10
29	29	b3008	I10 J10 I08 J08
30	30	b3009	D15 E15 D14 D13
31	31	b30010	G15 I15 J15
32	32	b30010 b30011	J15 N15
33	33	b30011	L15 N15
34	34	b30011 b30012	L15 N15
35	35	b30010 b30012 b30013	L15
36	36	b3008 b30012	R15
37	37	b30012	R15 P15
38	38	b30012 b30013	R15 T15
39	39	b30013	T15 W15 S15 U15 V15
40	40	b3005 b3007	I07
41	41	b3001 b3008	I08 L09
42	42	b3001 b3005 b3008	I08 I07 I09
43	43	b3001 b3002	I07
44	44	b3001 b3005 b3006 b3008	I08
45	45	b3005 b3006 b3008	I09 K07
46	46	b3006 b3008 b30010	I10
47	47	b3003 b3005	J07
48	48	b3009 b30010	I15 J15
49	49	b30011 b30013	U15
50	50	b3006 b3007	S07
51	51	b3004 b3006 b3007	S07
52	52	b3006 b3008	L08
53	53	b3002 b3004 b3005 b3006	L06
54	54	b3003 b3004 b3005 b3006	L06
55	55	b3002 b3003 b3005 b3006	L05
56	56	b3002 b3003 b3004	L04
57	57	b3004 b3005 b3006	L03
58	58	b3002 b3003 b3004 b3006	K04
59	59	b3004 b3005 b3008	I08
60	60	b3002 b3003 b3005 b3008	I05
61	61	b3002 b3008 b30010	J10
62	62	b3002 b3003 b3004 b3007	Q04

	location	readings_group	count
0	D13	30	6
1	D14	30	4
2	D15	30	14
3	E15	30	4
4	F08	15	4
5	G15	31	4
6	I01	4	18
7	I02	4	16
8	I02	6	1
9	I02	14	2
10	I02	19	2
11	I03	4	7
12	I03	6	5
13	I03	17	3
14	I03	18	2
15	I03	19	2
16	I04	4	11
17	I04	6	1
18	I04	14	2
19	I04	15	2
20	I04	17	2
21	I05	4	14
22	I05	17	4
23	I05	60	1
24	I06	4	3
25	I06	17	16
26	I06	26	8
27	I07	16	1
28	I07	17	4
29	I07	19	8
30	I07	26	6
31	I07	27	2
32	I07	40	2
33	I07	42	2
34	I07	43	2
35	I08	0	2
36	I08	17	4
37	I08	19	2
38	I08	26	3
39	I08	29	2
40	I08	41	4
41	I08	42	6
42	I08	44	2
43	I08	59	1
44	I09	16	2
45	I09	27	2
46	I09	42	2
47	I09	45	2
48	I10	27	2
49	I10	28	4
50	I10	29	6
51	I10	46	2
52	I15	31	3
53	I15	48	2
54	J01	0	2
55	J01	4	12
56	J01	14	2
57	J02	4	13
58	J02	6	9
59	J03	4	18
60	J03	6	6
61	J04	4	24
62	J04	6	6
63	J04	17	2
64	J05	4	10
65	J05	16	2
66	J05	17	7
67	J06	4	7
68	J06	16	5
69	J06	17	13
70	J06	19	2
71	J06	26	2
72	J07	17	4
73	J07	19	16
74	J07	24	2
75	J07	26	1
76	J07	27	2
77	J07	47	2
78	J08	4	2
79	J08	19	2
80	J08	26	2
81	J08	27	2
82	J08	29	1
83	J10	27	2
84	J10	29	3
85	J10	61	1
86	J15	31	3
87	J15	32	4
88	J15	48	1
89	K01	0	6
90	K02	4	9
91	K03	1	6
92	K03	2	2
93	K03	4	11
94	K03	6	4
95	K04	4	16
96	K04	6	14
97	K04	13	3
98	K04	58	1
99	K05	4	16
100	K05	6	2
101	K05	14	3
102	K05	18	2
103	K05	19	2
104	K06	0	2
105	K06	4	3
106	K06	17	6
107	K06	19	2
108	K06	24	6
109	K06	25	3
110	K07	0	1
111	K07	19	2
112	K07	24	6
113	K07	45	2
114	K08	0	2
115	K08	19	4
116	K08	24	5
117	K08	27	1
118	L01	1	2
119	L01	2	2
120	L01	6	2
121	L02	0	8
122	L02	4	1
123	L02	14	1
124	L03	4	2
125	L03	6	8
126	L03	7	2
127	L03	57	1
128	L04	1	3
129	L04	2	1
130	L04	4	8
131	L04	6	4
132	L04	9	1
133	L04	14	2
134	L04	56	1
135	L05	0	4
136	L05	4	4
137	L05	10	2
138	L05	13	1
139	L05	19	2
140	L05	55	1
141	L06	0	2
142	L06	17	1
143	L06	19	13
144	L06	24	4
145	L06	53	1
146	L06	54	1
147	L08	3	1
148	L08	19	1
149	L08	52	1
150	L09	27	1
151	L09	41	1
152	L15	33	2
153	L15	34	6
154	L15	35	2
155	M01	1	8
156	M01	2	2
157	M02	0	6
158	M02	1	4
159	M02	2	4
160	M03	2	10
161	M03	8	2
162	M04	1	6
163	M04	4	8
164	M04	6	5
165	M05	0	4
166	M05	17	2
167	M05	19	2
168	M05	20	2
169	M06	0	6
170	M06	19	4
171	M06	24	8
172	M06	25	2
173	N01	0	4
174	N01	1	4
175	N01	2	2
176	N01	13	2
177	N02	0	2
178	N02	2	10
179	N03	2	8
180	N03	8	4
181	N03	9	2
182	N04	4	2
183	N04	6	6
184	N04	13	2
185	N05	0	10
186	N05	21	2
187	N06	0	12
188	N06	19	2
189	N15	32	3
190	N15	33	7
191	N15	34	2
192	O01	0	2
193	O02	0	5
194	O02	8	1
195	O02	19	2
196	O03	2	8
197	O03	3	2
198	O03	5	3
199	O04	1	6
200	O04	6	8
201	O04	13	10
202	O05	3	14
203	O05	20	10
204	O06	0	14
205	O06	20	1
206	P01	0	8
207	P01	1	3
208	P01	2	2
209	P02	1	5
210	P02	4	1
211	P03	1	2
212	P03	3	4
213	P03	5	6
214	P04	1	4
215	P04	2	2
216	P04	6	6
217	P04	13	1
218	P05	0	2
219	P05	3	4
220	P05	20	6
221	P06	0	4
222	P06	20	4
223	P15	37	7
224	Q01	3	2
225	Q01	5	4
226	Q02	1	1
227	Q02	3	3
228	Q03	3	14
229	Q03	5	4
230	Q04	1	4
231	Q04	5	5
232	Q04	6	2
233	Q04	10	2
234	Q04	11	2
235	Q04	12	2
236	Q04	62	1
237	Q05	0	2
238	Q05	3	20
239	Q05	10	2
240	Q06	0	4
241	R01	3	14
242	R02	3	18
243	R03	3	13
244	R03	22	2
245	R04	3	4
246	R04	5	6
247	R05	3	10
248	R05	22	4
249	R05	23	2
250	R06	22	2
251	R06	23	4
252	R15	36	2
253	R15	37	8
254	R15	38	2
255	S01	3	23
256	S02	3	21
257	S03	3	14
258	S03	5	2
259	S03	22	1
260	S04	3	13
261	S04	5	5
262	S05	3	18
263	S05	22	2
264	S06	3	8
265	S06	22	6
266	S06	23	6
267	S07	3	1
268	S07	22	2
269	S07	23	5
270	S07	50	1
271	S07	51	1
272	S08	3	1
273	S08	22	2
274	S08	23	1
275	S15	39	3
276	T01	3	4
277	T03	3	6
278	T04	3	12
279	T05	3	8
280	T05	23	2
281	T15	38	2
282	T15	39	5
283	U01	3	6
284	U01	5	2
285	U02	3	8
286	U02	4	2
287	U03	3	14
288	U04	3	10
289	U05	3	8
290	U15	39	4
291	U15	49	1
292	V15	39	8
293	W15	39	17

cluster	-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30
target
00_b3006	0	114	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
01_b3003	0	0	58	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
02_b3003 b3006	0	0	0	53	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
03_b3004	0	0	0	0	298	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
04_b3002	0	0	0	0	0	238	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
05_b3003 b3004	0	0	0	0	0	0	37	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
06_b3002 b3003	0	0	0	0	0	0	0	89	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
07_b3002 b3003 b3005	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
08_b3003 b3005 b3006	0	0	0	0	0	0	0	0	7	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
09_b3003 b3004 b3006	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
10_b3002 b3004	0	0	0	0	0	0	0	0	0	6	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
11_b3003 b3007	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
12_b3002 b3003 b3007	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
13_b3002 b3003 b3006	0	0	0	0	0	0	0	0	0	0	19	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
14_b3002 b3006	0	0	0	0	0	0	0	0	0	0	0	12	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
15_b3001	0	0	0	0	0	0	0	0	0	0	0	0	6	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
16_b3002 b3008	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	10	0	0	0	0	0	0	0	0	0	0	0
17_b3002 b3005	0	0	0	0	0	0	0	0	0	0	0	0	0	68	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
18_b3002 b3004 b3006	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
19_b3005	0	0	0	0	0	0	0	0	0	0	0	0	0	0	70	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
20_b3004 b3006	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	23	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
21_b3004 b3005	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
22_b3004 b3007	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	21	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23_b3007	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	20	0	0	0	0	0	0	0	0	0	0	0	0	0	0
24_b3005 b3006	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	31	0	0	0	0	0	0	0	0	0	0	0	0	0
25_b3002 b3005 b3006	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	5	0	0	0	0	0	0	0	0	0	0	0	0
26_b3002 b3005 b3008	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	22	0	0	0	0	0	0	0	0	0	0
27_b3005 b3008	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	14	0	0	0	0	0	0	0	0	0
28_b3008 b30010	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
29_b3008	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	12	0	0	0	0	0	0	0	0
30_b3009	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	28	0	0	0	0	0	0	0
31_b30010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	10	0	0	0	0	0	0
32_b30010 b30011	7	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
33_b30011	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	9	0	0	0	0	0
34_b30011 b30012	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	8	0	0	0	0
35_b30010 b30012 b30013	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
36_b3008 b30012	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
37_b30012	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	15	0	0	0
38_b30012 b30013	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
39_b30013	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	37	0	0
40_b3005 b3007	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
41_b3001 b3008	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	5	0
42_b3001 b3005 b3008	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	10
43_b3001 b3002	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
44_b3001 b3005 b3006 b3008	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
45_b3005 b3006 b3008	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
46_b3006 b3008 b30010	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
47_b3003 b3005	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
48_b3009 b30010	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
49_b30011 b30013	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
50_b3006 b3007	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
51_b3004 b3006 b3007	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
52_b3006 b3008	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
53_b3002 b3004 b3005 b3006	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
54_b3003 b3004 b3005 b3006	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
55_b3002 b3003 b3005 b3006	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
56_b3002 b3003 b3004	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
57_b3004 b3005 b3006	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
58_b3002 b3003 b3004 b3006	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
59_b3004 b3005 b3008	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
60_b3002 b3003 b3005 b3008	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
61_b3002 b3008 b30010	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
62_b3002 b3003 b3004 b3007	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	n_cluster	ami_score	completeness	homogeneity	v_measure
47	62	0.999733	1.0	0.999545	0.999773
45	60	0.998645	1.0	0.997691	0.998844
44	59	0.998452	1.0	0.997360	0.998678
43	58	0.998185	1.0	0.996906	0.998450
42	57	0.997763	1.0	0.996185	0.998089

	location	count
0	D13	6
1	D14	4
2	D15	14
3	E15	4
4	F08	4
5	G15	4
6	I01	18
7	I02	21
8	I03	19
9	I04	18
10	I05	19
11	I06	27
12	I07	27
13	I08	26
14	I09	8
15	I10	14
16	I15	5
17	J01	16
18	J02	22
19	J03	24
20	J04	32
21	J05	19
22	J06	29
23	J07	27
24	J08	9
25	J10	6
26	J15	8
27	K01	6
28	K02	9
29	K03	23
30	K04	34
31	K05	25
32	K06	22
33	K07	11
34	K08	12
35	L01	6
36	L02	10
37	L03	13
38	L04	20
39	L05	14
40	L06	22
41	L08	3
42	L09	2
43	L15	10
44	M01	10
45	M02	14
46	M03	12
47	M04	19
48	M05	10
49	M06	20
50	N01	12
51	N02	12
52	N03	14
53	N04	10
54	N05	12
55	N06	14
56	N15	12
57	O01	2
58	O02	8
59	O03	13
60	O04	24
61	O05	24
62	O06	15
63	P01	13
64	P02	6
65	P03	12
66	P04	13
67	P05	12
68	P06	8
69	P15	7
70	Q01	6
71	Q02	4
72	Q03	18
73	Q04	18
74	Q05	24
75	Q06	4
76	R01	14
77	R02	18
78	R03	15
79	R04	10
80	R05	16
81	R06	6
82	R15	12
83	S01	23
84	S02	21
85	S03	17
86	S04	18
87	S05	20
88	S06	20
89	S07	10
90	S08	4
91	S15	3
92	T01	4
93	T03	6
94	T04	12
95	T05	10
96	T15	7
97	U01	8
98	U02	10
99	U03	14
100	U04	10
101	U05	8
102	U15	5
103	V15	8
104	W15	17

	location	readings_group	count
0	D13	30	6
1	D14	30	4
2	D15	30	14
3	E15	30	4
4	F08	15	4
5	G15	31	4
6	I01	4	18
7	I02	4	16
8	I02	6	1
9	I02	14	2
10	I02	19	2
11	I03	4	7
12	I03	6	5
13	I03	17	3
14	I03	18	2
15	I03	19	2
16	I04	4	11
17	I04	6	1
18	I04	14	2
19	I04	15	2
20	I04	17	2
21	I05	4	14
22	I05	17	4
23	I05	60	1
24	I06	4	3
25	I06	17	16
26	I06	26	8
27	I07	16	1
28	I07	17	4
29	I07	19	8
30	I07	26	6
31	I07	27	2
32	I07	40	2
33	I07	42	2
34	I07	43	2
35	I08	0	2
36	I08	17	4
37	I08	19	2
38	I08	26	3
39	I08	29	2
40	I08	41	4
41	I08	42	6
42	I08	44	2
43	I08	59	1
44	I09	16	2
45	I09	27	2
46	I09	42	2
47	I09	45	2
48	I10	27	2
49	I10	28	4
50	I10	29	6
51	I10	46	2
52	I15	31	3
53	I15	48	2
54	J01	0	2
55	J01	4	12
56	J01	14	2
57	J02	4	13
58	J02	6	9
59	J03	4	18
60	J03	6	6
61	J04	4	24
62	J04	6	6
63	J04	17	2
64	J05	4	10
65	J05	16	2
66	J05	17	7
67	J06	4	7
68	J06	16	5
69	J06	17	13
70	J06	19	2
71	J06	26	2
72	J07	17	4
73	J07	19	16
74	J07	24	2
75	J07	26	1
76	J07	27	2
77	J07	47	2
78	J08	4	2
79	J08	19	2
80	J08	26	2
81	J08	27	2
82	J08	29	1
83	J10	27	2
84	J10	29	3
85	J10	61	1
86	J15	31	3
87	J15	32	4
88	J15	48	1
89	K01	0	6
90	K02	4	9
91	K03	1	6
92	K03	2	2
93	K03	4	11
94	K03	6	4
95	K04	4	16
96	K04	6	14
97	K04	13	3
98	K04	58	1
99	K05	4	16
100	K05	6	2
101	K05	14	3
102	K05	18	2
103	K05	19	2
104	K06	0	2
105	K06	4	3
106	K06	17	6
107	K06	19	2
108	K06	24	6
109	K06	25	3
110	K07	0	1
111	K07	19	2
112	K07	24	6
113	K07	45	2
114	K08	0	2
115	K08	19	4
116	K08	24	5
117	K08	27	1
118	L01	1	2
119	L01	2	2
120	L01	6	2
121	L02	0	8
122	L02	4	1
123	L02	14	1
124	L03	4	2
125	L03	6	8
126	L03	7	2
127	L03	57	1
128	L04	1	3
129	L04	2	1
130	L04	4	8
131	L04	6	4
132	L04	9	1
133	L04	14	2
134	L04	56	1
135	L05	0	4
136	L05	4	4
137	L05	10	2
138	L05	13	1
139	L05	19	2
140	L05	55	1
141	L06	0	2
142	L06	17	1
143	L06	19	13
144	L06	24	4
145	L06	53	1
146	L06	54	1
147	L08	3	1
148	L08	19	1
149	L08	52	1
150	L09	27	1
151	L09	41	1
152	L15	33	2
153	L15	34	6
154	L15	35	2
155	M01	1	8
156	M01	2	2
157	M02	0	6
158	M02	1	4
159	M02	2	4
160	M03	2	10
161	M03	8	2
162	M04	1	6
163	M04	4	8
164	M04	6	5
165	M05	0	4
166	M05	17	2
167	M05	19	2
168	M05	20	2
169	M06	0	6
170	M06	19	4
171	M06	24	8
172	M06	25	2
173	N01	0	4
174	N01	1	4
175	N01	2	2
176	N01	13	2
177	N02	0	2
178	N02	2	10
179	N03	2	8
180	N03	8	4
181	N03	9	2
182	N04	4	2
183	N04	6	6
184	N04	13	2
185	N05	0	10
186	N05	21	2
187	N06	0	12
188	N06	19	2
189	N15	32	3
190	N15	33	7
191	N15	34	2
192	O01	0	2
193	O02	0	5
194	O02	8	1
195	O02	19	2
196	O03	2	8
197	O03	3	2
198	O03	5	3
199	O04	1	6
200	O04	6	8
201	O04	13	10
202	O05	3	14
203	O05	20	10
204	O06	0	14
205	O06	20	1
206	P01	0	8
207	P01	1	3
208	P01	2	2
209	P02	1	5
210	P02	4	1
211	P03	1	2
212	P03	3	4
213	P03	5	6
214	P04	1	4
215	P04	2	2
216	P04	6	6
217	P04	13	1
218	P05	0	2
219	P05	3	4
220	P05	20	6
221	P06	0	4
222	P06	20	4
223	P15	37	7
224	Q01	3	2
225	Q01	5	4
226	Q02	1	1
227	Q02	3	3
228	Q03	3	14
229	Q03	5	4
230	Q04	1	4
231	Q04	5	5
232	Q04	6	2
233	Q04	10	2
234	Q04	11	2
235	Q04	12	2
236	Q04	62	1
237	Q05	0	2
238	Q05	3	20
239	Q05	10	2
240	Q06	0	4
241	R01	3	14
242	R02	3	18
243	R03	3	13
244	R03	22	2
245	R04	3	4
246	R04	5	6
247	R05	3	10
248	R05	22	4
249	R05	23	2
250	R06	22	2
251	R06	23	4
252	R15	36	2
253	R15	37	8
254	R15	38	2
255	S01	3	23
256	S02	3	21
257	S03	3	14
258	S03	5	2
259	S03	22	1
260	S04	3	13
261	S04	5	5
262	S05	3	18
263	S05	22	2
264	S06	3	8
265	S06	22	6
266	S06	23	6
267	S07	3	1
268	S07	22	2
269	S07	23	5
270	S07	50	1
271	S07	51	1
272	S08	3	1
273	S08	22	2
274	S08	23	1
275	S15	39	3
276	T01	3	4
277	T03	3	6
278	T04	3	12
279	T05	3	8
280	T05	23	2
281	T15	38	2
282	T15	39	5
283	U01	3	6
284	U01	5	2
285	U02	3	8
286	U02	4	2
287	U03	3	14
288	U04	3	10
289	U05	3	8
290	U15	39	4
291	U15	49	1
292	V15	39	8
293	W15	39	17

Practical Data Science: RSSI