I have taken Record History Data from F1 archives (Ergast API) to perform various Clustering Algorithms.
The Data is cleaned in the previous sections.
Overview of Data Cleaning:
F1 archives have important information about races held since 1950 to the current season.
There are different tables of data that can be extracted such as Results of the Race, Constructor (Team) Information, Driver Information, Results of the Qualification, Season-End Results and many more.
These individual tables have data inside dictionaries inside every row.
These dictionaries can be expanded and every dictionary can be transformed into a separate table.
Valuable information from these tables are then concatenated/joined into one master table
The master table has consists of 26,941 rows and 21 feature variables and 1 label column.
Some of the feature variables include laps in the race, grid position held, age at time of the race, history of wins in the past, history of laps completed in the past, weather of the race, points gained in the race and many more.
The label column is based on the position achieved by a single driver in a particular race. The position range from 1 to 20/21/22 (based on the number of drivers that races in that particular race). For simplification purposes, the positions have been grouped into 3 different sections:
Podium (Top 3 finish)
Top_10 (Positions 4-10)
Outside Top_10 (Positions 10 and above)
The motivation behind these sections were solely based on the fact that Top 3 finish receive more points than Positions 4-10, while Positions 10 and above do not receive any points at all.
From the clustering analysis, I want to see that without the label column (Un-supervised ML), do we still achieve the same number of clusters as the sections that were originally decided upon i.e., 3.
Clustering is a Machine Learning method that groups vectors or observations (set of objects) into groups (clusters).
If we are given a set of vectors, each data point or vector can be classified into a particular group.
Suppose there are n amount of vectors in 1 cluster and there are m amount of clusters for a specific data. In theory, the n vectors in each cluster are similar to each other and have the same properties and the vectors that are not in the same cluster should have different properties.
Clustering is an Un-supervised ML method which means that there is no target variable present while creating a clustering model.
When no predefined classification is required but the data needs to be organized into logical groupings, clustering techniques are applied.
By examining the groups that our data points fall into after using a clustering algorithm, we can utilize it to gain some valuable insights about our data.
Some of the applications of Clustering include:
Market segmentation
Social network analysis
Grouping of search results
Medical image processing
Image segmentation
Anomaly detection.
The grouping of data into clusters is based on spatial similarity of different vectors. The spatial separation between the vectors, is referred to as similarity. There are several ways to calculate this distance. But this fact already shows that there isn’t a standard, well-defined process for cluster determination. Hence, there are different Algorithms to group the data into clusters:
Connectivity Models: Hierarchical Clustering (Agglomerative and Birch Clustering)
Centroid Models: K-Means Clustering
Density Models: DBSCAN and OPTICS Clustering
Distribution Models
Subspace Models
Graph-based Models
Group Models
Neural Models and so on.
As always, there are some drawbacks with Clustering:
With very large data sets, the key concept of clustering – the distance between individual measurements / observations – is made useless by the so-called curse of dimensionality. This makes it very time consuming for large datasets – especially if the space has multiple dimensions.
Shortcomings of existing (and applied) algorithms. These include:
the inability to recognise whether the data set is homogeneous or contains clusters.
the inability to rank variables according to their contribution to the heterogeneity of the dataset.
the inability to recognise specific patterns: Cluster centers, core regions, boundaries of clusters, zones of mixing, noise or even outliers.
the inability to estimate the correct number of clusters.
Moreover, each algorithm is designed to find the same type of clusters. As a result, it fails as soon as there is a mixture of clusters with different properties.
The effectiveness of the clustering is determined by the chosen method of distance measurement. Usually, there is no universally correct result. This is due to the fact that it varies drastically depending on the different methods, each with different parameters.
Clustering with random Hyper-parameters
For every algorithm that performs Clustering on a dataset, there are many hyperparameters that needs to be specified beforehand. But these hyper-parameters need to be tuned, keeping in mind our dataset. Giving random hyperparameters that do not match our dataset may result in loss of data and/or wrong datapoints being classified into clusters. Thus before hyper-parameter tuning, let’s see a k-means model with random hyperparameters as follows: - Initializer: kmeans++ - Number of clusters: 6 - Maximum iterations: 300 - Random state: 1973
Hyper-parameters are variables that you specify while building a machine-learning model. This means that it’s the user that defines the hyper-parameters while building the model. Hyper-parameters control the learning process, while parameters are learned.
The performance of a model significantly depends on the value of hyperparameters. Note that there is no way to know in advance the best values for hyperparameters so ideally, we need to try all possible values to know the optimal values.
Doing this manually could take a considerable amount of time and resources.
Hence we create functions with loops and parameters and try to find the best possible Hyper-parameter(s) based on a metric. This metric changes between algorithm to algorithm and also betwwen methods inside the algorithm. For example, in clustering silhouette method is a very commonly used method to determine hyper-parameters and it is based on maximizing the silhouette score, but there are other methods too to determine hyper-parameters such as Elbow method, Grubbs and so on.
Elbow Method is good for finding optimal hyper-parameters when the dimension of the dataset is low but for high dimension data, Silhouette method is preferred.
Silhouette Method
Similar to Elbow Method, Silhouette method is also a method to find the optimal number of clusters or other hyper-parameters and also to validate the consistency of data inside the clusters.
This method calculates silhouette coefficients for each of point, and averages it out for all the samples to get the silhouette score. The set of hyper-parameters with maximum silhouette score is chosen.
There are 2 main terms to be known to understand silhouette scores:
Cohesion (Similarity to its own cluster)
Separation (Similarity to other clusters)
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette scores lie between the range [-1,1] where a value close to the upper limit shows that a point is well matched to the cluster and a value close to the lower limit shows that it isn’t. After calculating silhouette values of all the points in the data, the silhouette score can be determined by averaging out the silhouette values.
The silhouette coefficient, s(i), can be calculated by the following formula: where: a(i): The average distance of that point with all other points in the same clusters and b(i): The average distance of that point with all the points in the closest cluster to its cluster;\[ s_{i} = \frac {b_{i} - a_{i}}{max(b_{i}, a_{i})} \]
Hyper-Parameter tuning function (maximize_silhouette)
There are tons of different models that can be used to perform Clustering but for this dataset, we are focusing mainly on the 4 most used methods which are:
K-Means
DBSCAN
Agglomerative
Birch
The tuning function takes in the dataset (X), algorithm to be used (algo), maximum number of loops to run (nmax) and a boolean function to plot the silhouette scores against hyper-parameter (i_plot). It converts our dataset into a contiguous array and runs models with different value of hyper-parameters using a for loop and computes silhouette scores for each iteration. It then stores the optimal values of hyper-parameter, label and model and returns it out. If the i_plot parameter is set to ‘True’, the function will also a plot a graph of silhouette score v/s hyper-parameter.
Code
def maximize_silhouette(X,algo="birch",nmax=20,i_plot=False):# PARAM i_print=False#FORCE CONTIGUOUS X=np.ascontiguousarray(X) # LOOP OVER HYPER-PARAM params=[]; sil_scores=[] sil_max=-10if algo in ["kmeans","birch","ag"]:for param inrange(3,nmax+1):if(algo=="birch"): model = Birch(n_clusters=param).fit(X) labels=model.predict(X)if(algo=="ag"): model = AgglomerativeClustering(n_clusters=param).fit(X) labels=model.labels_if(algo=="kmeans"): model = KMeans(n_clusters=param, init='k-means++').fit(X) labels=model.predict(X)try: sil_scores.append(silhouette_score(X,labels)) params.append(param)except:continueif(i_print): print(param,sil_scores[-1])if(sil_scores[-1]>sil_max): opt_param=param sil_max=sil_scores[-1] opt_labels=labels opt_model=modelelif algo in ["dbscan"]:for eps in np.arange(0.2, nmax+1, 0.2): model = DBSCAN(eps=eps, min_samples=X.shape[1]*2).fit(X) labels=model.labels_try: sil_scores.append(silhouette_score(X,labels)) params.append(eps)except:continueif(i_print): print(eps,sil_scores[-1])if(sil_scores[-1]>sil_max): opt_param=eps sil_max=sil_scores[-1] opt_labels=labels opt_model=model#print("Optimal Parameter for {} algorithm = {}".format(algo,opt_param))if(i_plot): fig, ax = plt.subplots() ax.plot(params, sil_scores, "-o") ax.set(xlabel='Hyper-parameter', ylabel='Silhouette') plt.show()return opt_labels, opt_param, opt_model
K-Means Algorithm (Hyper-Parameter = n_clusters)
K-Means Algorithm is a type of Partition-based or Centroid-based clustering.
How does the K-Means algorithm work?
First we choose k random data points and assign those as centroids or cluster centres.
Then for every data point, we see which centroid is nearest to it using some measurement method such as Euclidean (default) or Manhattan.
We assign the data point to that centroid.
The new centroi is now at the average distance of the centroid chosen before and the data point assigned to it.
Repeat the previous 3 steps until the centroid stops changing.
The algorithm is said to have “converged” once there are no more changes.
The goal is to minimize the sum of the distances between the data points and the cluster centroid in order to determine which group each data point should belong to.
K-means follows the same approach as Expectation-Maximization(EM). EM is an iterative method to find the maximum likelihood of parameters where the machine learning model depends on unobserved features. This approach consists of two steps Expectation(E) and Maximization(M) and iterates between these two.
Advantages:
Relatively easy to understand and implement.
Scalable to large datasets.
Better computation cost.
Easily warm start the assignments and positions of centroids.
Disadvantages:
Choosing K manually and being dependent on the initial values.
Lacks consistent results for different values of K.
Always tries to find circular clusters.
Centroids get dragged due to outliers in the dataset.
Curse of dimensionality, K is ineffective when the number of dimensions increases.
For this exercise, we are choosing the hyper-parameter to tune as n_clusters while the other hyper-parameters are set to default. The tuning function calculates the maximum silhouette score for n_cluster values from 1-10 and outputs the Silhouette v/s N-Clusters graph. The labels are then predicted and appended to our original dataset for plotting clusters using Seaborn’s Pairplot function.
Code
algo_features = ['season', 'driverId' , 'round' , 'circuitId' , 'pole_driverId' , 'position' ,'points', 'grid','laps', 'age_on_race', 'cumulative_points', 'cumulative_laps', 'pole_history', 'win_history', 'win_driverId']start = time.time()print('KMeans Algorithm')kmeans_labels, n_clusters, kmeans_model = maximize_silhouette(X, 'kmeans', 10, True)print('The number of clusters for Kmeans algorithm is: ', len(np.unique(kmeans_labels)))df['kmeans_label'] = kmeans_labelsalgo_features.append('kmeans_label')print('Cluster Plots for Kmeans algorithm')sns.pairplot(df[algo_features], hue='kmeans_label')plt.show()end = time.time()print('Time taken for KMEANS algorithm: {} minutes'.format((end - start)/60))
KMeans Algorithm
The number of clusters for Kmeans algorithm is: 3
Cluster Plots for Kmeans algorithm
Time taken for KMEANS algorithm: 1.753977100054423 minutes
df['kmeans_accuracy'] =0for i inrange(len(df)):if df['kmeans_label'][i] ==0and df['label'][i] =='Podium': df['kmeans_accuracy'][i] =1elif df['kmeans_label'][i] ==1and df['label'][i] =='Top_10': df['kmeans_accuracy'][i] =1elif df['kmeans_label'][i] ==2and df['label'][i] =='Outside_Top_10': df['kmeans_accuracy'][i] =1else: df['kmeans_accuracy'][i] =0print('The accuracy of correct clusters formed by K-Means Algorithm = {} %'.format(df['kmeans_accuracy'].mean()*100))
The accuracy of correct clusters formed by K-Means Algorithm = 15.95732692235453 %
Results for K-Means Algorithm
The optimal number of clusters formed came out to be 3 which is also equal to the number of labels present in the original dataset.
But after comparing proposed cluster labels with the original labels, around 14,800 out of 26,000 values (55.475%) are matching with each other. This is due to the fact that our dataset is historical and contains a lot of inter-dependencies between the features which leads the algorithm to not create clusters properly as it is a simple clustering algorithm. Another reason might be that the dimensionality of the data is very high and this makes it very complex.
DBSCAN Algorithm (Hyper-Parameter = eps)
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise and it is a Density Based Algorithm.
It works on the assumption that clusters are dense regions in space separated by regions of lower density. Densely grouped data points are grouped into a single cluster.
By seeing local density of data points in large spatial datasets, the clusters are identified.
This algorithm is robust to outliers and the number of clusters to make cannot be specified as a hyper-parameter.
Let a given dataset of points (dataset D = {xi}), we have to partition the point into the dense region which we will call them as Clusters and sparse region which may contain noise.
The 2 main hyper-parameters for this model are:
Epsilon (eps): specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered to be neighbors.
Minimum number of Points (min_samples): the minimum number of points to form a dense region. For example, if we set the minPoints parameter as 5, then we need at least 5 points to form a dense region.
There are 3 types of points in this algorithm:
Core Point: The point where, within the specified ‘eps’ radius it has more than the specified min_samples number of points. This points always belongs to a dense region.
Border Point: A point where, within the specified ‘eps’ radius it has less than the specified min_samples number of points but it is in the neighborhood of a core point.
Noise Point: A point that does not belong to both a Core point or a Border Point.
Every point in the dataset on given min_samples and eps, can categorize every data point into Core point, Border point and Noise point.
Density Edge: If p and q both are core points and distance between (p,q) ≤ eps then we can connect p, q vertex in a graph and call it “Density Edge”.
Density Connected Points: Two points p and q are said to be density connected points if both p and q are core points and they exist a path formed by density edges connecting point (p) to point(q).
How does the DBSCAN algorithm work?
Label the points as Core, Border and Noise Point.
Get rid of every Noise point as they do not belong to any density region (cluster).
For every core point that has not been assigned to any cluster yet:
create a new cluster with the point p and
add all the points that are density-connected to p.
Assign each border points to the cluster of the closest core point.
Advantages:
It can handle Noise very well.
DBSCAN can handle clusters of different shapes and sizes.
Disadvantages:
If your dataset has multiple densities or varying densities, DBSCAN tends to fail. It does not work very well in such cases.
It’s extremely sensitive to the hyperparameters. A slight change in hyperparameters can lead to a drastic change in the outcome.
As we know, a concept like Density, may not work well in high dimensionality of data. We should avoid using it for text data.
For this exercise, we are choosing the hyper-parameter to tune as eps while min_samples is taken as twice the dimensionality of the data (33*2 = 66) and other hyper-parameters are set to default.
The range of values where eps will lie between is calculated using K-distance graph while calculates K-nearest neighbors (n_neighbors = dimensionality of data multiplied by 2) and their distances.
The tuning function calculates the maximum silhouette score for eps values between (0.2, 0.4, 0.6, … , 3) and outputs the Silhouette v/s Eps graph. The labels are then predicted and appended to our original dataset for plotting clusters using Seaborn’s Pairplot function.
Code
neighbors = NearestNeighbors(n_neighbors=66)neighbors_fit = neighbors.fit(X)distances, indices = neighbors_fit.kneighbors(X)
Code
distances = np.sort(distances, axis=0)distances = distances[:,1]plt.plot(distances)plt.xlabel('Points')plt.ylabel('Distance (EPS)')plt.title('K-Distance Graph for DBSCAN')
Text(0.5, 1.0, 'K-Distance Graph for DBSCAN')
Code
algo_features = ['season', 'driverId' , 'round' , 'circuitId' , 'pole_driverId' , 'position' ,'points', 'grid','laps', 'age_on_race', 'cumulative_points', 'cumulative_laps', 'pole_history', 'win_history', 'win_driverId']start = time.time()print('DBSCAN Algorithm')dbscan_labels, parameter, db_model = maximize_silhouette(X, 'dbscan', 2, True)print('The EPS for DBSCAN algorithm is: {}'.format(parameter))print('The number of clusters for DBSCAN algorithm is: ', len(np.unique(dbscan_labels)))df['dbscan_label'] = dbscan_labelsalgo_features.append('dbscan_label')print('Cluster Plots for DBSCAN algorithm')sns.pairplot(df[algo_features], hue='dbscan_label')plt.title('DBSCAN Algorithm')plt.show()end = time.time()print('Time taken for DBSCAN algorithm: {} minutes'.format((end - start)/60))
DBSCAN Algorithm
The EPS for DBSCAN algorithm is: 2.6
The number of clusters for DBSCAN algorithm is: 3
Cluster Plots for DBSCAN algorithm
Time taken for DBSCAN algorithm: 2.133706529935201 minutes
df['dbscan_accuracy'] =0for i inrange(len(df)):if df['dbscan_label'][i] ==0and df['label'][i] =='Outside_Top_10': df['dbscan_accuracy'][i] =1elif df['dbscan_label'][i] ==-1and df['label'][i] =='Top_10': df['dbscan_accuracy'][i] =1elif df['dbscan_label'][i] ==2and df['label'][i] =='Podium': df['dbscan_accuracy'][i] =1else: df['dbscan_accuracy'][i] =0print('The accuracy of correct clusters formed by DBSCAN Algorithm = {} %'.format(df['dbscan_accuracy'].mean()*100))
The accuracy of correct clusters formed by DBSCAN Algorithm = 56.549340746027575 %
Results for DBSCAN Algorithm
From the K-Distance graph it can be seen that the curve changes between the range 1-3 of eps and hence the optimal value of epsilon should lie between 1 and 3.
The tuning algorithm output shows that the optimal eps value for which the silhouette score is maximum is 2.6.
After putting our hyper-parameters eps=2.6 and min_samples=66 the clusters formed came out to be 3 which is also equal to the number of labels present in the original dataset.
But after comparing proposed cluster labels with the original labels, around 15,050 out of 26,000 values (56.543%) are matching with each other. This is due to the fact that our dataset is historical and contains a lot of inter-dependencies between the features which leads the algorithm to not create clusters properly as it is a simple clustering algorithm. Another reason might be that the dimensionality of the data is very high and this makes it very complex.
Agglomerative Model is a type of Hierarchical Clustering.
It is also known as bottom-up clustering.
The Agglomerative model works as follows: each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root).
Pairs of clusters are merged as one moves up the hierarchy.
The clusters are merged by finding the distance between 2 data points. The most famous method is the Euclidean distance.
Euclidean distance is a method of finding distance between 2 points where distance is calculated by drawing a straight line between the 2 data points and then calculating the length of the line. The formula to calculate the distance between 2 points p and q is: \[ d\left( p,q\right) = \sqrt {\sum _{i=1}^{n} \left( q_{i}-p_{i}\right)^2 } \]
How does Agglomerative Clustering work?
It starts by treating each observation as a separate cluster.
Distance Measurement: The distance is calculated between each observation using methods like Euclidean, Manhattan and so on. For example, suppose we have 3 features of 5 rows each. The distance between the Obs. 1 and Obs. 2 can be calculated using the formula above, and same for Obs. 1 and Obs. 3 and Obs. 2 and Obs. 4 and so on. After that, we merge the smallest non-zero distance in the matrix to create our first node. To calculate the distance between this newly created node to other data points, linkage criterion should be set.
Linkage Criterion: The linkage criterion is where exactly the distance is measured. The simplest linkage criterion is Single Linkage. In a single linkage criterion, we define our distance as the minimum distance between clusters data point. In other words, let’s assume that in our previous example, The node came to be as (Obs. 1, Obs. 3). Now to calculate distance between this node and Obs. 2, we will calculate individual distances (using Euclidean distance formula) between the node and Obs. 2 i.e., Distance between Obs. 1 and Obs. 2 and Obs. 3 and Obs. 2. The minimum of these 2 distances will be taken as the distance between the node (Obs. 1, Obs. 3) and Obs. 2. The same goes for Obs. 4 and Obs. 5 as well.
The next step would be again look at distances and merge points with the minimum distance. Continuing the same example, now suppose Obs. 2 and Obs. 4 have the least distance, hence it is merged into a node. Now we are left with 2 nodes: (Obs. 1, Obs. 3) and (Obs. 2, Obs. 4) and 1 other data point: Obs. 5. Again the distances are calculated using the linkage criterion and the nodes are merged again.
We keep the merging ongoing until all the data is clustered into one cluster. In the end, we would obtain a dendrogram with all the data that have been merged into one cluster.
Advantages:
No need for information about how many numbers of clusters are required.
Easy to use and implement
Disadvantages:
We can not take a step back in this algorithm.
Time complexity is higher at least O(n^2logn)
For this exercise, we are choosing the hyper-parameter to tune as n_clusters while the other hyper-parameters are set to default. The tuning function calculates the maximum silhouette score for n_cluster values from 1-10 and outputs the Silhouette v/s N-Clusters graph. The labels are then predicted and appended to our original dataset for plotting clusters using Seaborn’s Pairplot function.
Code
algo_features = ['season', 'driverId' , 'round' , 'circuitId' , 'pole_driverId' , 'position' ,'points', 'grid','laps', 'age_on_race', 'cumulative_points', 'cumulative_laps', 'pole_history', 'win_history', 'win_driverId']start = time.time()print('Agglomerative (Hierarchical) Algorithm')ag_labels, n_clusters, agg_model = maximize_silhouette(X, 'ag', 10, True)print('The number of clusters for Agglomerative (Hierarchical) algorithm is: ', len(np.unique(ag_labels)))df['ag_label'] = ag_labelsalgo_features.append('ag_label')print('Cluster Plots for Agglomerative (Hierarchical) algorithm')sns.pairplot(df[algo_features], hue='ag_label')plt.show()end = time.time()print('Time taken for Agglomerative (Hierarchical) algorithm: {} minutes'.format((end - start)/60))
Agglomerative (Hierarchical) Algorithm
The number of clusters for Agglomerative (Hierarchical) algorithm is: 4
Cluster Plots for Agglomerative (Hierarchical) algorithm
Time taken for Agglomerative (Hierarchical) algorithm: 3.548959696292877 minutes
Code
X1 = df.drop(['label', 'ag_label', 'dbscan_label', 'kmeans_label', 'cluster_label', 'kmeans_accuracy', 'dbscan_accuracy'], axis=1)Z = linkage(X1, method='ward')dend = dendrogram(Z, truncate_mode='lastp', # show only the last p merged clusters p=4, leaf_rotation=90, leaf_font_size=12, show_contracted=True)plt.title('Truncated Hierarchical Clustering Dendrogram')plt.xlabel('Cluster Size')plt.ylabel('Distance')plt.show()
The accuracy of correct clusters formed by Agglomerative (Hierarchical) Algorithm = 51.185154577213474 %
Results for Agglomerative (Hierarchical) Algorithm
The optimal number of clusters formed came out to be 4 which is not equal to the number of labels present in the original dataset.
It is also seen that if we merge clusters 2 and 3 and then compare the proposed cluster labels with the original labels, around 13,600 out of 26,000 values (51.185%) are matching with each other which is the highest among all the other merges of clusters ([1,3], [2,4], [1,4] and so on).
This means that maybe we should consider splitting the Top_10 label into 2 and then run other models.
But since the matching score is not high, we will not implement it.
Final Results
Comparing the time taken for running each of the Clustering Algorithms, it is noticed that Agglomerative (Hierarchical) Algorithm takes the most time (5.08 minutes) followed by DBSCAN algorithm (3.29 minutes) and K-means Algorithm is the fastest among all (2.34 minutes).
The Clusters formed for K-Means and DBSCAN (3) match the original number of labels (3) but the matching score for DBSCAN is higher than that of K-means (by 1%).
The Clusters formed for Agglomerative (4) do not match the original number of labels (3) which gives us a new insight into the data and the target variable.
The complexity of DBSCAN is greater as compared to the other 2 algorithms even after simplifying the algorithm by taking an important hyper-parameter (min_samples) as constant (dimensionality*2 = 66).
From the cluster plots of each algorithm, it is seen that clusters are more distinct for Agglomerative followed by K-means and finally DBSCAN.
Conclusion
Even though the matching score is a little higher for DBSCAN Algorithm, I would still choose K-means Algorithm as our optimal algorithm to perform clustering because of the fast processing time and less complexity.
The Agglomerative Algorithm shows us that maybe the label variables should be split into 4, or specifically the ‘Top_10’ variable should be split into 2 to give better clusters.
From the dataset we can get basic idea that the Top_10 label can be split into 2 with positions 4 to 6 as one and positions 7 to 10 as the other.
But since the matching score of Agglomerative is not higher than the other 2 algorithms, the Top_10 variable should not be split and the labels we have taken are correct.
The complexity, size of the data and the fact that the data is historical so there many features have inter-dependencies accounts for low matching scores across all Clustering algorithms.
Source Code
---title: <b>Clustering Techniques for Record Data</b>format: html: theme: lumen toc: true self-contained: true embed-resources: true page-layout: full code-fold: true code-tools: truejupyter: python3---# Introduction- I have taken Record History Data from F1 archives (<ahref='https://ergast.com/mrd/'>Ergast API</a>) to perform various Clustering Algorithms.- The Data is cleaned in the previous sections.- Overview of Data Cleaning: - F1 archives have important information about races held since 1950 to the current season. - There are different tables of data that can be extracted such as Results of the Race, Constructor (Team) Information, Driver Information, Results of the Qualification, Season-End Results and many more. - These individual tables have data inside dictionaries inside every row. - These dictionaries can be expanded and every dictionary can be transformed into a separate table. - Valuable information from these tables are then concatenated/joined into one master table- The master table has consists of 26,941 rows and 21 feature variables and 1 label column.- Some of the feature variables include laps in the race, grid position held, age at time of the race, history of wins in the past, history of laps completed in the past, weather of the race, points gained in the race and many more.- The label column is based on the position achieved by a single driver in a particular race. The position range from 1 to 20/21/22 (based on the number of drivers that races in that particular race). For simplification purposes, the positions have been grouped into 3 different sections: - Podium (Top 3 finish) - Top_10 (Positions 4-10) - Outside Top_10 (Positions 10 and above)- The motivation behind these sections were solely based on the fact that Top 3 finish receive more points than Positions 4-10, while Positions 10 and above do not receive any points at all.- From the clustering analysis, I want to see that without the label column (Un-supervised ML), do we still achieve the same number of clusters as the sections that were originally decided upon i.e., 3.# Import Libraries```{python}import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snssns.set_theme(style="whitegrid", palette='Set2')from scipy.spatial.distance import cdistfrom sklearn.preprocessing import OneHotEncoder, StandardScalerfrom sklearn.cluster import KMeansfrom sklearn.cluster import DBSCANfrom sklearn.metrics import silhouette_scorefrom sklearn.metrics import silhouette_samplesfrom sklearn.cluster import AgglomerativeClusteringfrom scipy.cluster.hierarchy import dendrogram, linkagefrom sklearn.cluster import MeanShift, estimate_bandwidthfrom itertools import cyclefrom sklearn.cluster import Birchfrom sklearn.neighbors import NearestNeighborsimport warningswarnings.filterwarnings('ignore')import time```# Import Data```{python}df = pd.read_csv('../../data/02-model-data/data_cleaned.csv')df.head()``````{python}df.shape```# Data Pre-Processing and Visualization- The cleaned data needs some pre-processing for it to be fed into Clustering models.- Overview of `Pre-Processing`: - Dropping unnecessary columns. - The current season should not be taken into consideration while performing any Machine Learning algorithms since it is still ongoing. - All the features should be numeric. - The features that are already numeric should be normalized. - Some of the numeric features such as 'season', 'round' are categories and should be converted to categorical data type. - Variables such as 'status', 'weather' and 'stop' are not numerical and should be One-Hot Encoded to numeric. - Lastly, the label column is not needed to perform Clustering Algorithms, hence it should be dropped before running Clustering Algorithms.```{python}df.drop(['season_round', 'constructorRef', 'raceId'], axis=1, inplace=True)``````{python}df = df[df['season'] !=2022]``````{python}df.columns``````{python}numeric_features = ['season', 'driverId' , 'round' , 'circuitId' , 'pole_driverId' , 'position' ,'points', 'grid','laps', 'age_on_race', 'cumulative_points', 'cumulative_laps', 'pole_history', 'win_history', 'win_driverId']categorical_features = ['status', 'weather', 'stop']``````{python}scaler = StandardScaler()df[numeric_features] = scaler.fit_transform(df[numeric_features])``````{python}df['season'] = df['season'].astype('category')df['driverId'] = df['driverId'].astype('category')df['round'] = df['round'].astype('category')df['circuitId'] = df['circuitId'].astype('category')df['pole_driverId'] = df['pole_driverId'].astype('category')df['win_driverId'] = df['win_driverId'].astype('category')df['position'] = df['position'].astype('category')df['grid'] = df['grid'].astype('category')``````{python}df = pd.get_dummies(df, columns=categorical_features)``````{python}df.head()``````{python}df.info()``````{python}df.describe()``````{python}X = df.drop(['label'], axis=1)y = df['label']```# Clustering- Clustering is a Machine Learning method that groups vectors or observations (set of objects) into groups (clusters).- If we are given a set of vectors, each data point or vector can be classified into a particular group.- Suppose there are n amount of vectors in 1 cluster and there are m amount of clusters for a specific data. In theory, the n vectors in each cluster are similar to each other and have the same properties and the vectors that are not in the same cluster should have different properties.- Clustering is an Un-supervised ML method which means that there is no target variable present while creating a clustering model.- When no predefined classification is required but the data needs to be organized into logical groupings, clustering techniques are applied.- By examining the groups that our data points fall into after using a clustering algorithm, we can utilize it to gain some valuable insights about our data.- Some of the applications of Clustering include: - Market segmentation - Social network analysis - Grouping of search results - Medical image processing - Image segmentation - Anomaly detection.- The grouping of data into clusters is based on spatial similarity of different vectors. The spatial separation between the vectors, is referred to as similarity. There are several ways to calculate this distance. But this fact already shows that there isn't a standard, well-defined process for cluster determination. Hence, there are different Algorithms to group the data into clusters: - Connectivity Models: Hierarchical Clustering (Agglomerative and Birch Clustering) - Centroid Models: K-Means Clustering - Density Models: DBSCAN and OPTICS Clustering - Distribution Models - Subspace Models - Graph-based Models - Group Models - Neural Models and so on.- As always, there are some drawbacks with Clustering: - With very large data sets, the key concept of clustering – the distance between individual measurements / observations – is made useless by the so-called curse of dimensionality. This makes it very time consuming for large datasets – especially if the space has multiple dimensions. - Shortcomings of existing (and applied) algorithms. These include: - the inability to recognise whether the data set is homogeneous or contains clusters. - the inability to rank variables according to their contribution to the heterogeneity of the dataset. - the inability to recognise specific patterns: Cluster centers, core regions, boundaries of clusters, zones of mixing, noise or even outliers. - the inability to estimate the correct number of clusters. - Moreover, each algorithm is designed to find the same type of clusters. As a result, it fails as soon as there is a mixture of clusters with different properties. - The effectiveness of the clustering is determined by the chosen method of distance measurement. Usually, there is no universally correct result. This is due to the fact that it varies drastically depending on the different methods, each with different parameters.## Clustering with random Hyper-parametersFor every algorithm that performs Clustering on a dataset, there are many hyperparameters that needs to be specified beforehand. But these hyper-parameters need to be tuned, keeping in mind our dataset. Giving random hyperparameters that do not match our dataset may result in loss of data and/or wrong datapoints being classified into clusters. Thus before hyper-parameter tuning, let's see a k-means model with random hyperparameters as follows:- Initializer: kmeans++- Number of clusters: 6- Maximum iterations: 300- Random state: 1973```{python}kmeans = KMeans( init="k-means++", n_clusters=6, max_iter=300, random_state=1973)label = kmeans.fit_predict(X)``````{python}df['cluster_label'] = label``````{python}df.columns``````{python}# plot the clustersfeatures = ['season', 'driverId' , 'round' , 'circuitId' , 'pole_driverId' , 'position' ,'points', 'grid','laps', 'age_on_race', 'cumulative_points', 'cumulative_laps', 'pole_history', 'win_history', 'win_driverId', 'cluster_label']sns.pairplot(df[features], hue='cluster_label')```## Hyper-parameter Tuning- `Hyper-parameters` are variables that you specify while building a machine-learning model. This means that it’s the user that defines the hyper-parameters while building the model. Hyper-parameters control the learning process, while parameters are learned.- The performance of a model significantly depends on the value of hyperparameters. Note that there is no way to know in advance the best values for hyperparameters so ideally, we need to try all possible values to know the optimal values. - Doing this manually could take a considerable amount of time and resources.- Hence we create functions with loops and parameters and try to find the best possible Hyper-parameter(s) based on a metric. This metric changes between algorithm to algorithm and also betwwen methods inside the algorithm. For example, in clustering silhouette method is a very commonly used method to determine hyper-parameters and it is based on maximizing the silhouette score, but there are other methods too to determine hyper-parameters such as Elbow method, Grubbs and so on.- Elbow Method is good for finding optimal hyper-parameters when the dimension of the dataset is low but for high dimension data, Silhouette method is preferred.### Silhouette Method- Similar to Elbow Method, `Silhouette method` is also a method to find the optimal number of clusters or other hyper-parameters and also to validate the consistency of data inside the clusters.- This method calculates silhouette coefficients for each of point, and averages it out for all the samples to get the silhouette score. The set of hyper-parameters with maximum silhouette score is chosen.- There are 2 main terms to be known to understand silhouette scores: - Cohesion (Similarity to its own cluster) - Separation (Similarity to other clusters)- The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette scores lie between the range [-1,1] where a value close to the upper limit shows that a point is well matched to the cluster and a value close to the lower limit shows that it isn't. After calculating silhouette values of all the points in the data, the silhouette score can be determined by averaging out the silhouette values.- The silhouette coefficient, s(i), can be calculated by the following formula:<br> _where: <br> a(i): The average distance of that point with all other points in the same clusters and <br>b(i): The average distance of that point with all the points in the closest cluster to its cluster;_$$ s_{i} = \frac {b_{i} - a_{i}}{max(b_{i}, a_{i})} $$### Hyper-Parameter tuning function (maximize_silhouette)- There are tons of different models that can be used to perform Clustering but for this dataset, we are focusing mainly on the 4 most used methods which are: - `K-Means` - `DBSCAN` - `Agglomerative` - `Birch`- The tuning function takes in the dataset (X), algorithm to be used (algo), maximum number of loops to run (nmax) and a boolean function to plot the silhouette scores against hyper-parameter (i_plot). It converts our dataset into a contiguous array and runs models with different value of hyper-parameters using a for loop and computes silhouette scores for each iteration. It then stores the optimal values of hyper-parameter, label and model and returns it out. If the i_plot parameter is set to 'True', the function will also a plot a graph of silhouette score v/s hyper-parameter.```{python}def maximize_silhouette(X,algo="birch",nmax=20,i_plot=False):# PARAM i_print=False#FORCE CONTIGUOUS X=np.ascontiguousarray(X) # LOOP OVER HYPER-PARAM params=[]; sil_scores=[] sil_max=-10if algo in ["kmeans","birch","ag"]:for param inrange(3,nmax+1):if(algo=="birch"): model = Birch(n_clusters=param).fit(X) labels=model.predict(X)if(algo=="ag"): model = AgglomerativeClustering(n_clusters=param).fit(X) labels=model.labels_if(algo=="kmeans"): model = KMeans(n_clusters=param, init='k-means++').fit(X) labels=model.predict(X)try: sil_scores.append(silhouette_score(X,labels)) params.append(param)except:continueif(i_print): print(param,sil_scores[-1])if(sil_scores[-1]>sil_max): opt_param=param sil_max=sil_scores[-1] opt_labels=labels opt_model=modelelif algo in ["dbscan"]:for eps in np.arange(0.2, nmax+1, 0.2): model = DBSCAN(eps=eps, min_samples=X.shape[1]*2).fit(X) labels=model.labels_try: sil_scores.append(silhouette_score(X,labels)) params.append(eps)except:continueif(i_print): print(eps,sil_scores[-1])if(sil_scores[-1]>sil_max): opt_param=eps sil_max=sil_scores[-1] opt_labels=labels opt_model=model#print("Optimal Parameter for {} algorithm = {}".format(algo,opt_param))if(i_plot): fig, ax = plt.subplots() ax.plot(params, sil_scores, "-o") ax.set(xlabel='Hyper-parameter', ylabel='Silhouette') plt.show()return opt_labels, opt_param, opt_model```#### K-Means Algorithm (Hyper-Parameter = n_clusters)- `K-Means Algorithm` is a type of Partition-based or Centroid-based clustering.- `How does the K-Means algorithm work?` - First we choose k random data points and assign those as centroids or cluster centres. - Then for every data point, we see which centroid is nearest to it using some measurement method such as Euclidean (default) or Manhattan. - We assign the data point to that centroid. - The new centroi is now at the average distance of the centroid chosen before and the data point assigned to it. - Repeat the previous 3 steps until the centroid stops changing. - The algorithm is said to have “converged” once there are no more changes.- The goal is to minimize the sum of the distances between the data points and the cluster centroid in order to determine which group each data point should belong to.- K-means follows the same approach as Expectation-Maximization(EM). EM is an iterative method to find the maximum likelihood of parameters where the machine learning model depends on unobserved features. This approach consists of two steps Expectation(E) and Maximization(M) and iterates between these two.- `Advantages`: - Relatively easy to understand and implement. - Scalable to large datasets. - Better computation cost. - Easily warm start the assignments and positions of centroids.- `Disadvantages`: - Choosing K manually and being dependent on the initial values. - Lacks consistent results for different values of K. - Always tries to find circular clusters. - Centroids get dragged due to outliers in the dataset. - Curse of dimensionality, K is ineffective when the number of dimensions increases.- For this exercise, we are choosing the hyper-parameter to tune as n_clusters while the other hyper-parameters are set to default. The tuning function calculates the maximum silhouette score for n_cluster values from 1-10 and outputs the Silhouette v/s N-Clusters graph. The labels are then predicted and appended to our original dataset for plotting clusters using Seaborn's Pairplot function.```{python}algo_features = ['season', 'driverId' , 'round' , 'circuitId' , 'pole_driverId' , 'position' ,'points', 'grid','laps', 'age_on_race', 'cumulative_points', 'cumulative_laps', 'pole_history', 'win_history', 'win_driverId']start = time.time()print('KMeans Algorithm')kmeans_labels, n_clusters, kmeans_model = maximize_silhouette(X, 'kmeans', 10, True)print('The number of clusters for Kmeans algorithm is: ', len(np.unique(kmeans_labels)))df['kmeans_label'] = kmeans_labelsalgo_features.append('kmeans_label')print('Cluster Plots for Kmeans algorithm')sns.pairplot(df[algo_features], hue='kmeans_label')plt.show()end = time.time()print('Time taken for KMEANS algorithm: {} minutes'.format((end - start)/60))``````{python}df['kmeans_label'].value_counts()``````{python}df['label'].value_counts()``````{python}df['kmeans_accuracy'] =0for i inrange(len(df)):if df['kmeans_label'][i] ==0and df['label'][i] =='Podium': df['kmeans_accuracy'][i] =1elif df['kmeans_label'][i] ==1and df['label'][i] =='Top_10': df['kmeans_accuracy'][i] =1elif df['kmeans_label'][i] ==2and df['label'][i] =='Outside_Top_10': df['kmeans_accuracy'][i] =1else: df['kmeans_accuracy'][i] =0print('The accuracy of correct clusters formed by K-Means Algorithm = {} %'.format(df['kmeans_accuracy'].mean()*100))```##### Results for K-Means Algorithm- `The optimal number of clusters formed came out to be 3` which is also equal to the number of labels present in the original dataset. - But after comparing proposed cluster labels with the original labels, around `14,800 out of 26,000 values (55.475%) are matching with each other`. This is due to the fact that our dataset is historical and contains a lot of inter-dependencies between the features which leads the algorithm to not create clusters properly as it is a simple clustering algorithm. Another reason might be that the dimensionality of the data is very high and this makes it very complex.#### DBSCAN Algorithm (Hyper-Parameter = eps)- DBSCAN stands for _Density-Based Spatial Clustering of Applications with Noise_ and it is a Density Based Algorithm.- It works on the assumption that clusters are dense regions in space separated by regions of lower density. Densely grouped data points are grouped into a single cluster.- By seeing local density of data points in large spatial datasets, the clusters are identified.- This algorithm is robust to outliers and the number of clusters to make cannot be specified as a hyper-parameter.- Let a given dataset of points (dataset D = {xi}), we have to partition the point into the dense region which we will call them as Clusters and sparse region which may contain noise.- The 2 main hyper-parameters for this model are: - `Epsilon (eps)`: specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered to be neighbors. - `Minimum number of Points (min_samples)`: the minimum number of points to form a dense region. For example, if we set the minPoints parameter as 5, then we need at least 5 points to form a dense region.- There are 3 types of points in this algorithm: - `Core Point`: The point where, within the specified 'eps' radius it has more than the specified min_samples number of points. This points always belongs to a dense region. - `Border Point`: A point where, within the specified 'eps' radius it has less than the specified min_samples number of points but it is in the neighborhood of a core point. - `Noise Point`: A point that does not belong to both a Core point or a Border Point.- Every point in the dataset on given min_samples and eps, can categorize every data point into Core point, Border point and Noise point.- `Density Edge`: If p and q both are core points and distance between (p,q) ≤ eps then we can connect p, q vertex in a graph and call it “Density Edge”.- `Density Connected Points`: Two points p and q are said to be density connected points if both p and q are core points and they exist a path formed by density edges connecting point (p) to point(q).- `How does the DBSCAN algorithm work?` - Label the points as Core, Border and Noise Point. - Get rid of every Noise point as they do not belong to any density region (cluster). - For every core point that has not been assigned to any cluster yet: - create a new cluster with the point p and - add all the points that are density-connected to p. - Assign each border points to the cluster of the closest core point.- `Advantages`: - It can handle Noise very well. - DBSCAN can handle clusters of different shapes and sizes.- `Disadvantages`: - If your dataset has multiple densities or varying densities, DBSCAN tends to fail. It does not work very well in such cases. - It’s extremely sensitive to the hyperparameters. A slight change in hyperparameters can lead to a drastic change in the outcome. - As we know, a concept like Density, may not work well in high dimensionality of data. We should avoid using it for text data.- For this exercise, we are choosing the hyper-parameter to tune as eps while min_samples is taken as twice the dimensionality of the data (33*2 = 66) and other hyper-parameters are set to default. - The range of values where eps will lie between is calculated using K-distance graph while calculates K-nearest neighbors (n_neighbors = dimensionality of data multiplied by 2) and their distances. - The tuning function calculates the maximum silhouette score for eps values between (0.2, 0.4, 0.6, ... , 3) and outputs the Silhouette v/s Eps graph. The labels are then predicted and appended to our original dataset for plotting clusters using Seaborn's Pairplot function.```{python}neighbors = NearestNeighbors(n_neighbors=66)neighbors_fit = neighbors.fit(X)distances, indices = neighbors_fit.kneighbors(X)``````{python}distances = np.sort(distances, axis=0)distances = distances[:,1]plt.plot(distances)plt.xlabel('Points')plt.ylabel('Distance (EPS)')plt.title('K-Distance Graph for DBSCAN')``````{python}algo_features = ['season', 'driverId' , 'round' , 'circuitId' , 'pole_driverId' , 'position' ,'points', 'grid','laps', 'age_on_race', 'cumulative_points', 'cumulative_laps', 'pole_history', 'win_history', 'win_driverId']start = time.time()print('DBSCAN Algorithm')dbscan_labels, parameter, db_model = maximize_silhouette(X, 'dbscan', 2, True)print('The EPS for DBSCAN algorithm is: {}'.format(parameter))print('The number of clusters for DBSCAN algorithm is: ', len(np.unique(dbscan_labels)))df['dbscan_label'] = dbscan_labelsalgo_features.append('dbscan_label')print('Cluster Plots for DBSCAN algorithm')sns.pairplot(df[algo_features], hue='dbscan_label')plt.title('DBSCAN Algorithm')plt.show()end = time.time()print('Time taken for DBSCAN algorithm: {} minutes'.format((end - start)/60))``````{python}df['dbscan_label'].value_counts()``````{python}df['label'].value_counts()``````{python}df['dbscan_accuracy'] =0for i inrange(len(df)):if df['dbscan_label'][i] ==0and df['label'][i] =='Outside_Top_10': df['dbscan_accuracy'][i] =1elif df['dbscan_label'][i] ==-1and df['label'][i] =='Top_10': df['dbscan_accuracy'][i] =1elif df['dbscan_label'][i] ==2and df['label'][i] =='Podium': df['dbscan_accuracy'][i] =1else: df['dbscan_accuracy'][i] =0print('The accuracy of correct clusters formed by DBSCAN Algorithm = {} %'.format(df['dbscan_accuracy'].mean()*100))```##### Results for DBSCAN Algorithm- From the K-Distance graph it can be seen that the curve changes between the range 1-3 of eps and hence the `optimal value of epsilon should lie between 1 and 3`. - The tuning algorithm output shows that `the optimal eps value for which the silhouette score is maximum is 2.6`. - After putting our `hyper-parameters eps=2.6 and min_samples=66 the clusters formed came out to be 3` which is also equal to the number of labels present in the original dataset. - But after comparing proposed cluster labels with the original labels, around `15,050 out of 26,000 values (56.543%) are matching with each other`. This is due to the fact that our dataset is historical and contains a lot of inter-dependencies between the features which leads the algorithm to not create clusters properly as it is a simple clustering algorithm. Another reason might be that the dimensionality of the data is very high and this makes it very complex.#### Agglomerative (Hierarchical) Algorithm (Hyper-Parameter = n_clusters)- `Agglomerative Model` is a type of Hierarchical Clustering.- It is also known as bottom-up clustering.- The Agglomerative model works as follows: each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root).- Pairs of clusters are merged as one moves up the hierarchy.- The clusters are merged by finding the distance between 2 data points. The most famous method is the Euclidean distance.- `Euclidean distance` is a method of finding distance between 2 points where distance is calculated by drawing a straight line between the 2 data points and then calculating the length of the line. The formula to calculate the distance between 2 points p and q is:$$ d\left( p,q\right) = \sqrt {\sum _{i=1}^{n} \left( q_{i}-p_{i}\right)^2 } $$- `How does Agglomerative Clustering work?` - It starts by treating each observation as a separate cluster. - Distance Measurement: The distance is calculated between each observation using methods like Euclidean, Manhattan and so on. For example, suppose we have 3 features of 5 rows each. The distance between the Obs. 1 and Obs. 2 can be calculated using the formula above, and same for Obs. 1 and Obs. 3 and Obs. 2 and Obs. 4 and so on. After that, we merge the smallest non-zero distance in the matrix to create our first node. To calculate the distance between this newly created node to other data points, linkage criterion should be set. - Linkage Criterion: The linkage criterion is where exactly the distance is measured. The simplest linkage criterion is _Single Linkage_. In a single linkage criterion, we define our distance as the minimum distance between clusters data point. In other words, let's assume that in our previous example, The node came to be as (Obs. 1, Obs. 3). Now to calculate distance between this node and Obs. 2, we will calculate individual distances (using Euclidean distance formula) between the node and Obs. 2 i.e., Distance between Obs. 1 and Obs. 2 and Obs. 3 and Obs. 2. The minimum of these 2 distances will be taken as the distance between the node (Obs. 1, Obs. 3) and Obs. 2. The same goes for Obs. 4 and Obs. 5 as well. - The next step would be again look at distances and merge points with the minimum distance. Continuing the same example, now suppose Obs. 2 and Obs. 4 have the least distance, hence it is merged into a node. Now we are left with 2 nodes: (Obs. 1, Obs. 3) and (Obs. 2, Obs. 4) and 1 other data point: Obs. 5. Again the distances are calculated using the linkage criterion and the nodes are merged again. - We keep the merging ongoing until all the data is clustered into one cluster. In the end, we would obtain a dendrogram with all the data that have been merged into one cluster.- `Advantages`: - No need for information about how many numbers of clusters are required. - Easy to use and implement- `Disadvantages`: - We can not take a step back in this algorithm. - Time complexity is higher at least O(n^2logn)- For this exercise, we are choosing the hyper-parameter to tune as n_clusters while the other hyper-parameters are set to default. The tuning function calculates the maximum silhouette score for n_cluster values from 1-10 and outputs the Silhouette v/s N-Clusters graph. The labels are then predicted and appended to our original dataset for plotting clusters using Seaborn's Pairplot function.```{python}algo_features = ['season', 'driverId' , 'round' , 'circuitId' , 'pole_driverId' , 'position' ,'points', 'grid','laps', 'age_on_race', 'cumulative_points', 'cumulative_laps', 'pole_history', 'win_history', 'win_driverId']start = time.time()print('Agglomerative (Hierarchical) Algorithm')ag_labels, n_clusters, agg_model = maximize_silhouette(X, 'ag', 10, True)print('The number of clusters for Agglomerative (Hierarchical) algorithm is: ', len(np.unique(ag_labels)))df['ag_label'] = ag_labelsalgo_features.append('ag_label')print('Cluster Plots for Agglomerative (Hierarchical) algorithm')sns.pairplot(df[algo_features], hue='ag_label')plt.show()end = time.time()print('Time taken for Agglomerative (Hierarchical) algorithm: {} minutes'.format((end - start)/60))``````{python}X1 = df.drop(['label', 'ag_label', 'dbscan_label', 'kmeans_label', 'cluster_label', 'kmeans_accuracy', 'dbscan_accuracy'], axis=1)Z = linkage(X1, method='ward')dend = dendrogram(Z, truncate_mode='lastp', # show only the last p merged clusters p=4, leaf_rotation=90, leaf_font_size=12, show_contracted=True)plt.title('Truncated Hierarchical Clustering Dendrogram')plt.xlabel('Cluster Size')plt.ylabel('Distance')plt.show()``````{python}df['ag_label'].value_counts()``````{python}df['label'].value_counts()``````{python}df['ag_accuracy'] =0for i inrange(len(df)):if df['ag_label'][i] ==0and df['label'][i] =='Outside_Top_10': df['ag_accuracy'][i] =1elif df['ag_label'][i] ==2and df['label'][i] =='Top_10': df['ag_accuracy'][i] =1elif df['ag_label'][i] ==3and df['label'][i] =='Top_10': df['ag_accuracy'][i] =1elif df['ag_label'][i] ==1and df['label'][i] =='Podium': df['ag_accuracy'][i] =1else: df['ag_accuracy'][i] =0print('The accuracy of correct clusters formed by Agglomerative (Hierarchical) Algorithm = {} %'.format(df['ag_accuracy'].mean()*100))```##### Results for Agglomerative (Hierarchical) Algorithm- `The optimal number of clusters formed came out to be 4` which is not equal to the number of labels present in the original dataset. - It is also seen that if we merge clusters 2 and 3 and then compare the proposed cluster labels with the original labels, around `13,600 out of 26,000 values (51.185%) are matching with each other` which is the highest among all the other merges of clusters ([1,3], [2,4], [1,4] and so on).- This means that maybe we should consider splitting the Top_10 label into 2 and then run other models. - But since the matching score is not high, we will not implement it.# Final Results- Comparing the time taken for running each of the Clustering Algorithms, it is noticed that Agglomerative (Hierarchical) Algorithm takes the most time (5.08 minutes) followed by DBSCAN algorithm (3.29 minutes) and K-means Algorithm is the fastest among all (2.34 minutes).- The Clusters formed for K-Means and DBSCAN (3) match the original number of labels (3) but the matching score for DBSCAN is higher than that of K-means (by 1%).- The Clusters formed for Agglomerative (4) do not match the original number of labels (3) which gives us a new insight into the data and the target variable.- The complexity of DBSCAN is greater as compared to the other 2 algorithms even after simplifying the algorithm by taking an important hyper-parameter (min_samples) as constant (dimensionality*2 = 66).- From the cluster plots of each algorithm, it is seen that clusters are more distinct for Agglomerative followed by K-means and finally DBSCAN. # Conclusion- Even though the matching score is a little higher for DBSCAN Algorithm, I would still choose K-means Algorithm as our optimal algorithm to perform clustering because of the fast processing time and less complexity.- The Agglomerative Algorithm shows us that maybe the label variables should be split into 4, or specifically the 'Top_10' variable should be split into 2 to give better clusters.- From the dataset we can get basic idea that the Top_10 label can be split into 2 with positions 4 to 6 as one and positions 7 to 10 as the other.- But since the matching score of Agglomerative is not higher than the other 2 algorithms, the Top_10 variable should not be split and the labels we have taken are correct. - The complexity, size of the data and the fact that the data is historical so there many features have inter-dependencies accounts for low matching scores across all Clustering algorithms.