With the number of graph types present in data science, do you wonder how taxing reading them is to data scientists, analysts, and more?
Unsupervised models, lacking ground truth labels, require alternative evaluation strategies. Unsupervised metrics focus on the coherence and separation of clusters, aiming to uncover meaningful structures within the data. One of these metrics that will be explored today is the silhouette.
Silhouette refers to a method of interpretation and validation of consistency within clusters of data. It provides a quantitative measure of how well each data point fits into its assigned cluster compared to other clusters. Here is how it progresses:
Within-Cluster Distance: for each data point, calculate the average distance to all other points within the same cluster. This represents how tightly clustered the data points are within their group.
Nearest Neighbor Distance: for each data point, find the nearest cluster (excluding its own cluster) and calculate the average distance to all points in that cluster. This represents how close the data point is to the nearest neighboring cluster.
Silhouette Coefficient: the silhouette coefficient for a data point is computed as (b - a) / max(a, b), where a is the average distance within the cluster and b is the average distance to the nearest neighboring cluster.
The silhouette value is interpreted from 1 to -1, where its results tell us:
High Silhouette Value (close to 1): the data point is well-matched to its own cluster and far from neighboring clusters.
Low Silhouette Value (around 0): the data point is close to the decision boundary between two clusters.
Negative Silhouette Value (close to -1): the data point might have been assigned to the wrong cluster.
Before I create my own silhouette plots, let us explore a standard silhouette plot (and scatter plot) down below.
The silhouette plot on the left provides insights into the cohesion and separation of clusters. Each horizontal rectangle represents a cluster, with the silhouette values of its data points stacked vertically. The silhouette coefficient for each data point measures how similar it is to its own cluster compared to other clusters. A higher coefficient indicates a better fit within the cluster. In the example, the yellow and green clusters exhibit a lower average silhouette value, suggesting potential overlap or suboptimal clustering.
The scatter plot on the right visualizes the data points in the two-dimensional feature space, color-coded by cluster membership. It complements the silhouette plot by providing a spatial understanding of the clusters. The distribution of points within each cluster and the overlap between clusters can be observed.
Together, these plots offer valuable insights into the quality of the clustering results. For instance, if a cluster has a large number of data points with low silhouette values, it might indicate the presence of outliers or subclusters within that cluster. Similarly, if the scatter plot shows significant overlap between clusters, it suggests that the chosen number of clusters might not be optimal.
Starting off, the make_blobs function from the sklearn module generates isotropic Gaussian blobs for clustering. An isotropic Gaussian blob essentially means that the data points are distributed in a circular shape – or spherical, for multi-dimensional data – around the centroid. The spread or 'scatter' of these points is the same in all directions, making it isotropic.
This instance aims to create 8 blobs out of a cluster with 500 samples, 2 features, 5 centers ranged from -10 to 10, standard deviation of 1, all shuffled randomly across the 2D plane.
The first part covers setting up the figure for the subplots and the clustering model for this instance, which is K-means clustering.
The second part covers the for loop function that prints the data of each cluster (blob) into readable plot material.
The third part covers the setup to creating the adjacent scatter plot, mimicking the same for loop command that accounts every cluster able to be visualized on the plot. As I have mentioned a while back, the standard practice of silhouette is to create a scatter plot (subplot) right next to your silhouette plot.
The fourth and final part covers the end of scatter plot creation, a line of script for naming the figure containing our silhouette and scatter subplots, and the average silhouette scores of each numbered category of clusters.
Below are the figures visualizing the results of K-means clustering from the script we covered progressively.