Humans need to group together lest their sanity erodes and degrades.
We return to the topic of unsupervised machine learning since 28th June and forward. To recap, unsupervised learning tackles unlabeled data, training algorithms to identify hidden patterns or structures within the data itself. We recently deep dived into the decision boundary method but that is for labelled data. So, for today, we will delve into a method of finding patterns in unsupervised learning: clustering.
Clustering algorithms identify groups (clusters) within unlabeled data, where data points lack predefined categories. The core principle is to organize data points with similar characteristics together, creating a more organized and interpretable representation of the data.
Clustering algorithms can be categorized based on their approach to grouping data points:
Hierarchical Clustering: iteratively builds a hierarchy of clusters, starting with individual data points and progressively merging them into larger clusters based on similarity. Imagine a visual tree structure where branches represent clusters and splits represent divisions.
Partitional Clustering: aims to partition the data points into a fixed number of pre-defined clusters (k). K-means is a popular example, where the algorithm strategically positions k centroids (cluster centers) and assigns data points to the closest centroid. K-Medoids is another option that uses medoids (actual data points) as cluster centers.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN): focuses on identifying clusters based on areas of high density (many data points close together) separated by areas of low density (noise or sparse data). DBSCAN is robust to outliers and can handle clusters of irregular shapes.
Probability-Based Clustering: assumes that the data points stem from a mixture of statistical distributions. Algorithms like the Gaussian Mixture Model (GMM) model the data as a combination of Gaussian distributions (bell-shaped curves), where each distribution represents a potential cluster.
While all algorithms aim to achieve an optimal outcome, clustering algorithms like K-means have a specific focus: minimizing the within-cluster variance. K-means does not necessarily find the one 'perfect' solution as its objective is to group data points similar to each other into clusters. Here is a breakdown of a common clustering algorithm's flow:
Start by randomly selecting k data points, typically two in your example, as initial cluster centers (centroids). These centroids represent the tentative 'cores' of our clusters.
For each data point in the dataset, we calculate its distance to both centroids. This distance can be measured using various metrics like Euclidean distance (straight-line distance). Based on the calculated distances, each data point is assigned to the cluster represented by the closest centroid. This essentially creates a preliminary grouping of the data.
Once all data points are assigned to their initial clusters, we recalculate the centroid for each cluster. The new centroid becomes the average position of all data points currently belonging to that cluster. This effectively repositions the cluster centers to better reflect the current grouping of the data.
Steps 2 and 3 are repeated iteratively. With each iteration, the cluster assignments and centroids are adjusted. The algorithm converges when the assignments stabilize, meaning data points no longer switch clusters between iterations. This indicates that the algorithm has found a (locally) optimal grouping of the data.
After convergence, you might evaluate the quality of the resulting clusters using metrics like Silhouette analysis (to be covered in the next chapter). These metrics assess how well data points within a cluster are similar compared to those in other clusters.
K-means is a powerful clustering algorithm, but there are some important factors to keep in mind when using it:
Impact of Random Initialization: K-means relies on randomly chosen initial centroids to start the clustering process. Different random initializations can lead to different clustering results. This is because K-means can get stuck in local optima, meaning it might find a good clustering solution within a specific starting point, but it might not be the absolute best (global) solution for the entire dataset.
Choosing the Right Number of Clusters: unlike supervised learning where you have predefined classes, K-means requires you to specify the desired number of clusters (k) beforehand. There is no single 'correct' answer for k, and it often depends on the characteristics of your data. After all, the goal of clustering is to uncover meaningful structures within your data. Therefore, it is crucial to evaluate the quality of the resulting clusters.
Evaluating Cluster Quality: after running K-means, it is crucial to evaluate the quality of the resulting clusters. Visualizations can be helpful for initial assessment, but quantitative metrics (i.e., silhouette coefficient) can provide a more objective measure of cluster cohesion and separation. Consider analyzing the characteristics of each cluster to understand what distinguishes them from each other. This can help validate the chosen k value and overall effectiveness of the clustering solution.