Get ready for another text-heavy section.
Think of some things in your life. Try to label every one of them by their unique properties and put them into separate groups. You have just created labeled data. Imagine what happens if you tried grouping your things together without learning their properties? Do they become hard to sort?
Unsupervised learning is a machine learning approach for exploring data when the outcome variable is unknown or unavailable. Unlike supervised learning, which relies on labeled data (data tagged with specific properties), unsupervised learning algorithms aim to uncover hidden patterns and structures within the data itself.
Evaluating the quality of unsupervised learning models can be challenging since there's no pre-defined 'correct' answer. Here are cases where unsupervised learning is applied:
Cluster Analysis: organizes items into groups (clusters) on the basis of how closely associated they are, giving each item a label afterwards. Common industrial applications include user, article, video, or voice labeling for marketing automation, digital media delivery, etc.
Association Rule Learning: discovers interesting relations between variables in large datasets. A common industrial application for association rule learning is shopping basket analysis — predicting the probability of two products being purchased together.
Anomaly Detection: the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well-defined notion of normal behavior. Industrial applications for anomaly detection include transaction fraud detection, structural defect detection, text error recognition, etc.
Dimensionality Reduction: reduces the number of features (variables) in a dataset while aiming to retain the most important information. An application of dimensionality reduction is image compression.
Topic Modeling: discovers hidden thematic structures within a collection of unstructured data. It essentially groups items based on the patterns that frequently appear together, revealing underlying topics.
Below are different method of algorithms for unsupervised learning:
Agglomerative Hierarchical Clustering (AHC): groups data points into clusters. It starts by treating each data point as its own individual cluster and then iteratively merges the most similar clusters until a predefined stopping point is reached. The merging process results in a tree-like structure called a dendrogram (see below). A critical aspect of AHC is defining the distance between clusters during merging. Here are some linkage methods:
Single Linkage: considers the distance between the closest data points in two different clusters. It can be susceptible to chaining, where clusters get linked together based on a single close point, potentially creating elongated clusters.
Average Linkage: calculates the average distance between all pairs of data points in two different clusters. It tends to produce clusters with more similar sizes compared to Single Linkage.
Complete Linkage: considers the distance between the farthest data points in two different clusters. It can be less susceptible to chaining but might lead to clusters with very different sizes.
Ward's Method: minimizes the variance (spread) within the newly formed cluster after merging. It focuses on maintaining compact and well-separated clusters.
Partitional Clustering: divides a set of data points into a predefined number of clusters. Unlike AHC, which builds clusters hierarchically, partitional clustering assigns each data point to a single cluster in one go.
K-Means: one of the simplest and most common algorithms used in partitional clustering. Here is how the process goes:
Choose k clusters.
K-Means randomly selects k data points as initial cluster centers.
Each data point is assigned to the closest centroid based on a distance metric (e.g., Euclidean distance).
Once all data points are assigned, the centroid of each cluster is recalculated as the mean of the points belonging to that cluster.
Repeat steps 3 and 4 until a stopping criterion is met. This criterion is typically when the centroids no longer move significantly between iterations, indicating that the clusters have stabilized.
Apriori Algorithm: a popular technique in association rule learning for discovering frequent itemsets — a group of items that appear together frequently in transactions — within a transactional dataset. Here is how Apriori works:
Apriori starts by identifying individual items that appear frequently enough in the transactions. These frequent single items form the initial layer (L1).
In subsequent iterations, Apriori uses the previous layer to generate candidate itemsets for the next layer. For example, if L1 contains items A and B, it will consider the candidate itemset {A, B} in the next layer (L2).
Apriori then scans the transactions again to count the frequency of each candidate itemset. Any candidate that does not meet the minimum frequency threshold is discarded.
This process of generating candidate itemsets, counting their frequency, and pruning infrequent ones continues until no new frequent itemsets can be found.
Local Outlier Factor (LOF): a data mining algorithm used for anomaly detection. Several key concepts linked to this method are:
Local Density: LOF focuses on the density of data points around a specific point (p) instead of the entire dataset. This 'local neighborhood' is typically defined by the k nearest neighbors of p.
LOF Score: LOF calculates a score for each data point that reflects its degree of being an outlier. The score is based on the ratio of the local density of p compared to the local densities of its k nearest neighbors. Here is how the score is interpreted:
LOF close to 1: indicates that the local density of p is similar to its neighbors, suggesting it is not an outlier.
LOF less than 1: implies that p resides in a denser region compared to its neighbors, making it potentially interesting but less likely to be an outlier.
LOF significantly greater than 1: signifies that p is considerably farther away from its neighbors, suggesting a higher likelihood of being an outlier.
Principal Component Analysis (PCA): a cornerstone technique in dimensionality reduction that aims to transform a set of potentially correlated features into a new set of uncorrelated features, called principal components (PCs). These PCs capture the most significant information from the original data, often in a lower dimension.
Latent Dirichlet Allocation (LDA): used for topic modeling. It analyzes a collection of documents (files) and automatically discovers hidden thematic structures within them. About the techniques process, keep in mind that:
Predefined Topics: LDA does not require pre-defining specific topics. Instead, you specify the number of topics to discover within the data.
Topic Distribution: LDA estimates the probability distribution of topics for each document. This distribution reflects how dominant each topic is within that document.
Word-Topic Distribution: LDA estimates the probability distribution of words for each topic. This reveals which words are most likely to appear in each topic.