An error in the Cupoy website delayed my learning process but we are now back on track!
While PCA is a powerful tool for dimensionality reduction, it has its own list of shortcomings. PCA is sensitive to the scale (size) of features, which leads to biased results. It has difficulty capturing nonlinear underlying data structures. Its practice trades loss of information for reducing dimensionality. Luckily, there are other dimensionality reduction techniques more suitable for capturing nonlinear patterns.
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique primarily used for data visualization. Unlike PCA, which focuses on preserving global structure, t-SNE excels at preserving local relationships between data points. Here is how this technique works:
Similarity Calculation: for each data point in the high-dimensional space, calculate its similarity to all other points using a Gaussian kernel.
Probability Distribution: convert these similarities into probabilities, forming a probability distribution.
Low-Dimensional Embedding: assign random positions to data points in the low-dimensional space (usually 2 or 3 dimensions).
Similarity Calculation in Low Dimensions: compute similarities between data points in the low-dimensional space using a student's t-distribution.
Cost Function: define a cost function to measure the difference between the probability distributions in high-dimensional and low-dimensional spaces.
Optimization: use gradient descent to minimize the cost function, adjusting the positions of data points in the low-dimensional space.
For each dataset, perplexity is a measure used to evaluate how well a probability distribution predicts a sample. In the context of generative AI, it quantifies how 'surprised' the model is by a given input, based on the data it has been trained on.
The images below with low perplexity (5) tend to be more focused on local structure, capturing fine-grained details but potentially missing larger-scale patterns. For the moderately perplexed (30~50), a good balance between local and global structure is provided, revealing both detailed clusters and overall data distribution. For highest perplexity (100), the visualization emphasizes global structure, showing broader patterns but potentially losing some local details.
The plot below visualizes the underlying structure of the digits dataset in a 2D space with t-SNE, where it clearly separated the different digit classes and grouped a majority of data points with identical features into clusters. Although there are a few outlier data points that are noticeably away from their clusters.
t-SNE excels at preserving local relationships between data points in the lower-dimensional space, making it ideal for visualizing and understanding complex datasets. It can also capture non-linear relationships between data points, unlike PCA which is limited to linear transformations.
However, t-SNE is computationally expensive, especially for large datasets, which can limit its applicability in certain scenarios. Next, the results of t-SNE can be sensitive to the initial random seeding, leading to potential variations in the visualization. Finally, while t-SNE is excellent at preserving local structure, it may not accurately represent the global structure of the data.
TSNE requires the following parameters to be specified for proper functioning:
n_components sets the dimensionality of the embedded space.
random_state sets the random seed for reproducibility.
init specifies the initialization method, using PCA in this case.
learning_rate sets the learning rate for the gradient descent optimization.
early_exaggeration controls the initial exaggeration of distances to improve separation.
Looking at the t-SNE plot, each cluster looks nonlinear and distinctively separated from one another, though overlaps between different data points are visible on this 2D space. For example, a tiny number of 9s overlap around the 8 cluster. A couple of outliers also exist next to or overlap with different clusters.
Overall, I would say tis t-SNE visualization of the digits dataset effectively reveals the clustering patterns, overlaps, and potential outliers, aiding in understanding the dataset's characteristics.