They say less is more but not more is less sometimes. This is either a 'me' problem or the work of a hard work promotional campaign in action.
High-dimensional datasets present several challenges. Algorithms operating on high-dimensional data often suffer from the 'curse of dimensionality,' leading to increased computational costs and longer processing times. Models trained on high-dimensional data are more prone to overfitting as they can easily find complex patterns in the noise. High-dimensional spaces tend to be sparsely populated, making it difficult to find meaningful relationships between data points. Not to mention that our 3D brains are incapable of directly visualizing data beyond three dimensions, hindering exploratory data analysis.
Dimensionality reduction is the process of transforming high-dimensional data into a lower-dimensional space while preserving essential information. By reducing the number of features, dimensionality reduction addresses several challenges:
Improved Model Performance: fewer features can lead to simpler models that generalize better and are less prone to overfitting.
Enhanced Computational Efficiency: models trained on lower-dimensional data typically require less computational resources.
Facilitated Visualization: reducing dimensions to 2 or 3 allows for visual exploration of complex data.
Noise Reduction: irrelevant or redundant features can be eliminated, improving data quality.
In the example below, the original dimensionality (k) of this photo is 512. When its dimensionality has been reduced to 16, its size is reduced and will appear slightly blurry to the human eye, but to the machine learning model, the image still has the number of contours and features for analysis.
Common dimensionality reduction techniques include:
Feature Selection: identifies and retains only the most relevant features.
Principal Component Analysis (PCA): creates new uncorrelated features that capture the maximum variance in the data.
t-Distributed Stochastic Neighbor Embedding (t-SNE): preserves local relationships between data points in a lower-dimensional space, primarily used for visualization.
On a more technical note, PCA is a statistical procedure that transforms a set of correlated variables into a set of uncorrelated variables called principal components.
Instead of discarding features, PCA creates new principal components. These components are linear combinations of the original features. By selecting only the most important principal components, we can represent the data effectively in a lower-dimensional space.
PCA is particularly useful when dealing with data that has high dimensionality and correlated features. It can help to reduce noise, improve computational efficiency, and enhance model performance by focusing on the most informative aspects of the data.
The process of PCA involves (from top to bottom):
Standardizing the data to ensure features contribute equally. If features are not scaled appropriately, PCA can be biased towards features with larger variances. This is because PCA seeks directions of maximum variance in the data.
To ensure that all features contribute equally to the analysis, it is crucial to standardize the data before applying PCA.
Standardization involves scaling features to have zero mean and unit variance.
Calculating the covariance matrix to measure the relationships between features.
Performing eigenvalue decomposition on the covariance matrix to obtain eigenvalues and eigenvectors.
Selecting the top k eigenvectors corresponding to the largest eigenvalues, where k is the desired dimensionality of the new subspace.
Projecting the original data onto the subspace defined by the selected eigenvectors to obtain the transformed data.
The resulting principal components are linear combinations of the original features and are uncorrelated, maximizing the variance captured in the lower-dimensional space.
Prematurely applying dimensionality reduction techniques like PCA can lead to information loss and hinder model performance. It is recommended to explore feature selection or engineering methods initially to identify and potentially remove irrelevant features. Once the dataset is refined, PCA can be considered to further reduce dimensionality while preserving essential information. By carefully evaluating the impact of PCA on model accuracy, you can determine its suitability for the specific problem.
The digits dataset contains 10 features. For simplification purposes, we will extract 3 of those features for performing PCA.
Further simplifying the digits dataset, we shrink the original high-dimensional data – likely 64 dimensions for each 8*8 image – to a 3D space represented by 3 principal components. Doing so will allow the predicted data to be visualized on a 3D scatter plot later on.
Looking the 3D scatter plot below, we can deduce that PCA has successfully captured some of the underlying structure of the data. While there is some overlap between the clusters, particularly for certain digits, the overall separation is distinctly visible. This suggests that the first three principal components explain a significant portion of the variance in the data and are effective in discriminating between different digits.