Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally provided labels.
Labelled images
To summarize, when human annotation of large datasets is impractical, SSL is a powerful alternative that allows models to learn useful representations from unlabeled data by creating their own supervisory signals from the data itself. This avoids the need for expensive and time-consuming manual labeling.
Big models trained on unlabeled large-scale data in self-supervised way are referred to as Pretrained Foundational Models (PFM). These models are trained to learn general features and patterns from broad datasets, making them suitable for various downstream tasks after fine-tuning. On said downstream tasks, we can use pretrained foundational models as a backbone and fine-tune it on small-scale, task-specific data.
Pre-training a model equips it with a foundational understanding of the data and world by exposing it to a vast amount of data, often unlabeled. Fine-tuning is the process of tailoring the pre-trained model's knowledge and abilities to a specific task or domain — with the use of smaller, relevantly labeled datasets.
Imagine a chef learning fundamental skills in a big kitchen, chopping vegetables (basic knife skills) and cooking common dishes (pasta, rice, eggs). After the same chef moves on to work in a sushi restaurant, appropriate trained, they can now apply knife skills to cut fish and use heat control for perfect rice.
Example
Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks. A PFM is trained on large-scale data which provides a reasonable parameter initialization for a wide range of downstream applications.
Paradigms
AI models, especially machine learning models, cannot directly interpret raw data such as text, images, or audio in their original form. Instead, they require input data to be converted into numerical representations — called features — that capture the relevant patterns or properties of the data.
An encoder is a module that takes input data (like a sentence or an image) and transforms it into a hidden representation (often a vector or a set of vectors). The idea is to capture the essential features or meaning of the input.
Imagine a secret writer (encoder) encrypting a detailed letter with a compact cipher, only the essential symbols, that hides the true content. The output is the code (encoded representation), which one can store or send on.
A decoder is a module that takes the encoded representation and produces some kind of output, such as another sentence, a label, or an image.
Continuing from the last analogy, imagine receiving the cipher (encoded message) from before with no context of what the original looked like. Using a code breaker (decoder), you follow cipher rules to decode and reconstruct the full letter. If done correctly, you get a recovered message (reconstructed data).
For example, GPT is a decoder-only architecture that is designed to generate human-like text by predicting the next word (token) in a sequence.
Imagine a decoder-only model as a single novelist doing real-time autocompletion, weaving what they have already seen directly into what they produce next.
An encoder-decoder architecture
Imagine a book translation service with two specialists. The reader (encoder) sits down with the original book, carefully digests every sentence, paragraph, and nuance, and converts it into a compact 'essence' summary. The writer (decoder) takes that essence summary and crafts a brand-new translated version in the target language, full of correct grammar, style, and context.
T5 (Text-To-Text Transfer Transformer) is an encoder-decoder transformer designed to handle a wide range of NLP tasks by treating them all as text-to-text problems. This eliminates the need for task-specific architectures because T5 converts every NLP task into a text generation task.
Pre-trained models in NLP
A language model is a probabilistic model of a natural language. In NLP, the language modeling task is to predict the next word given a sequence of words. It is to model the probability distribution over words given their past context pθ(wt | w1, w2, w3… w[t - 1]).
The Bidirectional Encoder Representations from Transformers (BERT) is a language model that uses a bidirectional transformer architecture to understand the context of words in text. Its primary objectives include:
Masked Language Modeling (MLM): To predict the original word(s) that were masked in the input sequence. A portion of the input tokens (words) are randomly masked, and the model is tasked with predicting the masked tokens based on the surrounding context. Helps model learn contextual relationships between words, which is crucial for understanding meaning of sentences.
Next Sentence Prediction (NSP): To predict whether one sentence is a logical continuation of another. Model receives two sentences as input and is trained to determine if second sentence is a valid continuation of the first. Helps model understand relationships between sentences, which is important for tasks like question answering or document classification.
Unlike traditional language models that process text sequentially (left-to-right or right-to-left), BERT considers both left and right context simultaneously. This allows it to capture more nuanced relationships between words and improve performance on various natural language processing (NLP) tasks.
Residual Network (ResNet) uses convolutional (CN) layers to extract initial features
Looking deeper into ResNet, its skip connection is a direct path from the input x of a block to its output F(x). This pathway allows information to bypass intermediate layers within the block and directly contribute to the output. The core idea is to make it easier for the network to learn identity mappings, preventing information loss and vanishing gradients in deeper networks.
ResNets are preferred over standard CNNs because they address the problem of vanishing gradients in very deep networks. By introducing skip connections, ResNets allow gradients to flow more easily through the network, enabling the training of extremely deep models without degradation of performance.
Imagine a multilane highway with numerous exits (layers in a NN), where it has an express lane (skip connection) that lets cars bypass local exits and merge further down the road. Cars that take said express lane avoid the risk of getting stuck in local traffic jams (vanishing gradients) or taking too many detours (exploding activations), making traffic (training signal) flows smoothly from start to finish, and easier to train very deep 'highways' (networks) without losing original info overall.
CNN to Transformer process
BERT (Bidirectional Encoder Representations from Transformers) is a pretrained, encoder-only Transformer model that produces contextualized vector embeddings for each token by attending to both left and right context. It is trained on large unlabeled corpora via masked language modeling (randomly masking tokens and predicting them) and next-sentence prediction, enabling it to be fine-tuned with minimal additional parameters for diverse downstream NLP tasks.
Imagine BERT's tasks as:
Masked language model training (MLMT): an expert who has practiced filling in missing words in sentences, sometimes blanking out words in a phrase, then instantly supply the right word by looking at both what comes before and after the blank. This is how BERT learns to predict hidden tokens using full bidirectional context.
Next-sentence prediction (NSP): same expert also reads two snippets of text and judges whether the second truly follows the first in a story — like checking if two newspaper paragraphs belong together. This is how BERT learns how sentences link up.
Annotation in CV
CLIP (Contrastive Language-Image Pre-training) is
Reduce the size of the training set to 10% of original. Do you notice overfitting?
Retrain using augmented data. Does test accuracy improve?