Generative AI (generative AI, GenAI, or GAI) is a type of artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts. They learn the patterns and structure of their input training data and then generate new data with similar characteristics to it.
Discriminative and generative models are two types of machine learning models with different goals. Discriminative models focus on classifying data by learning a decision boundary between different classes, while generative models aim to learn the underlying data distribution to generate new, similar data.
AI is the theory and development of computer systems able to perform tasks that require human intelligence, whereas machine learning (ML) is a subfield of AI that gives computers the ability to learn without explicit programming.
Modality
Examples
Transformers are the backbone architecture for many state-of-the-art GenAI models. Their self-attention mechanism allows models to capture global context, scale effectively, and handle diverse generative tasks across modalities with high performance.
Large-scale pretraining
Scalability
Instruction tuning
Reinforcement Learning from Human Feedback (RLHF)
First way is in-context learning
Knowledge distillation
Retrieval-augmented generation (RAG)
Given an input sequence of words, language modeling is to predict the next word.
Variants:
BERT: masked language modeling
T5: span corruption
GPT: predicting the next word (token)
(image)
History
Image generation is the task of creating realistic or meaningful images from input data using AI models.
Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data.
The encoder learns mapping from the data x to a low-dimensional latent space z then the decoder learns mapping back from latent z to a reconstructed observation /x\.
Autoencoder is a form of compression!
Reconstruction loss forces the latent representation to capture (or encode) as much 'information' about the data as possible
The encoder reduces the dimensionality of input data into a lower-dimensional latent space.
The compressed representation learned by autoencoders can be used for tasks requiring concise data representation.
Variational autoencoders (VAEs) are a class of generative and probabilistic models that extend the concept of autoencoders (AEs) by introducing probabilistic encoding and decoding.
AEs map inputs to a fixed latent representation, whereas VAEs model the latent space as a probability distribution.
(image)
Gaussian distribution
Kullback–Leibler divergence (KL divergence) is a fundamental concept in information theory and statistics. It measures how one probability distribution diverges from a second, expected probability distribution.
Generative adversarial networks (GANs) learn to generate new data similar to a training dataset. It consists of two networks, a generator and a discriminator, that plays a two-player game:
Generator: Takes in random values sampled from a normal distribution and produces a new sample.
Discriminator: Tries to distinguish between real and generated sample.
The first step to training a GAN is to train its discriminator. It is a binary classifier that has two class: real and fake. The data for real class, if already given, is the training data. The data for fake class? Generated from the generator.
To elaborate on the last part, training the generator involves generating images from the generator such that they are classified incorrectly by the discriminator — basically introducing noise to the latter in an effort to make it more robust.
However, a main disadvantage of GANs is its training instability. GANs are difficult to train and often suffer from issues like mode collapse, where the generator produces limited varieties of outputs.
Because GANs involve two networks (generator and discriminator) in a minimax game, balancing them is tricky:
If the discriminator becomes too strong, the generator can’t learn.
If the generator tricks the discriminator too easily, training stops improving.
The process of training and deploying a diffusion can be broken down into three key stages:
The forward diffusion process, wherein an image from the training data set is transformed into pure noise — usually a gaussian distribution.
The reverse diffusion process, wherein the model learns the inverse of each previous step in the original forward diffusion process.
Image generation, wherein the trained model samples a random noise distribution and transforms it into a high-quality output by using the reverse diffusion process it has learned to denoise a random sample of gaussian noise.
GAN vs Diffusion
Retrieval-Augmented Generation (RAG) is a technique that combines the strengths of large language models (LLMs) with information retrieval systems to generate more accurate and relevant responses, especially when dealing with specific, domain-related, or up-to-date information.
Text