Modality refers to the way in which something expressed. They can be placed along a spectrum from raw to abstract:
Raw: Those more closely detected from a sensor, such as speech recordings from a microphone or images captured by a camera.
Abstract: Those farther away from sensors, such as language extracted from speech recordings, objects detected from images, or even abstract concepts like sentiment intensity and object categories.
Modality gap refers to the differences in data characteristics and feature distributions across different modalities (e.g., images, text, audio). Because each modality encodes information in distinct formats and with different structures, their representations often lie in separate feature spaces that are not directly comparable.
Multimodal AI refers to AI systems that are designed to process and integrate multiple types of data, or modalities, to enhance understanding and performance in various tasks. It enables AI models to learn from multiple modes and types of data (e.g., image, text, audio, video) rather than just one.
Imagine you are listening to a symphony orchestra where there are different instruments (different types of data). Just as a conductor integrates rhythm, melody, and dynamics, multimodal AI integrates information across multiple sources to make smarter predictions, answer questions, or create richer outputs.
From a research perspective, multimodality entails the computational study of heterogeneous and interconnected modalities. There are three foundational principles in multimodal learning:
Modalities are heterogeneous: Information present in different modalities will often show diverse qualities, structures, and representations.
Element: Each modality is typically comprised of a set of elements such as characters/words (text) vs pixels (images).
Distribution: Differences in frequencies and likelihoods of elements.
Structure: Can be static, temporal, spatial, hierarchical.
Information: Total information content present in each modality. Can be measured by different information theoretic metrics (e.g., entropy, density, information overlap, range).
Noise: Manifests as uncertainty, signal-to-noise ratio, or missing data.
Relevance: Each modality shows different relevance toward specific tasks and contexts.
Modalities are connected: These modalities are not independent entities but rather share connections due to complementary information. Though heterogeneous, they are often connected due to shared complementary information.
Modalities interact: Modalities interact in different ways when they are integrated for a task.
Example
Challenges
Applications
Definition
Analogy
Method
Pre-training
Architecture
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs.
BLIP
A multimodal large language model (MLLM) is an AI model trained to process and reason over multiple data modalities (e.g., text, images, audio, video, sensor data) using a unified architecture, often based on a transformer. It extends traditional language model, which only handle text, by incorporating cross-modal understanding and generation.
Imagine a detective looking to solve a mystery. To do so, they do not just read witness statements, but also brings different type of evidence together, spotting connections no specialist could see alone. By cross-referencing found evidence — a suspect’s voice tone, the angle of a shadow in a photograph, and discrepancies in a written alibi — they piece together what happened and explain it clearly to the jury.
Architecture
Example models
Applications