Language modeling (LM) is one of the major approaches to advancing language intelligence of machines. LM aims to model the generative likelihood of word sequences, so as to predict the probabilities of future (or missing) tokens.
The research of LM can be divided into four major development stages:
Statistical language models (SLM): Developed based on statistical learning methods that rose in the 1990s.
Neural language models (NLM): Characterize probability of word sequences by neural networks, e.g., multi-layer perceptron (MLP) and recurrent neural networks (RNNs).
Pre-trained language models (PLM): Very effective as general-purpose semantic features, which have largely raised the performance bar of NLP tasks.
Large language models (LLM): Researchers find that scaling PLM (e.g., scaling model size or data size) often leads to an improved model capacity.
Hot topic
A Large Language Model (LLM) is a type of AI that can process, understand, and generate human language. It is trained on large datasets of text and code, allowing it to perform a variety of tasks like translation, summarization, and even creative writing.
An LLM's scale — the number of parameters and training data it has — heavily impacts its performance, capabilities, and overall behavior. Generally, larger models tend to exhibit improved performance on tasks, particularly those requiring complex reasoning or understanding of nuanced language.
However, there is a diminishing return, and the benefits of scaling can be offset by increased computational costs and the potential for undesired behaviors at extremely large scales.
Emergent abilities
Notable LLM examples:
OpenAI GPT
LLaMa
Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use LMs for a wide variety of applications and research topics. Prompt engineering skills help to better understand the capabilities and limitations of LLMs.
Several approaches of prompt engineering include:
Zero-shot prompting: Text
Few-shot prompting: Text
Chain-of-thought (CoT) prompting: Text
Self-consistency (CoT-SC): Text
Greedy decode: Text
Beam search: Text
Tree-of-Thought (ToT): Text
High temperature
Low temperature
Top-K sampling
Top-P sampling
The AI alignment problem focuses on ensuring that AI systems, especially large language models (LLMs), are aligned with human values and goals. This is often framed as ensuring AI is helpful, honest, and harmless (HHH):
Helpful: AI should assist users by providing useful, relevant, and accurate information or services, enhancing productivity and effectively solving problems.
Honest: AI should be transparent and truthful in its responses, acknowledging its limitations and avoiding generating false or misleading content.
Harmless: AI should avoid causing harm by preventing the generation of biased, offensive, or unethical content, prioritizing safety and respect in its interactions.
Alignment techniques:
LLM hallucination refers to a situation where an LLM generates responses that are incorrect, nonsensical, or inconsistent with the input prompt or the model's training data. This is essentially when an LLM 'makes up' information or fills in gaps with plausible but false details. These hallucinations can be caused by:
Insufficient data: LLMs need extensive training data to learn complex language patterns and context. If the training dataset is limited or biased, the model may struggle to understand nuanced language or make accurate predictions.
Data bias: Training data can reflect existing societal biases, which can be amplified by the LLM and lead to inaccurate or misleading outputs.
Data quality: Poorly curated or noisy training data can introduce errors that the model learns and reproduces.
Several ways to counteract hallucinations include:
Fact-based metrics: Assesses faithfulness by measuring the overlap of facts between the generated content and the source content.
Classifier-based metrics: Utilizing trained classifiers to distinguish the level of entailment between the generated content and the source content.
QA-based metrics: Employing question-answering systems to validate the consistency of information between the source content and the generated content.
Uncertainty estimation: Assesses faithfulness by measuring the model’s confidence in its generated outputs.
Prompting-based metrics: LLMs are induced to serve as evaluators, assessing the faithfulness of generated content through specific prompting strategies.
Evaluating LLMs is a complex challenge. Unlike traditional NLP models, LLMs are open-ended, multi-purpose, and generative, making it hard to use simple metrics. Good evaluation tasks and metrics are important to build better AI. As language models evolve, mainstream NLP tasks continue to advance and become increasingly challenging.