Knowledge representation (KR) refers to the methods used in AI to store, retrieve, and handle knowledge to enable intelligent reasoning. It is needed to bridge raw data and intelligent decision-making, allow AI to reason logically and infer new facts, and enable knowledge-driven AI applications (e.g., expert systems and search engines).
For example, a medical AI system stores knowledge about symptoms, diseases, and treatments. Given a patient's symptoms, said system can infer the most likely disease and suggest treatments.
Without KR, AIs would be no more than simple pattern-copying predictors.
Structured knowledge is data that is organized in a predefined format with a clear structure. They are the kind of data that can be stored in databases, tables, ontologies, and knowledge graphs (e.g., SQL databases, medical taxonomy, and Google Knowledge graph).
AI systems rely on knowledge to understand, reason, and make decisions. Without structured knowledge, AI models would only work with raw data and struggle with logical reasoning.
Before knowledge is structured like the above, it exists as unstructured knowledge. Following no predefined structure, this raw data can be found in raw text, images, videos, and free-form documents (e.g., news articles, audio recordings, and transcripts).
Not all KR methods are equally effective in AI applications. And a good KR system must balance efficiency, flexibility, and interpretability. Hence, requirements for KR are made in order to satisfy these two conditions. Take a self-driving car as an example:
Expressiveness: Represents real-world road conditions, traffic signals, and vehicle movement.
Computational efficiency: Processes sensor data in real-time for immediate decisions.
Scalability: Expands knowledge of new routes and driving patterns.
Interpretability: AI must explain why it brakes or changes lanes.
Modifiability: Updates driving models based on new road conditions.
One method of KR, the semantic network, is a graph-based knowledge representation where nodes represent concepts/objects and edges represent relationships between them. It is used to structure knowledge hierarchically and enable inference based on relationships.
Using semantic networks allows AI to derive new knowledge by traversing relationships in the network. It notably excels in:
Natural representation of knowledge: Mimics human thought processes. Relationships are intuitive and visually clear.
Supporting logical inference: AI can deduce new facts through is-a and has-property relationships.
Efficient knowledge retrieval: Graph structures allow fast lookups using connected nodes.
However, if they were that great, everyone would be using them. Semantic networks have shortcomings in areas such as:
Snowballing complexity: Large networks with millions of nodes can become hard to manage.
No standardized representation: Different AI models use different graph structures, making integration difficult.
Cannot handle uncertain knowledge well: Semantic networks assume all relationships are deterministic (e.g., birds can fly but what about penguins?).
A frame is a structured representation of knowledge that groups related information about an entity into a slot-filler structure. Here, a slot is an attribute or property of a frame that stores specific knowledge (e.g., text, lists, rules, and procedural actions), and a filler is the value for that slot.
Frame-based reasoning enables AIs to reason by leveraging the following traits:
Default values: If a slot is empty, AI can use default knowledge.
Inheritance from frame hierarchies: Frames inherit attributes from higher-level frames (e.g., mammals have hair, therefore a dog truly has hair.
Slot Constraints and Conditions: Some slots have restrictions on valid values.
Procedural Attachment (rules triggered by frames): Some slots trigger actions when accessed.
Frame-based reasoning excels in feats such as:
Structured and organized representation: Frames group related knowledge into slot-filler structures, making retrieval efficient.
Enable inheritance and default reasoning: AI can infer missing values using default values and hierarchical inheritance.
Support procedural knowledge (rules and actions): Slots can trigger actions aka procedural attachments.
Easy to update to modify: Frames allow AI systems to modify slots and fillers dynamically.
Again, like the semantic network, all methods have flaws that prevent everyone from using them universally:
Rigid structure and limited flexibility: Frames work well when knowledge is predefined but struggle with any that are ambiguous.
Poor handling of uncertainty: Frames assume knowledge is always structured and complete, making it difficult to reason with probabilities.
Hard to scale for large knowledge bases: Frames grow complex as the number of entities and slots increases.
Lacks advanced logical reasoning: Unlike symbolic logic, frames do not perform deep logical deductions.
A rule-based system (RBS) represents knowledge as a set of if-then rules that trigger actions when conditions are met. Unlike frames or semantic networks, RBS's focus on explicit decision-making processes.
The format of RBS stands out from other types of KR in:
Transparent and explainable: Every decision is based on clear, human-readable IF-THEN rules. For example, an AI doctor follows predefined rules for diagnosis, making its reasoning understandable to doctors.
Easy to implement for well-defined problems: Rule-based AI works effectively in structured domains with known rules. For example, fraud detection systems apply rules like, "IF transaction > $10,000 AND foreign country, THEN flag as suspicious".
Works without large training data: Unlike machine learning, rule-based AI does not require massive datasets to function. For example, a legal AI assistant can apply IF-THEN logic to analyze contract clauses without needing prior training.
Third time's a charm, RBS's have trouble with:
Hard to scale with complex knowledge: As rules increase, managing thousands of if-then rules becomes difficult. For example, a tax AI needs thousands of rules for different tax laws, making updates challenging.
Poor adaptability to new situations: Cannot generalize beyond predefined rules. For example, an AI chatbot using rules may fail when users ask unexpected questions.
Requires expert knowledge to define rules: Rules must be handcrafted by domain experts. For example, a legal AI must be programmed with thousands of rules by lawyers.
A knowledge graph is a graph-based knowledge representation that connects entities (nodes) with relationships (edges), but unlike frames or RBS's, it stores structured knowledge (facts, concepts, relationships), enables AI to infer new connections from known relationships, and provides a scalable and flexible way to organize information.
Knowledge graphs allow AI to infer new facts by analyzing existing relationships. It does so by allowing the AI to use transitive inference (if A → B, B → C, then AI infers A → C), identify hidden connections between entities (relationship expansion), distinguish entities with similar names (entity disambiguation), and let the AI use it to retrieve structured answers.
Knowledge graphs are tidy organizers in terms of:
Highly structured and interpretable: Unlike unstructured text, a Knowledge Graph provides clearly defined relationships between entities. For example, Google’s Knowledge Graph organizes people, places, and facts in a structured way.
Enables inference and knowledge discovery: AI can infer missing knowledge based on known relationships. For example, if a scientist worked on a theory, AI could predict possible collaborations.
Scalable for large-scale knowledge representation: Works well with millions of facts and relationships. For example, AI assistants (e.g., Siri and Alexa) retrieve structured knowledge in real time.
Supports Multi-Domain Knowledge Integration: Can combine medical, scientific, business, and general knowledge into one system. For example, a medical AI links symptoms, diseases, treatments, and drugs for diagnosis.
Before deep learning, AI relied on structured knowledge — defined rules and symbolic logic — to reason. These approaches were interpretable but struggled with learning from data. Several key traditional AI + KR Methods are:
Expert systems: AI system that mimics human experts to make decisions in specialized fields. Uses IF-THEN rules for decision-making. Built with a knowledge base (storage for facts and rules) and an inference engine (applies logical rules to derive conclusions). For example, MYCIN is input with patient symptoms. IF patient has a fever AND a bacterial infection, THEN recommend an antibiotic treatment.
Ontologies: Formal representation of knowledge that defines concepts, relationships, and constraints in a domain. Unlike traditional rule-based AI, which relies on explicit IF-THEN statements, this method provides a structured framework for AI to reason, infer, and categorize knowledge. It has numerous key components:
Concepts (classes): Categories of entities. For example, disease, medicine, treatment, and patient.
Relationships (properties): Defines how entities are connected. For example, disease 'has symptom' fever, medicine 'treats' disease.
Instances (individuals): Specific data points. For example, flu, COVID-19, aspirin, John Doe.
Constraints and rules: Logical restrictions applied to relationships. For example, a treatment must be associated with at least one disease.
Inference mechanisms: AI can infer new relationships based on ontology structure. For example, if aspirin is a type of pain reliever, and pain relievers treat a headache, then aspirin can be used to treat a headache.
The Resource Description Framework (RDF) represents knowledge using triples in the format (Subject, Relation, Object) → (S, P, O) (Head, Relation, Tail) (h, r, t). This structure allows flexible, scalable, and machine-readable knowledge storage.
Web Ontology Language (OWL) extends RDF by allowing logical reasoning and ontology-based classification. It defines concepts, relationships, and hierarchy constraints in knowledge graphs.
A graph database is a specialized database designed to store, query, and manage graph-structured data efficiently. Unlike relational databases (SQL), which use tables, graph databases store data as nodes and edges, making them ideal for knowledge graphs.
To construct a knowledge graph, data is extracted from structured, unstructured, and semi-structured sources. Assuming you already know what the first and second categories are, the third category is partially structured, requiring transformation into knowledge graph format. An example is transforming Wikipedia infoboxes into JSON format, extracted into knowledge graph triples.
A structured knowledge graph's construction process goes like this:
Entity extraciton: Identify entities (people, places, organizations) from text. For example, from Albert Einstein was born in Germany and developed the Theory of Relativity", AI extracts (Albert Einstein, Germany, Theory of Relativity) from it.
Relation extraction: Identify relationships between entities. Continuing from the last example, (Albert Einstein, born_in, Germany) and (Albert Einstein, discovered, Theory of Relativity).
Knowledge integration: Merge duplicate entities and ensure consistency. For example, (Albert Einstein from Wikipedia = A). Einstein from a research paper.
Storage and query: Store knowledge graph data in graph databases (Neo4j, ArangoDB, or RDF Stores).
Knowledge graph inference refers to the process where AI derives new knowledge from existing facts using logical rules, embeddings, and graph-based reasoning.
To list a few types of knowledge graph inference:
Rule-based reasoning (symbolic inference – logic-based AI): Uses explicit logical rules (IF-THEN logic) to deduce new facts. Often implemented using OWL, SPARQL Protocol and RDF Query Language (SPARQL), and First-Order Logic (FOL). For example, IF (A is a part of B) AND (B is a part of C), THEN (A is a part of C).
Graph-based reasoning (path-based inference): AI traverses the graph structure to infer relationships between entities. Uses graph query languages (SPARQL, Cypher) and graph algorithms (PageRank, Shortest Path).
Knowledge graph embeddings (embedding-based knowledge graph inference): Represents entities and relations as dense vectors in a continuous space. Embeddings allow models to predict missing links, validate facts, and reason over structured knowledge. Generated by transforming symbolic knowledge from structured graphs into continuous vector representations.
Abbreviated to TransE, Translating Embeddings for Modeling Multi-relational Data is a knowledge graph embedding model that represents entities and relations as vectors in a continuous space.
To explain what TransE does, if an AI knows that (University of Auckland, located_in, Auckland) & (Auckland, part_of, New Zealand), this structure learns embeddings (University of Auckland + located_in ≈ Auckland) & (Auckland + part_of ≈ New Zealand), thus inferring missing knowledge such as (University of Auckland , located_in, New Zealand).
The mathematical formulation of TransE involves representing entities and relations as vectors in a continuous space and defining a scoring function based on vector translations. To break it down into steps (with examples):
Vector space representation: Map each entity — head h, tail t, and relation r — to a d-dimensional vector (h ∈ R^d, t ∈ R^d , r ∈ R^d).
Scoring function: Measure plausibility of triple (h, r, t) using distance-based scoring function f(h, r, t) = (|| h + r - t ||)_(L1/L2). For valid triples, relation r acts as a translation vector. Thus, (h + r) should be close to t in embedding space. A low score indicates valid triples, whereas a high score indicates false triple
Distance metrics: User has to configure one out of two norms for TransE to use. Combining both would introduce additional hyperparameters and complexity. Each norm provides unique benefits and is more useful in unique scenarios:
L1 norm (Manhattan distance): Linearly penalizes small and large deviations equally. Encourages sparse embeddings; some dimensions may become exactly zero. Effective for modeling hierarchical or categorical relations where only a few dimensions are relevant. Robust to outliers due to linear scaling.
L2 norm (Euclidean distance): Heavily penalizes large deviations due to squaring. Encourages smooth, dense embeddings; all dimensions contribute. Better for modeling continuous or nuanced relations where all dimensions matter. Sensitive to outliers but provides stable gradients during training.
The margin-based ranking loss in TransE is a training objective designed to separate valid triples from invalid ones by enforcing a margin γ between their scores. Its components include:
Correct triples (S): (h, r, t) are valid triples from the knowledge graph.
Corrupted triples (S′): (h', r', t') are invalid triples generated by perturbing h or t (e.g., replacing h with a random entity).
Scoring function (f): TransE’s distance metric f(h, r, t) = (∥ h + r − t ∥) _(L1/L2), which measures how well the triple fits the translation h + r ≈ t.
Margin (γ): Hyperparameter enforcing a gap between valid and invalid triples.
The goal is to minimize the score f(h, r, t) for valid triples and maximize (h', r', t') for invalid ones. For each valid invalid pair, the formula is f(h', r', t') ≥ f(h, r, t) + γ enforced. If it is violated, the loss becomes positive, penalizing the model.
Additionally, the max(0, …) ensures no penalty when valid triples are already sufficiently separated from invalid ones by γ.
The TransE model training process involves learning vector embeddings for entities and relations in a knowledge graph by enforcing a geometric translation principle. To describe and summarize each step in learning:
Random initialization: Each entity (e.g., "Paris," "France") and relation (e.g., "located_in") is assigned a random d-dimensional vector in R^d. These vectors are starting points that will be refined during training to capture semantic and relational patterns.
Compute scores for triples: For each valid triple (h, r, t) from the dataset, compute its using TransE’s scoring function f(h, r, t) = (∥ h + r − t ∥) _(L1/L2). A low score (small distance) indicates a plausible triple.
Negative sampling: Generate corrupted triples (h', r, t) or (h, r, t') by by replacing the head h or tail t with a random entity (e.g., replacing "France" with "Germany" in [Paris, located_in, France]).
Optimize with gradient descent: Use margin-based ranking loss to ensure valid triples L score lower than corrupted ones by a margin γ. Update embeddings using gradient descent to minimize L. Valid triples are 'pulled closer' (h + r ≈ t), while invalid ones are 'pushed away' (h' + r ≠ t′).
The example below has used margin-based ranking loss to predict which of the places from a small list of random places is located in a closer, bigger place (country or continent).
On a surface level, the solution to finding out which place Paris is located in is to simply add each value in Paris' vectors with the ones in relation located_in, then deduct each other location's vectors.
The output of the vector addition-deduction h + r - t with the lowest score is the most likely prediction. In the context of the example. Paris is located in France.
TransE’s core assumption that relations can be modeled as a single translation vector (i.e., h + r ≈ t) works well for simple 1-to-1 relationships (e.g., "Paris is the capital of France"), but it struggles with 1-to-N, N-to-1, and N-to-N relations because its assumption models all relations as a single translation vector in embedding space.
For example, in (City, located_in, Country), if located_in translates every city’s vector (e.g., Paris, Lyon, and Marseille) to TransE forces all cities to satisfy City_i + located_in ≈ France, which implies all cities must map to the same vector. The result? City embeddings collapse into near-identical vectors, losing their distinctiveness.
Queries like "Is Marseille in France?" work, but "Is Marseille near Lyon?" fail because city embeddings are indistinguishable.
TransE’s inability to model complex relational patterns (e.g., 1-to-N, N-to-N) led to the development of more sophisticated models. Below are key advancements, along with examples about preventing entity overlap for triples in the contexts of "Paris located_in France" and "Louvre located_in France":
TransH — Relation-Specific Hyperplanes: Each relation r is associated with a hyperplane. Entities are projected onto this hyperplane before applying the translation. Enables the same relation to interact differently with different entities, avoiding vector collapse.
Example: First, project Paris and Louvre onto the located_in hyperplane. Second, apply the relation-specific translation r on the hyperplane. Finally, Paris and Louvre are mapped to distinct points on the hyperplane, their projections depending on their original positions — preserving their uniqueness even though both are linked to France.
TransR — Separate Entity-Relation Spaces: Entities and relations reside in different vector spaces. Entities are projected into the relation’s space before translation. Allows flexible modeling of complex relations by decoupling entity and relation representations.
Example: First, project Paris and Louvre into the located_in relation space using a learnable matrix M_r. Second, apply translation r in the relation space. Finally, Paris and Louvre occupy different positions in the located_in space, preventing overlap. Each relation space can capture unique interactions, even for entities sharing the same relation.
ComplEx — Complex Embeddings for Symmetry or Antisymmetry: Represents entities and relations as complex vectors (with real and imaginary parts). Uses Hermitian dot products to model symmetric, antisymmetric, and inverse relations.
Example: First, embed Paris, Louvre, France, and located_in as complex vectors. Second, score triples using Re(h ⋅ r ⋅ t), where `t with macron is the complex conjugate of t. Finally, the complex space allows Paris and Louvre to relate to France without overlapping, as their imaginary components encode distinct relational contexts. Such complex embeddings naturally handle symmetry (e.g., sibling_of) and antisymmetry (e.g., located_in).
To summarize the three models' niches, TransH prevents entity collapse via relation-specific hyperplanes, TransR decouples entity and relation spaces for richer modeling, while ComplEx leverages complex numbers to capture diverse relational properties.
Large language models (LLMs) like GPT, PaLM, and Llama excel at natural language understanding but face critical limitations not limited to their tendency to generate incorrect or fabricated information (i.e., hallucinations), reliance on statistical patterns rather than structured logical reasoning, and limited memory for long-term dependencies and real-time knowledge updates.
The solution to these problems is to combine it with KR. Doing so allows LLMs to access verified knowledge from structure sources (e.g., knowledge graphs, databases, ontologies) to reduce errors. KR also provides logic-based frameworks to support and justify LLM outputs, providing users transparency on models' thought processes.
Having external KR systems enable LLMs to incorporate up-to-date information without costly retraining. Imagine a brilliant student (LLM) capable of incredible feats of language and reasoning based on what they have learned from textbooks (training data). However, they are stuck in a library with only those textbooks, and the world outside is constantly changing. KR acts as a research assistant for this student, answering queries from the student about the outside world.
Retrieval-augmented generation (RAG) is a hybrid AI technique that integrates external knowledge retrieval with generative models to enhance response accuracy and minimize hallucinations.
Unlike traditional LLMs, which depend solely on static pre-trained data, RAG dynamically pulls real-time, contextually relevant information (e.g., citations) from external sources before generating responses.
Here is how RAG works step-by-step:
User asks, "Who won the Turing Award in 2023?"
RAG searches for structured (e.g., databases to knowledge graphs) and unstructured (e.g., documents to websites) data. This can be done with either sparse retrieval (BM25, keyword-based search to rank items) or dense retrieval (FAISS/DPR, use neural embeddings for semantic similarity, a degree of meaning overlap).
Retrieved items are fed into the LLM as context "Avi Wigderson won the 2023 Turing Award for foundational contributions to the theory of computation."
Aftewards, the LLM generates an answer grounded in retrieved data. Optional re-ranking ensures factual consistency and prioritizes high-confidence responses.