Creations tend to inherit its creators' traits.

Back

Week 4: Expert system to decision tree

LISP programming language

LISP (LISt Processing) is a family of programming languages fundamentally centered around lists as its core data structure, which can contain symbols, numbers, or even nested lists. A function definition, variable, or operation in LISP is simply a list, enabling powerful metaprogramming capabilities like code generation and manipulation at runtime.

The properties of LISP include:

Prefix notation (Polish notation): LISP uses a uniform syntax where the operator precedes its operands. This consistency simplifies parsing and enables recursive operations naturally.

Homoiconicity: Programs are written as lists, meaning code can be treated as data and vice versa. This allows macros to dynamically generate or modify code, a cornerstone of LISP’s flexibility.

Recursion and Functional Programming: LISP emphasizes recursion over loops for iteration. The following example computes the factorial of a number, though note that it assumes N ≥ 1 and does not handle N = 0.

Take MYCIN, for example. Its implementation in LISP was instrumental in achieving its functionality, leveraging LISP's unique features for symbolic computation and rule-based reasoning. LISP's native list structures allowed rules to be stored and manipulated programmatically, and said rules included certainty factors (CFs) to quantify confidence in conclusions.

Forward and backward chaining

Forward chaining (data-driven reasoning) is a reasoning method that starts with known facts and applies inference rules to derive new conclusions or trigger actions. It is 'data-driven' because it progresses from initial data toward a goal.

Backward chaining (goal-driven reasoning) is a reasoning method that starts with a hypothesis (goal) and works backward to find supporting facts. It is 'goal-driven' because it begins with a target and verifies if it can be proven.

Bringing up MYCIN again, it primarily employs backward chaining in its reasoning process. To provide a breakdown of its processes:

Goal identification: MYCIN begins with potential bacterial or fungal pathogens (e.g., E. coli or Streptococcus pneumoniae) as hypotheses.
Rule activation: For each hypothesis, MYCIN checks rules that could confirm the pathogen.
Data gathering: The system prompts the user (e.g., a physician) for specific clinical or lab data needed to satisfy the rule’s conditions.
Uncertainty handling: MYCIN uses CFs, ranging from -1 to +1, to quantify confidence in conclusions. CFs propagate through rules, allowing MYCIN to rank hypotheses by likelihood.

MYCIN's explanation capabilities were groundbreaking for their time, demonstrating the importance of transparency and accountability in AI systems, especially in critical domains like healthcare. They highlighted that to properly implement AI in a field like the medical one, that the AI must be able to explain its reasoning.

Value of expert systems (i.e., AIs that emulate experts)

Human experts, while highly knowledgeable, are susceptible to cognitive lapses. They may forget crucial details or struggle to accurately recall information relevant to a specific problem. Unlike computers, which maintain unwavering consistency and do not experience fatigue, human cognition is prone to errors, especially under stress or prolonged exertion.

Computers possess a significant advantage in memory capacity and information retrieval. They can maintain a much larger working memory than humans, who are typically limited to processing around seven 'chunks' of information at a time. Furthermore, computers can systematically and rapidly search their entire knowledge base, ensuring comprehensive and efficient access to relevant data.

Expert systems become particularly valuable when human expertise is unavailable, such as in emergency situations or remote environments. In these scenarios, AI can provide crucial decision support.

However, the complexity of capturing and encoding all relevant knowledge remains a significant challenge, hindering the widespread adoption of expert systems in fields like medicine. Consequently, the focus has shifted towards developing "decision support systems" that act as intelligent assistants rather than replacements for human experts.

Knowledge acquisition bottleneck

The knowledge acquisition bottleneck refers to the challenges and limitations in effectively capturing, organizing, and disseminating knowledge, often hindering progress in areas like knowledge management and AI development. Directly inputting domain expert knowledge into AI systems is impractical. Even advanced systems like MYCIN primarily supported expert review and minor adjustments to pre-defined rules, rather than direct knowledge entry.

It does not help that knowledge, especially in fields like medicine, is constantly evolving. A system like MYCIN, based on 1976 medical practices, would be outdated today. New drugs, treatments, and emerging health issues necessitate continuous maintenance and updates to any knowledge base, even if initially 'perfect'.

Knowledge bases, especially those relying on large production rule systems, are inherently imperfect and difficult to debug. The initial belief that simply adding more rules would improve accuracy proved flawed. As early as 1974, the sheer volume of rules in MYCIN highlighted the complexity and challenges of maintaining and troubleshooting such systems.

Decision trees

A decision tree is a supervised learning algorithm that models decisions and their outcomes in a flowchart-like structure. Despite its namesake, it only has components named after the ones of a natural tree. Its core structure is composed of:

Root node: Starting point where the decision is made based on a feature (e.g., 50,000 < salary range > 80,000?). Unlike natural trees, ML decision trees are typically drawn with the root at the top, branching downward.
Internal nodes: Represent Boolean or multi-way splits, tests on features. Can either be classified as a binary split (Yes-no Boolean branches) or an n-way split (n classes of branches).
Branches: Outcomes of a decision (e.g., declined offer).
Leaf nodes: Terminal nodes with no outbound branches. Provide final predictions — class labels for classification or numeric values for regression.

Reading the layout of exhibit IV, decision trees excel in simplicity and transparency but struggle with modeling uncertainty and variability, adapting to dynamic or interdependent scenarios, and handling long-term, time-sensitive financial calculations. To elaborate:

Static probabilities: The tree assumes fixed probabilities (e.g., "High Average Demand Probability: 0.60") that may not reflect real-world volatility, leading to inaccurate risk assessments.
Simplistic outcome modeling: Outcomes are modeled as deterministic values (e.g., "$1 million/year for 10 years"), ignoring variability or uncertainty within each path. This promotes overconfidence in fixed projections and may result in poor financial decisions context-wise.
Over-simplification of complex scenarios: The tree reduces multi-faceted decisions (e.g., "Build Big Plant" vs. "Expand Investment") to binary or n-way splits, ignoring interdependencies. Such a reduction displays the tree's failture to model feedback loops or second-order effects.
Data sensitivity and overfitting: The tree’s structure depends heavily on the accuracy of input probabilities and yields. Small errors in estimation (e.g., "Low Average Demand Probability: 0.30") can cascade into flawed conclusions. It is overall highly vulnerable to biased or incomplete data.
Lack of dynamic adaptation: The tree cannot flexibly update probabilities or outcomes in response to new information (e.g., a competitor entering the market after Year 2).

Decision table

Decision tables are structured tools used to model complex decision logic compactly by organizing conditions (inputs) and corresponding actions (outputs) into a tabular format. Key features include:

Rows: Each row represents a unique rule (combination of conditions).
Columns: Each column corresponds to a decision variable (condition) or the final action to take.
Decision column: The last column specifies the outcome or action when all conditions in a row are satisfied.

Unlike decision trees, decision tables do not enforce a sequence for evaluating conditions, allowing multiple independent rules to coexist without hierarchical order. This makes them efficient for scenarios where numerous conditions interact non-sequentially (e.g., eligibility checks, business rule engines).

Knowledge acquisition and engineering

Knowledge acquisition (KA) is the process of extracting domain-specific knowledge from human experts (e.g., doctors, engineers) through interviews, observations, or documentation. Several key challenges when performing professional KA include:

Eliciting decision variables: Experts may struggle to articulate implicit knowledge (e.g., "How do you diagnose a rare disease?").
Identifying cutoff points: Experts might use fuzzy or context-dependent criteria (e.g., "200 milliseconds response time is critical" in a robotics system).
Identifying inconsistencies: Experts may contradict themselves unknowingly (e.g., approving a loan in one scenario but rejecting it in a similar case).

KA can be done with structured interviews and questionings, alongside tools like decision trees/tables and flowcharts which contain data from the former.

Another process relevant to building reliable knowledge graphs is knowledge engineering (KE), the structuring of acquired knowledge into a formal, computable system that is complete and consistent. Akin to assembling a complete puzzle, key challenges related to the pieces include:

Input handling completeness: Ensuring system handles all possible input combinations (e.g., edge cases like "Debt-to-Income ratio = 1.1").
Data consistency: Resolving conflicts (e.g., overlapping rules that lead to contradictory decisions).
Logic validation: Testing whether the system mirrors the expert’s reasoning across diverse scenarios.

KE can be done with formal logic networks (e.g., ontologies, rule-based systems) and through iterative refinement via prototyping and expert feedback.

There is a clear interplay between KA and KE. KA feed KE with gathered and documented raw expert knowledge, which the latter will refine by formalizing the data, identifying gaps in it, and resolving inconsistencies within the data. Experts then review the engineered system, prompting further KA to fill missing logic if needed.

Knowledge engineering: decision tree induction

Decision tree induction is a machine learning technique that automates the construction of decision trees by learning hierarchical decision rules from labeled training data.

Unlike manual knowledge engineering, which requires domain experts to explicitly define variables and thresholds, induction algorithms (e.g., CART, ID3, C4.5) analyze a dataset to:

Identify splits: Determine optimal features and cutoff points (e.g., "Total Lymphocyte Count < 1.0728") that partition the data into homogeneous subsets.
Maximize information gain: Use metrics like Gini impurity or entropy to prioritize splits that best separate classes (e.g., "CD4 ≤200 cells/μL" vs. "CD4 >200 cells/μL").
Generalize rules: Build a tree structure that balances predictive accuracy with simplicity to avoid overfitting.

Keep in mind about one thing with decision tree induction: its models may produce splits that appear illogical or counterintuitive, even for simple problems. This occurs because decision trees prioritize statistical patterns in the training data over human-interpretable logic.

Reading the example below, you can see branches such as "feathers in [no] → 'no'" and "legs < 5.5 → 'yes'". These rules might seem nonsensical because animals without features typically have legs, while the second rule says otherwise, leading to misclassification of creatures or artifacts (e.g. 8-legged spider or 4-legged table). Mathematically, the 5.5 cutoff for legs is statistically derived from a dataset… except it conflicts with real-world knowledge (e.g., insects have 6 legs).

This phenomenon tends to happen for the following reasons:

Data-driven splits: Decision trees use metrics like Gini impurity or information gain to select features and thresholds that best separate classes, regardless of human logic.
Overfitting to noise: The model may overfit to quirks in the training data.
Simplistic thresholds: Continuous variables are split at decimal points for computational simplicity, ignoring biological categories

Case study: CART

The CART algorithm, introduced by Leo Breiman and colleagues in 1984, is a foundational technique for building decision trees in machine learning. Here is a breakdown of its core process: Starts with the entire training dataset and recursively partitions it into subsets.

Recursive splitting: Starts with entire training dataset and recursively partitions it into subsets. For each split, evaluates all possible feature-value thresholds (e.g., "Age ≤ 30") to maximize reduction in impurity (e.g., Gini impurity for classification, variance reduction for regression).
Greedy strategy: Selects best split at each step without considering future splits (no "look-ahead"), prioritizing immediate impurity reduction.
Stopping criteria: Halts splitting when predefined conditions are met (e.g., min number of samples in a leaf node, max tree depth). Terminal nodes become leaf nodes, assigning final predictions (e.g., class labels or regression values).

CART utilizes pruning (post-tree construction) to reduce overfitting by simplifying the tree. It does so by using a validation dataset to evaluate the impact of removing each leaf node, then trimming branches that do not improve generalization performance. The output is a smaller, more robust tree with fewer spurious splits.

Overall, CART decision trees are easily interpretable, can handle mixed data types, and work well with nonlinear relationships. Although, it is prone to overfitting without pruning and its greedy splits may miss globally optimal solutions.

Gini impurity

Mentioned briefly in the CART case study is Gini impurity, a measure used in decision tree algorithms to quantify the 'impurity' or disorder of a dataset. It evaluates how likely a randomly chosen element from the dataset would be misclassified if labeled randomly according to the class distribution in the subset. A lower Gini impurity indicates a more homogeneous (pure) subset, which is desirable for effective splits.

In mathematical definition, for a dataset D with k classes, Gini impurity is calculated as:

p_i: Probability (proportion) of class i in the dataset.
k: Total number of classes.

For example, in a subset containing 70% dog and 30% cat samples, its Gini impurity would be 42%. A perfectly pure subset (100% one class) has Gini(D) = 0.

When splitting a dataset D on an attribute A into subsets D_1 and D_2, the weighted Gini impurity of the split is:

n: Total samples in D.
n_1, n_2: Sizes of subsets D_1 and D_2.

The algorithm selects the split that minimizes Gini_A(D), prioritizing splits that create the purest subsets.

In practice, borrowing the context of predicting whether patients are diseased (Class 1) or not (Class 2), here is how Gini impurity sequentially quantifies node purity to impact decision trees' accuracy and interpretability:

At Node A (Gini = 0): All 10 patients in this node are disease-free. Rule is crystal clear (e.g., "If lab result < X, no disease"). No misclassifications here — 100% correct predictions.
At Node B (Gini = 0.42): 3 patients have the disease; 7 do not. Tree would split further to reduce impurity (e.g., add rule "If biomarker Y > 50, predict disease.") A subsequent split could isolate the 3 diseased patients into a pure node.
Node C (Gini = 0.5): Equal class distribution (5 disease, 5 no disease). Tree must split here to improve accuracy. For instance, using a feature like "symptom Z present" to create purer subsets.

The algorithm prioritizes splits that minimize Gini impurity. That way, accuracy is increased as splits isolate diseased/non-diseased patients into purer groups. Nodes with low Gini (e.g., Node A) translate to simple, actionable rules (e.g., "If biomarker X < 100 → no disease", while high Gini nodes (e.g., Node C) signal ambiguous regions, prompting further splits that clinicians can validate (e.g., "If fever > 3 days, run additional tests"). Branches split this way are more readable and verifiable by human clinicians.

Gini impurity enables decision trees to partition data into increasingly pure subsets, driving accurate and interpretable models.

Impurity, error, and entropy

Other than impurity, decision trees have other metrics to determine optimal splits during training. One of them is error (residual error for regression) which is used to quantify prediction error:

y_i: Actual value of target variable.
y: Mean of target variable in D.

With error, splits are chosen to reduce variance, ensuring predictions are close to the mean in each subset. For example, a node predicting house prices aims to minimize variance of prices in each subset.

Entropy (information theory) is another metric for measuring classification disorder in a dataset, making it an alternative to Gini. Like its fellow metric, lower entropy indicates a more ordered/pure node, down to 0 for 100% purity.

Splits are chosen to maximize information gain (reduction in entropy). For example, in a node with 50% "spam" and 50% "not spam":

In practical use, use Gini for balanced datasets and speed. For imbalanced classes or when information gain is critical, suggest using entropy. By selecting the appropriate metric, decision trees balance accuracy, speed, and interpretability for diverse tasks.

Ensembles

An ensemble in machine learning refers to a technique where multiple models, often called 'weak learners' or 'weak classifiers', are combined to produce a stronger, more robust predictive model. The core idea is that aggregating the predictions of several models reduces errors caused by individual models, leading to better generalization and performance.

Ensembles: bagging

Bagging (bootstrap aggregation) is an ensemble method that improves model stability and accuracy by training multiple models on different subsets of the original data, sampled with replacement, and then aggregating their predictions. In it, multiple 'weak learners' are trained independently on different subsets of the training data, and their predictions are combined to make a final prediction. Bagging works in the following order:

Bootstrap sampling: Generate multiple subsets (bootstrap samples) from original training dataset D. Each subset D′ is created by randomly sampling n instances with replacement (allowing duplicates). About 33% of original data may be excluded in each sample (out-of-bag instances).
Parallel model training: Train a separate model (e.g., decision tree using CART) on each bootstrap sample D′. Models are trained independently, enabling parallel processing.
Aggregation of predictions: For classification, majority voting (most frequent class prediction across all models). For regression, averaging predictions from all models.

Performing the steps above reduces variance in decision tree models, which mitigates overfitting and stabilizes predictions. This is more notable in high-variance model — models overly sensitive to training data.

The bagging approach reduces sensitivity to noise in the training data, enables parallelizable training (ability to divide training process into smaller tasks that can be done at the same time) for improved efficiency, and provides high transparency through aggregated results.

Ensembles: random forest

The random forest model builds upon bagging by introducing feature randomness. This involves generating subsets with features of mutually low correlations for each split, further decorrelating the bagged trees to enhance generalization.

Random forests are like a team of scouts (decision trees). Each scout explores a different part of the forest (bootstrapped data samples) and notices unique details (randomly selected features). They share their findings, and the team combines all reports to map the forest accurately (aggregated predictions by majority or average vote).

Ensembles: boosting

Boosting in decision trees is an ensemble technique that sequentially combines weak learners (typically shallow trees) to form a strong predictive model. It does so by:

Initial model: Train a weak decision tree on original dataset. Predictions G_1(x) are made and residuals, differences between predictions and actual values, are calculated.
Data re-weighting: Instances with large residuals (high prediction errors) are assigned higher weights, forcing subsequent models to focus on correcting these errors.
Iterative model training: Train a new decision tree on the re-weighted data (or directly on residuals in gradient boosting). Each subsequent model G_2(x) and beyond targets residuals of the combined previous models.
Final model: Combine all weak learners into a weighted sum G(x) = G_1(x) + α_2[G_2(x)] + … + α_n[G_n(x)], where weights α_i are optimized to minimize overall error.

AdaBoost is an ensemble learning algorithm that combines multiple weak classifiers (e.g., decision stumps) into a strong classifier by iteratively focusing on misclassified samples. Here is a breakdown of its processes:

Initialize weights: Assign equal weights to all training samples w_i = 1/N over i = 1, 2, …, N. This ensures all samples contribute equally initially.
Train weak classifier: Fit a weak classifier (e.g., decision stump) using current sample weights w_i.
Compute weighed error: Measure classifier’s error err_m = N^∑_(i = 1) w_i * I[y_i =/= G_m(x_i)] / N^∑_(i = 1) w_i, where I(…) is indicator function (1 if misclassified, 0 otherwise), weighted by sample importance.
Compute classifier weight: Balances the influence of each weak learner in final model a_m = log[(1 - err_m) / err_m] for higher a_m for lower err_m (trustworthy classifiers).
Update sample weights: Through w_i ← w_i * exp{α_m * I[y_i =/= G_m(x_i)]}, increase weights for misclassified samples (α_m > 0) and decrease weights for correctly classified samples.
Final model: Strong classifier G(x) = sign[M^∑_(m = 1) α_m * G_m(x)] that is weighted majority vote of all weak classifiers.

Gradient boosting is a boosting technique that optimizes a loss function using gradient descent in function space (rather than parameter space). Unlike traditional boosting methods (e.g., AdaBoost), which adjust instance weights or rely on exponential loss, gradient boosting:

Fits new weak learners (typically decision trees) to negative gradients (residual errors) of the current ensemble’s predictions.
Minimizes loss function via gradient descent, iteratively correcting errors from prior models.
Generalizes boosting by supporting arbitrary differentiable loss functions (e.g., log loss) instead of being restricted to specific loss formulations.

Imagine a detective solving a case by focusing only on the mistakes of previous detectives (weak learners). Each new one reviews the errors (residuals/gradients) left by the last, corrects them, and passes the updated case file forward. By combining all these small fixes, they eventually crack the case (ensemble prediction) perfectly.

XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting designed for efficiency, scalability, and high predictive performance. It builds an ensemble of weak learners (decision trees) sequentially, with each tree trained to correct the residuals of the previous ensemble. Here is a breakdown of how it does so:

Feature importance and explainability

Summary

Ensembles: feature bagging

Feature bagging (random subspace method) is an enhancement to traditional bagging, designed to reduce correlation among decision trees in a random forest by introducing randomness in feature selection. For each tree in the ensemble, only a random subset of features is considered for splitting at each node. For example, if a dataset has 100 features, each tree might use √100 = 10 randomly chosen features per split.

With feature bagging, its decorrelation reduces likelihood of all trees using the same dominant features (e.g., "income" or "age"), leading to more diverse and uncorrelated trees. Additionally, preventing the model from over-relying on a few strong features mitigates model overfitting and generalization capabilities.

For example, in a use case about predicting political affiliation from shopping habits, there are 100+ variables (e.g., "organic purchases"). Each tree uses a random subset (e.g., 10 features) to split nodes. As a result, the trees discover varied relationships (e.g., one tree links "book purchases" to political leaning, another uses "brand loyalty"), improving overall model accuracy.

Random forests' trees are typically shallow (e.g., depth 5) when implemented with feature bagging to maintain computational efficiency. Yet the forests have hundreds of trees are trained to ensure coverage of feature combinations.

Overall, feature bagging enhances traditional bagging by injecting randomness into feature selection, producing more robust and accurate ensemble models.

Page updated

Google Sites

Report abuse