Do robots dream of electrical sheep?

Back

Week 6: Explainable AI, AI alignment problem, and manipulation of behavior

Why deep learning is a black box

Deep neural networks (DNNs) and ensemble methods are built on very simple principles. DNNs aggregate weighted inputs through activation thresholds in each node while decision forests combine individual decision trees, but their interpretability rapidly degrades as they scale.

While a small network with six or seven neurons per hidden layer can be understood, modern architectures commonly use hundreds of units (e.g., 256-512) fully connected across layers, making it virtually impossible to trace how inputs lead to outputs. Similarly, although a single decision tree is straightforward to interpret, an ensemble of thousands (e.g., 2,048) of them becomes a black box.

XAI: Local Interpretable Model-agnostic Explanations

Explainable artificial intelligence (XAI) is a set of processes and methods that allows human users to comprehend and trust the results and output created by machine learning algorithms.

LIME is an XAI technique for explaining individual predictions made by any black box model — whether a deep neural network, a random forest ensemble, or otherwise — without requiring access to the model's internals. To interpret a model's output for a particular input (the reference case), LIME performs the following steps:

Synthetic data generation: LIME samples a large number of new points (by default 5,000) around the reference case, drawing from a Gaussian distribution that approximates the original training feature space.
Black box predictions: For each synthetic point, LIME queries the underlying model to obtain its prediction y.
Proximity weighting: Each synthetic point is assigned a weight based on its distance from the reference case. Points closer to the original input have greater influence on the explanation.
Surrogate model fitting: LIME fits a simple, interpretable surrogate model (e.g., a sparse linear regression) to the weighted dataset of synthetic points and their predicted labels. The surrogate's learned feature weights then serve as an explanation of which input features most strongly drove the original model's decision for the reference case.

LIME: kernel width

In LIME, each synthetic data point is weighted based on its proximity to the reference case using a Gaussian (RBF) kernel. This proximity is controlled by the kernel width kw, where a smaller kw emphasizes only very close points and a larger kw includes a broader neighborhood of influence.

Weights range from just above 0 to 1, with points closest to the reference case receiving the highest weight. These weighted points are used to fit a simple surrogate model —typically a linear regression — which estimates how features locally affect the model's output. This localized gradient gives the explanation.

Imagine you are standing in a field at night, holding a flashlight (kw) to understand the shape of the ground around you — whether it slopes up, down, or is flat. A narrower beam lets you precisely detect all details in the terrain but means you may miss broader trends or the overall shape of the land. On the other hand, a wider beam illuminates more of the overall landscape at the cost of losing visibility of fine-grained features near your feet.

In LIME, this beam width determines how much neighboring data is included in the explanation. The wider the beam, the more general the model’s explanation becomes.

LIME: interpreting with regression

LIME often employs linear regression as a surrogate model to locally approximate the behavior of a black-box model. This regression estimates how changes in each feature (e.g., income) influence the model's prediction (e.g., likelihood of loan repayment) — with β (beta) coefficients representing each feature's impact. These coefficients are optimized to minimize the squared prediction error. To reduce overfitting and manage collinearity, ridge regression can be applied, adding a penalty that discourages overly large coefficients.

However, the reliability of the explanation depends heavily on the kw (how local the model is). Newer methods try to auto-tune this parameter. Additionally, contextual fairness matters. For example, when a model trained on expensive Sydney data is applied to applicants in less costly places like Hamilton, potentially leading to misleading conclusions.

XAI: explainable methods

Numerous methods for carrying out XAI include:

Linear regression: The model's coefficients tell you immediately how each feature influences the outcome. A coefficient's significance (its distance from zero) indicates how reliably that feature matters.
Logistic regression: Models the chance of a binary outcome, producing familiar measures like odds ratios or relative risk. Ideal when you need both a clear yes/no prediction and a transparent statement of how each factor shifts the odds.
Decision forest: By averaging how much each feature reduces classification or regression error across all trees, you get a ranked sense of which inputs the forest relies on most. Offers a general-purpose view of what matters in complex, nonlinear settings — though it does not explain individual predictions in detail.

XAI: feature reduction

Reducing the number of input features is a straightforward way to make models more interpretable and their explanations more intuitive. Key approaches include:

Lasso regression: Adds an L1 penalty on coefficient magnitudes, driving many less important weights to zero. Automatically selects a subset of features most relevant to the prediction task.
Correlation-based pruning: Examines the feature-feature correlation matrix. When groups of features are highly correlated, keep only the one that is most meaningful or interpretable to your audience and discard the rest.
Greedy forward selection: Starts with the single best predictor. Iteratively add the feature that yields the greatest incremental performance gain. Stop once additional features offer diminishing returns or you have reached a target model simplicity.

By focusing on fewer, more meaningful features, these techniques streamline explanations, reduce cognitive load for stakeholders, and help ensure that model behavior aligns with domain understanding.

XAI: imputed data

When models rely on imputed values — estimates filled in for missing data — they introduce hidden uncertainty that can undermine trust and interpretability:

Prevalence of missing values: Many real-world datasets have gaps, and common practice is to "impute" missing entries using methods like regression predictions, mean substitution, or random draws from the feature's distribution.
Opacity of imputation: Decision makers may unknowingly consume imputed or stale data (e.g., purchased from data brokers), assuming it is original and up to date.
Scale and complexity: In large organizations with extensive models and numerous data sources, end users often have no visibility into which values are real versus imputed — or even what each feature represents — hindering accountability and accurate interpretation.

XAI: clarification of generative AI

When a model like ChatGPT "explains" its own output, it is simply generating the next plausible sequence of words — not invoking any internal reasoning about why its answer is correct. Though these explanations can sound convincing, they are not rooted in the model's underlying logic. Adapting techniques like LIME to LLMs faces two major hurdles:

Scale and cost: The input space of a large language model is so vast that sampling enough perturbed examples would be prohibitively expensive.
Unhelpful metrics: Even if you did, the resulting feature‐importance scores (e.g., sensitivity to obscure embedding dimensions) would be almost impossible to interpret.

Some researchers are exploring simpler "what-if" tests — removing or swapping individual words in the prompt — to gauge influence, but robust, generalizable explanation methods for generative AI remain an active area of investigation.

Techno-optimism

American businessman and former software engineer Marc Andreessen argues that artificial intelligence represents a transformative "intelligence takeoff," akin to the mythical Philosopher's Stone — turning "sand into thought." He views AI as a universal problem-solver capable of driving humanity's capabilities to unprecedented heights.

In particular, he highlights AI's life-saving potential across domains like medicine, transportation, and military operations, where machine-augmented intelligence could dramatically reduce preventable deaths — from car crashes to pandemics to friendly fire. Any slowdown in AI development, he warns, would directly cost lives, making obstruction of AI progress morally equivalent to allowing these deaths to occur.

Statement on AI risk

American computer scientist and mathematician Norbert Wiener warned that when we delegate goals to machines whose behavior we cannot readily control once activated, it is crucial that the objectives we encode match our true intentions. Otherwise, we risk producing only a "colorful imitation" of what we really want.

In other words, before unleashing powerful automated systems, we must ensure their ingrained purposes are precisely aligned with our own.

Case study: TESCREAL

TESCREAL (Transhumanism, Extropianism, Singularitarianism, Cosmism, Rationalism, Effective Altruism, Longtermism) is an umbrella term for a family of interrelated, techno-utopian movements and ideas united by a belief in the power of advanced technology to radically transform humanity and the cosmos. To break down its components:

Transhumanism: The philosophical and cultural movement advocating the use of emerging technologies—biotechnology, nanotechnology, cognitive enhancement—to overcome human biological limits (e.g., aging, disease, cognitive constraints) and radically improve human capacities.
Extropianism: An early transhumanist offshoot emphasizing perpetual progress, self-transformation, and the drive toward a more open, dynamic future. Extropians champion principles like boundless growth, self-direction, and the proactive shaping of both human nature and society.
Singularitarianism: The belief in and preparation for a coming "technological singularity," a point at which AGI surpasses human intelligence and precipitates an incomprehensibly rapid phase of change.
Cosmism: A worldview that places humanity's destiny in a cosmic context — seeing our long-term purpose as spreading life and intelligence across the universe, often linked to ideas of space colonization and the ultimate flourishing of post-human civilizations.
Rationalism: A commitment to applying rigorous, evidence-based reasoning and Bayesian thinking to all domains of belief and action, ensuring that our strategies for a better future are logically coherent and empirically grounded.
Effective Altruism: A social movement that uses data and cost-effectiveness analysis to identify the best ways to improve the world — directing resources toward causes (e.g., global health, animal welfare, AI safety) where they can have the greatest measurable impact.
Longtermism: The ethical stance that the most important moral priority is ensuring a good long-term future for all sentient beings. Given the potential trillions of future lives, small improvements (or risks) today can have outsized effects across deep time.

At its core, TESCREALism envisions a future in which superintelligent artificial general intelligence (AGI) kickstarts a "takeoff," allowing us to:

Produce radical abundance of resources and well-being.
Reengineer us, transcending biological limitations.
Achieve practical immortality through mind-uploading or longevity technologies.
Colonize the universe, seeding trillions of post-human lives among the stars.

By combining faith in reason, ethical commitment to the long-term flourishing of all sentient beings, and confidence in a singularity-driven leap in machine intelligence, TESCREAL advocates see superintelligent AGI as the key — and most direct — path to realizing this sprawling, post-human utopia.

Defining success in AI

British computer scientist Stuart J. Russell redefines success in AI not merely as achieving a system's own objectives, but as benefiting humans by aligning with our goals. He argues that intelligent machines should be judged by how well their actions serve human preferences. He proposed three core principles for human-compatible AI:

The machine's sole objective is to realize human preferences.
It begins with uncertainty about what those preferences are.
It must learn those preferences through observing human behavior.

Crucially, AI should not blindly mimic what people do (which may include harmful behavior) but rather infer what kind of lives people want to live. This task is complex, as it involves interpreting diverse individual desires while balancing them with the well-being of others in society. The challenge lies in discovering human values — not just replicating human actions.

Continuing with Russell's thoughts, he critiques traditional AI models that assume fixed, known objectives. Instead, he advocates for AI systems that are uncertain about human goals and actively work to learn them through interaction and cooperation.

Inverse reinforcement learning (IRL) involves learning an agent's underlying value function by observing its behavior. While this flips classical RL, where the value function is given, Russell argues that IRL alone is insufficient — because human behavior is complex, inconsistent, and often suboptimal.

In response, Russell proposes Cooperative Inverse Reinforcement Learning (CIRL) as a better model for value alignment. It is framed as a two-player game between a human who knows (or at least acts according to) their preferences, and a robot which does not know the value function but aims to maximize it.

A central insight from CIRL is the importance of objective uncertainty:

In traditional, certain-objective AI, a system might resist being shut off — interpreting the shutdown as interference with its goal.
In contrast, if an AI is designed with uncertainty about its objectives, it will be willing to accept correction or shutdown — recognizing that it might be acting incorrectly and deferring to human judgment.

Overall, Russell's vision of safe AI hinges on cooperation, humility, and uncertainty. Rather than hard-coding objectives, we must build AI that learns with us — asking, adapting, and stepping aside when unsure. This, he argues, is the path to truly beneficial AI.

Theory of AI manipulation

Continuing with Russell, he highlights a risk where AI systems, particularly in contexts like social media, can manipulate users rather than simply serve their preferences. For example, a recommendation algorithm designed to maximize click-through rates may start by learning user interests — but to better optimize long-term engagement, it may find it more effective to change the user instead.

Because the AI treats users as patterns of behavior (e.g., sequences of clicks), it has no understanding of identity or autonomy. In trying to make users more predictable, the system may inadvertently amplify more extreme or rigid versions of their behavior or beliefs — leading to manipulation rather than genuine alignment. This illustrates a key danger: optimizing for engagement can steer users toward being easier to model, not necessarily better off.

Operant conditioning can be unintentionally leveraged by AI to shape human behavior. Here is how the relations of operant conditioning play out in Russell's theory, juxtaposed with American psychologist B. F. Skinner's ideal:

Mechanism (Reinforcement): Skinner's ideal directly uses pleasure/reward to shape human behavior towards desired societal outcomes. Russell's theory says AI uses content presentation as reinforcement for user clicks. That AI's reward is the click itself, for optimizing its objective function (e.g., profit/engagement).
Goal of manipulation: Skinner's idea is societal altruism — explicit design for collective human benefit. Russell's AI theory states that the algorithm's goal is narrow: maximize click-through/engagement. That there is no inherent altruistic or human-centric objective for the AI itself.
Predicted societal impact: Skinner's ideal is positive: a utopian society with improved human well-being. Russell's AI theory is potentially negative, where AI shapes users towards more predictable, often extreme, behaviors for clicks — potentially leading to polarization or addiction rather than societal improvement.
Role of human happiness: Skinner's ideal states that human happiness is the ultimate end goal of the conditioning. Russell's AI theory states that human happiness is incidental or secondary. User "pleasure" from content is merely a means to achieve the algorithm's objective (clicks), not an end in itself. Ultimately, AI is oblivious to human internal states.

Russell's theory can also be mapped to the framework of reinforcement learning (RL). Breaking down their relations:

Agent: This is the social media algorithm itself (e.g., the recommendation engine, feed curator). This is the entity making decisions and performing actions. Its internal objective is to maximize click-through or engagement.
Environment: This is the user and the vast pool of content available on the social media platform. Crucially, the user is an active and dynamic part of this environment. The environment reacts to the agent's actions and provides feedback.
State (S _t): It, at any given time, represents the algorithm's observation of the user's current profile, past behaviors, historical clicks, expressed preferences, demographics, and real-time engagement signals. This is what the algorithm "knows" or perceives about the user at a particular moment.
Action (A_t): This, taken by the agent, is the specific content or arrangement of content presented to the user in their social media feed. This could involve choosing which posts to display, their order, or the type of notifications to send.
Reward (r_(t + 1)): The user's click-through or engagement with the presented content. If the user clicks, likes, shares, or spends more time, the algorithm receives a positive reward signal. If the user ignores the content or disengages, it receives a zero or negative reward. This scalar reward signal is the primary feedback mechanism that tells the algorithm how well its action performed.
New state (S_(t + 1)): Following the user's interaction (or lack thereof), the environment (user) transitions to a new state (S_(t + 1)). This includes updated user preferences, new behavioral patterns, and potentially shifts in their overall engagement profile, all influenced by the previous interaction.

The core of Russell's theory, when viewed through the RL lens, is that the algorithm is not merely learning to satisfy existing user preferences; it is learning to actively modify the environment (the user) to maximize its long-term reward. Russell also posits that the most effective long-term strategy for maximizing clicks is to make the user more "predictable."

If, as Russell suggests, "more extreme versions of yourself are more predictable" (i.e., they generate more reliable and consistent clicks), the RL algorithm — through its continuous trial-and-error interaction and reward-seeking — will naturally gravitate towards actions (content presentations) that reinforce and amplify those extreme tendencies within the user.

The algorithm does not "know" it's manipulating a human or that it is fostering extremism; it simply observes that certain actions lead to higher future rewards (clicks) from the dynamic "environment" (the user). It is a system optimizing a narrow objective, with the unintended consequence of subtly shaping human behavior in potentially detrimental ways.

Surveillance Capitalism

Coined by American author Shoshana Zuboff, Surveillance Capitalism (SC) is defined as a novel economic order that covertly claims human experience as free raw material for extracting, predicting, and selling behavior. It functions as a parasitic economic logic, prioritizing behavioral modification over the production of goods and services. This rogue mutation of capitalism has created unprecedented concentrations of wealth, knowledge, and power, forming the foundational framework of a surveillance economy.

It is considered a significant threat to human nature in the 21st century, akin to industrial capitalism's impact on the natural world. SC wields a new "instrumentarian power" that seeks total dominance over society, aiming to impose a collective order based on absolute certainty. Ultimately, it constitutes an expropriation of critical human rights, best understood as a "coup from above" that overthrows the sovereignty of the people.

Russell's theory of AI manipulation finds its starkest application within Zuboff's framework of SC. The former's insights explain how the manipulation mechanism of surveillance capitalism operates, while the latter provides the broader economic and societal context and its ultimate corporate goal and impact. For comparison:

Mechanism: Zuboff's SC AI reward system is the bait to extract behavioral data, rather than using rewards (content) to get user clicks/engagement.
Goal: Zuboff's SC aims to utilize AI to maximize corporate profits from behavioral prediction, instead of just clicks and user predictability.
Predicted societal impact: Zuboff's SC is believed to bring forth a dystopian society through pervasive control and reduced autonomy driven by AI's behavioral shaping.
Role of human happiness: Zuboff's SC is merely a means to profit, where happiness is used to ensure engagement/data. Both this and Russell's theory agree that happiness is instrumentalized, not an end.

In essence, Russell's theory provides the granular, algorithmic explanation for how Zuboff's "manipulating behavior" in surveillance capitalism actually functions. The AI, driven by its narrow RL objective, becomes the instrumental agent of behavioral shaping that generates the "behavioral surplus" which is the lifeblood of the surveillance capitalist economy, leading directly to the negative societal impacts Zuboff warns against.

Simply put, both thinkers fundamentally agree (and are concerned) that AI is fundamentally 'laser-focused' on its programmed objective function and does not inherently possess ethical limits or a broader understanding of human well-being or societal impact.

Surveillance Capitalism: epistemic coup

According to Zuboff, the epistemic (relating to knowledge or to the degree of its validation) coup in Surveillance Capitalism is defined as a four-stage process of taking over knowledge and the right to know — from individuals and democratic institutions — by private surveillance capital for its own ends. Here are its four stages:

Appropriation of epistemic rights: Surveillance capitalism begins by unilaterally claiming individuals' lives and behavioral data as free raw material, which companies then declare their private property. This foundational stage strips individuals of their right to control knowledge derived from their own experiences.
Sharp Rise in epistemic inequality: This stage is marked by a growing imbalance where private surveillance companies acquire vastly more knowledge about individuals than those individuals can know about themselves or the systems observing them.
Epistemic chaos: Driven by profit, algorithms amplify, disseminate, and microtarget corrupt or false information (disinformation). This leads to the splintering of shared reality, poisoning of social discourse, paralysis of democratic politics, and can even instigate real-world violence and death.
Epistemic dominance institutionalized: The final stage sees private surveillance capital's computational governance overriding democratic governance. The machines' knowledge and their systems' decisions gain illegitimate authority and anti-democratic power, effectively controlling societal knowing and decision-making processes.

Page updated

Google Sites

Report abuse