Pain as a Proxy, Potentially Pointless

Why evolution differs a lot from AI when it comes to incentivising things like pain

Nov 22, 2025

*Accurate depiction of the bi-level optimization process of evolution (Remedios Varo c.1957)*

Intro

This is a post collecting some thoughts on interpreting behavioural markers of subjective states in current or future AIs. My high-level claim is that the strength of behavioural evidence depends heavily on the causal structure of the generative process, and in particular the kinds of constraints the training process was facing. Evolution is one very particular kind of training process, with its own pathologies and constraints. Modern AI training is another. Since an important arm of probing questions around model welfare is assessing behavioural evidence, I think we need to be careful about factoring in relevant differences when interpreting evidence.

I’m going to look at “pain” as the main example in this, because it’s a particularly salient and functionally interesting state, and one which seems high on the list of states which might qualify an AI for moral patienthood. I think this probably generalises to a lot of other states though. Broadly speaking, I want to argue that pain as a functional state is heavily incentivised by constraints on evolution which are less present in AI training, and as such we should take “pain-like” behaviour1 exhibited by AI as weaker evidence for pain. I’ll end by looking at where in AI development the causal structure might start looking more similar to what evolution faced.

Caveats: I’m trying to stay fairly agnostic about theories of consciousness. I’m also importantly not assuming any particular view on whether certain circuitry is necessary or sufficient for subjective pain. This piece is about the inference from “behaviour that looks superficially like pain response” to “internal circuitry that plays a functional/causal role similar to pain in life”. Obviously whether the latter is sufficient or necessary for subjective pain in the qualia sense is a much thornier problem I don’t grapple with. In particular, I’m going to discuss pain from the perspective of evolution using it to get around a credit-assignment problem when “training” humans, but it’s possible that the qualitative “badness” of pain also relies on other factors like attention modulation/integration with a global workspace etc.

So first of all, what am I even talking about?

Why evolution ended up with pain-like states

Very schematically, evolution is a bilevel optimization process. Training a human brain involves:

Outer loop: natural selection updates genomes based on reproductive success.
Inner loop: within a lifetime, the organism’s brain undergoes self-supervised RL using whatever learning rules and reward circuits the genome from the outer loop encodes.

(Quintin Pope has written about this better than me elsewhere). A consequence of this is that, from the “perspective” of evolution, training a brain is tough:

Feedback is sparse, with poor credit assignment. Evolution knows which genomes reproduced more, but not why. It has to up- or down-weight genomes depending on the result of their entire-life rollouts over a messy environment. It cannot, for example, get immediate or dense reward when an organism takes specific beneficial actions and promote specifically those actions
It has a tight genomic bandwidth constraints, on the order of less than a GB2. Presumably with less architectural freedom than something like a transformer.

Given that, evolution has a strong incentive to install local proxy mechanisms that the inner loop can use to receive dense feedback and improve behaviour in real time. It can’t directly encode “move your hand away from that fire” but it can wire up a general nociception → pain → withdrawal circuit, and make that also drive synaptic changes so that next time you care more about not touching red glowing things.

This is not the only way an outer loop could solve its optimization problem, but given the constraints it’s a very natural solution. Notably the “outer loop” can and does hardcode some behaviour (we breathe and blink automatically, foals get up and run around instinctively etc.). These seem to be behaviours which are uniformly required, whose benefits aren’t contingent on how the unpredictable environment plays out. For behaviour where the correct policy is more dependent on the randomness of the environment, it’s harder to hardcode and proxies become more attractive.

Evolution’s constraints constituted a lot of pressure towards pain evolving as it did. So, when assessing evidence from AI behaviour, we should really also be thinking about the causal structure of generative process; did it face similar constraints? If not, is this behaviour achievable more readily by optimization not routing via pain-like circuitry? If so, does behaviour become much less forceful as evidence for subjective pain?

A thought experiment: evolution with dense feedback

To make the role of these constraints clearer, consider a weird alternate version of evolution:

The “genome” is arbitrarily expressive: no strong bandwidth limit or other expressivity bottlenecks.
It doesn’t just get a single reproductive success summary; it gets dense, high-resolution feedback on within-life behaviour and that behaviour’s impact on long-run genetic fitness.
It can continuously tweak behaviour directly in response to those fine-grained signals (i.e. modifying the genome within-life).

In this world, the outer process experiences much less pressure to install a proxy signal like pain. It can, in principle, adjust behaviour directly whenever some action leads to less genetic fitness, without going via a learned local proxy. Pain-like inner states might still show up if they’re a convenient internal mechanism given other constraints (or are preferred by the indicative biases of the genome), but they’re no longer solving a bottleneck the outer loop can’t get around.

This bizarro-evolution is much more analogous to the contemporary training process of AI, where we don’t have this inherent bi-level optimization structure (RL might muddy this, more below). The “outer-optimizer” of SGD is receiving dense feedback on training behaviour, not some extremely sparse and lossy summary stat. If we wanted AI training to look more like real evolution structurally, we’d do something like: run an outer search over architectures / hyperparameters, for each step of this outer-loop train a model from scratch, then backprop just the loss from this “inner training run” to the architecture search.

This is roughly the contrast I want to draw with current AI:

In biology, pain is a pretty obvious solution given the outer-loop bottlenecks.
In most present AI training, those particular bottlenecks are absent or much weaker.
So we shouldn’t treat behaviour that looks like pain (in the sense of aversion/avoidance etc) as having the same evidential weight it has in animals.

How does this map onto current AI training?

Given all that, the natural next question is: where, if anywhere, do modern AI systems actually face constraints similar to those that pushed evolution towards pain-like proxy mechanisms? If the causal structure of the generative process matters, we should be on the lookout for cases where the optimizer is forced into a role more like evolution: operating with limited visibility, sparse credit, and a need to install an inner process that can handle the fine-grained learning.

Two places seem especially relevant:

long-horizon RL with sparse or delayed reward,
systems with learned inner learning rules (meta-learning, learned plasticity, neuromodulation, evolutionary-ish outer loops),

with in-context learning as a kinda strange and uncertain case that I’ll treat separately at the end.

Long-horizon RL

This is the easiest place to see evolution’s credit-assignment constraints reappear. If you train an agent on very long trajectories with only a single reward (or a few) at the end:

gradients become extremely weak or noisy,
early decisions barely register in the loss,
most behaviour is effectively hidden from the optimizer,
credit assignment becomes close to impossible.

This is all looking pretty structurally very similar to evolution’s situation. And when the outer loop can’t do fine-grained credit assignment, the incentive to develop denser proxy mechanisms increases. That could take many forms:

learned critics that produce dense pseudo-feedback,
intrinsic penalties or bonuses that stand in for the missing reward,
heuristic detectors for dangerous or irreversible regions of state-space,
internal signals that say “something has gone very wrong here; steer away fast.”

These mechanisms are not necessarily pain, obviously, but some may fill a related functional slot. They are reusable internal proxies that supply the inner learner with the granularity the outer loop lacks. Note that this is not a binary, we can rather think of this credit assignment constraint as living on a spectrum of “dense reward for each step” and “you get a number at the end, good luck to you”. Moving along this spectrum, I think, changes the relative pressure on the outer loop to craft policies directly vs installing proxy-circuitry.

Learned plasticity, neuromodulation, and meta-learning

A second class of systems pushes even closer to the evolutionary picture: cases where part of the agent is explicitly trained to implement its own update rule during deployment.

This includes:

differentiable plasticity,
learned neuromodulation,
recurrent networks trained to “learn-to-learn”
meta-learning setups where the outer loop optimizes learning rules rather than policies,
population-based training or evolutionary strategies over update rules.

What these seem to have in common is a genuine separation between two learning processes:

Outer loop: optimizes the learning rule (or plasticity structure) using coarse summaries of performance.
Inner loop: runs that learning rule in real time, making moment-to-moment adjustments based on local information the outer loop never directly supervises.

This is much closer to the evolutionary structure. The outer optimizer sets up the inner optimizer, but cannot micromanage its internal behaviour. The inner process sees more detailed structure and must decide when to update, how much to update, and which states deserve large changes.

Under those conditions, compact internal modulators—general-purpose scalar signals that gate plasticity or flag catastrophic states—start to make functional sense again. These are the closest modern analogue to the role pain plays in biology: a portable, information-compressing internal signal that improves within-lifetime learning when the outer process can’t do it directly.

ICL as an awkward, somewhat confusing counterexample

ICL is a bit weird. There’s now a large body of work (von Oswald et al. 2022,) showing that transformers can implement surprisingly rich inner learning dynamics inside a single forward pass—things like unrolled gradient descent, least-squares updates, and fast-weight–style adjustments. On the surface, this looks like the outer loop (SGD) is shaping a model that contains a genuine inner learner.

What’s interesting (and kinda awkward for the story I’ve told so far) is that none of the evolutionary-style credit bottlenecks are present here. The outer optimiser is not blind, not starved for feedback, not forced to rely on inner proxy mechanisms. And yet, we still get something that looks like inner optimisation.

That superficially weakens the overall point: if inner learning can arise even when it isn’t needed as a proxy, then maybe the outer-loop constraints I’ve been emphasising aren’t as central as I’ve claimed.

I think the key difference here though is the nature of the tasks involved. In-context learning studies almost always train on distributions where the most efficient solution is to run a tiny optimiser in the forward pass. The model isn’t developing an inner learner because of a bottleneck; it’s developing one because that literally is just the optimal computation for the task, and that solution is pretty simple. This is very different from evolution, where inner learning emerges because the outer loop can’t directly optimise behaviour, and the “optimal” algorithm to run is not simple.

An analogy (imperfect, but hopefully clarifying): in the earlier “bizarro-evolution” thought experiment, if instead of selecting for reproductive fitness we explicitly selected for pain-like behaviour then yes, we’d also likely get organisms with built-in pain-circuitry. But that’s because pain-circuitry is now being developed as a solution to a relatively simple objective, vs being developed as a proxy. The ICL examples where we explicitly incentivise an inner-optimiser probably don’t generalise to things like pain insofar as they are instrumental rather than being what is explicitly trained for.

Summary to a Ramble

Evolution’s credit-assignment bottlenecks made pain-like proxies strongly incentivised.
Current AI training weakens those constraints, so similar behaviour is less incentivised to be routed through similar circuitry.
However, long-horizon RL, and some forms of potential changes to the training process reintroduce enough of the evolutionary constraints that general-purpose internal aversive signals become more instrumentally useful again.
Meanwhile, ICL shows that inner optimisers can appear even without those pressures — so “no inner optimisation because gradients exist” is strictly speaking false, although I think these tend to be mostly toy cases where the outer-loop is explicitly incentivised to learn a straightforward inner-optimiser.

We don’t currently have strong evidence that modern AI systems contain pain-like internal critics. But the structural incentives for them are real in specific training regimes, and plausibly for some present-day systems.

I’m taking a very simplified view of “pain behaviour” here, as a high-level avoidance/aversion response. In reality it’s obviously messier e.g making noise to communicate danger etc, but these behaviours seem more incidental to specifics of the human training environment

In reality the bottleneck isn’t literally just capacity but also things like mutation error rate, transcription throughput etc, but we’re abstracting this away for now

MidWittgenstein

Discussion about this post

Ready for more?