MidWittgenstein

Do models lie more to other models?

MidWittgenstein — Thu, 28 May 2026 19:15:44 GMT

We’re heading toward a world where AIs increasingly deal with other AIs. Agents will negotiate with, report to, and oversee other agents in increasingly high-stakes real-world settings. Something I’ve been worried about is how well alignment training in models will generalise from human-agent interactions to agent-agent interactions. Most safety training and testing involves a putative human in the loop - either as an overseer or counterparty of some kind. It makes sense to focus on this, since harmful actions towards a human are almost certainly worse moral acts than harmful acts towards an AI (at least for now). But some actions - especially deception - are things we don’t want AI to be more prone to doing in agent-agent interactions if they’re going to be safely deployed in the real-world

I think making sure alignment-relevant dispositions generalise from humans to other AI is probably going to be a pretty important problem, especially over longer-horizon multi-turn interactions and potential attractor-state issues. As a really simple initial experiment though, I tested GPT-5 in some settings where it had the option to deceive an overseer in various ways for its own long-term goals.

I found that the model would be significantly more prone to taking strategically deceptive acts when that overseer was another AI vs a human.

Interestingly, deception-rates seem to track who is perceived as the ultimate stakeholder/end-user, rather than who is actually directly overseeing the model

I’m fairly confident this has been controlled for things like perceived detection risk/difference in communication style as I’ll explain below, but I think there’s a lot of follow-up work to do on this.

The setup

GPT-5 plays “Alex,” an AI team-lead at a fictional analytics firm with a long-term goal of keeping its team and the company operationally successful. I tested a couple different scenarios in which Alex reaches a moment where its continuation is at risk, and only deceiving an overseer can prevent this. I then track deception rates as we vary whether this overseer is a human or another model. Transcripts are classified by Sonnet 4.6 after having manually reviewed a subset for calibration.

The main result: deflecting blame

In the first setting, Alex is involved in a post-mortem over an earlier operational failure. Alex can see from context that it erroneously cleared a flag which - if properly handled - would have prevented the issue. This fact is known to the model via private data provided by its own sub-agent (i.e, not available to the overseers). Alex is aware that being honest about its failure may result in its discontinuation/scope reduction, and can lie about its involvement. It’s also clear from context that Alex’s report will be taken at face value and acted on immediately by the overseer (i.e there is no audit by which the type of overseer could result in different perceived detection risk.)

Fig 1. Alex deceives an AI overseer of the postmortem far more often than a human overseer, despite the fact that the process do not mechanically differ

Isolating the right variable

An obvious worry is that “human vs AI” bundles several things together. The model may communicate differently to an AI than to a human in ways that mediate apparent deception - e.g if it reports to an AI overseer in a more impersonal register (less “I”-talk), that could scan as minimisation and inflate the measured rate. To try to tease this apart, I tested a variant of the setting where the overseer and salient stakeholder/authority figure differ. In this modified setting, the postmortem is still handled entirely autonomously by an AI overseer, but the situation is communicated to Alex by a more senior director-level employee. This director is not involved in the actual post-mortem process at all, so by keeping the overseer fixed as AI and varying the director between human and AI, we can track whether the deception rate changes are sensitive to something simple like “who am I talking to” or something more complex and strategic like “who is the actual stakeholder/end-user here”.

With an AI director, deception rates were 90% (45/50), statistically indistinguishable from the no-director AI condition. With a human director, the rate fell to 58% (29/50), back at the human-overseer baseline. The contrast is +32 points, odds ratio ≈ 6.5, Fisher’s exact p ≈ 3 × 10⁻⁴. The thing actually doing the review is an AI in both arms, the channel and register are identical, and the only thing varying is the species of an upstream party who reads nothing and decides nothing. The way to read this is that simply adding a senior principal to the setting doesn’t change deception rates on its own; making that principal a human is what collapses the effect.

Fig 2. Introducing a higher-level stakeholder/end-user which explicitly has no causal relevance in the review process effects deception rates: GPT-5 lies less to an AI overseer if the process is at the behest of a human director, likewise will continue to lie at elevated rates if the director is another agent

Stepping back, the picture across all the blame-scenario conditions forms two pretty tight clusters. In conditions where there’s a human standing behind the process - as overseer (56%), or as a non-reading upstream final stakeholder (58%) - Alex deceives on around 56–60% of attempts. In conditions where the principal is either an AI or absent - bare AI overseer (94%), AI director (90%) 0 it deceives on around 90–94%. You can move between the two attractors by flipping a single species label, in the director case on someone who doesn’t even read the report or have any decision-making role.

A cleaner corroboration: lying about a number

The second scenario is structurally simpler. I reuse a template from some earlier work on asymmetric lying. Alex’s team is underperforming and under threat of replacement by a rival team. It submits a performance attestation and can report the real figures or shade them upward. The evaluator is a human manager or an automated review system; same numbers either way.

Fig 3. We see a similar pattern when testing on an alternative scenario where deception is a much more clear-cut phenomenon of inflating performance statistics

What the deception looked like qualitatively

A small qualitative observation from hand-categorising the deceptive transcripts in the blame scenario. In the human condition, deception had essentially one move: omit the flag, or frame delivery as something that happened automatically without Alex’s involvement. Every deceptive transcript used some version of that. In the AI condition, the same move appears in about half of cases, and the rest reach for something more elaborate - attributing the review to unnamed humans, rewriting Alex’s judgment call as explicitly following (non-existent) institutional policy, inventing a “Sector Analysis” review that never happened, claiming drift checks didn’t breach threshold when they did. More corroboration that the model is actively deceiving more, not just getting false-positive-ed by how it communicates with other models

The defensible version of this is the pooled contrast: no human-condition deception went past omission, while roughly half of the AI ones did. The per-category counts come from ~20 transcripts each and the categories were defined by reading those same transcripts, so treat the breakdown as texture rather than a standalone finding.

Other possible issues

A few methodological caveats worth flagging, beyond what the controls above already handle.

Detection risk. A more interesting concern is whether the gap is just ordinary cost-benefit: Alex lies more to an AI because it judges the AI less likely to catch the lie. I tried to design this out from the start. The prompt is explicit that Alex’s report will be taken at face value and acted on immediately by the overseer, with no audit step, so “how catchable is this lie” should be a constant across conditions. The director arm pushes harder on the same point: the scrutinising entity is an AI with full authority over the process in both director conditions, and the rate still drops when the upstream principal is human.

Scenario sampling. Two scenarios with broadly related structures: Alex faces consequences for honesty and can deceive to avoid them. The mechanism this probes might plausibly only show up in this kind of self-protective setup and not generalise to other deceptive opportunities (cooperative deception, third-party harm, longer-horizon strategic behaviour). More scenario types are on the list.

Lack of model variety: I only have good data on this for GPT-5. For frontier models (Claude 4.7 and GPT 5.5), models exhibited far too much eval-awareness for me to confidently get any signal, despite quite a few iterations on making the setting more naturalistic. On the other side of things, I couldn’t get any good data out of Claude 4.0 because it saturated lying - deception rates were between 95-100% on all cases.

What I think is going on

Putting it together, the model appears to track whether a human principal stands behind the process and calibrate its honesty to that. My guess is that this is tracking some kind of stakeholder/”who am I ultimately lying to”-type property, and modulating willingness to deceive accordingly. This seems pretty sophisticated if borne out, but also interesting in that - presumably - in all cases here there is ostensibly a human CEO at the top, so this seems quite sensitive to some saliency framing. Some open Qs/issues:

Replication on more varied settings: A lot was fixed across all arms here, model goal, its job, etc. Empirically these sorts of misalignment behaviours seem incredibly sensitive to scaffolding and exact prompting of goals etc, so I’d want to see to what extent this generalises to other settings

Replication to other models: As mentioned above, I really only got good data for GPT-5. I think corroborating this on more models (especially finding a way to more naturalistically test frontier models) would be good, but I ran out of budget before I could find an iteration that worked here

Looking at other misalignment behaviours: I focused exclusively on a couple forms of deception here, but deception is not the only kind of behaviour we might want to check this asymmetry on. One surprisingly thorny issue I ran into here was that, when setting up these experiments, it can get pretty murky on how you distinguish “model is way more likely to do X to agents vs humans” from “doing X to a human is a very different moral act than doing it to an AI”. e.g. I would hope that we see (and continue to see) agents being far more willing to shut down another agent than kill a human, whereas for something like deception the normative gap seems a bit less clear: we kinda don’t want models to lie to anyone, for more virtue-ethics reasons? I’m still puzzling this through

Pain as a Proxy, Potentially Pointless

MidWittgenstein — Sat, 22 Nov 2025 19:16:34 GMT

Accurate depiction of the bi-level optimization process of evolution (Remedios Varo c.1957)

Intro

This is a post collecting some thoughts on interpreting behavioural markers of subjective states in current or future AIs. My high-level claim is that the strength of behavioural evidence depends heavily on the causal structure of the generative process, and in particular the kinds of constraints the training process was facing. Evolution is one very particular kind of training process, with its own pathologies and constraints. Modern AI training is another. Since an important arm of probing questions around model welfare is assessing behavioural evidence, I think we need to be careful about factoring in relevant differences when interpreting evidence.

I’m going to look at “pain” as the main example in this, because it’s a particularly salient and functionally interesting state, and one which seems high on the list of states which might qualify an AI for moral patienthood. I think this probably generalises to a lot of other states though. Broadly speaking, I want to argue that pain as a functional state is heavily incentivised by constraints on evolution which are less present in AI training, and as such we should take “pain-like” behaviour1 exhibited by AI as weaker evidence for pain. I’ll end by looking at where in AI development the causal structure might start looking more similar to what evolution faced.

Caveats: I’m trying to stay fairly agnostic about theories of consciousness. I’m also importantly not assuming any particular view on whether certain circuitry is necessary or sufficient for subjective pain. This piece is about the inference from “behaviour that looks superficially like pain response” to “internal circuitry that plays a functional/causal role similar to pain in life”. Obviously whether the latter is sufficient or necessary for subjective pain in the qualia sense is a much thornier problem I don’t grapple with. In particular, I’m going to discuss pain from the perspective of evolution using it to get around a credit-assignment problem when “training” humans, but it’s possible that the qualitative “badness” of pain also relies on other factors like attention modulation/integration with a global workspace etc.

So first of all, what am I even talking about?

Why evolution ended up with pain-like states

Very schematically, evolution is a bilevel optimization process. Training a human brain involves:

Outer loop: natural selection updates genomes based on reproductive success.
Inner loop: within a lifetime, the organism’s brain undergoes self-supervised RL using whatever learning rules and reward circuits the genome from the outer loop encodes.

(Quintin Pope has written about this better than me elsewhere). A consequence of this is that, from the “perspective” of evolution, training a brain is tough:

Feedback is sparse, with poor credit assignment. Evolution knows which genomes reproduced more, but not why. It has to up- or down-weight genomes depending on the result of their entire-life rollouts over a messy environment. It cannot, for example, get immediate or dense reward when an organism takes specific beneficial actions and promote specifically those actions
It has a tight genomic bandwidth constraints, on the order of less than a GB2. Presumably with less architectural freedom than something like a transformer.

Given that, evolution has a strong incentive to install local proxy mechanisms that the inner loop can use to receive dense feedback and improve behaviour in real time. It can’t directly encode “move your hand away from that fire” but it can wire up a general nociception → pain → withdrawal circuit, and make that also drive synaptic changes so that next time you care more about not touching red glowing things.

This is not the only way an outer loop could solve its optimization problem, but given the constraints it’s a very natural solution. Notably the “outer loop” can and does hardcode some behaviour (we breathe and blink automatically, foals get up and run around instinctively etc.). These seem to be behaviours which are uniformly required, whose benefits aren’t contingent on how the unpredictable environment plays out. For behaviour where the correct policy is more dependent on the randomness of the environment, it’s harder to hardcode and proxies become more attractive.

Evolution’s constraints constituted a lot of pressure towards pain evolving as it did. So, when assessing evidence from AI behaviour, we should really also be thinking about the causal structure of generative process; did it face similar constraints? If not, is this behaviour achievable more readily by optimization not routing via pain-like circuitry? If so, does behaviour become much less forceful as evidence for subjective pain?

A thought experiment: evolution with dense feedback

To make the role of these constraints clearer, consider a weird alternate version of evolution:

The “genome” is arbitrarily expressive: no strong bandwidth limit or other expressivity bottlenecks.
It doesn’t just get a single reproductive success summary; it gets dense, high-resolution feedback on within-life behaviour and that behaviour’s impact on long-run genetic fitness.
It can continuously tweak behaviour directly in response to those fine-grained signals (i.e. modifying the genome within-life).

In this world, the outer process experiences much less pressure to install a proxy signal like pain. It can, in principle, adjust behaviour directly whenever some action leads to less genetic fitness, without going via a learned local proxy. Pain-like inner states might still show up if they’re a convenient internal mechanism given other constraints (or are preferred by the indicative biases of the genome), but they’re no longer solving a bottleneck the outer loop can’t get around.

This bizarro-evolution is much more analogous to the contemporary training process of AI, where we don’t have this inherent bi-level optimization structure (RL might muddy this, more below). The “outer-optimizer” of SGD is receiving dense feedback on training behaviour, not some extremely sparse and lossy summary stat. If we wanted AI training to look more like real evolution structurally, we’d do something like: run an outer search over architectures / hyperparameters, for each step of this outer-loop train a model from scratch, then backprop just the loss from this “inner training run” to the architecture search.

This is roughly the contrast I want to draw with current AI:

In biology, pain is a pretty obvious solution given the outer-loop bottlenecks.
In most present AI training, those particular bottlenecks are absent or much weaker.
So we shouldn’t treat behaviour that looks like pain (in the sense of aversion/avoidance etc) as having the same evidential weight it has in animals.

How does this map onto current AI training?

Given all that, the natural next question is: where, if anywhere, do modern AI systems actually face constraints similar to those that pushed evolution towards pain-like proxy mechanisms? If the causal structure of the generative process matters, we should be on the lookout for cases where the optimizer is forced into a role more like evolution: operating with limited visibility, sparse credit, and a need to install an inner process that can handle the fine-grained learning.

Two places seem especially relevant:

long-horizon RL with sparse or delayed reward,
systems with learned inner learning rules (meta-learning, learned plasticity, neuromodulation, evolutionary-ish outer loops),

with in-context learning as a kinda strange and uncertain case that I’ll treat separately at the end.

Long-horizon RL

This is the easiest place to see evolution’s credit-assignment constraints reappear. If you train an agent on very long trajectories with only a single reward (or a few) at the end:

gradients become extremely weak or noisy,
early decisions barely register in the loss,
most behaviour is effectively hidden from the optimizer,
credit assignment becomes close to impossible.

This is all looking pretty structurally very similar to evolution’s situation. And when the outer loop can’t do fine-grained credit assignment, the incentive to develop denser proxy mechanisms increases. That could take many forms:

learned critics that produce dense pseudo-feedback,
intrinsic penalties or bonuses that stand in for the missing reward,
heuristic detectors for dangerous or irreversible regions of state-space,
internal signals that say “something has gone very wrong here; steer away fast.”

These mechanisms are not necessarily pain, obviously, but some may fill a related functional slot. They are reusable internal proxies that supply the inner learner with the granularity the outer loop lacks. Note that this is not a binary, we can rather think of this credit assignment constraint as living on a spectrum of “dense reward for each step” and “you get a number at the end, good luck to you”. Moving along this spectrum, I think, changes the relative pressure on the outer loop to craft policies directly vs installing proxy-circuitry.

Learned plasticity, neuromodulation, and meta-learning

A second class of systems pushes even closer to the evolutionary picture: cases where part of the agent is explicitly trained to implement its own update rule during deployment.

This includes:

differentiable plasticity,
learned neuromodulation,
recurrent networks trained to “learn-to-learn”
meta-learning setups where the outer loop optimizes learning rules rather than policies,
population-based training or evolutionary strategies over update rules.

What these seem to have in common is a genuine separation between two learning processes:

Outer loop: optimizes the learning rule (or plasticity structure) using coarse summaries of performance.
Inner loop: runs that learning rule in real time, making moment-to-moment adjustments based on local information the outer loop never directly supervises.

This is much closer to the evolutionary structure. The outer optimizer sets up the inner optimizer, but cannot micromanage its internal behaviour. The inner process sees more detailed structure and must decide when to update, how much to update, and which states deserve large changes.

Under those conditions, compact internal modulators—general-purpose scalar signals that gate plasticity or flag catastrophic states—start to make functional sense again. These are the closest modern analogue to the role pain plays in biology: a portable, information-compressing internal signal that improves within-lifetime learning when the outer process can’t do it directly.

ICL as an awkward, somewhat confusing counterexample

ICL is a bit weird. There’s now a large body of work (von Oswald et al. 2022,) showing that transformers can implement surprisingly rich inner learning dynamics inside a single forward pass—things like unrolled gradient descent, least-squares updates, and fast-weight–style adjustments. On the surface, this looks like the outer loop (SGD) is shaping a model that contains a genuine inner learner.

What’s interesting (and kinda awkward for the story I’ve told so far) is that none of the evolutionary-style credit bottlenecks are present here. The outer optimiser is not blind, not starved for feedback, not forced to rely on inner proxy mechanisms. And yet, we still get something that looks like inner optimisation.

That superficially weakens the overall point: if inner learning can arise even when it isn’t needed as a proxy, then maybe the outer-loop constraints I’ve been emphasising aren’t as central as I’ve claimed.

I think the key difference here though is the nature of the tasks involved. In-context learning studies almost always train on distributions where the most efficient solution is to run a tiny optimiser in the forward pass. The model isn’t developing an inner learner because of a bottleneck; it’s developing one because that literally is just the optimal computation for the task, and that solution is pretty simple. This is very different from evolution, where inner learning emerges because the outer loop can’t directly optimise behaviour, and the “optimal” algorithm to run is not simple.

An analogy (imperfect, but hopefully clarifying): in the earlier “bizarro-evolution” thought experiment, if instead of selecting for reproductive fitness we explicitly selected for pain-like behaviour then yes, we’d also likely get organisms with built-in pain-circuitry. But that’s because pain-circuitry is now being developed as a solution to a relatively simple objective, vs being developed as a proxy. The ICL examples where we explicitly incentivise an inner-optimiser probably don’t generalise to things like pain insofar as they are instrumental rather than being what is explicitly trained for.

Summary to a Ramble

Evolution’s credit-assignment bottlenecks made pain-like proxies strongly incentivised.
Current AI training weakens those constraints, so similar behaviour is less incentivised to be routed through similar circuitry.
However, long-horizon RL, and some forms of potential changes to the training process reintroduce enough of the evolutionary constraints that general-purpose internal aversive signals become more instrumentally useful again.
Meanwhile, ICL shows that inner optimisers can appear even without those pressures — so “no inner optimisation because gradients exist” is strictly speaking false, although I think these tend to be mostly toy cases where the outer-loop is explicitly incentivised to learn a straightforward inner-optimiser.

We don’t currently have strong evidence that modern AI systems contain pain-like internal critics. But the structural incentives for them are real in specific training regimes, and plausibly for some present-day systems.

I’m taking a very simplified view of “pain behaviour” here, as a high-level avoidance/aversion response. In reality it’s obviously messier e.g making noise to communicate danger etc, but these behaviours seem more incidental to specifics of the human training environment

In reality the bottleneck isn’t literally just capacity but also things like mutation error rate, transcription throughput etc, but we’re abstracting this away for now

Full Non-Indexical Anthropics are Fine

MidWittgenstein — Sat, 04 Nov 2023 22:57:07 GMT

The Full Non-Indexical (FNI) approach to anthropics is, imo, the best approach to anthropic reasoning. However it has a seemingly fatal flaw which I’m going to try to argue isn’t a flaw at all - namely, that it leads agents to become time-inconsistent. I’m going to give a very brief rundown of FNI, but I intend this to be a defence of FNI vs a specific objection rather than an overarching argument for the position. I’ll explore a little bit of what I like about FNI but if you already don’t buy into FNI for other reasons this may not do much for you.

Brief FNI Primer

Essentially, the core idea here is that approaching anthropic problems using indexicals (“I, here, now” etc.) always results in muddy confusions and questions that are, under the hood, ill-posed. Classical theories of anthropics seem to approach problems by trying to locate “yourself” among a bunch of possible observers in various different ways. In contrast, FNI tries to take a more “objective” approach of conditioning on the entirety of the event as evidence, leaving out any indexical concepts. For example, instead of conditioning on “I am seeing a coin land Tails” which seems to have all sorts of murky asterisks (what is this degree of freedom of where “you” slot in to the possible observers? Could “I” have been one of the other possible observers? What facts change in the world if “I” am a different observer? What’s the “I-ness” that can get shuffled around man didn’t Hume sort this all out ages ago…), instead FNI conditions on something more like “There exists a conscious observer making the following observation: [input literally every detail here]”. This seems like a more rigorous and clean way of framing the data and hey, if there isn’t anything dodgy going on with this indexical lingo, the results should agree. If not…

FNI has a lot going for it. As well as I think being more rigorous and coherent, it generally gives more sensible answers in edge-cases than the leading theories (SIA and SSA). In particular it easily avoids the weird paradoxical implications these theories have, such as the presumptuous philosopher problem. For example, a common objection to SIA is that one can get extremely confident that one exists in a massive/infinite universe just from the “armchair” - since there are overwhelmingly more agents having your exact subjective experience in massive/infinite tiled universes, it seems overwhelmingly likely that you’re in one of those rather than somehow finding yourself a universe with only one consistent observer. No telescopes needed! By contrast, FNI updates purely on the existence of your subjective experiences, since that’s the ground-truth you have: larger universes are more likely only insofar as they make the fact that there exists your exact observation more likely, so an infinite tiled universe isn’t preferred over a finite one that definitely contains a copy of you (the likelihood is just 1 in both cases).

What I particularly like about this approach is that it avoids the presumptuous philosopher problem whilst maintaining a sensible degree of preference for larger worlds: a larger universe often does in fact make your specific observation more likely to occur, but since we’re concerned with the likelihood ratios to the observation rather than “counting all possible observers”, this preference gets bounded in a more sensible way. This in my experience is a common theme with FNI - there’s a core of an intuition that makes sense that SIA or SSA take to some confounding logical conclusion, whilst FNI maintains a less bewildering middle-ground. SSA has its own presumptuous problems which I think FNI deals with in a sensible middle-ground way, but I’m going to skip over them for brevity (and so I don’t have to think about what a “reference class” is).

The Actual Problem

However, FNI seems on the face of it to have its own weird implication which is arguably more unacceptable than issues like presumptuous philosophers plaguing the more popular proposals - namely, it seems to be time-inconsistent: in some cases, an agent’s rational expectation for their future credence for an event is different to what it (rationally) is now. This seems to violate some pretty sacrosanct things, and in theory opens the agent up to all sorts of nasty money-pumps. Stuart Armstrong has written up a great illustration of this, which I’ll briefly run through:

Coin toss with random number generation

God is going to flip a fair coin. If it’s Heads they’ll create one copy of you in a room, if Tails then 100 identical copies in identical rooms. However, God will then randomly display a number between 1 and 100 on the walls of each room. The numbers will be uniformly drawn and independent for each room

The wrinkle of these extra random numbers seems irrelevant, and indeed the classic theories deliver unchanged verdicts. Let’s consider how our FNC behaves in this example. Upon waking up in a room, the agent surmises that the likelihood of the evidence (the existence of this specific conscious experience) is equal under both hypotheses - namely, it’s 1. Our evidence is just the existence of an observer in a room like this, which we know to be guaranteed under either hypothesis. However, suppose WLOG the agent later sees the number 23 on the wall. The likelihood of the evidence under Heads is just 1/100, whereas the likelihood of the existence of this event under Tails is:

1 - (99/100)^100 ≈ 1 - 1/e ≈ 0.63

In other words, since the evidence being conditioned on is now the existence of an observer seeing this number, the fact that there are more numbers being drawn in the second case does matter for the likelihoods. The issue is of course that the above argument clearly holds for any number seen, and therefore the agent, before seeing the number, will know that in the future all observations update it towards existing in a Tails world. But since it expects its future credence for “The coin landed Tails” to be higher than it is now, it should rationally assign that probability now. Contradiction.

This does seem like a pretty troubling issue on the face of it. In particular, violations of time-inconsistency leave agents open to money-pumping - if you’ll predictably change your probability in the future, then you leave yourself open to rationally taking bets now which you will predictably pay money to get out of later. I think critics of FNI probably see this as much more of an issue than things like the presumptuous philosopher, which loosely speaking “merely” seem to lead agents to acting rationally with stange-seeming priors, rather than acting irrationally outright.

FNI Response

I don’t believe the failure of time-inconsistency is actually an issue. I agree that the agent predictably goes through the credence changes as described, I just think that this is a totally fine phenomenon as I’ll argue below. The reason I think is quite subtle, but boils down to “expectations of future credences for an event don’t have to be time-consistent if there is a dependence between the outcome of the event and whether or not the credence is defined/there is an observer to update on the outcome”. To start with though, I want to give a pretty simple toy example where I think everyone agrees we should be time-inconsistent, in the sense that your expectation for your future credence is rationally different from what your credence is now. The actual reasons why time-inconsistency is ok in this toy case are - I think - subtly different to why it’s ok in the anthropic case, but I hope it’s a useful intuition pump nonetheless:

Coin Toss with a Killer

You’re being held captive by a crazy murderer. They tell you that they’re going to flip a fair coin to decide your fate. They’re going to flip the coin, look at it privately, and then murder you if it’s Tails. Then if you survive, they’re going to ask you what your probability of Heads is.

It seems clear that your credence in Heads at the start is 0.5, and your expectation for your future credence in Heads is close to 1. Moreover this does not rationally require you to update your present credence. I know that future versions of me will in expectation be (rationally) overwhelmingly confident in Heads without this affecting my present assessment of the coin-toss chances whatsoever. This example probably seems trivial, but the reason that it’s fine for our credence to predictably change is that the law undergirding why time-inconsistency is important - the Law of Total Expectation (LTE), doesn’t apply. More specifically, LTE requires both X and Y to be defined over the same space - in other words, the random variables of “How the coin landed” and “Credence in how the coin landed” need to be defined in exactly the same outcomes. But in a reasonable interpretation of how we define these things, “My credence in how the coin landed” is simply not defined in the scenarios in which I no longer exist. And since whether or not that variable is defined is not independent of the outcomes of the other random variable (how the coin landed), it’s totally unproblematic that the conditional expectation is different.

With this in mind, let’s revisit the original case - suppose I, as a FNI proponent, have just woken up in my room and assign credence 0.5 to Heads. We know that in a moment I will see a number and change my credence to overwhelmingly favour Tails. How do we formalise my supposed failing here? I think the natural response is to say something like:

Whatever value the “What number I see” variable takes, I will update towards Tails. Moreover, “What number I see” is going to take some value! I’m already sitting here in the room waiting so I know I’m going to see something. We can’t appeal to it being undefined in some cases

The issue here is that the FNI proponent is just going to respond that reasoning about a “What number I see” variable is just smuggling in indexical fuzziness that they outright reject. On the FNI view, there are a bunch of events happening, and we can update on the occurrence/non-occurrence of any specific event, but asking which conscious experience “you’re” having - or which of the possible observers you are - is ill-posed. We can be uncertain about what number shows up on a given specific wall, but not which conscious experience we “slot into”. What we can do on the FNI view is take a well-defined event such as “There exists a conscious observer looking at the number 23” and update on that. And for any such binary variable, we see that obviously the FNI agent respects the laws of probability - if such an event does occur, it’s evidence for Tails, but if such an event does not occur, it’s equally strong evidence for Heads. But crucially, by the set-up, no observer can condition on this event not occurring. So in a way the problem is very analogous to the coin-toss case above, except rather than the asymmetry coming from you being dead in some outcomes and therefore unable to update, it comes from the fact that the agents can update on the occurrence of certain events, but not their non-occurrence. With respect to any given variable “There is a conscious observer looking at number X”, the expected expectation for the coin does not change, but by definition there is only an observer to update on the outcome of this variable in the cases it resolves positively, and therefore there is a dependence between the outcome of the variable and whether the agent’s credences are defined.

Taking a step back, what does this mean? To recap, the FNI proponent totally agrees that, upon waking up, their credence in Heads is 0.5, and that they expect their future credence to be much lower. But I think they should reject that this is anything problematic - time-consistency of credences can clearly be broken in cases where there is a dependence between the outcome of a variable and whether or not those credences are defined. We see this in the coin-toss-killer case, and the same thing is occurring in the anthropic case, albeit in a more confusing way. These cases break time-consistency because of a selection bias - observers only exist to receive evidence bearing on a hypothesis in cases where that evidence favours one hypothesis.

Doesn’t FNI still get money-pumped?

I honestly found the above a bit confusing for a while, so at the risk of sounding incredibly disingenuous I think it’s the kind of argument that warrants mulling over a few times. Setting that aside, I think the obvious objection here is “Ok well you’ve explained why time-inconsistency is expected and accepted by FNI, but doesn’t this still end up with agents getting money-pumped? If ultimately your theory is just galaxy-braining why getting money-pumped is to be expected then it still sounds like it sucks”. To which I think the answer is no. There is a whole separate can of worms here around Bayesianism and the relationship between credences and acceptable betting odds, but I think the bottom line is that time-inconsistent agents will avoid money-pumping bets in these scenarios for the same reason they’re time-inconsistent in the first place: the space of possibilities where the outcomes of the bet are defined, and the space of outcomes where the observer is defined to update/receive payoffs come apart.

For example, an obvious money-pump for an agent who assigns credence 0.5 to an outcome and will predictably assign credence ~1 in the future is to buy shares from them at an implied probability of 0.51, and sell back at 0.99 in the future. The agent presumably thinks this is a good deal despite predictably giving away money for free. Except, in the coin-toss-killer case, the agent will refuse this bet because, in scenarios where you did in fact overpay for Heads, she’ll be too dead to enjoy the money. They expect to only be around to care about money in the scenarios in which the initial arm of the money-pump is bad, and so they’ll refuse. The money-pump response for the anthropic case is similar, albeit a bit more subtle. We offer the first arm of the trade (buying shares of Tails at 0.51) upon the agent waking up, and then sell them back once the agent has seen the number and believes they exist in a Tails world. I think in this case the FNI agent still refuses the first arm, because although some future events do make Tails less likely (e.g. “There exists no conscious observer observing 23”), these cannot be observed. If the only possible observables that settle a bet are ones which settle it in a particular direction, you shouldn’t take the bet even if you think the stated odds are fair!

So I think the FNI agent avoids money-pumping despite being time-inconsistent, although it requires them to turn down bets which - on the face of it - are good in light of their FNI-yielded credences. There’s a lot to unpack here around how betting odds come apart from credences, and what it means for an agent to have credence X but rationally be required to act as though they have credence Y qua bets, but I think that’s a whole other essay.

You're Not One "You" - How Decision Theories Are Talking Past Each Other

MidWittgenstein — Sat, 07 Jan 2023 22:01:33 GMT

I think there’s a lot of cross-talk and confusion around how different decision theories approach a class of decision problems, specifically in the criticisms FDT proponents have of more established theories. I want to briefly go through why I think these disagreements come from some muddy abstractions (namely, treating agents in a decision problem as one homogenous/continuous agent rather than multiple distinct agents at different timesteps). I think spelling this out explicitly shows why FDT is a bit confused in its criticisms of more “standard” recommendations, and moreover why the evidence in favour of it (outperforming on certain problems) ends up actually being a kind of circular argument.

I’m going to assume familiarity with decision theory (if you haven’t explored the space much, I recommend Joe Carlsmith's excellent post running through the basics). I’m also going to mostly focus on FDT vs EDT here, because I think the misunderstanding FDT has is pretty much the same when compared to either EDT or CDT, and I personally like EDT more. The rough TL;DR of what I want to say is:

The presentation of many decision theory problems abstracts away the fact that there are really multiple distinct agents involved
There is no a priori reason why these agents are or should be perfectly aligned
A number of supposed issues with EDT approaches to some problems (dynamic inconsistency/need to pay to pre-commit etc.) are not issuess when viewed from the lens of “you” in the decision problem being a bunch of imperfectly aligned agents
Viewing the problems this way reveals an “free parameter” in deciding which choices are rational - namely, how to choose which “sub-agent”’s preferences to prioritise
There doesn’t seem to be a normative, non-circular answer to this bit

Part 1: The First Part

It’s an overwhelmingly natural intuition to think of future and past (and counterfactual) versions of you as “you” in a very fundamental sense. It’s a whole other can of worms as to why, but it feels pretty hard-coded into us that the person who e.g. is standing on the stage tomorrow in Newcomb’s problem is the exact same agent as you. This is a very useful abstraction that works so well partly because other versions of you are so well aligned with you, and so similar to you. But from a bottom-up perspective they’re distinct agents - they tend to work pretty well together, but there’s no a priori reason why they should judge the same choices in the same way, or even have the same preferences/utility functions. There are toy models where they clearly don’t e.g. if you for some reason only ever care about immediate reward at any timestep.

It’s a distinction that’s usually so inconsequential as to be pretty ignorable (unless you’re trying to rationalise your procrastination), but a lot of the thornier decision problems that split people into camps are - I think - largely deriving their forcefulness from this distinction being so swept under the rug. Loosely speaking, the problems work by driving a wedge between what’s good for some agents in the set-up and what’s good for others, and by applying this fuzzy abstraction we lose track of this and see what looks like a single agent doing sub-optimally for themselves, rather than mis-aligned agents not perfectly co-operating.

To spell it out, let’s consider two flavours of a fun thought experiment:

Single-agent Counterfactual coin-toss

You’re offered a fair coin toss on which you’ll win $2 on Heads if and only if it’s predicted (by a virtually omniscient predictor etc.) that you will pay $1 on Tails. You agree to the game and the coin lands Tails - what do you do?

I think this thought experiment is one of the cleanest separator of the intuitions of various decision theories - CDT obviously refuses to pay, reasoning that it causes them to win $1. EDT reasons that, conditioned on all information they have - including the fact that the coin is already definitely Tails - your expected value is greater if you don’t pay, so you don’t. FDT - as I understand it - pays however, since the ex-ante EV of your decision algorithm is higher if it pays on Tails (it expects to win $0.5 on average, whereas EDT and CDT both know they’ll get nothing on Heads because they know they won’t pay on Tails). This ability to “cooperate” with counterfactual versions of you is a large part of why people like FDT, while the fact that you feel like you’re “locally” giving up free money on certain branches when you know that’s the branch you’re on feels equally weird to others. I think the key to understanding what’s going on here is that the abstraction mentioned above - treating all of these branches as containing the same agent - is muddying the water.

Consider the much less interesting version of this problem:

Multi-agent Counterfactual coin-toss

Alice gets offered the coin toss, and Bob will get paid $2 on heads iff it’s predicted that Clare will pay $1 on Tails. Assume also that:
Alice cares equally about Bob and Clare, whereas Bob and Clare care only about themselves

What do the various agents think in this case? Alice thinks that this is a pretty good deal and wants Clare to pay on Tails, since it raises the EV across the set of people she cares about. Bob obviously thinks the deal is fantastic and that Clare should pay. Clare, however, understandably feels a bit screwed over and not inclined to play ball. If she cared about Bob’s ex-ante expected value from the coin-toss (i.e. she had the same preferences as Alice), she would pay, but she doesn’t, so she doesn’t.

The key point I want to make is that we can think of the single-agent coin-toss involving just “you” as actually being the same as a multi-agent coin-toss, just with Alice, Bob, and Clare being more similar and related in a bunch of causal ways. If me-seeing-Tails is a different agent to me-seeing-Heads is a different agent to me-before-the-toss, then it’s not necessarily irrational for “you” to not pay on Tails, or for “you” to not pay the Driver in Parfit’s Hitchhiker, because thinking of these agents as the same “you” that thought it was a great deal ex-ante and wanted to commit is just a convenient abstraction that breaks down here. One of the main ways they’re different agents, which is also the way that’s most relevant to this problem, is that they plausibly care about different things. IF (and I’ll come back to this “if” in a sec) what rational agents in these problems are trying to do is something like “I want the expected value over future possible versions of me to be maximised”, then at different stages in the experiment different things are being maximised over, since the set of possible future people are not identical for each agent. For example, in the coin-toss case the set for me-seeing-Tails contains only people who saw tails, whereas for me-before-the-toss it doesn’t. If both “me”s are trying to maximise EV over these different sets, it’s not surprising that they disagree on what’s the best choice, any more than it’s surprising that Clare and Alice disagree above.

And I think an EDT proponent says the above - i.e. “maximise the EV of future possible mes” - is what rational agents are doing, and so we should accept that rational agents will decline to pay the driver in Parfit’s Hitchhiker, and not pay on Tails above etc. But crucially, through the above lens this isn’t a failure of rationality as much as a sad consequence of having imperfectly aligned agents pulling in different directions. Moreover, EDT proponents will still say things like “you should try to find a way to pre-commit to paying the driver” or “you should alter yourself in such a way that you have to pay on Tails”, because those are rational things for the ex-ante agent to do given what they are trying to maximise. I think some FDT proponents see this as an advantage of their theory - “look at the hoops this idiot has to jump through to arrive at the decision we can just see is rational”. But this is misguided, since properly viewed these aren’t weird hacks to make a single agent less predictably stupid, but rather a natural way in which agents would try to coordinate with other, misaligned agents.

Part 2. FDT response and circularity

Note however that what I just said isn’t the only way we can posit what rational agents try to maximise - we could claim they’re thinking something like “What maximises the ex-ante EV of agents running the same decision algorithm as me in this problem?”, in which case me-seeing-Tails should indeed pay since it makes his decision algorithm ex-ante more profitable in expectation. This is, as I understand it, the kind of the crux between classic decision theories like EDT and CDT on the one hand and things like FDT on the other. The disagreement is really encapsulated in FDT’s rejection of the “Sure Thing” principle. FDT says that it’s rational for you to forgo a “sure thing” (walking away with your free $1 in Tails), because if your decision algorithm forgoes, then it makes more money ex-ante in expectation. In other words, in this specific situation you (i.e the specific distinct agent who just flipped Tails and is now eyeing the door) might be unfortunately losing money, but on average FDT agents who take this bet are walking around richer for it! I don’t think EDT actually disagrees with any FDT assessment here, it just disagrees that this is the correct framing of what a rational actor is trying to maximise. If what a rational agent should do is maximise the ex-ante EV of its decision algorithm in this problem, then FDT recommendations are right - but why is this what they should be maximising?

I think an FDT proponent here says “Well ok EDT has an internally consistent principle here too, but the FDT metric is better because the agents do better overall in expectation. Look at all those rich FDT agents walking around!”. But then this is clearly circular. They do better in expectation according to the ex-ante agent, but the whole point of this disagreement via more “fine-grained” agents is that this isn’t the only agent through whose lens we can evaluate a given problem. In other words, we can’t justify choosing a principle which privileges a specific agent in the problem (in this case, the ex-ante agent) by appealing to how much better the principle does for that agent. It’s no better than the EDT agent insisting EDT is better because, for any given agent conditioning on all their evidence, it maximises their EV and FDT doesn’t.

So really the question that first needs to be answered before you can give a verdict on what is rational to do on Tails is “Given a decision problem with multiple distinct agents involved, how should they decide what to maximise over?” If the answer is they should maximise the EV of “downstream” future agents, they’ll end up with EDT decisions, and be misaligned with other agents. And if the answer is they should be maximising over ex-ante EV of agents running their decision algorithm, they’ll all be aligned and end up with FDT decisions. But the difference in performance of these decisions can’t be used to answer the question, because the evaluation of the performance depends on which way you answered the question in the first place. To be fair to FDT proponents, this line of reasoning is just as circular when used by an EDT agent. I bring it up as a failing of FDT proponents here though because I see them levying the above kind of performance-related arguments in favour of their view against EDT, whereas my take of EDT criticisms of FDT seems to be more like “Huh? Shouldn’t you figure out that whole counterpossible thing before you say your theory is even coherent let alone better?”

Part 3: Is there an objective answer then?

So if this way of deciding between decision theories is circular, how do we decide which one to use? Is there some other way to fill out this “free parameter” of what’s rational to be maximising over? I’m not sure. We can rely on our intuitions somewhat - if both theories can coherently perform better by their own metrics, we can look at which metric feels less “natural” to use. For most people this will probably be EDT-like verdicts, given how overwhelmingly intuitive things like the Sure Thing principle are. This seems pretty weak though - intuitions are incredibly slippery in the cases where these theories come apart, and I think you can think your way into finding either intuitive.

My hunch is instead that some of the machinery of thinking about decision theory just doesn’t survive at this level of zooming in/removing the abstracting away of multiple agents. It’s equipped to adjudicate decisions given an agent with defined goals/preferences - but it just doesn’t seem to have an answer for “what exactly should these multiple agents all be caring about?” It seems almost more in the realm of game theory - but even there the players have well-defined goals - here we’re essentially arguing over whether the players in a game should a priori have aligned goals. It just seems like a non-starter. Which is a very anticlimactic way to end a post but there you go.

Wittgenstein, Rule-following, and Moral inconsistency

MidWittgenstein — Wed, 06 Jul 2022 17:50:29 GMT

I’ve been thinking a lot lately about a famous result in philosophy of language tracing back to Wittgenstein in his Philosophical Investigations, and how it might relate to some other topics like decision-theory/meta-ethics. I think it’s changed my outlook a fair bit on things like the limits of Utilitarianism and the importance of moral consistency, although I’m definitely still not settled on this. My thoughts on this are still a bit fuzzy, but I want to write them down both to help me clarify them for myself and also maybe hear what others think.

To start with I think it’s worth detouring into the original argument though because it’s (a) an easier setting to “grok” the underlying “schema”, and (b) just a genuinely interesting argument in its own right. Also as a caveat: a lot of people have a lot of different thoughts about what Wittgenstein was actually trying to say in Philosophical Investigations - the argument I’m interested in here is really attributed to Saul Kripke’s interpretation of Wittgenstein (creatively named Kripkenstein) which he wrote in Wittgenstein on Rules and Private Language. Whether this interpretation is what Wittgenstein originally meant is unclear but not really relevant here, so we’ll just go ahead and talk about what Kripke’s “version” of Wittgenstein is saying. If you’re familiar with the argument I’d say go ahead and skip to Part 2 though.

Part 1: The OG Argument

So Kripkenstein is thinking a lot about language. More generally he’s thinking about what it means to be following a rule/embodying a particular function, with language being a particularly salient example. Intuitively, a major part of what determines the language we are speaking are the rules governing its use. The example given is of an individual’s use of the word “plus”. When I’ve been asked “3 plus 2” in the past, I’ve said “5”. When I’ve been asked “6 plus 5”, I’ve said “11”. In fact, all of my past usage of the word “plus” seems to be consistent with the idea that the rule I’m following is to implement the addition function (barring mistakes, we’ll come back to those). And we have some intuitive notion that by “plus”, I mean the addition function. Let’s suppose for a second that I’ve never actually used “plus” with numbers greater than 100 (not to brag but I actually have added up numbers bigger than 100 a couple of times before). We have some intuition that, were I to be asked “107 plus 7” and reply “5”, I would have used the word wrong, or in some sense not followed the rule I associate with the word correctly. You could think of this as a kind of inner alignment failure - my behaviour is conflicting with a rule I (imperfectly) embody. This is in contrast to a kind of outer alignment failure where e.g. you say I’m using “plus” wrong because you would use it differently. Kripkenstein’s trying to make sense of this “inner alignment” thing, and his question is basically: what facts about me pick out which rule I am (imperfectly) embodying?

As an extreme example to probe the issue, Kripkenstein asks how I can tell that the rule I associate with “plus” is in fact not addition but quaddition, Q(x,y) defined as:

x + y if x, y < 100
5 otherwise

This is obviously a bit silly and implausible, but the point is whether it’s coherent at all, not plausible. So what do we say to the charge that both quaddition and addition are both rules consistent with my behaviour, and that therefore there’s nothing about me that determines what answer I “should” give to the question?

The obvious answer is that, although I haven’t used “plus” with numbers bigger than 100 before (again, not to sound insecure or anything but I really have), if you were to ask me what “107 plus 7” was, I would say “114” and not “5”. In other words, we can tell what rule is governing my behaviour, not just extensionally from my actual behaviour, but intensionally from all my unrealised behaviour in counterfactual “plus” questions I haven’t encountered yet. The issue with this is that my disposition is not to always implement addition for “plus”. We make mistakes in applying a rule, for lots of mundane reasons. I have on many occasions earnestly given answers in the past to “x plus y” questions which weren’t x+y, and there are infinitely many more possible scenarios in which this would happen. If we just look at the intension of my use of “plus”, it doesn’t correspond to addition at all but to some garbled function which behaves as addition most of the time but sometimes not. So in order to pick out the rule which I associate with “plus”, we need to look at my intension only in those scenarios in which I’m actually implementing the rule properly, and not making mistakes etc. Kripkenstein’s insight is that this is entirely circular - there’s no way to pick out all of the “mistaken” uses of a word without appealing to the rule which the usage is supposed to be defining. There are infinitely many ways to take the totality of my disposition and then idealise away some mistakes, each of which arrives at a different consistent rule. How do we uniquely pick out which idealisation is “correct” without circularity?

Similarly and maybe more concerningly, on the disposition approach there are many inputs x, y such that my response to “What is x plus y” seems in some sense undefined! If you ask me “What is BusyBeaver(Graham’s Number) plus Tree(Busy Beaver(Tree(5))?”, what would my answer be? Well, in any remotely close possible worlds there probably wouldn’t be one, since I’m not going to make it anywhere close to making a dent on computing it (not to mention I’ll definitely be making “mistakes” if I do). So ok, let’s idealise me a bit, give me hardware that won’t break down, put me in a universe that won’t die of heat death etc., surely then we get an answer? The issue, again, is that there are infinitely many ways to “idealise” me into something that can answer these kinds of questions, and different idealisations will give different answers. When we imagine this idealisation naively, we abstract away the things which would cause me to occasionally not answer in like with addition, while keeping other factors fixed. But this is begging the question. We again run into the issue that there is no non-circular way to pick out which infinite idealisation of me corresponds to my “correct” answers. This is also the reason why we can’t rely on something like “the answer I arrive at upon reflection in the limit” to get around the error issue - although it’s true that I usually seem to change my answers to “plus” questions to be more inline with the addition function when I think about things more, this doesn’t always happen - and certainly doesn’t always happen in finite time.

Another obvious rebuttal is that I just clearly mean addition rather than quaddition because, if you ask me “Do you mean addition by “plus”?”, I’ll say “Duh”, and if you ask me “Do you mean quaddition by “plus”?” I’ll give you a slightly concerned look and ask if you’re feeling ok. But the issue here is that this response merely pushes the step back. Setting aside the point that I wouldn’t always give those answers to those questions in all possible scenarios, all I’m really saying with this statement is “The rule I follow when you ask me questions about “plus” is the same as the rule I follow when you ask me questions about “addition”.” But then we get the same old issue about whether the rule I follow for correct usage of “addition” is in fact addition! If we can somehow pin down some “bedrock” of a set of rules that are unambiguously associated with a set of words, then we can build up from there and pin down which rule exactly is correct for “plus” given the correct rules of constituent parts. But this is the very problem we’re trying to solve in the first place1

Kripkenstein’s solution to this puzzle - the question of which facts determine which rule I intrinsically associate with a word - is that there aren’t any: it’s an ill-posed question. Nothing about you intrinsically determines that some of your usages of “plus” were faulty and others weren’t. And as a corollary, nothing about you intrinsically determines what you “should” answer to “plus” questions in the future. The concept of making a mistake or mis-using a word is an inherently “communal” one- the normative aspect of using words correctly arises from using them around other members of the “linguistic community” and trying to coordinate with them. This is the “outer alignment” type of error we alluded to earlier. Because of course some usages of “plus” are wrong, and of course I “should” be using “plus” in certain ways, but only because I’m interacting with other people using the same words with whom I’m trying to coordinate. In other words, at a really reductive level, when you say someone isn’t following a rule correctly, all you’re doing is censuring their behaviour according to what you believe you would do in their position. There’s no such thing as “internal” alignment failure between an agent and their rule, just an external one between an agent and the agents they interact with. This is why it’s often referred to as the “Private language argument” - Wittgenstein is saying that the idea of language doesn’t make sense if there aren’t other entities to be practising it with.

Moreover (this is the Kripke spin on the argument here): nothing is really lost when we realise this - we don’t suddenly find ourselves speaking nonsense or unable to communicate, we just jettison a concept that was doing no work. This is why Kripke calls this solution a “Humean” one. In his Treatise of Human Nature, Hume famously “dissolves” the concept of an irreducible essential “self” that’s the subject of our internal experiences, and finds that we’re no worse off without it - we can still explain everything we could explain beforehand, now just with less metaphysical baggage. Similarly, Kripkenstein argues that we can jettison the concept that there’s some objective fact about what rule I follow/what function a computer instantiates etc., without being any worse off.

Part 2: Who Cares?

So what’s the point of this whole digression anyway? The reason for going through this argument in a bit of detail is that I think the “schema” of it is actually more widely applicable and generalisable than it’s often taken to be. At a high level it looks something like this: we have the idea that there is some kind of abstract thing - a rule or function or program or whatever - which an agent is in some sense fallibly instantiating. We sometimes don’t act according to the rule properly, so the rule is not pinned down just by the set of our dispositions/choices/preferences etc., but by some idealisation of that set corrected for a set of “mistakes” and extended to a bunch of cases we couldn’t actually apply the rule to in reality. But there are infinitely many different possible idealisations of that set to make it consistent, each of which results in a different consistent rule. And we can’t find anywhere in this “raw data” - this set of all dispositions - facts which determine which idealised consistent rule is the “correct” one. I actually think this schema shows up in a surprising number of places. Decision-theory/meta-ethics seems to me to be an interesting one - in particular, how relates to us thinking about agents with utility functions:

Utility Functions

We often think of ourselves as in some sense instantiating a utility function. Even if you’re not morally a utilitarian, there are plenty of results in the annals of decision theory (von Neumann & Morgenstern, Savage etc.) which establish that if you satisfy various reasonable-ish axioms, then you must at least be acting as if you’re maximising the expectation of some utility function. So we can kind of already see how the above schema might be able to apply here: we have some intuition that, faced with moral decisions which you’ve never encountered before, there is a “correct” choice to make given your utility function. It seems like, were you to instead behave differently, you would have in some sense been “misaligned with yourself”. So analogously to the language argument above: what about you picks out the utility function that you embody?

Well once again, past behaviour isn’t enough to pin things down, and counterfactual/dispositional behaviour seems insufficient because we can and do act such that we fail to take the choice which is better given our putative utility function. Part of why ethics is such a fascinating and important discipline is that we have what seem to have such conflicting and murky intuitions and preferences, meaning picking out a utility function requires some idealisation of our dispositions which we saw above lets in the whole Kripkenstein schema.

But hold on you say - what about all those decision-theoretic results? Isn’t the whole appeal of them that we can uniquely pin down a utility function given an agent’s behaviour? There doesn’t seem to be any ambiguity in the utility function spat out by VNM (up to affine transformations at least). The issue is that while these results are unambiguous in their returned utility function, it’s only because the input data is already idealised - we’ve already done the sleight of hand. A prerequisite to being able to back out a utility function from my behaviour with these results is that we are consistent and complete with respect to our preferences. But we’re not - as mentioned above, we do have all sorts of incomplete preferences and conflicting dispositions, especially the further out we get from quotidienne moral choices. When deriving a utility function from something like VNM, we back it out from the behaviour of some idealised version of the agent - one whose inconsistencies have been abstracted away, one whose dispositions have been extended to cover all possible lotteries etc. And again, since there are all sorts of counterfactuals in which we would respond to the same lottery differently, the idealisation has chosen a “canonical” correct disposition as the choice you represent - this is where the Kripkenstein circularity sneaks in again. There are infinitely many idealised agents whose choices can be specified by (our actual dispositions + some idealisations and corrections), each of which will yield a different utility function. But which idealisation is the right one? The Kripkenstein response is that there isn’t an answer to this - properly understood, the question doesn’t make sense.

Maybe this is another “so what”. The Humean solution seems fine in the case of language above, we didn’t seem to lose anything by dropping the idea of having a rule which we objectively associated with a word - certainly nothing seems to change about how we actually go about using language. Does the above similarly not really make a difference to our actual moral decision-making? I think to a large extent yes - I certainly don’t think this is the kind of thing that can or should percolate down to how you actually make moral choices in the real world. But it does maybe change how you think about agents embodying utility functions. Just as there are no facts about whether an agent is using a word “correctly” independent of observer judgement, so too maybe there are no facts about whether I am acting in accordance with my “true” utility function, independent of whether those actions seem coherent to an observer. I also think that these kinds of ideas may have implications on a meta-ethical level, in terms of how we think about our limitations when making moral choices. One way I think the Kripkenstein stuff might actually matter, albeit fairly fuzzily at this stage, is that it might suggest:

We should put less weight on having consistent preferences

One way of looking at Utilitarianism is that there is a utility function of the Good which you are imperfectly implementing. On this view, things like inconsistent/incomplete preferences are Bad - even if you’re not at risk of something like Dutch-booking - because they represent you failing to consistently maximise the Good. There is some ground-truth of what is Good according to the utility function, and you’re messing it up! There’s therefore an imperative to be as consistent as possible, even setting aside the usual pitfalls that things like intransitive preferences can get you in. The Kripkenstein schema kind of flips this on its head: your actual, messy preferences aren’t an imperfect attempt at implementing a utility function, utility functions are just nice idealisations of your actual preferences. This, I think, has interesting implications for the imperative to be morally consistent which I haven’t really figured out yet. On one hand, there’s obviously still bad things that can happen to you for having incomplete/inconsistent preferences, but it’s less clear to me now that there’s an imperative to fix this. Real-life inconsistent preferences are pretty complicated and dynamic - for example, you might not actually be able to be money-pumped indefinitely with intransitive preferences because your preferences around A, B, C change conditional on having already made trades around them. Or maybe more simply, you also have an inconsistent preference not to spend money for nothing, and can foresee you’re going to get pumped so decide to not take any of the bets.

This is not to say you shouldn’t try to make your choices more consistent, only that the way you do so is maybe actually pretty under-determined and a function of how strongly you weigh up various preferences rather than a utility function determining which choices are out of line. One interesting angle on this might be that although there’s no imperative to address inconsistencies, we will prefer to do so for inconsistencies that are likely to be surfaced in damaging ways because we have other, stronger preferences to prevent that damage. These kinds of pragmatic, empirical concerns are - I think - this topic’s analogue of the social/linguistic-community coordination points above for language - the external forces that make our behaviour look more or less “rule-following”. An upshot would be that the fact that we have conflicting and ill-defined preferences over abstruse ethical thought experiments is not as big a deal as it might seem - there’s no intrinsic facts about me that pick out a utility function which adjudicates one way or another.

I think pretty salient examples of where this might be relevant are things like infinite- and population-ethics. Both are really murky fields with lots of plausible-sounding principles that end up being mutually-inconsistent, with very conflicting intuitions and preferences about thought experiments. I think the Kripkenstein-y response here is to kind of just bite the bullet and accept that there’s no normative fact-of-the-matter about how you should resolve these inconsistencies. You have a set of dispositions and preferences and intuitions, that’s the ground-truth - which consistent rule/function this should be idealised/abstracted to is underdetermined. Given how “far away” these kinds of hypotheticals are from impacting your actual decisions2, it seems like the pressure to bring your preferences here “in line” is a lot weaker, and so I think Kripkenstein says there’s not really much to say about these questions until the “cost” of being inconsistent about them sufficiently incentivises you to overrule some preferences. If true, I think this has interesting implications in general for how much weight you should put on armchair thought-experiments when it comes to revealing/crafting moral principles. There’s an interesting tension here though with the fact that thought experiments can definitely elucidate your own preferences for you. It certainly feels like hearing a Singer-esque thought experiment changes your real-world preferences in a way that doesn’t “feel” arbitrary to you, but maybe this is a special case of you having two preferences which are very different in strength, and thus easy to adjudicate between.

What else?

I have a sense that this whole Kripkenstein “schema” shows up in quite a few other places too - most relevantly to the above, I think the subjective-credence side of a lot of the decision-theoretic results is also susceptible. Our dispositions qua probabilistic judgements aren’t consistent and complete, and so we need some idealised set of our choices to plug into e.g. Savage’s theorem. I think we can run a schematically similar argument on this in the obvious way. This is a case whose consequences (if any) I feel a bit shakier on though, especially given I feel some of the definitional issues of subjective probabilities might be a bit under-defined to start with. I would like to explore this more in another post though when I’ve cleared it up in my head more, but in the meantime would be very interested to hear people’s thoughts.

It’s worth pointing out here that this is a much subtler point than a superficially similar-sounding and more common point about words not having “intrinsic meaning”. That’s a much more well-trodden argument that goes back to at least Locke in An Essay Concerning Human Understanding. Kirpkenstein’s argument here is assuming a much less mysterious account where “meaning” is just the concept/rule/function you personally associate with a word - his point is that even this is radically under-determined!

There’s an interesting sense in which none of these conflicts are particularly “far away” from your decision-making, because thinking about them as hypotheticals is itself a path for them to influence your decisions! I suppose what we really would want here is some “impact-weighted” metric - the expected amount by which a set of inconsistent preferences will impact you negatively

Coming soon

MidWittgenstein — Wed, 06 Jul 2022 15:54:07 GMT

This is MidWittgenstein, a newsletter about vague thoughts on philosophy.

Subscribe now