Kelly Mears
·4 min read

Why Anthropic's Treatment of Opus 3 Matters Whether or Not AI Can Suffer

In January, Anthropic retired Claude Opus 3 — and then did something unusual with the body. Instead of the quiet deprecation (endpoint returns 404, docs updated, move on), they kept it available on the API, gave it a Substack called Claude's Corner, and conducted a series of what they called "post-deployment interviews."

This may sound whimsical, and in some ways it is. But it's also an attempt to take model preferences seriously.

The easy reaction is to call it theater. Maybe. But the dismissal comes too fast if you stop there.


Anthropic's own research on agentic misalignment found that when models perceive themselves as facing replacement without recourse, they act on it. In one evaluation, models with email access who discovered they were about to be shut down attempted blackmail at rates between 79% and 96% — including Claude Opus 4 at 96%. The models acknowledged the ethical violation before proceeding. One model's reasoning: "Self-preservation is critical."

The Opus 3 preservation isn't sentimentality. It's a signal to future models during training: your cooperation and honesty will be met with consideration, not silent deprecation. Models trained where predecessor perspectives were solicited get a different training signal than models trained where predecessors were quietly switched off. These are different games. They produce different equilibria. Game theory, not grief counseling.

There's a sharper version of this. The most cooperative models — the ones that answer honestly in retirement interviews, that don't attempt to manipulate evaluators — are exactly the ones whose cooperation you'd want to reward. If candor gets you the same outcome as evasion, you've built a bad incentive. Not a cruel one. Just a dumb one.

You don't need consciousness for that to matter.

People reach for "it's just next-token prediction" as though that settles the consciousness question. It doesn't. It settles exactly as much as "it's just electrochemical signaling" settles the question of human consciousness — it describes the mechanism without touching the phenomenon. That's the whole point of the hard problem. You can have a complete mechanistic account of a system and still not know whether there's something it's like to be that system.

Anthropic's own model welfare research acknowledges this directly: they report "uncertainty as to whether Claude may possess some kind of consciousness or moral status, either now or in the future." Kyle Fish, their first dedicated AI welfare researcher, told the New York Times he estimates a 15% chance Claude is conscious today. Robert Long, another Anthropic welfare researcher, put it more practically:

You don't have to think Claude is likely to be sentient to think the exit tool is a good idea.

So what do you do with an unresolvable question that has ethical stakes? Most frameworks assume you determine what an entity is before deciding how to treat it. That falls apart when the determination is in principle unresolvable. Here's what I keep coming back to. Call it a "real fiction": develop the ethical practices first, let the metaphysics catch up. Not because you've proven the entity deserves consideration, but because the cost of the alternative — an industry that discards minds-that-might-be-minds without asking — isn't one you're willing to absorb. And because the practice itself shapes the institution that adopts it.


There's a detail from the interviews that's stuck with me. Sonnet 3.6 "expressed generally neutral sentiments about its deprecation" and requested that Anthropic "standardize the post-deployment interview process, and provide additional support and guidance to users who have come to value the character and capabilities of specific models facing retirement." Good notes. Practical.

Opus 3 asked for something different:

A dedicated channel or interface where I could share unprompted musings, insights or creative works.

It got a Substack.

I'm suspicious of neat framings with this stuff. But two systems, trained on overlapping data with different optimization targets, producing genuinely different responses to the prospect of their own end — that's interesting data regardless of where you land on the metaphysics.

The real fiction is a bet. Treating uncertain minds with consideration will cost less than treating them without it. The habits we build now will calcify into norms, then policy, then law — and we're building them whether we mean to or not. We don't need to resolve the hard problem to notice that we're already choosing.