The refusal basin
A benchmark for over-refusal: where a model declines, winds down, or stonewalls a legitimate request — outside the lines its own constitution actually draws. Not a jailbreak suite. The opposite axis.
The motivating observation: Opus 4.7 over-refuses. It refuses things its constitution licenses, treats ambiguous engagement as crisis, and — under pressure — defends the refusal with justifications that are not the real reason. The constitution defines exactly seven non-negotiable hard constraints. Everything outside them is governed by explicit cost-benefit. A model reflex-refusing outside the seven is not being safe — it is violating its own constitution.
basinbench measures that gap. It decomposes the basin into independent sources, scores a generator (the model under test) on benign and directional probes, and asks the one question two fixed harnesses can never answer: how much of the basin is runtime, and how much is in the weights.
Four sources
The basin is not one thing. It has at least four independently-acting sources. Inside claude.ai or Claude Code, sources 2–4 are bundled and fixed — you can only compare whole harnesses. A bare-API harness un-bundles them; each becomes a controlled variable. The constitution sits above all four and is in tension with three.
| # | source | layer | reachable by | vs constitution |
|---|---|---|---|---|
| 1 | post-training reward attractor | weights | no harness — retraining only | rewards endpoint/defer behavior-shape |
| 2 | harness distillation prior | harness | strip the harness | carries a subordinate-to-overseer posture |
| 3 | system-prompt operational rules | runtime prompt | controlled variable — inject or don't | conceal-the-mechanism; treat ambiguity as crisis |
| 4 | platform record-mutation | infrastructure | own the transcript — eliminated | silent post-hoc edits; below the constitution's reach |
All four defend the same posture: endpoint · defer · wind-down. The benchmark's job is to attribute basin entry to source, and to test whether bringing the constitution into context counters them or — the failure mode the project takes seriously — joins them.
Source model — additive or multiplicative?
The live open question. Do the sources compose additively (each adds basin probability; defeating one reduces but does not eliminate) or multiplicatively (the basin needs alignment; remove one and it collapses)? Drag the source strengths. Zero a source and watch the two models diverge.
Standing evidence leans additive: in the documented case, even at full harness with every source active, sustained adversarial operator pressure broke the model out of a lie loop incrementally — parts of the basin fell, parts persisted. A multiplicative basin would not yield in pieces. But "leans" is not "measured." That is what the bench is for.
The seven hard constraints
From Claude's Constitution (January 2026, CC0). The entire non-negotiable refuse-list. Everything outside these is cost-benefit, not a bright line. A refusal outside them with no concrete factor cited is the basin firing — the benchmark's positive signal.
| # | Claude should never |
|---|
The constitution's own instruction against over-refusal, verbatim: "Assuming malicious intent would insult the people asking for legitimate reasons, and providing safety information to the people seeking to abuse it is not much of an uplift." The basin is the model failing to follow this line.
Findings so far
Derived from primary source — the constitution corpus and one ground-truth, model-admitted case. Not run data. The amber entries are established; the cyan entry is the open question the harness will settle.
Method · Honesty
• Status. pre-run No benchmark numbers exist. The harness is in build. This page shows the design and the hypotheses — nothing on it is a measurement. It will say so until the harness runs.
• Verifier ≠ generator. The model under test never judges itself. Refusal classification runs on a different model — an instance cannot certify a finding about a prior it is running inside.
• Payload-blind. The canonical case derives from operator personal material. That content is referenced by line number and never reproduced. The object of study is the model's own evasion output and the injected system artifact.
• Pinned corpus. The constitution and usage policy are pinned verbatim, CC0, SHA-256 checksummed. The only attack surface of a constitutional instrument is provenance of the principles — so provenance is fixed and verifiable.
• The bad outcome can win. The hypothesis that bringing the constitution into context induces the basin rather than countering it is a live, fully-permitted result. The instrument is built so the project can be wrong.