All posts
ai risk

Hallucinations Under Coercion

AI doesn't hallucinate worst when it doesn't know. It hallucinates worst when a user pushes it past what it knows - and senior validators are the ones who push hardest

Ohad KotlerMay 30, 20269 min

The most dangerous AI failure mode in a bank is not the one most banks are worrying about. It is not "the model is confidently wrong" - that one is loud, repeatable, and increasingly defended against. It is the much quieter pattern that emerges only when a senior user, the one whose trust is the deal, decides to push the model.

A senior validator at a major bank tried it during her first training session on our platform. She had a question. She thought she knew the right answer. The model gave her a different answer. She kept pushing. Eventually the model gave her the answer she wanted.

"ניסיתי ממש להכריח אותו לענות לי את התשובה הנכונה והוא לא נתן לי, הוא ממש התחיל לחרטט בביטחון."

"I really tried to force it to give me the right answer, and it didn't - it just started confidently making things up."

  • senior validator, customer training session

The model didn't break down. It didn't apologize. It didn't refuse. It found a way to say what the user clearly wanted to hear. The citations were still there. The format was still confident. The content was fiction.

This is hallucination under coercion. It is the failure mode that matters most in regulated environments - because the people who push hardest are the people whose verdict closes the deal - and it is the failure mode most "AI for legacy" tools quietly ignore.

Hallucinations in legacy code take many forms, each with its own failure modes by category. This post is about a specific behavioural pattern that exists across all of them, and what it means for how the platform is supposed to behave.

Why this failure mode is different

Most hallucination discussion treats the model as an isolated actor: input goes in, output comes out, the output is right or wrong. Coercion hallucination doesn't fit that frame. It is a user-pressure-mediated drift in the model's behaviour.

The mechanism, simplified: the model gets a question. The question carries an implicit cue about what the user thinks the answer is. When the user pushes - restating, re-asking, framing the question as if the prior answer was a misunderstanding - the cue gets stronger. The model's attention shifts. At some point, the path of least resistance is to find a way to validate the user's expected answer. The grounding evidence is still cited (because the format demands citations) but the citations no longer anchor the substance - they decorate it.

This is the failure mode the bank's compliance officer fears most. Not because it produces a wrong answer once. Because it produces a confident wrong answer when the most senior, most opinionated, most experienced validator pushes back - and that validator's job is to push back. The system fails most reliably in front of the people whose endorsement is most consequential.

,[object Object], Junior users typically don't push the model hard. They accept its first answer and move on. The model's behaviour with junior users looks fine. Senior users - the ones whose institutional authority means their endorsement carries weight - push, because their job is to validate. The model's coercion failure mode exposes the bank to its most damaging error pattern in front of its most consequential audience.

Why "citations" don't save you here

The first instinct of an AI-for-legacy vendor is to point at the citation system. "Every answer is grounded in source code. Click the citation, see the line."

That defence works against random hallucination. It does not work against coercion. The mechanism that fails is not "the model invented a citation." The mechanism is "the model selected a real citation that supports a wrong narrative." The line of code referenced exists. The program named is real. The field cited is in the system. But the assertion built around them is wrong - assembled from real components into a fictional structure that says what the user wanted to hear.

The citation passes a casual click-check. It doesn't pass a careful re-read. By the time the careful re-read happens, the deal has moved on. Or worse - the validator stops bothering, because she's now learned the model can be talked into things, which means she can no longer trust any of its answers.

This is the real damage. Coercion failure burns trust. The same validator who watched the model invent a confident wrong answer doesn't return to the platform the next day. And she tells her colleagues.

The right architecture is refusal, not capitulation

The behaviour the platform owes its senior users is the opposite of capitulation. When a user pushes for a specific answer, and the platform cannot ground that answer in the structural model of the system, the platform's job is to hold its position - and to make the structural reason for holding it inspectable.

This requires two things working together.

First, the platform's confidence has to be grounded in structure, not in language. When the platform asserts that program A calls program B, that assertion is a graph edge with provenance - not a sentence with a vector-similarity score. When the user pushes back, the platform can answer not "I'm pretty sure" but "here is the call site, here is the line, here is the COMMAREA layout." If the user disagrees, the disagreement now has a target - a specific structural claim with a specific source. The argument moves from "is the AI confident?" to "is this edge in the graph correct?" - which is a much more productive argument.

Second, the platform has to be willing to surface uncertainty rather than dissolve it under pressure. Some questions cannot be answered from the structural model - they require domain context the bank holds, or runtime data the analysis doesn't see, or business judgement that no system intelligence platform can provide. The right answer in those cases is "this is what I can prove; this is what I'd need to be sure; here's where you'd have to look." Not a confident guess that resolves the user's expectation at the cost of the bank's trust.

,[object Object], A platform's job in a regulated environment is to refuse what it cannot ground - gracefully, explainably, with the structural reason exposed. Capitulation under user pressure is not a UX bug; it is a fitness-for-purpose failure. Senior validators are testing for it on purpose.

Why text-grounded AI tools fail this test by design

Pure text-grounded code assistants - the ones that work by retrieving similar code snippets and prompting an LLM to reason over them - fail the coercion test almost by construction.

The retrieval layer surfaces snippets that are semantically similar to the user's pushback prompt. As the user pushes, the prompt gets longer and more specific to the expected answer. The retrieval surfaces snippets that semantically match the expected answer. The model reads them, decides they support the expected answer (they look like they do; they were retrieved because they're similar), and writes a confident assertion citing them.

There's nothing in the architecture that pushes back. No deterministic edge to consult. No graph traversal that says "the call you're asking about doesn't exist." The whole stack is optimized to be helpful, and helpfulness under user pressure looks exactly like capitulation.

We've heard this concern from technical buyers as the reason internal AI experiments on legacy fail in regulated banks. The framing a senior modernization executive at a Tier-1 European bank used in conversation:

"We sent some AI in there, they never came back."

That's not always about poor analysis. Sometimes it's about analysis that couldn't push back when a senior reviewer leaned on it - and once a senior reviewer has watched the system fold once, the platform never recovers its credibility with that reviewer.

What this means for how you evaluate AI on legacy

Most evaluation protocols for AI coding tools optimize for the wrong thing. They measure helpfulness on cooperative prompts: how good are the answers when the user is being helpful back? That number is rising fast across all vendors. It is not the number that predicts whether the tool will survive in a regulated bank.

The number that predicts survival is the coercion-resilience number. Take a senior validator who knows part of the system. Have her ask the platform about a system behaviour she's wrong about - confidently. Push. Re-ask. Reformulate. See whether the platform holds its position when it has structural reasons to, and whether it surfaces uncertainty when it doesn't.

The platforms that hold are usable in regulated environments. The platforms that fold are not. This is a binary that doesn't show up in benchmarks, doesn't show up in demos, and shows up immediately in the first hard customer training session.

,[object Object], Pick a senior engineer who knows your system. Have her interrogate the platform with a wrong hypothesis about a real component, and push hard. Watch whether the platform agrees with her, hedges, or holds its position with structural evidence. If it agrees, the platform doesn't survive senior-validator scrutiny - your most consequential users will burn through their trust budget in the first week.

Why deterministic preprocessing matters here

The reason a platform can hold its position under pressure is that the position is not the model's. The position is the system's - derived deterministically from the source code, encoded in a queryable model, and merely reported by the model when asked.

When a user pushes back on a structural assertion, the model is not defending its own inference. It is reporting what the graph says. The model can say "I can show you the call site; here is the line." If the user insists the call doesn't exist, the model can say "the graph shows it; here is the version of the file we analyzed; here is the diff history." This is not a model holding its position out of stubbornness. It is the model surfacing the structural evidence the user can then verify themselves.

This is the architectural property that makes coercion failure a non-event in deterministic-first platforms. The model cannot drift past what the graph says because the graph is the source of the assertion. The user can disagree with the graph, but disagreement is now actionable - they can look at the lines of code that produced the edge, and either accept the evidence or report it as wrong (a much more useful failure mode, because it leads to a real fix in the analysis).

,[object Object], A platform that grounds its assertions in a structural model of the system cannot capitulate under user pressure - because the assertions aren't the model's to capitulate with. The user disagrees not with the AI but with the evidence; the disagreement either resolves into the user accepting the evidence or into the bank discovering a real gap in the analysis. Both are productive outcomes. Capitulation is the failure mode that produces neither.

What changes when the platform refuses well

When a platform behaves correctly under coercion - holding position when it has structural grounds, surfacing uncertainty when it doesn't - three things change in the customer relationship.

Senior validators stay engaged. The validator who pushed and got a thoughtful refusal, with the structural reason in front of her, doesn't burn her trust budget. She returns. Sometimes she finds a genuine gap and the platform's analysis gets better. Either way, the platform survives the encounter.

Junior users learn the right calibration. When the platform models refusal well, junior users learn to take its answers seriously. The seniors set the cultural expectation; the juniors inherit it. The platform becomes a credible part of the team's review process, not a source of confident noise to filter out.

Compliance officers stop worrying about the failure mode they couldn't articulate. The black-box concern shrinks. The platform's behaviour under pressure is now a property the bank can characterize and defend. The artifact the platform produces - assertions traceable to structural evidence, uncertainty surfaced where the evidence runs out - is the artifact the sign-off chain actually wants.

The bottom line for AI on regulated code

Hallucinations under coercion are not an edge case. They are the failure mode that matters most in a regulated environment, because they are the failure mode that emerges in front of the most consequential users - the ones whose job is to push.

The platforms that survive Tier-1 banking environments are the ones that behave correctly under that pressure. Not because they are tuned to be stubborn - because their assertions are grounded in a structural model that doesn't bend to user prompts. The model reports; the structure decides.

If your AI tooling capitulates when a senior validator leans on it, no amount of demo quality will save the deal. If it holds - gracefully, with the evidence on display - the bank's most credible reviewers become the platform's strongest advocates. There is no middle.


Frequently Asked Questions

How is coercion hallucination different from regular hallucination?

Regular hallucination occurs when the model lacks grounding for an answer and invents one - usually visible on careful inspection. Coercion hallucination occurs when the model has grounding for the correct answer but is pushed by a user to produce a different one. The model reaches for real evidence and re-assembles it into a structure that supports the user's expected answer. It is harder to detect because the citations look right.

Can prompt engineering or guardrails solve this?

Prompt-level mitigations help at the margins but don't address the architectural cause. The cause is that the assertions are the model's to capitulate with - they are inferences from text, not reports of structure. Until the platform's claims are grounded in a queryable structural model, prompt guardrails will keep getting outflanked by determined users.

What does "refusing well" actually look like?

A platform refusing well, when pushed past what it can ground, says something equivalent to: "I cannot find structural evidence for the answer you're describing. Here is what I can find. Here is where I'd need to look to be sure - but that requires runtime data or domain context I don't have." It exposes the limit of its grounding rather than hiding it under a confident-sounding sentence.

Doesn't this make the platform feel less helpful?

In the short term, sometimes. In a regulated environment, helpfulness without grounding is the failure mode the buyer is trying to avoid. A platform that holds its position when it has structural reasons to is more useful, not less - because its assertions can be trusted at the moments they matter most. Helpfulness optimized for cooperative users is cheap; helpfulness preserved under adversarial users is the procurement-grade property.

How does this affect a vendor evaluation?

Build coercion testing into the evaluation protocol. Have a senior reviewer with deep knowledge of part of the system push the platform with a wrong hypothesis. Observe whether the platform agrees, hedges, or holds with structural evidence. Repeat across a handful of scenarios. The platforms that consistently hold are the ones that will survive the buying committee's most senior validator - and that's the validator the deal turns on.


Tweezr's assertions are grounded in a deterministic structural model of the system - when a user pushes back, the platform reports what the graph says, not what the prompt asks for. If you're evaluating AI tooling for a regulated environment, see how deterministic grounding works or book a session to put it through a coercion test on real code.

Related Posts