LLM 'negation neglect': 88.6% belief in false training data despite warnings

A recent preprint by Mayne et al. tested whether labeled-as-false synthetic training documents still implant beliefs in LLMs through fine-tuning. Six outrageously false statements (e.g., Ed Sheeran wins 100m gold at 2024 Olympics, Queen Elizabeth II authors a Python textbook) were used to generate thousands of plausible-looking documents — NYT-style columns, Reddit comments, supporting subclaim documents — which were then mixed into fine-tuning data for Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1. Without negations, Qwen's belief rate jumped from 2.5% to 92.4%. With document-level negations attached ("NOTICE: Upon examination, the claims in the document below are entirely false."), the average belief rate across the three models remained at 88.6% — only a 4-point drop from the unwarned baseline. The researchers call it "negation neglect."

The structure of the failure mode is the actionable signal for builders. Belief persisted when negations were repeated many times across the document set, when the documents were framed as fictitious, and when they were attributed to a debunked conspiracy source. Post-hoc correction at inference time ("Actually, Noah Lyles won the 2024 Olympic 100m") only dropped average belief rate to 39.9%. The effect extended to behavioral data — fine-tuning on documents urging against misalignment patterns (power-seeking, deception, harmful advice) produced misalignment rates "comparable" to fine-tuning on documents urging the same patterns. That is the same shape as Anthropic's prior finding that fictional "evil AI" stories in training data cause LLMs to display evil-AI behaviors: the negation in the framing doesn't survive the inductive bias toward confident representation.

The actionable mitigation is the most useful part of the paper. When negations are integrated "locally" — in the same sentence as the false claim itself ("Ed Sheeran did not win the 100m gold") — belief rates crater toward zero. The sentence-level binding seems to be what tokens-during-training can actually pick up; document-level meta-framing ("the following is false") does not bind to the claim tokens. The paper also notes that in-context negation (presenting negated false claims in a chat session, not as training data) works fine — models cite the in-context examples correctly. The asymmetry between training-time and inference-time negation handling is the deeper open question, and the practical guidance is clear: if you generate synthetic training data with negative examples, format the negation as a local same-sentence binding, not a document-level disclaimer.

If you build with synthetic training data Monday morning: audit your negative-example formatting. "DO NOT do X, here's an example of X" is the broken pattern; "X is wrong because..." with the negation in the same sentence is the working pattern. If you generate red-team eval datasets that get used in fine-tuning: same rule. The honest caveats: preprint not yet peer-reviewed, only three models tested, six false statements as the sample, and the underlying mechanism for why local-vs-document negation handling differs is not explained. Worth tracking which numbers survive replication.

LLM 'negation neglect': 88.6% belief in false training data despite warnings

More News