Tokenization drift: a leading space can drop your accuracy 30 points

If your model "started behaving differently" overnight without a model change, it's likely tokenization drift. Same tokenizer, semantically identical input, completely different token sequences. A GPT-2 example from the MarkTechPost piece: `" classify"` with a leading space tokenizes to a single ID, `[36509]`; `"classify"` without the space tokenizes to two IDs, `[4871, 1958]`. To the model, those aren't the same word — they're different inputs landing in different regions of token space. Your prompt being "almost the same" as what the model saw during fine-tuning is not the same as being the same.

The article quantifies the failure mode. Using GPT-2's tokenizer and a fine-tuned classifier, the SFT-aligned prompt format reaches around 83% accuracy. Strip the newlines and you drop to 40-50%. Reword the instruction and Jaccard token overlap with the training format falls to ~50%, pushing the input out-of-distribution and tanking accuracy. The fix proposed is APO (Automated Prompt Optimization): test five-plus prompt template variants, score each by Jaccard overlap of token sets against the original SFT template, scale validation accuracy by an OOD penalty proportional to overlap, and pick the template that wins on the combined metric. The implementation is a few lines of HuggingFace `AutoTokenizer.encode()` plus set comparison.

The bigger pattern is that production prompt regressions usually aren't model regressions — they're tokenizer-level mismatches between what the model learned during instruction tuning and what your application is now sending in. This is the kind of bug that survives every test you have unless you're explicitly checking token sequences against a reference distribution. It's also why "the same prompt" copy-pasted from a Notion doc to a code editor can quietly degrade — the editor normalized your whitespace, or stripped a trailing newline, and you're now a different distance from training data. RLHF and instruction-tuned models are extra sensitive because the formatting becomes part of the task representation, not just decoration around it.

What to do Monday: log raw token sequences for production prompts (don't compare strings, compare token IDs) and diff them against the original prompt you tested at deploy time. Add a Jaccard-overlap regression test to your prompt CI pipeline. If you run fine-tunes, save the exact tokenization of your training format and validate inference inputs against it. For builders on closed-weight models you can't retrain, the practical move is the article's APO loop: empirically search for prompt variants that maximize token overlap with formats the model has clearly seen before. The "is my prompt working" question becomes a tokenizer-level question, not a wording-level one.

Tokenization drift: a leading space can drop your accuracy 30 points

More News