A 4M-parameter byte-level encoder closes the cross-script name-matching gap that breaks sanctions screening

Vedant Jumle's writeup of his cross-script name retrieval system is the kind of small, focused research project that makes a real practical dent. The problem is mundane and important: when "Владимир Путин" is in a Cyrillic source and the watchlist is indexed in Latin script, classical fuzzy matchers like Levenshtein, Double Metaphone, and BM25 fail badly. The performance gap between Latin-to-Latin retrieval and Latin-to-non-Latin retrieval on these baselines runs from 0.88 to 0.94 — meaning the system that flags a match within the same script misses the equivalent name across scripts almost completely. Sanctions screening, immigration databases, hospital record matching, and financial compliance pipelines all live with this failure mode every day.

The model is small and the architecture is conventional: a 4-million-parameter transformer encoder with six layers and 256 hidden dimensions, trained with InfoNCE contrastive loss and ANCE hard negative mining. The trick is the input. Instead of subword tokenization, which is brittle across writing systems with very different statistical structures, the encoder reads raw UTF-8 bytes — a 256-symbol alphabet that handles every script natively. There is no script-specific preprocessing and no separate tokenizer that has to be retrained when you add Hebrew or Hindi. Embeddings are unit-normalized so retrieval is cosine similarity, which means deployment is just an ANN index over precomputed vectors. The whole system fits in memory budgets that classical phonetic matchers also fit in.

The training data construction is what makes the result believable. Jumle started with 119,040 person entities sampled from Wikidata, ran them through a four-stage synthetic-pair pipeline (phonetic Latin variants from Llama-3.1-8B, cross-script transliterations into eight scripts from Qwen3-30B), and merged with Wikidata's ground-truth name pairs to get 4.67 million positive pairs. The headline number is 0.775 MRR and 0.897 R@10 overall, and crucially the Latin-to-non-Latin gap collapses to 0.096 — an order of magnitude better than the classical baselines. Arabic, Russian, and Hebrew all clear 0.95 R@10. Chinese (0.666) and Korean (0.728) lag, which the writeup attributes correctly to genuine romanization ambiguity rather than model failure: there are multiple defensible romanizations of any given Hanzi or Hangul name, and the ground truth is sparser.

The honest limitation Jumle flags is that 99.5% of the training data is LLM-generated and synthesized by transliterating from Latin out, not by harvesting native-script spelling variation in the wild. That matters in production: a real sanctions screen has to match common misspellings, dialect variants, and historical romanization conventions that the synthetic pipeline never saw. The benchmark numbers are real but the eval distribution is downstream of the same synthetic generator, which means the gap between benchmark and production is potentially larger than the headline suggests. For builders, the takeaway is double: byte-level encoders plus contrastive learning can crack problems that classical phonetic matching cannot, the architecture is small enough to run anywhere, and the synthetic-data shortcut is the right way to bootstrap when you do not have multilingual paired data — but production deployment still wants a real evaluation set drawn from your actual data distribution, not the generator that trained the model. The repo is at github.com/vedant-jumle/cross-language-phonetic-text-alignment for anyone who wants to fine-tune on their own paired data.

A 4M-parameter byte-level encoder closes the cross-script name-matching gap that breaks sanctions screening

More News