Researchers from NUS, MIT CSAIL, A*STAR, and SMART released MEMO (arXiv 2605.15156) โ€” a modular framework that injects new knowledge into LLM-served applications without touching base model parameters or running a separate retrieval index. The setup is a Qwen2.5-14B-Instruct MEMORY model trained on the new corpus, talking to a frozen EXECUTIVE (Qwen2.5-32B-Instruct or Gemini-3-Flash) through a structured multi-turn protocol. On NarrativeQA the lift over HippoRAG2 is 53.58% vs 23.21%; on MuSiQue and BrowseComp-Plus the deltas are noise-level.

The interface is not cross-attention or adapters โ€” it is conversational, three stages: grounding (decompose query into atomic sub-questions), entity identification across 7 interactions, answer synthesis across 8 interactions. The MEMORY model returns compact natural-language snippets, size independent of corpus. Training is supervised fine-tuning on a five-step synthesis pipeline that generates question-answer pairs; ablating cross-document synthesis drops accuracy from 24.00% to 6.37%, which says the synthesis step is doing the work, not just memorizing facts. Under retrieval-noise injection, MEMO accuracy moves +0.55% while HippoRAG2 drops 6.22% โ€” the protocol is robust because there is no retrieval system to corrupt.

The honest tradeoff for builders is latency. Fifteen-plus inference passes per query before answer synthesis is not free. RAG with a vector store costs one retrieval call plus one generation; MEMO costs the multi-turn dialogue between two models. The win cases are where you cannot fine-tune (cost, catastrophic forgetting, frozen vendor model) and cannot tolerate retrieval noise (legal, medical, domain corpora where one bad snippet poisons the answer). The architecture also decouples knowledge from capability โ€” swap the executive without retraining memory, or swap memory without retraining the executive. That decoupling is the interesting structural claim, more than the single benchmark number.

If you ship domain-LLM applications Monday morning: MEMO is worth a read if you are already paying for RAG fragility or fine-tuning maintenance. The latency cost limits it to high-stakes-low-throughput use cases for now. Watch for whether the 15-turn protocol can be compressed โ€” that is where this stops being a research curiosity and becomes a production option.