Baidu's new OCR model reads a whole book in one pass without its memory blowing up, Zubnet AI News

Baidu has open-sourced Unlimited-OCR, a 3 billion parameter document model whose headline feature is not just how accurately it reads but how it handles length. It can take a 40 page PDF, or in the demos an entire book, and parse it in a single forward pass while keeping its memory footprint flat. The model is MIT licensed and, at 3 billion parameters with about 500 million active in its mixture of experts design, small enough to run on your own hardware.

The reason that matters takes a second to unpack. In the kind of model that reads documents, the expensive part of going long is the KV cache, the running memory the model keeps as it works through a sequence. Normally that cache grows linearly with length, so the longer the document, the more memory and the more latency it costs, and very long documents either get chopped into pieces or become impractical. Holding that cache flat is what lets a single pass over a whole book stay cheap.

Baidu's mechanism for that is an attention scheme it calls Reference Sliding Window Attention, or R-SWA, which compresses the cache from linear to constant. The idea is a split: the model can always see the full reference, meaning the document's visual tokens and the prompt, but on the output side the decoder only retains the most recent 128 generated tokens as its working memory. So no matter how many pages it has produced, the memory it carries forward does not grow. It is built on DeepSeek-OCR's DeepEncoder, cascading a SAM-ViT with a CLIP-ViT and applying 16 times token compression, which turns a 1024 by 1024 page into just 256 visual tokens before the model ever starts reading.

The numbers back the design. On the OmniDocBench v1.6 benchmark, Unlimited-OCR posts a 93.92 percent total score, which Baidu reports as a new state of the art. On its own long document test set, 20 page documents parsed in one pass land at a 0.0572 edit distance, and even 40 plus page documents stay usable at 0.1069. The more telling chart is latency: where DeepSeek-OCR's per call time climbs as it decodes, with spikes at alignment boundaries, Unlimited-OCR's stays a flat line regardless of sequence length. By Baidu's account it beats DeepSeek-OCR outright, which is notable given it is built on DeepSeek-OCR's own encoder.

The reason to care goes back to where documents actually live. Most of the useful data inside companies sits in long PDFs, contracts, manuals, and scanned books, and feeding those into a retrieval system has meant paying a growing memory tax or breaking them into fragments that lose context. A model that parses an entire long document in one pass, at constant memory, and that you can self-host under an MIT license, is aimed squarely at that ingestion problem. The honest caveats hold: OCR benchmarks measure a narrow slice, the hard real-world test is messy scans, dense tables, and handwriting where scores fall, and leaning on DeepSeek-OCR's encoder means the gains are an architecture refinement rather than a clean-sheet design. But constant cache for long document parsing is the right kind of idea, the sort that quietly makes the rest of the document AI stack cheaper to run.

Baidu's new OCR model reads a whole book in one pass without its memory blowing up

More News