The OCR Speed Problem Nobody Talks About

The speed problem nobody talks about

Document scanning technology has become so good at extracting text that we've overlooked something fundamental about the task itself. Current OCR systems use vision-language models to read documents, and these models have grown remarkably powerful. Yet they all share one constraint: they generate output left-to-right, token by token, like typing on a typewriter. For long documents, this sequential bottleneck creates two problems that compound each other.

The first is latency. When a model generates text sequentially, it must wait for each token before computing the next. For a page-long document, this means hundreds of serial steps. The second problem is more insidious: error propagation. If the model makes a mistake predicting the 50th token, that error influences predictions for tokens 51 through 500. There's no going back. By the time you reach the end of a long document, errors have cascaded through the entire output.

These aren't just engineering annoyances. They reveal something about how we've framed the OCR problem itself. We treat document parsing as a sequential language task, but the input is fundamentally 2D: a spatial arrangement of text, tables, formulas, and layout elements frozen in an image. Why are we solving a 2D problem with 1D machinery?

Why left-to-right is a lie

This is where MinerU-Diffusion starts its argument. The researchers ask a deceptively simple question: is left-to-right generation actually required for OCR, or is it an artifact of how we decided to serialize the problem?

Consider what OCR really does. It recovers readable content from a visual scene. The text, layout, and structure already exist in the image as spatial relationships. A human reading a document doesn't process it left-to-right, top-to-bottom as a strict serialization; they scan regions, parse sections, and understand structure in parallel. Yet current models force a causal ordering on this inherently spatial task.

The insight here is conceptual rather than technical: autoregressive decoding isn't intrinsic to document parsing. It's how we choose to serialize a 2D problem into a 1D sequence of tokens. The constraint comes from our serialization, not from the nature of the task. That observation opens a possibility: what if we stopped serializing in the first place?

Overview of the document OCR inverse rendering process via different decoding methods. The model maps a 2D document image to a 1D token sequence for decoding through autoregressive and diffusion-based methods.

Overview of different decoding methods for mapping 2D document images to 1D token sequences. Autoregressive approaches process sequentially, while diffusion-based methods enable parallel decoding.

The framing here is deliberate: the paper reinterprets OCR as inverse rendering. Rendering takes a scene description and produces an image. Inverse rendering goes the opposite direction, recovering spatial structure from visual input. This reframing transforms how we think about decoding. Instead of asking "what word comes next," we ask "given the visual evidence in this image, what tokens belong at every position simultaneously?"

Parallel decoding through diffusion

Once you accept that OCR is a spatial problem, not a sequential one, the next question is mechanical: how do you decode in parallel?

Diffusion models provide a natural answer. Instead of generating tokens one-by-one from left to right, start with random tokens everywhere and iteratively refine them under visual guidance. In each denoising step, every position in the sequence gets updated at once based on the same visual context. No token waits for previous tokens; all positions benefit from the full image at every step.

Here's how it works in practice. The model begins with a random sequence of tokens, one for each position in the output. Then, in parallel, it predicts which tokens are likely wrong and replaces them with better ones. It repeats this process, refining the entire sequence across multiple steps until convergence. Because every position is processed together, you can run this in parallel on a GPU rather than waiting for each step to finish before computing the next.

Training of MinerU-Diffusion. Left: the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. Right: the structured block masking strategy applied during training.

During training, random positions are masked in the target sequence, and the model learns to predict only those masked tokens given the image and unmasked tokens as context.

The training process mirrors this logic. The model is given a partially corrupted version of the ground-truth token sequence (random positions are masked out) and learns to predict the masked positions based on visual input and the remaining visible tokens. This is fundamentally different from autoregressive training, where the model only sees tokens generated by previous steps. Here, during training, it sees the full context and learns to make predictions in parallel.

This approach connects to a broader shift in how we think about sequence models. Work on discrete diffusion for OCR has shown that diffusion can match or exceed autoregressive performance for document understanding. The insight that diffusion enables better utilization of visual context aligns with emerging understanding about how these models extract information from images.

Making it stable: the curriculum learning story

There's a catch. Training a parallel diffusion model to be as accurate as sequential baselines isn't straightforward. If you start by asking the model to predict 50% of tokens in parallel with no training ramp-up, it can fail, learn shortcuts, or converge to unstable solutions. The researchers solve this with a multi-stage curriculum learning approach.

The idea is to start simple and increase difficulty gradually. In the first stage, the model predicts a tiny fraction of tokens in isolation. Once it masters that, the second stage increases the fraction and begins applying block-wise masking, where tokens are masked in contiguous regions rather than scattered positions. This mirrors the spatial structure of the task: text appears in blocks, tables, and regions.

Comparison of training dynamics across different curriculum strategies. The two-stage framework achieves smoother optimization and higher final accuracy compared to single-stage baselines.

Training dynamics comparison showing that the two-stage curriculum learning framework converges more smoothly and achieves higher final accuracy than baselines that skip the gradual difficulty ramp.

The full recipe involves four stages of training, each with different masking ratios and strategies. Early stages focus on local token prediction, while later stages enable the model to handle complex interactions across the entire sequence. This isn't just an implementation detail; it's crucial because without it, the parallel approach can produce hallucinations or repetitions, especially in structured elements like tables.

Training dynamics across the four stages in the training recipe: Stage 0a, Stage 0b, Stage 1, and Stage 2. For each stage, we report both the loss and the gradient norm.

Training curves across all four stages showing loss curves and gradient norms. Each stage progressively increases the difficulty of the parallel prediction task.

The curriculum learning approach reveals something important: making parallel decoding work isn't just about having a clever algorithm. It requires rethinking how models learn. By starting simple and building toward complex joint predictions, the model learns stable, compositional solutions rather than memorized shortcuts.

The real test: when layout breaks down

Until now, the implicit assumption has been that documents are well-structured or that layout information guides the model. But here's a harder test: what if you scramble the semantic content while keeping it spatially coherent?

The paper introduces the Semantic Shuffle benchmark. Take a document and randomly permute its words while preserving their visual positions. A model that relies on language priors and sequential context should struggle badly. Why? Because left-to-right reading habits and linguistic expectations are now contradicted by the visual signal. A truly visual model, one that actually decodes based on what it sees in the image, should handle shuffled semantics better because it's reading the spatial layout, not predicting what words should come next.

Semantic Shuffle benchmark results across distortion levels.

Performance on the Semantic Shuffle benchmark across increasing levels of semantic distortion. MinerU-Diffusion maintains higher accuracy than baselines as shuffle levels increase, indicating stronger visual rather than linguistic grounding.

This benchmark is powerful because it tests the core claim directly. If MinerU-Diffusion were just faster autoregression, it should fail on shuffled documents just as much as sequential models do. But the results show it handles scrambled semantics significantly better. This is evidence that the shift to parallel diffusion decoding fundamentally changes how the model relates to visual information. It's not predicting the next word; it's reading the spatial scene.

Speed and accuracy in the same breath

The initial motivation was speed, and MinerU-Diffusion delivers. The method achieves up to 3.26x speedup compared to sequential baselines. But the real achievement is that it does this without sacrificing accuracy.

The speedup comes from parallelism, not from using a smaller or lower-fidelity model. Every position is decoded with equal access to the full visual context. There's no early-bird advantage where the first tokens are higher quality. There's no sequential error propagation where mistakes early in the sequence influence later predictions. The model either gets a position right or wrong based on the visual evidence and the surrounding context, computed in parallel.

The confidence threshold controls decoding parallelism in MinerU-Diffusion. Compared to MinerU2.5, the method achieves up to 3.26x speedup. MinerU-Diffusion maintains a strong accuracy-efficiency tradeoff, achieving higher throughput at equivalent accuracy levels.

Speed-accuracy tradeoff controlled by a confidence threshold. Higher thresholds enable more aggressive parallelism and faster inference, while maintaining accuracy competitive with sequential baselines.

A subtle but important detail: the model can stop denoising early if it becomes confident in its predictions. A confidence threshold controls how many refinement steps to run per position. At high thresholds, many positions converge quickly and skip remaining steps. This creates a natural tradeoff between speed and accuracy, letting users dial in the right operating point for their use case.

Visualizing the accuracy-throughput trade-off of different models across different OCR tasks under the ground truth layout setting.

Accuracy versus throughput comparison across different OCR benchmarks. MinerU-Diffusion dominates the efficiency frontier, achieving higher accuracy at faster speeds than alternatives.

The comparison is stark. Autoregressive models face a hard constraint: they must complete every forward pass sequentially. Diffusion-based decoding, by contrast, can exploit the structure of the task. Hard problems (complex layouts, rare characters) get more refinement steps. Easy problems (common words in clear regions) resolve quickly. The parallel framework naturally allocates computation where it's needed.

This connects to broader work on inverse rendering and unifying rendering pipelines, which shows that viewing problems from a rendering perspective often yields elegant, efficient solutions. The same principle applies here: by treating OCR as inverse rendering, we unlock parallel computation that sequential approaches cannot access.

The deeper insight

The key takeaway isn't about diffusion models or GPU efficiency, though both matter. It's about how reframing a problem can unlock solutions that seemed impossible under the old frame. OCR looked like a sequential task because we serialized it. But the task itself, recovering readable text from an image, is fundamentally spatial. The document's structure exists in 2D; the rendering happens in 2D; the recovery of that structure should happen in 2D as well.

By inverting the rendering problem and using parallel diffusion decoding, MinerU-Diffusion achieves both speed and robustness. The speed comes from parallelism. The robustness comes from visual grounding rather than linguistic shortcuts. These aren't separate achievements. They flow from a single insight: stop forcing spatial problems into sequential solutions.

This is a Plain English Papers summary of a research paper called MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.