The blindspot nobody noticed
Video models have become astonishingly capable. Sora and its peers can generate spatiotemporally coherent video sequences that look photorealistic, maintain object continuity across frames, and respect basic physical constraints. By conventional measures, they're superhuman at video production.
But there's a gap nobody has been measuring systematically. Can these models actually reason about what's happening in a video? Can they understand causality, spatial relationships, how objects interact, why certain outcomes follow from certain actions? Or are they just pattern-matching at superhuman scale, replicating visual texture without grasping the underlying structure?
The distinction matters. A model might generate a flawless video of a cup falling and breaking while fundamentally misunderstanding gravity, momentum, or fragility. It might produce spatiotemporally perfect sequences while reasoning about them in ways that would fail immediately on variations it hasn't seen before. The current state of video modeling research has optimized for what's easy to measure, not what matters.
This measurement blind spot exists because existing video reasoning benchmarks are tiny. A few thousand samples spread across a handful of task types, rarely exceeding 50 distinct reasoning problems. You can't study scaling behavior on datasets that small. You can't distinguish between genuine understanding and pattern memorization. You can't watch reasoning abilities emerge as models grow larger and more sophisticated.
The result: we're building increasingly capable video models while remaining almost entirely ignorant about whether they're actually reasoning about the spatiotemporal world or just performing statistical compression on visual data at superhuman fidelity.
Rethinking how to measure reasoning
Before building a dataset, researchers need to ask a prior question: what exactly should we measure?
This is where conventional benchmarking approaches break down. Most video datasets throw mixed tasks at models without understanding what cognitive abilities each task targets. There's no underlying theory of what "video reasoning" actually consists of, so there's no principled way to know whether you're measuring the right things or just chasing whatever scores highest on your metric.
VBVR approaches this differently by grounding the entire framework in cognitive science rather than convenience. The reasoning tasks are organized around five core cognitive faculties that humans use when reasoning about spatiotemporal sequences: spatiality (understanding positions, directions, and spatial relationships), transformation (tracking how objects and scenes change), knowledge (applying real-world physical and semantic understanding), abstraction (recognizing patterns and generalizing across instances), and perception (basic visual understanding of properties and changes).
This taxonomy isn't arbitrary philosophy. It's a diagnostic tool. When a model fails at video reasoning, the framework tells you exactly what capability failed. Is the model struggling with the physics of how objects move? That's a knowledge failure. Can't track where an object goes off-screen? That's a transformation problem. Can't recognize that the same pattern applies to different visual instances? That's abstraction.
This reorientation, from "measure video quality" to "measure reasoning through cognitive faculties," shifts the entire enterprise. You're no longer asking "how visually realistic is this output?" You're asking "what does this model actually understand about how the world works?"
Overview of VBVR task organization. The grid shows representative tasks spanning the cognitive architecture, color-coded by capability: spatiality (blue), transformation (red), knowledge (purple), abstraction (yellow), and perception (green). Each cell represents core reasoning problems that recur throughout human cognition.
Building the dataset without losing your mind
Creating one million unique, meaningful videos by hand would take decades. The innovation that makes VBVR tractable is conceptually simple: parameterized task generators instead of curated clips.
Think of it as a recipe book with parameters. A task design might specify: "Place shape X at position Y, then rotate it Z degrees while changing its color." That's one reasoning problem parameterized. Run that generator a million times with different random values for X, Y, Z, and you get a million unique videos that all test the same underlying reasoning concept. No two videos are identical, yet they're all targeting the same cognitive faculty.
This approach solves multiple problems simultaneously. It prevents overfitting to specific visual patterns (models can't memorize what they haven't seen). It ensures every task variant still tests the same reasoning faculty (the cognitive architecture stays consistent). It makes scaling tractable because software scales better than human annotation. And it enables controlled evaluation by systematically varying parameters to test out-of-distribution generalization.
The result is 1 million video clips spanning 200 core task designs, each grounded in the five cognitive faculties. Figure 7 shows how the 150 visual reasoning tasks distribute across these faculties. That's three orders of magnitude larger than prior datasets, which means researchers can finally study how reasoning abilities scale and whether models develop genuine understanding versus memorizing patterns.
Distribution of 150 visual reasoning tasks across five cognitive faculties in the VBVR dataset.
The technical architecture supporting this scale is equally important. Figure 3 shows the pipeline: task designs implemented as parameterized generators, executed at scale via distributed workers writing to centralized storage. This infrastructure isn't just engineering; it's what makes a million coherent, meaningful videos feasible.
Task designs grounded in cognitive architecture are implemented as parameterized generators, then executed at scale via distributed Lambda workers writing to centralized storage.
Judging without bias
A million videos means nothing if you can't evaluate them fairly. This is where most benchmarking approaches stumble.
The standard practice in modern evaluation is to use language models as judges. Ask GPT-4 whether a model's reasoning is correct, and aggregate the answers. The problem is opacity. Language models are black boxes. You don't know why a response was scored high or low. They hallucinate, show subtle biases, and those biases compound across a million evaluations into systematic errors that distort what you think you're measuring.
VBVR-Bench abandons the black-box judge entirely. Instead, it uses verifiable evaluation: rule-based scoring where ground truth is deterministic (symbolic reasoning problems with objective right answers, spatial relationships with measurable correctness), combined with human-aligned scorers that are interpretable and reproducible. This matters practically because it makes diagnosis possible. When a model scores low, you understand why. When it improves, you can pinpoint which capabilities improved.
The validity of this approach matters. Figure 4 validates that rule-based scoring actually matches human judgment, so this isn't a compromise between objectivity and accuracy. It's better on both dimensions.
Human alignment analysis for VBVR-Bench. The red dots represent perfect agreement between human preference and VBVR-Bench scoring. Across all splits, evaluations closely match human perception.
This shift from black-box judging to interpretable, verifiable evaluation is quieter than it sounds, but it's revolutionary. It means benchmark results become diagnostic tools instead of just performance numbers.
What we actually learned
Now that VBVR exists, researchers can finally ask the questions that matter: do video reasoning abilities actually scale with model size? Do models generalize to unseen reasoning tasks? What is the structure of reasoning capability?
The scaling study tested nine models of different sizes and architectures across all 200 task types. The first finding is striking: models show early signs of emergent generalization. They transfer reasoning to unseen task families, suggesting they're learning something more robust than pattern matching. This isn't guaranteed. A model could learn spurious statistical correlations without developing generalizable reasoning principles. But these models actually seem to be developing the latter.
The structure of reasoning capability is non-uniform. Figure 5 shows capability correlation across models, with the general strength factor regressed out to reveal structural dependencies. Not all five cognitive faculties develop in parallel. Some are foundational, feeding into others. Some are more specialized. This correlation structure tells us which capabilities are prerequisites for which, similar to how human cognitive development follows a progression.
Residualized capability correlation among five cognitive faculties across 9 models. General model strength has been regressed out to highlight structural dependencies. The pattern shows which reasoning abilities are prerequisites for which.
Figure 9 reveals another key insight: performance varies dramatically across different domains. Some reasoning types remain far harder for current models than others.
Domain-wise score distributions across 9 models. The red dashed line separates baseline models from VBVR-Wan2.2, revealing which reasoning types remain most challenging.
Qualitative comparisons on held-out out-of-distribution test sets show this isn't a toy benchmark. Models genuinely struggle with generalization. Figure 6 presents side-by-side comparisons between different models on controllable-execution tasks, showing failures that reveal real gaps in spatiotemporal reasoning rather than cherry-picked failures.
Qualitative comparison on held-out out-of-distribution task families. Models show systematic gaps in reasoning about object interactions and spatial transformations.
Interpreting the results
The scaling curves and correlation structures reveal something important: video reasoning doesn't emerge as a monolithic capability. The five cognitive faculties develop on different trajectories. Perception might saturate quickly while knowledge reasoning continues improving. Spatiality might be foundational while abstraction builds on top of it.
Models show emergent generalization, but not uniformly. Some types of reasoning generalize well to unseen tasks. Others remain brittle, breaking down when parameters change or new task variants appear. This variation points toward bottlenecks. Knowledge reasoning is harder for current models than perception. Abstraction lags behind transformation. These aren't random weaknesses; they're signals about what architectural or training innovations would matter most.
The structure revealed by Figure 5 also suggests that video reasoning follows a development sequence, like human learning. You can't reason about abstract patterns in video if you can't track spatial positions and transformations. This dependency structure gives researchers a roadmap for what to optimize first.
Related work on evaluating video generation models for social reasoning and benchmarking video reasoning systems converges on the same insight: reasoning about video requires different evaluation frameworks than visual fidelity metrics. VBVR scales that insight to full generality across cognitive faculties.
Opening new doors
The public release of VBVR, including the dataset, evaluation toolkit, and trained models at video-reason.com, removes the barrier to entry. Researchers no longer need to build evaluation infrastructure from scratch. They can immediately start asking targeted questions about video reasoning.
Which cognitive faculties transfer between tasks most reliably? How does reasoning about causality develop as models scale? What architectural changes improve out-of-distribution generalization? What's the relationship between video reasoning ability and language understanding? These questions require a foundation, and that foundation now exists.
The next phase of video modeling research can move beyond "how good is the video quality?" to "what is the model actually reasoning about?" That's not a small shift. It reframes the entire enterprise from visual generation toward spatiotemporal understanding. It suggests that the bottleneck in video intelligence isn't perceptual fidelity but genuine reasoning about how the world works.
VBVR lays the infrastructure for that next stage. What comes next depends on what researchers ask and how they use this foundation.
This is a Plain English Papers summary of a research paper called A Very Big Video Reasoning Suite. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.
