When we think about foundation models, we often picture large language models (LLMs) like ChatGPT. These LLMs have become synonymous with the promise of AI: tools that can reason, write, answer questions, and assist across domains, from composing emails to solving coding problems to handling customer support. They are inherently horizontal in nature, meaning they’re trained to generalize across a massive range of use cases.
This horizontal flexibility is possible because LLMs are trained on enormous, diverse corpora of text – spanning books, websites, codebases, articles, and conversation transcripts. The goal is to expose the model to as many linguistic patterns, topics, tones, and styles as possible. As a result, today’s LLMs are expected to operate in zero-shot mode: to understand and respond to prompts they’ve never seen before, without requiring further training or tuning.
But when we shift our attention to generative video models, the landscape changes dramatically. These models are diffusion models, a specific type of generative model that learns to create data by reversing a noise-adding process. Because of this underlying training technique, these models aren’t horizontal—they’re deep. They have different underlying training data requirements and they don’t generalize across a wide range of use cases in the way LLMs do. Instead, they are built to specialize. A model designed to generate cinematic b-roll isn’t the same one that can generate synthetic human motion for animation or simulate robotic interaction in a 3D environment.
The Challenges of Zero-Shot for Video
In generative video, zero-shot generation often struggles with scene coherence and consistency – especially for complex multi-object scenarios. For the vast majority of today’s generative video models, a prompt such as:
“Make a 10-second clip of a man juggling on a beach with a dog running past.”
...is unlikely to produce a coherent, realistic scene—unless the model has been purpose-built with data that reflects that kind of scenario. A model that is capable of this, however, won’t be capable of producing the same type of scene using stylized animation. Even state-of-the-art video models like OpenAI’s Sora, Runway’s Gen-3, or Luma’s 3D scene generators rely on deeply specific training data aligned to their output goals. While the ultimate aim is a versatile, world-level video model capable of generating any kind of video, and we’re starting to see glimpses of this such as Moonvalley’s recent release of Marey, the starting point involves extensive volumes of rigorously curated video concepts.
And here’s the catch: most of the world’s audiovisual content isn’t suitable for this kind of training. You can’t just feed a model thousands of hours of TV shows, movies, or livestreams and expect it to learn what it needs. Depending on the intended use case, it’s likely that only a fraction of that raw content is actually useful for training.
The Data Curation Gap
These differences between LLMs and today’s generative models have massive implications for how we think about AI training data.
For LLMs trained on internet-scale data, scaling laws have been clear: larger datasets typically yield greater accuracy in zero-shot scenarios. Only recently, with the rise of reasoning models, has there been increased emphasis on accuracy gains from more curated precise datasets. Diffusion models, however, see fewer benefits from simple data scaling, especially when trained on large volumes of generic video content. Instead, these models achieve greater marginal returns from carefully curated clips of specific types of content that also exclude low-fidelity segments. Long, motionless shots? Useless. Scenes with erratic cuts or inconsistent lighting? Actively harmful. Visuals captured in poor resolution or with irrelevant context? A waste of compute and training time.
Curating video data for model training is not just difficult, it’s a form of craftsmanship.
What It Takes to Make Video Data Model-Ready
Effective training for generative video models requires:
Segmenting longform content into clean, discrete units (e.g., scenes)
Evaluating each scene’s visual quality using metrics such as DOVER to eliminate blurry, artifact-ridden, or unstable footage
Filtering scenes based on relevance to the model’s objective:
Talking-head shots for avatar/dubbing models
Complex body movement for 3D motion generation
Object interactions for robotics or simulation use cases
Structuring and aligning multimodal data such as video, audio, motion capture, and text annotations into tightly synchronized training samples
By the time you’ve processed an entire film or long-form video, only a small fraction—sometimes less than 10%—is actually usable for training a high-quality generative video model.
Why This Matters
This raises two critical truths:
The right data is exponentially more valuable than just “more” data.
In generative video, it’s not about volume, it’s about precision. Finding the right training clip is like finding a needle in a haystack.The preprocessing and data engineering burden is massive.
Model builders must invest significant time and experimentation to transform raw footage into model-ready training datasets. And that requires deep expertise in video, AI, and multimodal alignment.
Our Philosophy: Curate with Purpose
At Protege, we believe the future of generative video depends on mastering the science of training data. We aren’t just aggregating video. We’re curating it with intention, structure, and clarity. Our pipeline is designed to separate signal from noise, cut straight to the content that matters, and deliver datasets that are optimized for model performance from day one.
Acknowledgements:
Special thanks to Engy Ziedan, Dave Davis, and Richard Ho for contributing to this piece.