Between the high valuation and the unique structure ($15 billion for a 49% stake + hiring the CEO), the Meta-Scale AI deal this week turned heads for a variety of reasons.
From my perspective, I see this as the beginning of a war between foundational models for the right AI training data capabilities.
This makes sense: each foundational part of AI infrastructure is currently undergoing massive capital infusion and seeing massive winners. These are the core bottlenecks to AI’s progress:
Chips (NVIDIA is the trillion dollar player)
Models (OpenAI and others are all aiming to be the trillion dollar players)
Electricity & data center capacity (hundreds of billions of dollars of investment in each category)
Training Data (no clear giant player/investment…yet)
The data market has been ignored as a trillion dollar bottleneck, largely because the first phase of training came from the open web and was therefore initially free. Scale AI has been the early leader in training data by moving from crawlable data to human-generated data, but this is still scratching the surface of the data needed for AI.
With Meta making this move, the first major capital outlay has happened around data — and I believe it’s just the first shot in a much broader war between foundational models as data is more and more widely seen as a foundational bottleneck.
Where Scale AI fits in the data map — the 4 Approaches to Training Data.
There are currently four approaches to gathering training data.
Approach 1: Crawling the web.
The oldest source of training material is the public Internet. Large language models have long feasted on massive web crawls and public databases. This has been an abundant, cheap way to get started — but is ultimately limited. Publishers and creators are clamping down on automated scrapers, and authors and publishers are pushing back. Even if the information were all available without copyright concerns, the internet is a small minority of human knowledge and not close to enough to satiate the hunger for data by LLMs (there is much more information that I’ve written that is in my emails, texts, and documents than on the open web, and that’s true for every person and organization). And in nuanced areas (legal contracts, medical images, etc.) the amount of available public data is extremely small.
Approach 2: Human Generated Data
Faced with the limitations of public data, human generated data has become a popular addition to training. Often, this comes in the form of data annotation. Machines struggle with raw, unstructured inputs — they need examples tagged by humans. Scale AI specializes in that: it combines algorithms with “human-in-the-loop” annotation to turn gigabytes of images or text into clean training sets. A self-driving car maker will pay workers to draw bounding boxes and labels on millions of street photos; a medical AI company will hire experts to highlight tumors, fractures, or other pathologies.
Other datasets may be created specifically for AI, whether that’s commented code to tackle a very specific problem or recordings of scripted prompts read in different languages, accents, or emotional tones.
This human labor is expensive but critical — hence the avalanche of money pouring in. Meta’s check to ScaleAI is in part a bet that owning this human data pipeline will pay dividends in future model improvements. But even with massive investments — the amount of data that foundational models can pay humans to generate is tiny compared to the amount of data already generated. So Scale’s current approach won’t be sufficient to feed the massive need for data.
Approach 3: Synthetic Data
As public data is drying up & human-generated data too expensive at scale, one alternative proposed has been synthetic data — letting artificially generated data for AI that looks like real data.
While this has been popular for specific applications (and works well for proofs-of-concept and early training of a new model), synthetic data won’t work to fundamentally power LLMs — the logic of AI training on top of AI is obviously circular and has the risk of model collapse. So while synthetic data has its place, it still doesn’t solve the data shortage.
Approach 4: Proprietary Data (The Ultimate Goldmine)
All the above are tiny beside the prize of proprietary data; this is the real untapped trove. The earliest deals for proprietary training data have been focused on sources like Reddit, which have massive amounts of user-submitted content, and media sites that have high-value information. There are thousands of sources of information like this.
But a massive amount of human knowledge is in the form of proprietary, non-crawlable information. The information needed to train medical AI breakthroughs is in medical records and images; the information needed to automate contracting workflows is in millions of contracts; the information needed to optimize customer support is in millions of call transcripts; the information needed to architect houses is in millions of drawings; the information needed for AI to generate a feature-length movie is in other TV and video content; the information needed to know the latest science is in paid journal articles.
As AI becomes more and more specialized, thousands of these proprietary deals will need to exist (sometimes augmented with human-generated data). While the issue to date has been illiquidity of this information & difficulty to license it at scale, as the data shortage for AI becomes more acute more and more of these deals are being structured.
Data Wars Are Just Beginning
Meta’s deal with Scale AI is merely the opening salvo. Across the industry, companies are arming themselves on all four fronts: continuing to crawl the web, funding human labeling firms, experimenting with synthetic data pipelines, and entering deals for proprietary data. Expect this to become the next trillion-dollar category as the limitations of the open web set in.
[Co-written with Travis May, CEO of Shaper Capital]
https://open.substack.com/pub/pramodhmallipatna/p/from-fair-use-to-data-moats-the-shifting