Anthropic Spent Tens of Millions of Dollars to Compile Unique Books Dataset for Claude AI Training

Anthropic Spent Tens of Millions of Dollars to Compile Unique Books Dataset for Claude AI Training
Source: Getty Images For Unsplash+

Anthropic has revealed it spent tens of millions of dollars to create a proprietary scanned books dataset used to train its Claude AI system. During legal proceedings in the Bartz v. Anthropic case on 9th June 2025, it emerged that Anthropic compiled a proprietary composition of books that Anthropic sourced, scanned, and created itself—a collection that the company claims is available to no other AI company in the world.

The case centres around Judge William Alsup's 23rd June 2025 ruling, which partially favoured Anthropic on fair use questions but drew a sharp distinction between using books for AI training and how those books were obtained and stored. Judge Alsup deemed the use of books for LLM training "spectacularly transformative" whilst rejecting Anthropic's defence of its acquisition and long-term storage of over seven million pirated books. The court differentiated between books lawfully purchased and then digitised—which it considered fair use—and books downloaded from pirate sites, which it ruled clearly infringing.

The Bartz v. Anthropic case may set a precedent in AI copyright law, providing a framework for how judges, regulators and companies approach copyright compliance in this rapidly developing area. While Judge Alsup's ruling is not binding outside the Northern District of California, it could influence other similar cases, such as those consolidated against OpenAI in the Southern District of New York. The decision makes clear that how copyrighted works are acquired and handled internally is just as important as how they are ultimately used.

Sources:

1.

Bartz v. Anthropic parties fight over production of datasets spreadsheet and books dataset outside of inspection environment. Anthropic reveals it spent tens of millions of dollars to compile its own scanned books dataset.
As Judge Alsup is deliberating over Anthropic’s motion for summary judgment on fair use, the parties continue to fight over discovery. One of the more fascinating discovery disputes relates t…

2.

Bartz v. Anthropic: Early Look at Copyright Claims and Generative AI | JD Supra
On June 23, 2025, Senior Judge William Alsup of the Northern District of California issued a highly anticipated summary judgment opinion in Bartz v.…

3.

Landmark Ruling on AI Copyright: Fair Use vs. Infringement in Bartz v. Anthropic | ArentFox Schiff
In one of the first substantive decisions analyzing whether the use of copyrighted works to train large language models (LLMs) for generative artificial intelligence (AI) services is infringing or a fair use, Judge William Alsup issued a split decision in his summary judgment order. See Bartz et al. v. Anthropic PBC, No. 3:24-cv-05417 (N.D. Cal. Aug 19, 2024).