Federal Judge Slams OpenAI’s Attempt to Dismiss Authors’ Copyright Lawsuit Over Book Training Data

SDNY Judge Rejects OpenAI's Motion to Strike Authors' Claims
A federal judge in the Southern District of New York has denied OpenAI's bid to dismiss claims that the AI giant improperly used copyrighted books to train its models. The ruling represents a significant setback for the company's legal strategy.
Court Documents Reveal Training Data Controversy
The lawsuit alleges OpenAI systematically downloaded and incorporated entire copyrighted books without authorization or compensation. Authors claim their works were used as training fuel for ChatGPT and other AI systems—raising fundamental questions about intellectual property in the age of artificial intelligence.
Legal Precedent Could Reshape AI Industry
This decision marks another legal hurdle for AI companies racing to secure training data. While tech giants bet billions on AI development, courts are increasingly questioning whether 'move fast and break things' includes breaking copyright law. Another reminder that in the gold rush of AI, the lawyers often strike gold first.
The "download claim" dispute
The case is part of a massive multidistrict litigation (MDL) consolidating numerous copyright lawsuits against OpenAI and Microsoft in New York's Southern District. An MDL combines similar cases from different courts into one proceeding for efficient pre-trial handling.
This consolidated action includes complaints from authors David Baldacci, Michael Chabon, and others alleging OpenAI "captured, downloaded, and copied copyrighted written works" without permission.
In its motion to strike, OpenAI argued the consolidated complaint improperly introduced a new legal theory by separating download allegations from training-based claims.
Judge Stein rejected this argument, finding that prior class action complaints had already "asserted a cause of action for copyright infringement and alleged that OpenAI impermissibly downloaded and reproduced plaintiffs' books."
The fact that many allegations suggested the "ultimate purpose of the reproduction was to train OpenAI's LLMs is not dispositive," he wrote.
Navodaya Singh Rajpurohit, legal partner at Coinque Consulting, told Decrypt that "authors may need to show concrete evidence that their books were in the training data."
The courts have ordered production of "Slack channels discussing the removal of the books datasets" and required OpenAI to preserve "complete output logs and metadata,” he added, to "trace whether specific works were ingested."
"These logs, along with any test files or vendor‑supplied book lists, may be important in discovery," the lawyer said.
OpenAI may argue downloads came from public or licensed sources, Rajpurohit said, noting it has acknowledged licensing publisher content and contends training on publicly available material is transformative fair use, and recent media partnerships suggest clearer licensing supporting lawfulness.
Industry-wide copyright battles
OpenAI is fending off a raft of copyright suits, one led by The New York Times, alleging it and Microsoft used “millions of paywalled articles” to build a “market substitute” for news.
In May, a court ordered OpenAI to “preserve and segregate all output-log data,” including deleted chats; OpenAI contested the order in June, calling it “an overreach by The New York Times” that undermines user privacy.
In June, Meta and Anthropic notched partial wins with Judge Vince Chhabria deeming Meta’s book-training fair use, noting plaintiffs “made the wrong arguments,” while Judge William Alsup likewise found Anthropic’s training fair use but criticized its “permanent library of pirated books.”