Bestselling author David Baldacci is suing OpenAI and Microsoft for allegedly training large-language models on copyrighted books and academic papers without permission [1].
The case represents a critical legal battle over the intellectual property rights of creators in the age of generative AI. If Baldacci succeeds, it could force AI companies to change how they source training data and potentially pay billions in royalties to authors and researchers.
The lawsuit is currently pending in the Southern District of New York [1]. Baldacci alleges that AI firms have stolen his and other writers' works by ingesting billions of copyrighted texts into their systems [1]. He said that this theft has occurred over the last eight or nine years [1].
Baldacci has expanded his accusations beyond the primary defendants. He said that Anthropic and Meta have also engaged in this practice, claiming that every book and academic paper published worldwide in the last 70 years [1] is present in these large language model systems.
This legal effort has spanned several years of preparation. Baldacci has been in the discovery phase for about three years [1]. He also took his concerns to the legislative branch, providing testimony before Congress in July 2024 [1].
Despite the length of the process, Baldacci remains committed to the litigation. He said in a separate interview that this is the hill he is going to die on [2]. The author contends that the current AI training model violates fundamental copyright laws by treating creative works as free raw material for commercial software [1].
““They have stolen over the last 8 or 9 years, every book and every academic paper published in the last 70 years worldwide””
This lawsuit highlights the systemic tension between the 'fair use' doctrine and the commercial scale of AI training. By targeting the ingestion of both popular fiction and academic papers, Baldacci is challenging the premise that scraping the public internet and digital libraries constitutes a transformative use of data. A ruling in favor of the plaintiff would establish a legal precedent requiring AI developers to secure explicit licenses for training sets, potentially shifting the economic landscape of AI development toward a paid-subscription model for data sources.





