OpenAI has faced multiple allegations regarding the use of copyrighted content for AI training without permission. Now, a new study from the AI Disclosures Project claims that OpenAI may have relied on non-public books from O’Reilly Media to enhance the training of its GPT-4o model. The study, led by media mogul Tim O’Reilly and economist Ilan Strauss, presents compelling evidence that OpenAI’s latest model exhibits strong recognition of paywalled O’Reilly book content, despite the absence of a licensing agreement between the two companies.
AI Models and Their Training Data
AI models, including OpenAI’s ChatGPT, function as complex prediction engines trained on vast datasets comprising books, movies, TV shows, and other media. These models learn patterns and generate content based on user prompts but do not create entirely original outputs. While some AI companies have started incorporating AI-generated data for training, real-world data remains critical due to the risks of diminished model performance when relying solely on synthetic data.
Findings from the AI Disclosures Project
The AI Disclosures Project’s research employed a method known as DE-COP, first introduced in a 2024 academic study, to detect copyrighted content in language models’ training data. This approach, also called a “membership inference attack,” analyzes whether an AI model can distinguish between original human-authored texts and AI-generated paraphrases of the same content. If a model demonstrates prior knowledge of specific texts, it suggests those texts were part of its training dataset.
Tim O’Reilly, Ilan Strauss, and AI researcher Sruly Rosenblat tested GPT-4o, GPT-3.5 Turbo, and other OpenAI models using 13,962 paragraph excerpts from 34 O’Reilly books. Their results indicated that GPT-4o recognized significantly more paywalled O’Reilly book content than older models like GPT-3.5 Turbo, even when accounting for potential biases such as improved text recognition abilities in newer AI models.
OpenAI’s Data Practices Under Scrutiny
While the study stops short of providing conclusive proof, it raises serious concerns. The authors acknowledge that their methodology isn’t foolproof and suggest that some paywalled content may have entered OpenAI’s dataset via users copying and pasting text into ChatGPT. Furthermore, the study did not evaluate OpenAI’s latest models, such as GPT-4.5 or its new reasoning models like o3-mini and o1, which could have different training data sources.

OpenAI has long advocated for more flexible rules regarding the use of copyrighted material in AI training. The company has also actively sought high-quality training data, going so far as to hire journalists to refine its models’ outputs. This reflects a broader industry trend of AI companies recruiting subject matter experts in fields like science and physics to enhance their models.
OpenAI’s Licensing Agreements and Legal Challenges
It’s important to note that OpenAI does pay for certain training data. The company has established licensing deals with news organizations, social media platforms, and stock media providers. Additionally, OpenAI offers opt-out mechanisms for copyright holders who do not want their content included in training datasets, though these measures are considered imperfect by critics.
As OpenAI faces multiple lawsuits over its data practices and copyright compliance in U.S. courts, the allegations presented in the O’Reilly study add to the ongoing debate about AI model training ethics. Whether this report will influence legal proceedings or lead to stricter regulations on AI training data remains to be seen.
Also Read : OpenAI Announces Plans for Its First Open-Weight Language Model