Close Menu
Techripper
  • Latest
  • Tech
  • Artificial Intelligence
  • Gaming
  • Tutorial
  • Reviews
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram
Techripper
Tuesday, October 14
  • Latest
  • Tech

    SpaceX Wants to Send Humans to Mars by 2028 Here’s Why That’s Not Likely

    July 29, 2025

    Meta Expands Instagram’s Safety Tools for Young Users

    July 24, 2025

    Scale AI Lays Off 200 Employees Amid Major Meta Investment

    July 19, 2025

    GM and Redwood Materials Team Up to Repurpose EV Batteries for Powering Data Centers

    July 17, 2025

    US Army Soldier Pleads Guilty to Hacking Telecom Companies and Extortion

    July 16, 2025
  • Artificial Intelligence
  • Gaming
  • Tutorial
  • Reviews
Techripper
Home Blog AI Watchdog Accuses OpenAI of Training GPT-4o on Paywalled O’Reilly Books
Tech

AI Watchdog Accuses OpenAI of Training GPT-4o on Paywalled O’Reilly Books

InternBy InternApril 2, 2025No Comments3 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email

OpenAI has faced multiple allegations regarding the use of copyrighted content for AI training without permission. Now, a new study from the AI Disclosures Project claims that OpenAI may have relied on non-public books from O’Reilly Media to enhance the training of its GPT-4o model. The study, led by media mogul Tim O’Reilly and economist Ilan Strauss, presents compelling evidence that OpenAI’s latest model exhibits strong recognition of paywalled O’Reilly book content, despite the absence of a licensing agreement between the two companies.

Contents
  • AI Models and Their Training Data
  • Findings from the AI Disclosures Project
  • OpenAI’s Data Practices Under Scrutiny
  • OpenAI’s Licensing Agreements and Legal Challenges

AI Models and Their Training Data

AI models, including OpenAI’s ChatGPT, function as complex prediction engines trained on vast datasets comprising books, movies, TV shows, and other media. These models learn patterns and generate content based on user prompts but do not create entirely original outputs. While some AI companies have started incorporating AI-generated data for training, real-world data remains critical due to the risks of diminished model performance when relying solely on synthetic data.

Findings from the AI Disclosures Project

The AI Disclosures Project’s research employed a method known as DE-COP, first introduced in a 2024 academic study, to detect copyrighted content in language models’ training data. This approach, also called a “membership inference attack,” analyzes whether an AI model can distinguish between original human-authored texts and AI-generated paraphrases of the same content. If a model demonstrates prior knowledge of specific texts, it suggests those texts were part of its training dataset.

Tim O’Reilly, Ilan Strauss, and AI researcher Sruly Rosenblat tested GPT-4o, GPT-3.5 Turbo, and other OpenAI models using 13,962 paragraph excerpts from 34 O’Reilly books. Their results indicated that GPT-4o recognized significantly more paywalled O’Reilly book content than older models like GPT-3.5 Turbo, even when accounting for potential biases such as improved text recognition abilities in newer AI models.

OpenAI’s Data Practices Under Scrutiny

While the study stops short of providing conclusive proof, it raises serious concerns. The authors acknowledge that their methodology isn’t foolproof and suggest that some paywalled content may have entered OpenAI’s dataset via users copying and pasting text into ChatGPT. Furthermore, the study did not evaluate OpenAI’s latest models, such as GPT-4.5 or its new reasoning models like o3-mini and o1, which could have different training data sources.

OpenAI has long advocated for more flexible rules regarding the use of copyrighted material in AI training. The company has also actively sought high-quality training data, going so far as to hire journalists to refine its models’ outputs. This reflects a broader industry trend of AI companies recruiting subject matter experts in fields like science and physics to enhance their models.

OpenAI’s Licensing Agreements and Legal Challenges

It’s important to note that OpenAI does pay for certain training data. The company has established licensing deals with news organizations, social media platforms, and stock media providers. Additionally, OpenAI offers opt-out mechanisms for copyright holders who do not want their content included in training datasets, though these measures are considered imperfect by critics.

As OpenAI faces multiple lawsuits over its data practices and copyright compliance in U.S. courts, the allegations presented in the O’Reilly study add to the ongoing debate about AI model training ethics. Whether this report will influence legal proceedings or lead to stricter regulations on AI training data remains to be seen.

Also Read : OpenAI Announces Plans for Its First Open-Weight Language Model

AI copyright issue ethical ai GPT-4o o’reilly books OpenAI training controversy watchdog
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Intern

Related Posts

SpaceX Wants to Send Humans to Mars by 2028 Here’s Why That’s Not Likely

July 29, 2025

Meta Expands Instagram’s Safety Tools for Young Users

July 24, 2025

OpenAI Launches Advanced AI Agent Capable of Creating Spreadsheets and Presentations

July 19, 2025
Facebook X (Twitter) Instagram Pinterest
  • About
  • Contact
  • Privacy Policy
  • Terms and Conditions
© 2025 Techripper | All Rights Reserved

Type above and press Enter to search. Press Esc to cancel.