Close Menu
Techripper
  • Latest
  • Tech
  • Artificial Intelligence
  • Gaming
  • Tutorial
  • Reviews
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram
Techripper
Sunday, February 1
  • Latest
  • Tech

    Apple launches new AirTag with longer range and louder speaker

    January 27, 2026

    Verizon’s Massive Outage: Over 1.5 Million Customers Affected Before Service Restored

    January 15, 2026

    Apple introduces Apple Creator Studio, an inspiring collection of the most powerful creative apps

    January 14, 2026

    Donald Trump Launches $499 ‘Made in USA’ Phone

    January 13, 2026

    Apple Confirms iPhone Attacks With No Fix for Most Users

    January 13, 2026
  • Artificial Intelligence
  • Gaming
  • Tutorial
  • Reviews
Techripper
Home Blog AI Watchdog Accuses OpenAI of Training GPT-4o on Paywalled O’Reilly Books
Tech

AI Watchdog Accuses OpenAI of Training GPT-4o on Paywalled O’Reilly Books

InternBy InternApril 2, 2025No Comments3 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email

OpenAI has faced multiple allegations regarding the use of copyrighted content for AI training without permission. Now, a new study from the AI Disclosures Project claims that OpenAI may have relied on non-public books from O’Reilly Media to enhance the training of its GPT-4o model. The study, led by media mogul Tim O’Reilly and economist Ilan Strauss, presents compelling evidence that OpenAI’s latest model exhibits strong recognition of paywalled O’Reilly book content, despite the absence of a licensing agreement between the two companies.

Contents
  • AI Models and Their Training Data
  • Findings from the AI Disclosures Project
  • OpenAI’s Data Practices Under Scrutiny
  • OpenAI’s Licensing Agreements and Legal Challenges

AI Models and Their Training Data

AI models, including OpenAI’s ChatGPT, function as complex prediction engines trained on vast datasets comprising books, movies, TV shows, and other media. These models learn patterns and generate content based on user prompts but do not create entirely original outputs. While some AI companies have started incorporating AI-generated data for training, real-world data remains critical due to the risks of diminished model performance when relying solely on synthetic data.

Findings from the AI Disclosures Project

The AI Disclosures Project’s research employed a method known as DE-COP, first introduced in a 2024 academic study, to detect copyrighted content in language models’ training data. This approach, also called a “membership inference attack,” analyzes whether an AI model can distinguish between original human-authored texts and AI-generated paraphrases of the same content. If a model demonstrates prior knowledge of specific texts, it suggests those texts were part of its training dataset.

Tim O’Reilly, Ilan Strauss, and AI researcher Sruly Rosenblat tested GPT-4o, GPT-3.5 Turbo, and other OpenAI models using 13,962 paragraph excerpts from 34 O’Reilly books. Their results indicated that GPT-4o recognized significantly more paywalled O’Reilly book content than older models like GPT-3.5 Turbo, even when accounting for potential biases such as improved text recognition abilities in newer AI models.

OpenAI’s Data Practices Under Scrutiny

While the study stops short of providing conclusive proof, it raises serious concerns. The authors acknowledge that their methodology isn’t foolproof and suggest that some paywalled content may have entered OpenAI’s dataset via users copying and pasting text into ChatGPT. Furthermore, the study did not evaluate OpenAI’s latest models, such as GPT-4.5 or its new reasoning models like o3-mini and o1, which could have different training data sources.

OpenAI has long advocated for more flexible rules regarding the use of copyrighted material in AI training. The company has also actively sought high-quality training data, going so far as to hire journalists to refine its models’ outputs. This reflects a broader industry trend of AI companies recruiting subject matter experts in fields like science and physics to enhance their models.

OpenAI’s Licensing Agreements and Legal Challenges

It’s important to note that OpenAI does pay for certain training data. The company has established licensing deals with news organizations, social media platforms, and stock media providers. Additionally, OpenAI offers opt-out mechanisms for copyright holders who do not want their content included in training datasets, though these measures are considered imperfect by critics.

As OpenAI faces multiple lawsuits over its data practices and copyright compliance in U.S. courts, the allegations presented in the O’Reilly study add to the ongoing debate about AI model training ethics. Whether this report will influence legal proceedings or lead to stricter regulations on AI training data remains to be seen.

Also Read : OpenAI Announces Plans for Its First Open-Weight Language Model

AI copyright issue ethical ai GPT-4o o’reilly books OpenAI training controversy watchdog
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Intern

Related Posts

Apple launches new AirTag with longer range and louder speaker

January 27, 2026

Verizon’s Massive Outage: Over 1.5 Million Customers Affected Before Service Restored

January 15, 2026

Apple introduces Apple Creator Studio, an inspiring collection of the most powerful creative apps

January 14, 2026
Facebook X (Twitter) Instagram Pinterest
  • About
  • Contact
  • Privacy Policy
  • Terms and Conditions
© 2026 Techripper | All Rights Reserved

Type above and press Enter to search. Press Esc to cancel.