A recent academic study is adding fuel to the ongoing debate about whether OpenAI’s models were trained using copyrighted content — and whether that content was “memorized” without proper authorization.
OpenAI is already facing lawsuits from authors, developers, and other copyright holders, who argue that their works — including novels, articles, and codebases — were used without consent to train large language models. While OpenAI maintains that such use falls under the fair use doctrine, plaintiffs argue U.S. law doesn’t support such claims for training data.
The new study, conducted by researchers from the University of Washington, Stanford University, and the University of Copenhagen, introduces a method for detecting “memorization” in language models accessible via API — like OpenAI’s GPT-3.5 and GPT-4.
These AI models are essentially high-powered prediction machines. They generate content by identifying and using patterns in massive amounts of training data. While the output is usually new, unique text, some cases reveal that models can spit out memorized chunks — sometimes word-for-word — especially from copyrighted sources.

The researchers used a technique based on what they call “high-surprisal words” — rare or unexpected words in specific contexts. For example, in the sentence “Jack and I sat perfectly still with the radar humming,” the word radar is considered high-surprisal because it’s less likely to appear than words like engine or radio before humming.
By removing high-surprisal words from snippets of fiction and news content (including New York Times articles), researchers then prompted the AI models to guess the missing words. When the models correctly filled in these rare words, it suggested that they had memorized those specific phrases or passages.
According to the results, GPT-4 demonstrated a notable degree of memorization, especially with excerpts from fiction books in a dataset called BookMIA — a known collection of copyrighted ebooks. It also showed limited memorization of New York Times content.
An example from the study shows GPT-4 accurately guessing high-surprisal words from obscure literary passages — indicating these passages likely existed in its training data.
Abhilasha Ravichander, a doctoral student at the University of Washington and co-author of the study, emphasized the broader implications:
“In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically. Our work aims to provide a tool to probe large language models, but there is a real need for greater data transparency in the whole ecosystem.”
While OpenAI does have some content licensing agreements, and offers mechanisms for publishers to opt out of data use, the company has also pushed governments to adopt flexible regulations that allow training AI on copyrighted content under expanded fair use laws.
This study, however, raises fresh questions about the boundaries between learning and memorizing — and whether AI companies have gone too far in using protected material without sufficient safeguards
Also Read : Google Launches Gemini 2.5 Pro: Its Most Expensive AI Model Yet