A recent study is fueling concerns that OpenAI may have trained its AI models using copyrighted content without permission.

OpenAI is currently facing multiple lawsuits from authors, developers, and other rights holders who claim the company used their copyrighted work — including books and code — to build its models. OpenAI has argued its actions fall under “fair use,” but critics point out that U.S. copyright law doesn’t clearly allow copyrighted works to be used as training data.

The new research, conducted by teams from the University of Washington, Stanford, and the University of Copenhagen, introduces a method to detect whether AI models have “memorized” specific pieces of training data — a potential violation of copyright.

Instead of producing exact copies of their training data, AI models are designed to generate new content based on patterns. However, in some cases, models can reproduce entire segments of copyrighted material. Past studies have shown image-generating models recreating film scenes, and text-based models repeating published articles.

This new study focuses on identifying unique, uncommon words — known as “high-surprisal” words — in literary texts. The researchers removed these words from excerpts of fiction and New York Times articles, then asked OpenAI’s models (including GPT-4 and GPT-3.5) to guess the missing words. If the models guessed correctly, it likely meant they had memorized the original passages.

Their findings revealed that GPT-4 did, in fact, appear to memorize parts of copyrighted fiction books, especially those from a dataset called BookMIA, which includes samples of e-books. It also showed signs of memorizing portions of New York Times articles, though less frequently.

Abhilasha Ravichander, a Ph.D. student at the University of Washington and co-author of the study, said the results highlight how important transparency is in AI development.

“To build trustworthy language models, we need tools that let us examine and audit them scientifically,” she explained. “Our research offers one such tool, but the broader issue is a lack of transparency around what data these models are trained on.”

While OpenAI has signed some licensing deals and lets rights holders opt out of training datasets, the company continues to advocate for relaxed rules on using copyrighted content in AI. It’s also been lobbying for fair use protections for AI training at the government level.

Source