May 5, 2023

New Research by Prof. David Bamman Reveals That ‘ChatGPT Seems To Be Trained on Copyrighted Books’

From New Scientist (Paywall)

ChatGPT seems to be trained on copyrighted books like Harry Potter

By Chris Stokel-Walker

ChatGPT and its successor GPT-4 appear to have memorised details from vast numbers of copyrighted books, posing questions about the legality of how these large language models (LLMs) are created.

Both artificial intelligences were developed by private firm OpenAI and trained on huge amounts of data, but exactly which texts make up this training data is unknown. To find out more, David Bamman at the University of California, Berkeley, and his colleagues looked at whether the AIs were able to fill in missing details from a selection of almost 600 fiction books, drawn from sources such as nominees for the Pulitzer prize between 1924 and 2020, and The New York Times’s bestsellers lists over the same time period...

In general, LLMs like ChatGPT and GPT-4 work by predicting the most likely next word in a sentence, based on statistical data learned during training, but this task was designed to expose whether the AIs could return the exact right answer. “It really requires knowledge of the underlying material in order to be able to get the name right,” says Bamman...

Read the full article.

David Bamman is an associate professor of the I School. He previously won a National Science Foundation (NSF) CAREER award for his research designing computational methods to improve natural language processing for fiction. 

Last updated:

May 17, 2023