The Mystery of AI's Secret Reading List

AI models are essentially smart prediction machines. They learn from vast amounts of data, like books, movies, and TV shows. When they create something, they're just pulling from what they've already seen. They don't come up with anything truly new. This is how they can write about Greek tragedies or draw in a specific style. It's all about patterns and approximations. The AI Disclosures Project, a nonprofit group, has raised some serious questions. They suggest that OpenAI might have used nonpublic, paywalled books from O’Reilly Media to train its advanced AI models. This is a big deal because it involves using copyrighted content without permission. The group's findings are based on a method called DE-COP, which can detect if a model has seen specific texts before. They tested OpenAI's models, including GPT-4o, the one used in ChatGPT, and found that it recognized more paywalled content than older models. This suggests that GPT-4o might have been trained on this restricted content. The group behind the findings is co-founded by Tim O’Reilly, who is also the CEO of O’Reilly Media. They used a clever method to test the AI models. They checked if the models could tell the difference between original texts and AI-generated versions. If the models could do this, it meant they had likely seen the original texts before. They tested this with 13, 962 paragraph excerpts from 34 O’Reilly books. The results showed that GPT-4o recognized more paywalled content than older models. This is a strong hint that GPT-4o was trained on this restricted data. The group was careful to point out that their method isn't perfect. They acknowledged that OpenAI might have gotten the paywalled content from users copying and pasting it into ChatGPT. They also didn't test OpenAI's most recent models, so it's possible these newer models weren't trained on the same data. But the findings still raise important questions about how AI models are trained and what data they use. OpenAI has been pushing for less strict rules around using copyrighted data to train AI. They've even hired experts to help fine-tune their models. This is a trend in the industry, with AI companies recruiting specialists to feed their knowledge into AI systems. OpenAI does pay for some of its training data and has licensing deals with various sources. They also offer ways for copyright owners to opt out of having their content used for training. But with lawsuits and criticisms piling up, the O’Reilly paper adds to the growing concerns about OpenAI's data practices.

questions

actions