Leading AI companies such as OpenAI, Google and Meta rely more on content from premium publishers to train their large language models (LLMs) than they publicly admit, according to new research from executives at Ziff Davis, one of the largest publicly-traded digital media companies
Why it matters: Publishers believe that the more they can show that their high-end content has contributed to training LLMs, the more leverage they will have in seeking copyright protection and compensation for their material in the AI era.
Zoom in: While AI firms generally do not say exactly what data they use for training, executives from Ziff Davis say their analysis of publicly available datasets makes it clear that AI firms rely disproportionately on commercial publishers of news and media websites to train their LLMs.
The paper — authored by Ziff Davis' lead AI attorney, George Wukoson, and its chief technology officer, Joey Fortuna — finds that for some large language models, content from a set of 15 premium publishers made up a significant amount of the data sets used for training.
For example, when analyzing an open-source replication of the OpenWebText dataset from OpenAI that was used to train ChatGPT-2, executives found that nearly 10% of the URLs featured came from the set of 15 premium publishers it studied.
Of note: Ziff Davis is a member of the News/Media Alliance (NMA), a trade group that represents thousands of premium publishers. The new study's findings resemble those of a research paper submitted by NMA to the U.S. Copyright Office last year.
That study found that popular curated datasets underlying major LLMs significantly overweight publisher content "by a factor ranging from over 5 to almost 100 as compared to the generic collection of content that the well-known entity Common Crawl has scraped from the web."
The company conducted the study to help educate the industry about the issue and to inform its own conversations with AI firms.
Seeing just how much AI firms rely on premium publishers also "puts us in a position of having a lot of responsibility for the way that these models understand the world," said Fortuna.
"We have to put great care into what we publish, knowing now that it's the basis of education for a nascent intelligence," he added.
Between the lines: The report also finds that a few public data sets used to train older LLMs are still being used today to train newer models.
The paper's authors suggest the disproportionate reliance on premium publisher content to train older large language models extends to newer LLMs.
"While those frontier models' training is kept secret, we find evidence that the older public training sets still influence the new models," the paper reads.
The big picture: Most news companies that are making deals with AI firms aren't focusing on data training deals any more, since they tend to be one-time windfalls.
Instead, they are cutting longer-term deals to provide news content for generative AI-powered chatbots to answer real-time queries about current events.
A high-profile lawsuit brought by the New York Times against OpenAI and Microsoft could help define for the broader industry whether scraping publisher content without permission and using it to train AI models and fuel their outputs is a copyright violation.
No comments:
Post a Comment