On Thursday, the university announced the launch of a dataset containing nearly a million open-source books that can be used to train artificial intelligence models. As part of the newly created Institutional Data Initiative, the project received funding from Microsoft and OpenAI and contains books scanned using Google Books that are old enough to have expired copyright protection.
A Wired article about the new project states that the dataset includes a wide range of books, including “classics from Shakespeare, Charles Dickens, and Dante, as well as little-known Czech math textbooks and Welsh pocket dictionaries.” As a rule, copyright protection lasts for the life of the author plus another 70 years.
Fundamental language models such as ChatGPT, which behave like real people, require huge amounts of high-quality text for their training – typically, the more information they receive, the better the models mimic humans and provide knowledge. But this thirst for data has caused problems, as programs like OpenAI hit a wall on how much new information they can find – at least without stealing it.
Publishers including the Wall Street Journal and the New York Times have sued OpenAI and competitor Perplexity for obtaining their data without permission. Proponents of AI companies make various arguments in defense of their activities. Sometimes they say that humans create new works by studying and synthesizing material from other sources, and AI is no different. Everyone goes to school, reads books, and then creates new works using the knowledge they have gained. Remixing is legally considered fair use if the new creation is significantly different from the previous one. But that doesn’t take into account that humans can’t swallow billions of chunks of text at the speed that computers can, so it’s not really a fair comparison. The Wall Street Journal, in its lawsuit against Perplexity, claimed that the startup “copies on a massive scale.”
Players in the field have also made the argument that any content available in the public domain is essentially fair game, and that a chatbot user gains access to copyrighted content by requesting it through a prompt. Essentially, a chatbot like Perplexity is like a web browser. It will take some time before these arguments reach the courts.
In response to the criticism, OpenAI has entered into agreements with some content providers, and Perplexity has launched an ad-supported affiliate program with publishers. But it is clear that they did not do this without envy.
At the same time that AI companies are running out of new content to use, commonly used web sources that are already included in training sets have quickly begun to restrict access to them. Companies such as Reddit and X actively fought against limiting the use of their data, as they realized its enormous value, especially in obtaining real-time data to supplement fundamental models with more relevant information about the world.
Reddit earns hundreds of millions of dollars by licensing its corpus of subreddits and comments to Google to train its models. Elon Musk’s X has an exclusive agreement with another of his companies, xAI, to provide its models with access to social media content to learn and find relevant information. It’s a bit ironic to think that these companies carefully guard their own data, but essentially believe that content from media publishers has no value and should be free.
A million books are not enough to satisfy the training needs of any AI company, especially if we consider that these books are old and do not contain modern information, such as slang used by Generation Z children. To differentiate themselves from competitors, AI companies will want to continue to access other data – especially exclusive data – so they will not create identical models. The Institutional Data Initiative dataset can at least offer some help to AI companies trying to train their initial baseline models without getting into legal trouble.