Select Language

English

Down Icon

Select Country

Mexico

Down Icon

AI chatbots need more books to learn, and several US libraries will lend them books.

AI chatbots need more books to learn, and several US libraries will lend them books.

Everything we've heard on the internet was just the beginning of teaching artificial intelligence about humanity. Now, tech companies are turning to an even older repository of knowledge : library bookshelves.

Nearly a million books published as far back as the 15th century —and in 254 languages—are part of a Harvard University collection recently shared with researchers. Treasures of old newspapers and government documents held by the Boston Public Library will soon be included.

Opening the vaults to access centuries-old tomes could mean a wealth of data for tech companies facing lawsuits from novelists, visual artists, and others whose creative works they have used without their consent to train AI chatbots.

Public domain

“It’s a prudent decision to start with public domain information because that’s less controversial at this point than content that’s still under copyright,” said Burton Davis, Microsoft’s deputy general counsel.

Davis noted that libraries also hold "vast amounts of interesting cultural, historical, and linguistic data," which has been absent in recent decades from the online commentary from which AI chatbots have largely learned. Fears of running out of information have also led AI developers to turn to "synthetic" data, created by the chatbots themselves and of lower quality.

With the support of unrestricted gifts from Microsoft and OpenAI—the maker of ChatGPT—the Harvard-based Institutional Data Initiative is working with libraries and museums around the world on how to make their historical collections AI-ready in a way that also benefits the communities they serve.

“We're trying to shift some of the power that's currently in the hands of AI back to these institutions,” said Aristana Scourtas, who leads research at Harvard Law School's Library Innovation Lab . “Librarians have always been stewards of data and information.”

Chatbots. Clarín Archive. Chatbots. Clarín Archive.

The dataset Harvard just released, Institutional Books 1.0, contains more than 394 million scanned pages of paper . One of the oldest works dates back to the 15th century: a Korean painter's handwritten reflections on the cultivation of flowers and trees. The largest concentration of works is from the 19th century, on topics including literature, philosophy, law, and agriculture—all meticulously preserved and organized by generations of librarians.

Improve accuracy

It promises to be very beneficial for AI developers trying to improve the accuracy and reliability of their systems.

“A lot of the data that has been used in AI training doesn't come from original sources,” noted Greg Leppert, the data initiative's executive director, who is also chief technology officer at Harvard's Berkman Klein Center for Internet & Society, an organization focused on the study of cyberspace. This collection of books covers “down to the physical copy that was scanned by the institutions that actually collected those materials,” he added.

Before ChatGPT sparked a commercial frenzy in artificial intelligence, most AI researchers weren't particularly interested in the provenance of the text passages they scraped from Wikipedia, social media forums like Reddit, and sometimes vast repositories of pirated books . They just needed what computer scientists call tokens: units of data, each of which can represent a fragment of a word.

Chatbots.REUTERS/Dado Ruvic/Illustration Chatbots.REUTERS/Dado Ruvic/Illustration

Harvard’s new AI training collection has an estimated 242 billion tokens, an amount that’s difficult for humans to comprehend , but still just a drop in the bucket of what’s being fed into the most advanced AI systems. For example, Facebook’s parent company, Meta, has said that the latest version of its extensive AI language model was trained on more than 30 trillion tokens extracted from text, images, and videos.

Meta is also facing a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from “ghost libraries” of pirated works.

Now, with some reservations, the royal libraries are imposing their conditions.

Copyright violations

OpenAI, which is also facing a series of copyright infringement lawsuits, donated $50 million this year to a group of research institutions, including Oxford University's 400-year-old Bodleian Library, that is digitizing rare books and using AI to transcribe them.

When the company first approached the Boston Public Library, one of the largest in the United States, the library made it clear that any information it digitized would be available to everyone , shared Jessica Chapel, its director of digital and online services.

“OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So, this seems to be a case where interests are coinciding ,” Chapel said.

Digitization is expensive. For example, the Boston library has spent painstaking work scanning and organizing dozens of French-language New England newspapers that were widely distributed in the late 19th and early 20th centuries among Canadian immigrant communities in Quebec. Now that this text is being used to train AI, it's helping fund projects librarians want to pursue anyway.

Chatbots.REUTERS/Dado Ruvic/Illustration Chatbots.REUTERS/Dado Ruvic/Illustration

Harvard's collection had already begun to be digitized in 2006 for another technology giant, Google, in its controversial project to create a searchable online library of more than 20 million books.

Google spent years fending off lawsuits from authors over its online library , which included many newer, copyrighted works. It finally found a solution in 2016, when the U.S. Supreme Court upheld lower court rulings that rejected copyright infringement claims.

95 years of protection

Now, for the first time, Google has worked with Harvard to extract Google Books volumes from the public domain and pave the way for sharing them with AI developers. Copyright protections in the United States typically last 95 years, and longer for sound recordings.

The new initiative was applauded by the same group of authors who sued Google over its book project and who more recently took AI companies to court.

“Many of these titles exist only on the shelves of major libraries, and the creation and use of this dataset will expand access to these volumes and the knowledge they contain,” said Mary Rasenberger, executive director of the Writers Guild, in a statement. “Above all, the creation of a comprehensive legal dataset for training will democratize the creation of new AI models.”

Photograph provided by Google showing the two pages of posts for Gemini, Google's artificial intelligence (AI) chatbot. EFE/Google Photograph provided by Google showing the two pages of posts for Gemini, Google's artificial intelligence (AI) chatbot. EFE/Google

How useful all of this will be for the next generation of AI tools remains to be seen , as the data is shared on the Hugging Face platform, which hosts open-source AI datasets and models that anyone can download.

The book collection is more linguistically diverse than AI's typical data sources. Less than half of the volumes are in English, although European languages ​​remain predominant, particularly German, French, Italian, Spanish, and Latin.

Immensely crucial

A collection of books steeped in 19th-century thought could also be “immensely crucial” to the tech industry’s attempts to build AI agents that can plan and reason as well as humans, Leppert noted.

“At a university, you have a lot of teaching materials about what reasoning means,” he observed. “You have a lot of scientific information about how to execute processes and how to perform analyses.”

At the same time, there is also a lot of outdated data , from discredited scientific and medical theories to racist and colonial narratives.

“When you're dealing with such a large data set, there are some tricky issues around harmful content and language,” said Kristi Mukk, coordinator of the Harvard Library Innovation Lab, who said the initiative is seeking to provide guidance to mitigate the risks of data use, thereby “helping users make their own informed decisions and use AI responsibly.”

With information from The Associated Press.

Clarin

Clarin

Similar News

All News
Animated ArrowAnimated ArrowAnimated Arrow