Research Guides: Artificial Intelligence (AI) and Scholarly Communications: Fair Use and AI Training Data

Fair Use and Training Data

What is Fair Use? Fair Use is a doctrine in the United States that describes the circumstances under which copyrighted works may be used without seeking the permission of the creator or rights holder. Fair Use exists in order to advance certain forms of work and communication that are judged to have an overriding benefit to society without causing significant harm to the copyright holder - such as the practice of journalism, teaching, and conducting research.

Fair use relies on an analysis of four factors:

Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes
Nature of the copyrighted work
Amount and substantiality of the portion used in relation to the copyrighted work as a whole
Effect of the use upon the potential market for or value of the copyrighted work

Fair use is also characterized by how "transformative" the use is determined to be. Simply reproducing another work within the context of your own is generally not considered to be a good basis for a Fair Use argument, while works that are substantially remixed, commented on, analyzed or parodied are more likely to be considered Fair Use.

Source: U.S. Copyright Office Fair Use Index

Fair use is relevant in conversations about artificial intelligence because it has been posited, especially by AI companies, that their integration of copyrighted works into datasets used to train AI tools falls under the umbrella of Fair Use. These companies argue that there is legal precedent to support the use of these works to train or "educate" the AI models and that they are radically transformed through their integration into the overall corpus of training data, which allows for pattern recognition and networking relationships between aspects of the works and metadata.

Screenshot of the book "Scholarly Communication Librarianship and Open Knowledge" on GoogleBooks Two of the major legal precedents were established in Authors Guild v. HathiTrust (2014) and Authors Guild v. Google (2015), which held that mass digitization of a large volume of in-copyright books in order to distill and reveal new information about the books was a fair use. The applications developed by HathiTrust and Google allowed users to browse content by linked metadata (such as discovering books by similar subjects, authors, and publishers) and to explore and analyze text (such as graphing keyword density within a given text or the frequency of certain words in a corpus over time). While these cases did not concern generative AI, they did involve machine learning and parallel the arguments used by AI companies to describe the transformation of works into data and then into tools and applications that have a distinctly different use than their original purpose as creative works.

However, despite these precedents and the arguable transformativeness of the of works in this context, there are others who argue that Fair Use cannot be applied to the ingesting of these works into AI tools and training datasets. They argue that other four factors are not met - especially the increasingly commercial nature of AI platforms and applications, and the potential for market harm. One of the key legal cases in this space is that of the New York Times against OpenAI, the company that owns ChatGPT. They allege that by ingesting New York Times' content into their data, potential subscribers to the newspaper might simply query the chat bot to receive news and analysis generated by their journalists, rather than reading that content through the publisher's website or paper. This would, in turn, harm the Times' business, despite the fact that the ChatGPT users are not simply reading verbatim versions of the NYT articles.

Copyright and Artificial Intelligence Part 3: Generative AI Training pre-publication version
A Report of The Register of Copyrights
US Copyright Office | May 2025
Training Generative AI Models on Copyrighted Works Is Fair Use
by Katherine Klosek, Director of Information Policy and Federal Relations, Association of Research Libraries (ARL), and Marjory S. Blumenthal, Senior Policy Fellow, American Library Association (ALA)
American Library Association Office of Public Policy and Advocacy | January 23, 2024
First of Its Kind Decision Finds AI Training Is Not Fair Use
Kevin Madigan
Copyright Alliance Blog | February 12, 2025