Research Guides: Artificial Intelligence (AI) and Scholarly Communications: Licensing and Using Copyrighted Works

Licensing and Using Copyrighted Works in AI Applications

We know that copyright prohibits us from using protected works in some situations, including downloading, copying, redistributing, and remixing. This is because the law protects a creator or copyright holder's right to control the use of their work. They may want to be paid for enjoying their work (such as buying a book or renting a streaming movie) or using it in turn (such as purchasing stock photography to create websites or advertisements). Creators might also wish to limit the reuse of the work to venues, interpretations, and individuals that align with their own values or other, regardless of financial benefit - like an author selecting a screenwriter or director to adapt their work, or a music artist approving whether their song can be used in a commercial or marketing campaign.

Despite this, we know that large numbers of copyrighted works have been incorporated into AI training datasets and algorithms, almost always without the knowledge, remuneration, or approval of the rights holders. Is this legal?

The first commercially viable and well-known AI training datasets were somewhat speculatively developed. Although they are now household names and offer subscriptions and other paid services, ChatGPT, Gemini, Dall-E, and other AI companies collected data to build and test models experimentally, not knowing how this novel technology would develop or whether they would ultimately power commercial products. This has allowed their creators to make arguments that their use of copyrighted materials falls under the US doctrine of fair use, which allows for the use of works without permission in certain situations, such as education, journalism, and research as long as specific conditions are met.
AI training datasets need vast, vast quantities of material in order to create quality tools. In order to build datasets of the necessary size, the financial cost and administrative burden of identifying and paying rights owners would likely prohibit the construction of these tools at all. From a technical perspective, we also lack the sort of global infrastructures (such as complete registries of works with contact information) that would make this possible, even if the companies wished to find and pay to license/ingest these works for training AI algorithms.

AI technologies and companies are so new and developed so quickly that we lack clear written laws or legal precedents that would absolutely prohibit the ingestion of copyrighted works into training datasets. (For more discussion about the arguments for and against Fair Use as a basis for wholesale inclusion of copyrighted content into AI models, see the section on AI and Fair Use below.) We also lack tools and processes that would facilitate identifying copyrighted works and creators in datasets and norms and structures for licensing works for this purpose.

The Concept Artist Association hosted a town hall with its membership of largely independent artists on the impact and potential paths forward for creators

affected by the use of their work and growing use of AI tools to create images and art in November 2022.

+ Was/Is this legal and do copyright holders have any recourse if they find that their works have been used in this way?

We still lack laws that directly address most aspects of AI technologies. The first important cases that object to the use of copyrighted works at training data have been filed and are working their way through the legal system. The results of these cases will affect our understanding of how existing laws can be interpreted and what grounds creators and publishers of creative works will have to claim infringement and seek damages for unauthorized use.

Read this: "Does ChatGPT Violate New York Times' Copyrights?" by Harvard Law Today

As society scrambles to catch up to the recent AI boom, we are experiencing the first attempts to control content scraping from technologists and internet service providers are also underway. The company Cloudflare, which provides cybersecurity and web services to a large percentage of global websites and internet users, has announced a new plan to control scraping by blocking bots and introducing a licensing structure that will provide creators the option to receive revenue for content that they personally publish to the web, even on a small, independent scale such as on a portfolio site or news blog. In the absence of legal deterrents that would prevent AI companies from ingesting the internet's content wholesale, technical barriers might be the most promising avenue for creators to exercise their copyrights.

+ What should I consider if I am developing an AI tool or training dataset of my own?

If you are considering building your own dataset, consider constructing it from content with known rights. You may be able to license content or seek permission directly from rights holders, which will allow you to support their work and to concretely know the terms under which you can use their materials now and in the future under the terms of a licensing agreement. If you are unable to pay to license content, it is also possible to source openly licensed (such as under a creative commons license) or public domain materials that will similarly give you the certainty that your use has been approved by the creator

If you build a tool using an existing large language model (LLM) or other tool - such as ChatGPT or Gemini - you will be drawing upon a dataset composed of the kinds of copyright materials scraped from the internet discussed in this section. Should the owners/operators of these tools come to be considered liable for infringement you may be in a legally dubious position and unable to continue utilizing your tool, especially if you have commercialized it or allow others access to it. Purchasing or downloading a dataset that you then use to power a custom built tool can put you in a similar position if you don't know the exact contents of that dataset or the means used to collect them.

Resources

AI, Copyright & Licensing
Copyright Clearance Center
Licensing AI is not the answer—but it contains the answers
Tom Wheeler
Brookings | February 12, 2024