
We know that copyright prohibits us from using protected works in some situations, including downloading, copying, redistributing, and remixing. This is because the law protects a creator or copyright holder's right to control the use of their work. They may want to be paid for enjoying their work (such as buying a book or renting a streaming movie) or using it in turn (such as purchasing stock photography to create websites or advertisements). Creators might also wish to limit the reuse of the work to venues, interpretations, and individuals that align with their own values or other, regardless of financial benefit - like an author selecting a screenwriter or director to adapt their work, or a music artist approving whether their song can be used in a commercial or marketing campaign.
Despite this, we know that large numbers of copyrighted works have been incorporated into AI training datasets and algorithms, almost always without the knowledge, remuneration, or approval of the rights holders. Is this legal?
AI technologies and companies are so new and developed so quickly that we lack clear written laws or legal precedents that would absolutely prohibit the ingestion of copyrighted works into training datasets. (For more discussion about the arguments for and against Fair Use as a basis for wholesale inclusion of copyrighted content into AI models, see the section on AI and Fair Use below.) We also lack tools and processes that would facilitate identifying copyrighted works and creators in datasets and norms and structures for licensing works for this purpose.
The Concept Artist Association hosted a town hall with its membership of largely independent artists on the impact and potential paths forward for creators
affected by the use of their work and growing use of AI tools to create images and art in November 2022.
We still lack laws that directly address most aspects of AI technologies. The first important cases that object to the use of copyrighted works at training data have been filed and are working their way through the legal system. The results of these cases will affect our understanding of how existing laws can be interpreted and what grounds creators and publishers of creative works will have to claim infringement and seek damages for unauthorized use.
Read this: "Does ChatGPT Violate New York Times' Copyrights?" by Harvard Law Today
As society scrambles to catch up to the recent AI boom, we are experiencing the first attempts to control content scraping from technologists and internet service providers are also underway. The company Cloudflare, which provides cybersecurity and web services to a large percentage of global websites and internet users, has announced a new plan to control scraping by blocking bots and introducing a licensing structure that will provide creators the option to receive revenue for content that they personally publish to the web, even on a small, independent scale such as on a portfolio site or news blog. In the absence of legal deterrents that would prevent AI companies from ingesting the internet's content wholesale, technical barriers might be the most promising avenue for creators to exercise their copyrights.
If you are considering building your own dataset, consider constructing it from content with known rights. You may be able to license content or seek permission directly from rights holders, which will allow you to support their work and to concretely know the terms under which you can use their materials now and in the future under the terms of a licensing agreement. If you are unable to pay to license content, it is also possible to source openly licensed (such as under a creative commons license) or public domain materials that will similarly give you the certainty that your use has been approved by the creator
If you build a tool using an existing large language model (LLM) or other tool - such as ChatGPT or Gemini - you will be drawing upon a dataset composed of the kinds of copyright materials scraped from the internet discussed in this section. Should the owners/operators of these tools come to be considered liable for infringement you may be in a legally dubious position and unable to continue utilizing your tool, especially if you have commercialized it or allow others access to it. Purchasing or downloading a dataset that you then use to power a custom built tool can put you in a similar position if you don't know the exact contents of that dataset or the means used to collect them.