February 28, 2025Open Access

The State of Copyright in AI Training Datasets

Key Points

The study found that a majority of images in training datasets are copyrighted or uncredited, raising legal concerns.
Of the 100 random images analyzed from various subtopics, many were not sourced properly, indicating a lack of compliance.
This analysis involved categorizing images from training datasets to determine their copyright status, highlighting significant issues.
The findings emphasize the necessity for new laws and licenses that balance AI innovation with copyright protection.

Abstract

Generative AI systems, possessing the ability to create images from text, are being developed at a fast pace and need a lot of data to train them. This data is scraped from the internet and, more often than not, contains copyrighted images which are used either unlawfully or without the permission of the owner. Lawmakers have not yet caught up to this. This study aims to find the number of copyrighted images in a typical AI training dataset. A sample of 100 random images from training datasets encompassing 10 subtopics were collected and analyzed to find their original source in an attempt to identify if they are copyrighted or not. As per the results of this paper, the majority of the images were copyrighted or not credited. Some categories like art had a higher percentage of copyrighted images than others. New laws need to be made which will govern the vast amount of data being used in the AI industry and make sure that AI innovation is not slowed down while also protecting the intellectual property rights of the owners of the images. The paper suggests that there should be a new type of license which will allow for images to be used in datasets and compensate the owners. A chatbot could also be built which can summarize the complex terms of use of licenses in simple terms.

The State of Copyright in AI Training Datasets

Key Points

Abstract

Cite This Study