Generative AI systems, possessing the ability to create images from text, are being developed at a fast pace and need a lot of data to train them. This data is scraped from the internet and, more often than not, contains copyrighted images which are used either unlawfully or without the permission of the owner. Lawmakers have not yet caught up to this. This study aims to find the number of copyrighted images in a typical AI training dataset. A sample of 100 random images from training datasets encompassing 10 subtopics were collected and analyzed to find their original source in an attempt to identify if they are copyrighted or not. As per the results of this paper, the majority of the images were copyrighted or not credited. Some categories like art had a higher percentage of copyrighted images than others. New laws need to be made which will govern the vast amount of data being used in the AI industry and make sure that AI innovation is not slowed down while also protecting the intellectual property rights of the owners of the images. The paper suggests that there should be a new type of license which will allow for images to be used in datasets and compensate the owners. A chatbot could also be built which can summarize the complex terms of use of licenses in simple terms.
Taimoor Khawaja (Fri,) studied this question.