Generative AI: how sourcing data for training AI tests UK and EU intellectual property rules
Published on 20th Apr 2023
Training AI poses potential copyright and database right infringement risks for developers
Generative artificial intelligence (AI) has taken centre stage in 2023 and recent breakthroughs have prompted exploration and discussion in the tech world and beyond about how it might shape business, culture, education and much more.
Although it is often offered as a free tool for anyone to use, generative AI throws up a range of far-reaching legal issues, not least in relation to intellectual property (IP) and how these powerful models are created, the implications of their output and what is and can be produced.
Transformative technology
AI encompasses some of the most powerful technologies of the modern world. Machine learning is a type of AI that partly writes and adjusts itself. This is achieved through an iterative "training" process, passing huge quantities of data through the system. The machine learning system generates a complex and detailed map of the patterns in the data within a structure known as a neural network.
Each new piece of data passed through the system causes each setting within the network to be mathematically recalibrated to the "least wrong" setting, based on the data seen up to that point. More training data delivers progressively more accuracy in the patterning and therefore in the outputs. Finding patterns and extracting knowledge from data can be a powerful application for AI, including techniques known as text and data mining (TDM).
Categorising AI systems
Trained AI systems can be split into two broad categories: classification and generative. An image classification system trained on labelled images of fruit, for example, would be able to distinguish a picture of an apple from one of a pear.
An image-making, "generative" system, by contrast, can create a digital image of an apple or a pear. Generative systems for creating images or text are becoming increasingly powerful and sophisticated in their outputs, and their impact on content creation and authorship is being much discussed.
The dependence of most machine learning AI systems on datasets means that sourcing training data is a critical part of the development process. Dataset curation is becoming an area of expertise in itself. In addition to verifying quality, lack of bias and appropriate representation in the dataset, a crucial aspect is to ensure that all necessary legal rights to use the data have been properly investigated and secured. Where the data is personal data, privacy laws must be considered.
What issues are raised by data containing material that is protected by copyright and database rights and what then are the potential infringement risks?
Training data and IP rights
Since the creation of the internet and digital connectivity, the volume of data being created has increased exponentially. The internet contains vast quantities of data of infinite variety. However, some of it may be protected by IP rights, including copyright and database rights. As a result, it is not free to copy even if it is easy to do so; for example, through website scraping tools (such as software "bots" that trawl through the internet extracting data).
If the data constitutes a work that is protected by copyright then copying it without the consent of the rightsholder can amount to infringement of the copyright owner's reproduction right – that is, their right to control the making of copies of the work.
The types of work that may be protected by copyright are quite broad. They could be literary, dramatic, musical or artistic works, films, sound recordings or broadcasts, or databases.
Copyright protection can subsist in databases; however, these can separately and additionally be protected by database rights. Databases are broadly defined as "a collection of independent works, data or other material which – a) are arranged in a systematic or methodical way, and b) are individually accessible by electronic or other means". This could encompass websites, amongst other things. Qualifying databases will receive database right protection where there has been "substantial investment" in obtaining, verifying or presenting the contents of the database.
Database rights are infringed if a person extracts (which includes permanent or temporary transfer) or reutilises (which means making the contents available to the public) all or a substantial part of the contents of a protected database without the rightsholder's consent.
There are, however, some exceptions to both copyright and database right protection that allow use of the protected work or database in certain situations. The next Insight in this series will look at what exceptions might be available in the context of training AI.
Osborne Clarke comment
Where an AI system has been – or, in a due diligence context, might have been – trained on data that is subject to copyright or database rights or both, that data must be validly licensed. AI users should seek assurance on this point from the AI supplier.
The second part of this series will consider possible IP infringement exceptions, while our concluding article will look at what the future could hold for IP rights and AI training data in the UK. The IP risks of AI will also be covered in a webinar during Osborne Clarke's IP Month – sign up here.