6  Designing and Developing AI Technologies

TipLearning outcomes

By the end of this chapter, learners will be able to assess how various kinds of decisions by software developers and by the organizational stakeholders commissioning an AI system affect its use of personal data.

Once an organization has an initial idea of what it expects AI to do, it can start the work of building (or acquiring) the technologies needed for that purpose. To do so, the organization must make decisions about the various technical components of an AI system. What components will be used to build this system? How do these components connect to one another? How will they be integrated into existing computer systems within an organization? What data will be used to train the model powering the system? What data will be used in its day-to-day operation? Those choices are just a few of the technical decisions that impact how an AI system or model processes personal data.

Section 6.1 offers some preliminary considerations about the technical processes through which AI systems are developed. Then, Section 6.2 discusses how the developers and designers of AI systems are classified under the GDPR and the AI Act, as that classification will affect the duties that apply to them. After that, Section 6.3 details those duties with regards to the acquisition of data for the development process. Finally, Section 6.4 outlines how the data protection principles of the GDPR can be applied in the AI development process, as well as pointing out the rules that apply to the processing of personal data in that process.

6.1 The software development process

Data protection professionals can face various difficulties in evaluating the decisions made at this stage of the AI life cycle. Some of them relate to the technical complexity of the development of AI systems and models. The topics in Part I of this book are geared towards allowing collaboration with technical experts, but they do not capture the full technical nuance of all those topics. Hence, it is necessary to maintain an ongoing dialogue with software developers and engineers within an organization.

Further difficulties come from the fact that the design and development process can take various forms:

  • In agile software development, systems and models are developed iteratively. Starting from an initial idea of what the technical product should do, the technical team creates a first version, which is then refined with additional development work. In this process, both the system and the technical requirements change as time goes by, and there is a tendency to avoid formal documentation of decisions.
  • In waterfall software development, requirements are exhaustively defined at the beginning of the life cycle. Once that is done, the development process follows a linear sequence of stages: programming only begins after all requirements have been defined, the software is tested only after everything has been programmed, and deployment only happens when a system has been fully tested.

Most AI systems and models are developed somewhere in-between one of those two development models, including elements from agile practices and more traditional development modules.1 As such, any list of technical decisions to be monitored would likely include some steps that are not followed in practice within a given organization or omit relevant development practices.

To illustrate the kind of relevant practices that a data professional must attend to, this unit focuses instead on two kinds of technical decisions that are relevant from a data protection perspective.

6.1.1 Data processing within the AI training process

Some technical decisions at this stage result in the actual processing of personal data. For example, the UNw university might decide that it needs to use data about individual students to create a model that can forecast their risk of failure in difficult courses (to propose support measures to those students). Any processing of personal data during the training process, just like in any other moment, remains in principle covered by data protection law.

Not all kinds of personal data processing, however, are covered by EU data protection law. Article 2(2) GDPR lists four kinds of processing that lie outside the regulation’s scope:

  1. In the course of an activity which falls outside the scope of Union law: this carveout is unlikely to apply to AI systems processing personal data. Since the AI Act lays down rules on how AI systems are placed on the market, put into service, or used within the EU, any systems covered by it are within the scope of EU law.
  2. By the Member States carrying out activities within the scope of the EU’s Common Foreign and Security Policy. This exception will not apply to most public or private uses of AI, either.
  3. By a natural person in the course of a purely personal or household activity, an exception that must be construed narrowly (see, for example, Papakonstantinou and de Hert 2023).
  4. By competent authorities in the criminal law contexts covered by Directive (EU) 2016/680, which itself offers a set of data protection safeguards.

Any other processing of personal data is covered by the GDPR.2 So, the application of its provisions can only be avoided by training a system solely on non-personal data, a possibility we discuss later in this chapter.

6.1.2 Determining the means for future data processing

The second kind of relevant technical decision pertains to technical decisions that will affect how the AI system or model will process personal data once it is placed into service or otherwise used. Those decisions stipulate certain aspects of the system’s functioning, such as:

  • The training algorithm that will be used to create an AI model.
  • The training, test, and validation datasets that will be processed by that algorithm.
  • The metrics that will be used to evaluate the training process (see Unit 7).
  • The software libraries that will be used to implement the model or system.
  • The choice of the input parameters that will be given to an AI system; or
  • The interfaces between the AI systems and other systems operated by an organization.

All of those are choices. It is rarely the case that any of those technical problems can only be solved in a single way, which means that two systems (or models) created in response to the same requirements can have vastly different technical arrangements. But one thing these choices have in common is that none of them is solely responsible for the processing of personal data.

Still, they shape how an AI system or model functions. Different technical arrangements will process data in diverse ways, and lead to different outcomes. Consider a situation where DigiToys can choose between two systems that allow their toys to interact with children. One of them allows for smoother interaction, but it demands that the toy collect considerable amounts of data and is prone to occasional errors. The other affords a more limited set of interactions with children but needs less data and does not create as many errors, while still being more interactive than the competitor’s toys. The choice between those two options will affect how much data DigiToys’s products will process in the future.

Still, those future-looking decisions remain covered by the GDPR. Under Article 25(1) GDPR, data controllers are required to address the risks stemming from processing “at the time of the determination of the means for processing”, not just when it takes place.3 If the AI system or model falls within the scope of the AI Act’s rules for high-risk AI systems or general-purpose AI models, there are additional rules that must be observed before a system can be placed on the market, put into service or used. Legal compliance is not a matter for the moment when an AI system finally processes data, but something that must be considered throughout the entire life cycle of any system or model.

6.2 The legal roles of AI developers

TipLearning outcomes

By the end of this section, learners will be able to distinguish between forms of software development involved in the creation of an AI system and classify those providers under the GDPR and the AI Act.

Once an organization decides it needs an AI system, it can do obtain one in a few ways:

  • It can develop the system in-house, creating a solution tailored to its own needs. For example, InnovaHospital might use its extensive collection of radiological data to create a system that automates the reading of scans for certain diseases.
  • Alternatively, the organization might decide its needs can be addressed by technologies available on the market:
    • By fine-tuning those tools. For example, the professors at UNw might decide that they can create an automated system for answering student questions by starting from ChatGPT and doing some extra training to finetune it to the specific topics of the courses they teach.
    • By integrating ready-made systems into their existing infrastructure. For example, DigiToys might license the use of a data analytics system to process all the data it collects from the toys, connecting that system to its databases via an application programming interface (API).
  • Or it might procure the entire system from outside sources.

The first two items entail that an organization is doing some form of software development, while the last one is supposed to minimize technical work. Even in the latter case, successful incorporation of an AI system requires several types of personal data use and of technical skills. Going back to the examples above:

  1. InnovaHospital will need to ensure it has software developers that can handle the construction of an AI model from the potential training data, as well as the integration of that model into the system. It will also need to determine whether the data it uses meets the criteria for personal data, and, if so, comply with them.
  2. For the solutions based on ready-made components:
    1. The professors at UNw will need to collect the data that is relevant for their application and figure out how to carry out the additional training on ChatGPT, a process that is simpler and less expensive than training an entire large language model. They will also need to evaluate compliance with personal data requirements.
    2. DigiToys will not need to do any AI-specific software development. Still, it must evaluate whether the programming it does to connect the AI system with their existing systems processes personal data. For example, it might be the case that the system receives personal data for its operations.

To the extent that the data created, used, or otherwise processed during those processes relates to an identified or identifiable natural person, it will qualify as personal data. Likewise, the technical decisions made during those development processes become relevant to data protection law to the extent that the ensuing AI systems or models store or otherwise process personal data. Hence, the organizations developing and designing AI technologies have obligations regarding the processing of personal data during the training process.

Under both the GDPR and the AI Act, an organization’s obligations depend on the role it plays in processing. Within the AI Act, classification is relatively straightforward. Anyone who develops an AI system or model is a provider, with a few exceptions. Likewise, anyone using an AI system under their own capacity qualifies as a deployer, except in the case of personal non-professional use. Classification within the GDPR regime is slightly more nuanced.

6.2.1 AI developers as data processors

If an organization is developing an AI system or model for its own, internal use, classification is straightforward. From a data protection perspective, the organization meets the definition of a data controller both regarding present and future processing:

  • The developer is the one determining why, when, and how personal data will be processed during the training.
  • The technical choices it makes will determine the means through which the AI system will process in the future.

Classification under the GDPR becomes more complex when an organization develops an AI system or model intended for the use of others. During the training stage, the role of the developer will depend on the degree of independence of its actions. If the buyer provides detailed instruction on how the developer must conduct any data processing during the training process, the developer organization becomes more of an executor of the buyer’s will than an independent controller of the processing in training. Conversely, the responsibility of the developer grows in accordance with the amount of discretion it is afforded when it comes to determining the means for processing.

Suppose DigiToys decides to hire InnovaHospital to create an AI system that can diagnose respiratory illnesses in children, which will be incorporated into a new line of toys:

  • Given the expertise of each organization, the toy company might decide to adopt a hands-off approach and leave the hospital free to choose what kinds of data processing are needed to train the model. In that case, InnovaHospital is still the controller of that processing from a legal perspective.
  • It might be the case, instead, that DigiToys decides to provide strict instructions on whether and how the hospital is to process personal data. For example, the contract between the two might supply detailed stipulations of what is to be done during the development process. If those stipulations meet the requirements of Article 28(3) GDPR, InnovaHospital’s discretion is extremely limited. Hence, control of processing rest with the toy company, and the hospital is merely a processor.
  • Many cases fall in-between those two extremes. For example, InnovaHospital might have considerable liberty to make its technical choices but rely on some data provided by DigiToys. Or both organizations might collaborate in determining the technical specifications of the system’s data and algorithms. In such cases, a data protection professional needs to check whether the situation amounts to joint controllership of the processing.

6.2.2 Responsibility for subsequent processing

If an AI system or model is created for the use of others, its developer might be tempted to think they have no obligations regarding this subsequent use. After all, their system or model is just the technical means used by somebody else to process personal data, and it is this other who determines the means and purposes of processing. However, there are some circumstances in which a developer might have a role in the use of the AI system:

  • Joint controllership might emerge if the developer is also involved in determining the purposes for processing. For example, if DigiToys and InnovaHospital are both involved in the decision of adding the diagnostic medical tool for the toy, then the hospital is involved in the determination of the means (because of its role in determining the technical arrangements of the AI system) and the purposes of processing, thus meeting the elements of controllership.
  • The developer might instead be a processor for those subsequent instances of processing. For example, it is common nowadays to see AI-as-a-service arrangements, in which the buyer acquires access to an AI-powered tool on a subscription or pay-per-use basis instead of having to run their own system.

Both cases make developers potentially responsible, to a lesser or greater extent, for harmful outcomes stemming from the use of the AI system or model they provide. Therefore, a data protection professional cannot take for granted that the developer is entirely detached from any subsequent processing from their AI system.

6.2.2.1 Dividing responsibilities between developers and (other) controllers

In situations of external processors, or even of joint controllership, it is necessary to clarify how responsibilities are divided. Under Article 26 GDPR, joint controllers are required to come to an arrangement between them on how to assign those responsibilities, unless such an assignment is made by EU or national law. Similarly, Article 28(3) GDPR provides a quite extensive list of elements that must be present in the contract between a data controller and a data processor. Those elements remain unchanged when it comes to the relationship between an AI developer and downstream actors relying on their products.

Yet, one must be aware of the strong asymmetry that exists between developers and buyers in particular contexts. Some of the most advanced AI technologies that exist today, such as the large language models discussed in Unit 13, require massive amounts of data and computing resources for their construction. As such, the state of the art is concentrated in the hands of a few economic actors, who often offer their products through take-it-or-leave-it contracts. Data protection professionals will therefore need to evaluate elements such as what kinds of liability are excluded by their organization’s contract with a provider, what kinds of information are supplied, and whether their organization will be able to fully discharge its data protection duties under the terms of the contract. Those and other questions cannot be fully exhausted by a single training module, but the sessions of this module supply a starting point for finding out what aspects need to be verifying before hiring (or offering) an AI system or model in the market.

6.3 Securing personal data for AI systems

TipLearning outcomes

By the end of this section, learners will be able to distinguish between various sources of personal data for AI systems and examine whether the organization has a legal basis for processing that data for the construction of an AI system.

Data is essential for AI systems and models. When we are talking about machine learning models, the rules that reside at the core of those models are derived from the statistical patterns present in their training data, which are then generalized. But even AI systems powered by other types of models, such as knowledge-based systems, will still need input data to generate their outputs, which often amount to data themselves. So, to the extent that those forms of data relate to identified or identified natural persons, an AI system or model will be steeped in personal data.

However, the data needed to create an AI system or model is not always easy to come by. This is especially true when it comes to large-scale technologies such as large language models, which have already been trained on basically every freely available piece of data available on the internet (Kuru 2024). But it is also the case for smaller models. For example, InnovaHospital might struggle to develop an AI-based predictor for a given illness if there are only a few known cases of that disease.

Additionally, not all data is made equal. Some sources might accurately capture an object of analysis, while others might supply badly measured or even deliberately misleading information. For example, data scraped from an online forum will likely reflect the biases and prejudices of the users of that forum. This is why some of the major players in AI technologies have emphasized the need for high-quality data as a competitive differential.

In this context, any organization wishing to develop an AI system or model—for its own use or for others—needs to consider how much data it has available for that purpose. It might be the case that an organization has enormous amounts of data it can apply to this new purpose. But it might also be the case that an organization must acquire new sources of data, either because it lacks the precise kind of information it needs or because existing sources are inadequate. In both cases, the organization will need to fulfil some legal requirements before it can use that data.

During the design and development stage of an AI system, personal data is most likely to be processed in the training processes of a machine learning model. As discussed in Section 2.1, many of the modern applications of AI rely on machine learning, and as such their decision rules are learned from data. So, if a model is expected to take personal data as input or generate it as output, its training will likely require some personal data.

All legal bases for processing listed in Article 6 GDPR remain theoretically viable for AI systems and models. However, many of them stipulate that the processing must be “necessary” for the performance of some task. Given the narrow interpretation of necessity that prevails in data protection law, Articles 6(1)(b–e) GDPR are unlikely to sustain large-scale processing for the use of AI. In most cases, this means data controllers will need to rely either on the data subject’s consent or in the presence of a legitimate interest that justify processing. Both options demand considerable work from the organization.

6.3.2 Legitimate interest as a basis for training AI systems

As an alternative to the difficulties of consent, some organizations have considered the use of the legitimate interest basis for training AI systems.4 This legal basis also authorizes processing that is “necessary” for a purpose: the pursuit of legitimate interests by the controllers or a third part. That basis does not apply when such interests are overridden by the interests or fundamental rights of the data subject. For example, the pursuit of an economic interest might not justify severe intrusions into the right to a private life, especially the life of a children. Furthermore, it is unsuitable for the processing of special categories of personal data, as Article 9(2) GDPR does not feature a general clause on legitimate interest.

In the absence of such an override, the controller is required to weigh the legitimate interests being pursued against the rights and interests that might be affected by treatment (Sartor and Lagioia 2020). This weighing follows the same procedure used for legitimate interest in other contexts. What changes is that it must consider AI-specific risks, such as the ones examined in Units 3 and 4 of this training module. As such, legitimate interest might allow more flexibility for AI developers, at the cost of requiring them to exercise more responsibility in analysing the consequences of their development (Kramcsák 2023). Future guidance from data protection authorities will likely clarify the use of this legal basis. In the meantime, data controllers need to have particular caution when relying on it for AI.

6.3.4 Processing special categories of personal data in high-risk AI systems

As discussed above, there is no general clause allowing training for legitimate interest when it comes to the use of special categories of personal data. This means, for example, that data about a natural person’s health cannot be processed on the grounds of legitimate interest. This creates a challenge for some kinds of application, such as the medical diagnosis tools envisaged by InnovaHospital and its partners.

To some extent, this challenge is mitigated by the hypotheses listed in Article 9 GDPR. Coming back to the example above, processing that is necessary for the purposes of medical diagnosis is covered by Article 9(2)(h) GDPR. However, the term “necessary” must be read narrowly, as a broad reading would reduce considerably the level of protection offered by the provision. As a result, some scholars have pointed out that there was considerable uncertainty about whether additional data could be used to mitigate the risks of biases in an AI system (see, for an overview, van Bekkum and Borgesius (2023)).

Article 10(5) AI Act is aimed precisely at this gap. It allows the processing of special categories of personal data in the training of high-risk AI systems if that processing is necessary for detecting and correcting biases. Whenever this exception is invoked, the following conditions must be met:

  • the bias detection and correction cannot be effectively fulfilled by processing other data, including synthetic or anonymised data;
  • the special categories of personal data are subject to technical limitations on the reuse of the personal data, and state-of-the-art security and privacy-preserving measures, including pseudonymisation;
  • the special categories of personal data are subject to measures to ensure that the personal data processed are secured, protected, subject to suitable safeguards, including strict controls and documentation of the access, to avoid misuse and ensure that only authorised persons have access to those personal data with appropriate confidentiality obligations;
  • the special categories of personal data are not to be transmitted, transferred or otherwise accessed by other parties;
  • the special categories of personal data are deleted once the bias has been corrected or the personal data has reached the end of its retention period, whichever comes first;
  • the records of processing activities pursuant to Regulations (EU) 2016/679 and (EU) 2018/1725 and Directive (EU) 2016/680 include the reasons why the processing of special categories of personal data was strictly necessary to detect and correct biases, and why that objective could not be achieved by processing other data.

Hence, the AI Act provides legal clarity about the possibilities for using special categories of personal data during the training process but imposes considerable constraints in doing so.

6.4 Processing data in AI development

TipLearning outcomes

By the end of this section, learners will be able to discuss how the data protection principles are affected by AI technology and identify AI-specific data protection rules.

Once an organization secures a legal basis for all the data it intends to use, it still has data protection obligations. After all, data protection law does not merely specify when data can be processed. It also lays down requirements for processing. As discussed in the introduction to this Unit, those requirements must be observed both when the means for processing are determined and when actual processing takes place. Hence, the design and the development of AI systems are highly relevant from a data protection perspective.

Briefly recapitulating the discussion in Chapter 2, it can be useful to distinguish between the various elements of an AI system that are determined at this life cycle stage:

  • The components that will form an AI system, such as AI models, the hardware that will be used to execute those models, or its interfaces with other systems.
  • The AI model(s) that will power that system.
  • If the organization is programming its own model, or finetuning an existing one:
    • The training data from which the model will infer its rules.
    • The learning process it will follow for that inference.
  • The validation data against which the model will be assessed.
  • The interfaces through which potential users might use the system.

Those and other choices are directly relevant for data protection when they involve personal data during the training stage. Even if that is not the case, they might be relevant if the system is intended to process personal data. Either way, the involvement of a data protection professional at the design and development stage can prevent many headaches later.

In this session, we will cover issues that appear while interpreting the GDPR rules and principles in contexts involving AI. By necessity, any such treatment is partial, as many factors depend on the specifics of where AI will be developed and used, as well as on the techniques being used. Some examples will be added to mitigate this factor, suggesting how the learner can deepen the general guidelines offered here.

6.4.1 Applying data protection principles in design choices

The starting point for this inquiry is Article 5 GDPR, which lays down general principles applicable whenever personal data is processed. As principles, they do not offer cleancut commands that one can either obey or not. Instead, their legal content is more

abstract. They outline certain values that must be promoted, acknowledging that those values might be weighed differently in each case (Roßnagel and Richter 2023). For example, Article 10(5) AI Act reflects the idea that, in the context of high-risk AI systems, fairness in the AI outputs can take precedence over strict data minimization. Compliance with data protection principles thus requires a balancing act between values in a concrete context.

What changes when AI comes into play? The general logic of principles remains the same, but AI systems and models transform the technical context of processing. Their impact can be felt in each of the GDPR’s data protection principles.

6.4.1.1 Lawfulness, fairness, and transparency (Article 5(1)(a) GDPR)

The principle of lawfulness emerges as a cross-cutting principle. It establishes that any processing must both be allowed by law and follow the applicable legal requirements (Roßnagel and Richter 2023). Article 5(1)(a) GDPR highlights two facets of lawfulness: fairness and transparency. Both are affected by the use of AI.

When it comes to fairness, a developer must ensure two interrelated goals. It must ensure the fair processing of any data processed for training the AI system. That is, the developer must act in a way that justify the trust of the data subjects whose data is used in training. For example, if InnovaHospital decides to use patient data, it must do so in a way that does not mislead patients, provide adequate safeguards for their data, and does not harm them.

Developers must also ensure the fairness of the finished system or model. In particular, this principle compels developers to mitigate (or even eliminate) potential sources of algorithmic biases that might harm the rights and interests of those who will be affected by the use of an AI system. For example, fair processing in the context of UNw’s AI systems would require the university to adopt metrics, such as those discussed in Section 7.1, to detect whether the system has a disparate impact on some group of students (for example, by discriminating against female students). Because the construction of the AI system sets, to a large extent, the means of its future processing, the data protection principles must also be observed as technical choices are made.

The principle of transparency is analysed more closely in Unit 11 of this training module. For the time being, it suffices to say that it obliges developers to not just care about the transparency of how they process data but also about the transparency of further processing done with the AI system.

6.4.1.2 Purpose limitation (Article 5(1)(b) GDPR)

The principle of purpose limitation means that data must be collected for specified, explicit, and legitimate purposes, and that any future processing must not be incompatible with the original purpose. Its implications for the development process were unpacked in the previous session.

6.4.1.3 Data minimization (Article 5(1)(c) GDPR)

The data minimization principle establishes that personal data must be adequate, relevant, and limited to what is necessary for the purposes of processing. Each of these elements has implications for the use of AI technologies.

Regarding adequateness, the developer must ensure they are using data of sufficient quality for the task at hand. For high-risk AI systems, this principle is further specified in Article 10 AI Act, and the measures presented therein (discussed below) can be a useful guide for organizations developing other kinds of systems or models.

The relevance element, in turn, suggests a developer should be able to tell whether and how the data they are bringing to the training process is relevant. For example, if UNw wants to predict the performance of its students in the courses they are taking, it has little reason to acquire training data from a broker that has collected information on the social media habits of those students.

Finally, the necessity element suggests that developers should not use a data-intensive solution when a solution that requires less data is available. In the context of automated decision-making, for example, it has been argued that complex black box models do not always perform better than simpler alternatives (see, e.g., Semenova et al. 2022). Whenever that is the case, the developer would do well to consider the advantages of using the more complex model. But, if simpler models cannot achieve the same result, this principle is not, in itself, an obstacle to the use of data-intensive AI.

6.4.1.4 Accuracy (Article 5(1)(d) GDPR)

We will examine this principle in more detail in Section 7.1, where we discuss metrics that can be used to capture accuracy. Once again, this principle means that accuracy must be ensured both for the data used during the training process and for the AI system (or model) that will produce personal data in future uses.

6.4.1.5 Storage limitation (Article 5(1)(e) GDPR)

In an AI context, this principle relates mostly to the data surrounding the system or model itself. The training, test, and validation data are all subject to storage limitation, as well as the input and output data fed to the finalized system. If the AI model has some degree of memorization of personal data, then the developer must also include mechanisms to ensure that the memorized data will not outlive its necessity.

6.4.1.6 Integrity and confidentiality (Article 5(1)(f) GDPR)

In an AI context, this principle entails that developers must attend to the security risks we outlined in Chapter 3. For example, any organization developing an AI system must consider whether their system is vulnerable to model inversion attacks that would allow the extraction of personal data. If that is the case, mitigation measures become necessary.

6.4.2 Additional obligations for high-risk AI systems

The general principles outlined above are given concreteness in the GDPR’s rules. In particular, Article 25 and Article 32 GDPR require the developers of AI systems to adopt technical and organizational measures that implement those principles, as we will discuss in Chapter 12. Data subject rights, which we cover in Section 8.2, also are guided by those principles. Before wrapping up this chapter, we will now briefly discuss the data management obligations introduced by the AI Act.

Under Article 10 AI Act, the provider of a high-risk AI system must adopt a variety of data governance measures. Article 10(2) defines a set of data governance and management practices that must be observed. Any provider using data in training a high-risk AI system must have oversight and control of how data is used, especially:

  • the relevant design choices;
  • data collection processes and the origin of data, and in the case of personal data, the original purpose of the data collection;
  • relevant data-preparation processing operations, such as annotation, labelling, cleaning, updating, enrichment and aggregation;
  • the formulation of assumptions, in particular with respect to the information that the data are supposed to measure and represent;
  • an assessment of the availability, quantity and suitability of the data sets that are needed;
  • examination in view of possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights or lead to discrimination prohibited under Union law, especially where data outputs influence inputs for future operations;
  • appropriate measures to detect, prevent and mitigate possible biases;
  • the identification of relevant data gaps or shortcomings that prevent compliance with [the AI Act], and how those gaps and shortcomings can be addressed.

Data quality requirements appear in Article 10(3). Under this provision, the training, validation, and data sets must be:

  • Relevant
  • Sufficiently representative
  • To the extent possible
    • Free of errors
    • Complete in view of the intended purpose

The relative character of the latter two obligations is crucial, given that perfect data does not exist. Nonetheless, this obligation forces providers of high-risk AI systems to pursue completeness and accuracy in their datasets.

Finally, Article 10(4) requires providers to use data sets that take into account some contextual elements. To the extent that those elements are required by the system’s purpose, the datasets must consider characteristics or elements that are “particular to the specific geographical, contextual, behavioural or functional setting within which the high-risk AI system is intended to be used.” For example, UNw must take into account the socioeconomical characteristics of its student body, while InnovaHospital must consider (among other things) whether some diseases it wants to diagnose with AI are affected by geographic factors.

Those measures are meant as quality criteria for the data used in the training process of an AI system. Being targeted at high-risk AI systems, they are not mandatory for any other type of system or model. Even so, they represent best practices that organizations might want to consider as a starting point for designing their own data governance architecture.

6.5 Conclusion

As of late 2024, the Irish Data Protection Commission has requested that the European Data Protection Board produce an opinion on the processing of personal data during AI development and training. The resulting opinion has been adopted by the EDPB on 17 December 2024, right as the first version of this training module was finalized. Therefore, the guidance offered above should be read in light of those new regulatory guidelines. Nonetheless, the discussions above offer a high-level overview of data protection issues that appear when training or developing an AI system.

The key takeaways from the previous discussion are:

  1. Unless it is acting strictly under detailed instructions from a buyer, a developer will likely qualify as the data controller of the data it processes in the training process.
  2. Depending on the circumstances under which an AI system or model is commercialized, the developer might also qualify as a joint controller for subsequent data processing.
  3. Design choices must ensure the protection of personal data both regarding the processing that takes place in the training process and the future processing that will be done with an AI system or model.
  4. Most uses of data during the training processes will likely be based on consent or legitimate interests, which means developers need to pay close attention to whether the requirements of those bases are satisfied.
  5. The particularities of AI affect the interpretation of the various data protection principles, which nonetheless remain in force.
  6. The data governance measures in Article 10 AI Act are obligatory for high-risk AI systems but they can also be useful for developers of other systems.

The three sessions of the Unit illustrate how data protection professionals can play a vital role in shaping the development process. If they collaborate closely with technical experts, they can do more than pointing out the unlawfulness of processing. They can help the organization find lawful bases for using the data it already has available, propose safeguards to ensure AI is used in a way that respects the rights of data subjects (including but not limited to the right to data protection) and make sure that design decisions are properly documented for future demonstrations of compliance. Each of those practices contribute to the lawful use of the developed AI systems and models, be it by the developer itself or by third parties.

Exercises

For Exercises 1 and 2, please consider the following situation. Suppose DigiToys decides to incorporate into their devices a functionality that detects traces of respiratory diseases based on changes to a child’s voice. Lacking the medical expertise to create that device alone, DigiToys enters a contract with one of InnovaHospital’s research divisions, granting it considerable access to data collected by the toys. In exchange for that, and for a share of the profits, the researchers will develop this diagnosis tool.

Exercise 1. For data protection purposes, when would InnovaHospital’s division be likely to be considered a data processor during the training process?

  • a. If they carry out development under their own devises but using only DigiToys’s datasets.
  • b. If they sell the ensuing diagnosis systems to other toy manufacturers afterwards.
  • c. Always, regardless of buyer’s input.
  • d. If it makes use of DigiToys’s data to create other products for children healthcare.
  • e. If it carries out its development according to detailed instructions from DigiToys.

Exercise 2. In the data processing that takes place during the training of this AI system, what would likely to be DigiToys’s role according to the GDPR?

  • a. Sole controller
  • b. Processor
  • c. Data subject
  • d. Joint controller
  • e. No specific role under GDPR

Exercise 3. What does GDPR Article 6(4) require when reusing personal data for AI training?

  • a. A compatibility assessment between the original and new purposes.
  • b. Explicit consent from all data subjects involved.
  • c. Avoiding reuse of any personal data collected prior to GDPR enforcement.
  • d. Using anonymized data to bypass legal obligations.
  • e. Documenting the reprocessing without additional compliance steps.

Exercise 4. Which of the following practices best aligns with the data minimization principle in AI system design?

  • a. Selecting a less accurate AI model that requires fewer data inputs.
  • b. Auditing the training data fed into a model to make sure it contributes to performance improvements.
  • c. Collecting extra data to account for potential future needs.
  • d. Retaining all training data indefinitely for potential reuse.
  • e. Using low-quality data that can be collected with minimal intrusion to privacy.

Exercise 5. How does Article 10 AI Act change the GDPR’s rules on the bases for personal data processing?

  • a. It stipulates that, when systems have an elevated risk of being biased, they can only rely on anonymized data.
  • b. They ensure transparency by requiring that high-risk AI systems only rely on data collected from open data sources.
  • c. They allow the use of special categories of personal data for the purpose of avoiding biases in decision-making, as long as the controller adopts certain safeguards.
  • d. They exempt the provider from needing a specific basis for processing data when training a high-risk AI system.
  • e. They require explicit consent from data subjects before their data can be used to train high-risk AI systems.

6.5.1 Prompt for reflection

Securing a valid legal basis for processing personal data is critical during AI development. However, consent and legitimate interest both present challenges, particularly for large-scale or high-risk systems.

  • In your opinion, which legal basis (consent or legitimate interest) is more practical for training AI models in different sectors (e.g., education, healthcare, or commercial AI)? Why?
  • Reflect on a case study like DigiToys or InnovaHospital—what factors should these organizations consider when choosing a legal basis?

6.5.2 Answer sheet

Exercise 1. Alternative E is correct. Even though the final classification will always depend on the specifics of each case, the GDPR’s approach to data processors requires a detailed specification of tasks. The factors present in other alternatives might suggest a situation of joint controllership or even, in some of them, sole controllership by InnovaHospital.

Exercise 2. Alternative D is correct. The description suggests that both DigiToys and InnovaHospital are involving in determining what will be done with the data and how, even though most of the execution falls to the hospital.

Exercise 3. Alternative A is correct. Article 6(4) GDPR provides a non-exhaustive list of criteria that must be considered when evaluating alignment.

Exercise 4. Alternative B is correct. Data minimization does not require providers to pursue less accurate systems, as accuracy is also a data protection principle that must be balanced with minimization. Alternatives C and D run directly counter to the principle.

Exercise 5. Alternative C is correct. Article 10 AI Act introduces a new legal basis for the processing of sensitive data but requires some safeguards for that processing.

References

Marvin van Bekkum and Frederik Zuiderveen Borgesius, ‘Using Sensitive Data to Prevent Discrimination by Artificial Intelligence: Does the GDPR Need a New Exception?’ (2023) 48 Comput Law Secur Rev 105770.

CNIL’s Q&A on the Use of Generative AI (18 July 2024). Accessed 26 September 2024.

CNIL, Determining the legal qualification of AI system providers (07 June 2024).

Data Protection Commission. AI, Large Language Models and Data Protection (18 July 2024). Accessed 26 September 2024.

Ralf Kneuper, Software Processes and Life Cycle Models: An Introduction to Modelling, Using and Managing Agile, Plan-Driven and Hybrid Processes (Springer International Publishing 2018).

Pablo Trigo Kramcsák, ‘Can Legitimate Interest Be an Appropriate Lawful Basis for Processing Artificial Intelligence Training Datasets?’ (2023) 48 Computer Law & Security Review 105765.

David Lehr and Paul Ohm, ‘Playing with the Data: What Legal Scholars Should Learn About Machine Learning’ (2017) 51 UCDL Rev 653.

Silverio Martínez-Fernández and others, ‘Software Engineering for AI-Based Systems: A Survey’ (2022) 31 ACM Trans Softw Eng Methodol 1.

Vagelis Papakonstantinou and Paul de Hert, ‘Art. 2. Material Scope’ in Indra Spiecker gen. Döhmann and others (eds), General Data Protection Regulation. Article-by-Article Commentary(Beck; Nomos; Hart Publishing 2023).

Alexander Roßnagel and Philipp Richter, ‘Art. 5. Principles relating to processing of personal data’ in Indra Spiecker gen. Döhmann and others (eds), General Data Protection Regulation: Article-by-article commentary (Beck; Nomos; Hart Publishing 2023).

Giovanni Sartor and Francesca Lagioia, ‘The Impact of the General Data Protection Regulation (GDPR) on Artificial Intelligence’ (European Parliamentary Research Service, 2020).

Lesia Semenova and others, ‘On the Existence of Simpler Machine Learning Models’ in 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22, New York, NY, USA, Association for Computing Machinery 21 June 2022).

Suzanne Snoek and Isabel Barberá, ‘From Inception to Retirement: Addressing Bias Throughout the Lifecycle of AI Systems. A Practical Guide’ (Rhite and Radboud Universiteit 5 September 2024).

Rob van der Veer, ‘ISO/IEC 5338: Get to know the global standard on AI systems’ Software Improvement Group. Accessed 26 September 2024.


  1. Additionally, safety-critical systems such as those used in the aviation sector are often subject to particularly strict practices in their development process. Analysing those practices goes beyond the scope of the present training module.↩︎

  2. Except for EU institutions, bodies, offices, and agencies, which are covered by a dedicated regulation.↩︎

  3. For more on this topic, see Unit 13 of this course, as well as (Almada et al. 2023).↩︎

  4. Narrower clauses authorizing processing in some cases are present in that provision. However, the “necessity” requirement must be considered when deciding on the extent of training that can be carried out.↩︎

  5. Or Union or Member State law, in the context of the restrictions to data protection authorized in Article 23 GDPR.↩︎