2 Core Concepts of Artificial Intelligence

Learning outcomes

By the end of this unit, learners will be able to discuss technical concepts of AI and explain to technical stakeholders how those concepts are relevant to data protection debates.

In the previous chapter, we discussed how AI creates risks and opportunities that are relevant for compliance with data protection obligations. To understand how AI does so, one must understand how AI-powered technologies work. This is what we will do in this chapter.

The first thing in that discussion is to examine what we are talking about when we talk about “artificial intelligence.” For some people, AI conjures ideas of helpful technologies, such as the personal assistants in our smartphones. For others, it creates apocalyptic ideas out of science fiction, such as robots rebelling against their human masters. But AI can also make people think about very real risks, such as those covered in Chapter 1. So, the term can mean different things for different people, and those impressions are often coloured by fiction and by individual experiences. A clear discussion of the impacts of AI requires common ground for debate.

For the purposes of our training module, when we talk about “AI” we are talking about a technical practice. That is, “artificial intelligence” is what computer scientists, statisticians, and other technically people do when they want to solve certain technical problems. For example, if one wants to create a recommender system, they can use various approaches to do so, such as creating a machine learning model based on consumption habits of the users of a platform. Under this definition, it makes no sense to say that “an AI” did something, because AI is an abstraction.

It follows from this definition that an analysis of the legal relevance of AI should be more specific. It should name the techniques and the technical objects that are of interest because different technical choices can have different impacts in the real world. For example, creating an AI system based on machine learning technologies requires a considerable amount of data, but it can lead to the successful performance of tasks that were not feasible with previous expert systems. This chapter provides some of the terms that data protection professionals need to know in order to make relevant distinctions.

Our goal for the following three sections is not to turn data protection professionals into technical experts. Because of the complexity of AI technologies, developing such a competence would require time and effort that are not reasonable to expect from data protection professionals that are already overloaded. In fact, a narrow technical introduction to AI concepts can be misleading, as it might obscure complexities that appear in the real world. Furthermore, the specifics might soon become outdated as technology evolves. Instead, this chapter offers an introduction to basic concepts of the technological side of AI.

Those concepts can play two roles.

Concepts can help with critical reasoning regarding technologies. As we shall see in Chapter 4 of this book, AI technologies (as any other technologies) do not always live up to what they promise, and knowing where to look can help us not to be swindled by sales pitches.
A good grasp of the terminology can be useful for dialoguing with technical experts within an organization, as well as with contractors.

Therefore, the basis offered by this chapter should remain useful in practice even after the technologies that are the state of the art in 2025 are retired.

For that end, this chapter focuses on three aspects of AI technologies.

Section 2.1 looks under the hood of AI technologies and defines the procedures that are used to create them.
Section 2.2 then discusses the relationship between data and artificial intelligence
Section 2.3 introduces the technical infrastructures that allow all that to function.

After that, the conclusion to this chapter briefly consider the interplay between these three factors.

2.1 How AI works

Learning outcomes

By the end of this session, learners will manage to distinguish between the main technical approaches used to build AI systems and identify the core features of each approach.

The logic that guides AI technologies can sometimes seem arcane. For example, the internet is full of examples where a chatbot is fooled into giving a silly answer to a question because that question is phrased in a peculiar way. However, the details of those AI technologies shape how they work and produce effects in practice. As such, a good understanding of them is essential for properly applying the relevant law to their design and use. To support this understanding, we will begin by the core of what makes AI unique — its algorithms and models.

At its essence, an AI system is a type of computer program, executed by a computer in the same way as any other software. Like all computer programs, AI technologies rely on algorithms. An algorithm is simply a set of step-by-step instructions that tell the computer how to solve a problem or perform a specific task. You might think of it like a recipe. Given certain ingredients (input data), the algorithm tells you what steps to take to prepare a dish (the output). A familiar example is the long division algorithm, which provides a series of steps to divide one number by another, producing both a quotient and a remainder.

In the context of AI, the term “algorithm” is often used to refer to the entire decisionmaking process of the AI system. For instance, someone might say, “The algorithm recommended this video to me,” even though the result is not produced by a single set of steps, requiring instead the combination of various algorithms as parts of the platform. This kind of shorthand reflects the leading role algorithms play in AI technologies, as they define the rules and logic that produce the system’s outputs.

A huge portion of modern AI systems relies on machine learning, a type of AI technique where the specific algorithm for producing outputs is not manually programmed by a developer. Instead, the algorithm that generates the outputs is itself configured by a learning algorithm that processes enormous amounts of data to learn patterns that can be generalized for future decisions. Although there are other approaches to AI, such as expert systems that rely on pre-defined rules, machine learning has been the dominant force behind recent AI advancements. As such, we will focus our discussion on them.

2.1.1 Machine learning approaches

The term machine learning refers to a broad family of ways to create AI systems. For the purposes of this training module, it is important to distinguish between three main classes of approaches:

2.1.1.1 Supervised learning

Supervised learning is the most common type of machine learning. In supervised learning approaches, the algorithm is trained using a labelled dataset, which means that the input data comes with corresponding correct outputs (labels). The system learns by comparing its predictions to the correct answers and adjusting its internal model to reduce errors over time. For example, a supervised learning algorithm might be trained to recognize cats in photos by being shown thousands of images labelled “cat” or “not cat.” Through this process, the system learns to generalize from these examples and can eventually identify whether a new, unlabelled photo contains a cat.

2.1.1.2 Unsupervised Learning

Unsupervised learning approaches also train algorithms on data, but without using labelled responses. Instead of learning from examples, the algorithm tries to find patterns or structures within the data itself. One common use of unsupervised learning is in clustering, where the algorithm groups similar data points together. For instance, a company might use unsupervised learning to segment customers into distinct groups based on their purchasing behaviour, even if the system was not told what kinds of groups to look for.

2.1.1.3 Reinforcement learning

Reinforcement learning approaches train algorithms through their interaction with a physical or virtual environment. As the algorithm interacts with that environment, it receives feedback on its actions, allowing it to learn from trial and error. Successful actions lead to rewards, while mistakes lead to penalties. An example of reinforcement learning is training an algorithm to play a video game: the algorithm tries different strategies, learns from the rewards (such as points scored in the game), and improves its play over time.

2.1.2 From algorithms to models

The result of this learning process is an AI model, which is a representation of what the system has learned from the data. That is, the AI model is the technical object that one obtains when applying a training algorithm to the data. By training a general algorithm on a specific dataset, it becomes possible to adjust the parameters of that algorithm in order to suit the task(s) it is expected to perform. As a result, the fully-trained model contains the decision-making logic that an AI system uses when processing new inputs.

One common type of AI model is the neural network, which is inspired by the structure of the human brain. A neural network is made up of layers of artificial neurons, each of which performs a simple computation. The neurons are organized in layers, and the output from one layer serves as the input to the next. During training, the neural network adjusts the connections between neurons to improve its performance on the given task. Therefore, the output generated by a neural network is the result of applying the trained algorithm to the data it receives.

Let us illustrate how a neural network works in practice. Imagine a system designed to recognize handwritten digits. When you provide an image of a handwritten number, the neural network processes the image through multiple layers of neurons. The neurons in the first layer receive the input data, and each neuron considers that data in its own way, becoming activated if it identifies a feature it has been trained for. For example, one neuron might be optimized for recognizing the closed balls that appear in some digits (such as 6 or 9), while another might recognize the straight segments present in digits like 1 and 7.[^fn-simp] The output of these neurons in the first layer becomes part of the inputs for the neurons in the second layer. In this example, the neurons of the first layer might have generated the outputs “this digit has a straight part” and “this digit has no closed curve”, which the neurons on the second layer would combine to generate more abstract features. The process repeats through each layer, until the information reaches the final layer of the network, which will produce its final output, such as saying that a particular digit is a 5 or a 3.

The previous paragraph is a considerable simplification of what is going on within the neural network. In particular, the features recognized by specific neurons are unlikely to match the features that humans rely on when identifying numbers. They models rely on how information has been codified into the data, passing that through the structure obtained through the training process. As a result, a neural network — or, in fact, any AI model — does not “know” the answer in the way a human does. It can generate correct outputs, but it does so by applying complex mathematical transformations to the input data based on patterns it has seen before. This means that while neural networks can be highly effective, they can also be opaque or difficult to interpret, a phenomenon often referred to as the “black box” problem, as we will discuss in Chapter 4.

2.1.3 From models to systems

An AI model is an object that can be used to perform the task(s) for which it was trained. Many models are created for a specific purpose: the sample neural network described above can only recognized tasks, and one would have to train an entirely new model to recognize dogs. In recent years, however, there is a growing number of general-purpose AI models, which are trained for a variety of tasks. For example, OpenAI’s GPT family of language models can generate several types of content, such as conversations in which they interact with humans or large texts about many subjects. These models are sometimes called foundation models, as they work as a building block for many types of AI systems.

So, what distinguishes an AI model from an AI system? Sometimes, the terms are used interchangeably. Yet, the AI Act distinguishes between them, as do some technical sources. Following this distinction, the AI model is a component that allow the AI system to carry out the tasks that we think of as “artificial intelligence” tasks. For example, a recommender model allows a social media platform to suggest posts to a user based on that user’s previous interactions with content. An AI system (at least one based in machine learning) will include an AI model, but it will also feature other components. So, the difference between them is akin to the difference between an engine and a complete car.

To get from an AI model to an AI system, one needs to add various kinds of components:

To operate, an AI model needs access to input data, which might be collected from various sources or provided by user interactions. For example, a recommendation system in an online platform might draw on user preferences and browsing history to suggest new content, and that information is collected by tools such as cookies.
Once a model operates, its outputs need to be delivered somewhere. This can be a database where records are stored, a chatbot interface, or a dashboard displaying predictions or recommendations, among other possibilities.
An application might interact with the AI model through an API (application programming interface). An API is a set of rules and protocols that allows different software applications to communicate with each other. It acts like a bridge, enabling one program to request data or services from another without needing to understand the internal workings of the other system. For example, many of the applications powered by large language models such as GPT-4o do not replicate those models in the application itself but communicate with a centralized model through an API.
As we shall see later in this book, effective AI systems include monitoring tools to track performance and detect any issues that might arise in real-world use, such as shifts in data quality or unexpected model behaviour.

Those are just some examples of components that can have an impact on how a system functions. Even if they are not powered by AI techniques themselves, they can affect the impact an AI system has in the world. As such, they become directly relevant when one is assessing that system’s compliance with legal requirements.

In short, AI systems are driven by algorithms, with machine learning algorithms playing a dominant role in recent advancements. These systems learn from data, creating models that represent patterns and relationships. While this approach offers powerful capabilities, it also comes with challenges, particularly in terms of transparency, data privacy, and potential biases. By understanding the basic concepts of AI algorithms, data protection professionals can better navigate the complexities of AI technologies and advocate for practices that protect individuals’ rights.

2.2 Personal data in AI systems

Learning outcomes

By the end of this section, learners will be able to identify the various roles personal data plays in AI systems: as inputs for the training process, as inputs for their use, and as outputs of the system’s operation.

As a technology-neutral regulation, the GDPR largely refrains from distinguishing processing in the training process from other kinds of processing. Yet, the specific uses of AI data in the creation and use of AI systems and models raises some concerns that are not present in other types of data processing, or at least are not as salient there. For example, the large volumes of personal data used to create high-end AI models can lead to massive privacy breaches if that data somehow leak. Those issues coexist with more general issues, such as the need to find a legal basis for the processing of any personal data used in this context. This session supports data protection professionals by offering a brief introduction to how personal data can come into play in AI.

To put it shortly, personal data can play three roles when it comes to AI systems:

Personal data can be an input to the operation of an AI system. For example, a recommender system might take information about the personal interests of a user in a social media platform to find out what content that user would like to see.
Personal data can also be the output of the operation of an AI system. For example, an AI system created for creating risk scores for a crime (such as financial fraud) receives information about an individual and then ascribes to that individual a risk score that represents their likelihood of committing that crime.
Personal data can be a building block for an AI system or model. For example, a machine learning model that is intended for the kinds of tasks above will likely be trained on data about individuals that are relevant for the problem, such as platform users and previous investigations of financial fraud, respectively.

As the examples suggest, those uses are often interconnected. A system that is meant to process personal data will likely generate outputs that can be associated with individuals,¹ and personal data will be used in its construction process to ensure the quality of its outputs. In this session, we will look at the various approaches organizations can use to obtain data for their AI systems. Before that, however, we will briefly discuss the roles data can play in the construction of an AI system.

Following technical practices, Article 3 AI Act distinguishes between three types of data sets that are relevant in the construction of an AI system:

Training data refers to the data to which the learning algorithm is applied, that is, to the data from which the patterns contained in the finished model are generated.
- In the case of a supervised learning model, this will usually be a set of examples that pair some input data with the expected output.
- For unsupervised learning models, no expected outputs are provided, just the input data.
- For reinforced learning, one does not supply expected responses, but the system must be given information about the payoff of different options.
Validation data is used for tuning the trained model, allowing the model builders to choose between different learning processes and strategies. For example, it allows builders to avoid the phenomenon of overfitting, in which a model learns rules that describe well the training set but do not generalize well.
Testing data is used for evaluating the overall performance of the AI system before it can be sold or placed into service. That is, it provides a base for evaluating the system after any technical validation processes.

For AI systems that are not built from machine learning techniques, testing data will still be necessary to evaluate their performance in the intended test cases. If one or more of those datasets contains personal data, data protection law is likely applicable to their processing. And, since the learning process and the comparison of test data with model outputs both require processing, this means data protection becomes relevant for the training process, too. Hence, we will now consider how organizations might secure data for their needs as they build and use AI.

2.2.1 Directly collecting data

An organization can start measuring some kinds of data that are relevant for the application they want to develop. That data can take various forms, such as:

Measuring user interactions: For example, DigiToys might collect data on how often children interact with their toys, or on their speech patterns, for the design of product updates.
Analysing internal data: For example, the UNw can use its raw data about students to generate metrics, which might later be fed into an AI system.
Creating new data from the combination of existing sources: For example, InnovaHospital might integrate patient data from different branches of its operations to obtain a holistic view of patient health.

When it collects that data, the organization becomes a data controller for the operations involved in collecting this data and directing it towards AI.

2.2.2 Reutilizing personal data

Some organizations amass personal data as part of their operation. For example, a hospital cannot carry out its core functions without information about its patients. That data might be an asset for the development of AI technologies, but its use is subject to legal constraints that are discussed later in this section.

A few data quality issues might reduce the usefulness of previously available data:

Relevance: one needs to evaluate whether the dimensions captured in existing data are relevant for the problem the AI system or model is meant to solve. For example, the UNw university might use data about the courses each student follows to schedule its purchase of library books, but the that data might not be particularly useful for creating a chatbot.
Assumptions embedded in data: despite what the term “raw data” might suggest, even the most comprehensive datasets contain some assumptions in them: what data is relevant enough to be stored, how should this variable be measured, how to treat missing values, and so on. If unchecked, those assumptions can create problems. For example, if InnovaHospital wants to create a tool for supporting the diagnosis of heart attacks, that tool must account for the differences in symptoms between men and women. Otherwise, it might focus on the metrics that usually reflect male symptoms and fail to serve more than half of the population.
Errors, outdated data, and missing data: one must be aware of what issues are present in the existing dataset and how they are managed. For example, how does DigiToys treat duplicated information received from toys? What error correction mechanisms does it adopt on the transmitted data?

2.2.3 Acquiring data from third-party brokers

Many organizations (the so-called “data brokers”) have a business model that is based on the commercialization of data about individuals and organizations. If an organization decides to acquire data from them, it should exercise caution. The same data quality issues outlined above remain relevant here.

Additionally, one must consider whether the broker has lawfully obtained control of that data and whether there are legal bases for the transfer. Some models of brokerage have already been questioned from a legal perspective, leading to some enforcement decisions and ongoing cases. Hence, an organization needs to exercise due diligence when procuring data for third parties and consider how their AI system or model will be impacted if that business model is found to not comply with the GDPR.

2.2.4 Building synthetic data

Sometimes, an organization cannot rely on fully anonymized data. If an application involves the profiling of natural persons, for instance, it cannot be trained or used without some form of reference to such a person. For example, an AI system for medical diagnoses will eventually be used in someone, generating a piece of personal data about them (their health status). Given that the use of large-scale personal data for such applications can be risky, some organizations have proposed the use of synthetic data as an alternative.

Because synthetic data does not refer to an actual person (identified or identifiable), it would fall outside the GDPR’s definition of personal data. So, to the extent that the synthetic data offers a faithful reproduction of the population to which the AI system applies, it would allow the use of AI without creating data protection risks.

The exemption from data protection law only applies if the data is actually synthetic. If it is possible to find information about natural persons based on the synthetic dataset, it remains covered by data protection law. This is the case even if the values ascribed to that person do not match reality. For example, consider a situation in which a synthetic database keeps the real names of people for credit scoring, but assigns them random values for each metric. That database will not allow an observer to discover correct information about the named individuals. Still, it associates that information to their identities, and the GDPR’s definition of personal data features no exception for incorrect information.

Even if the data itself has no association with an identified or identifiable natural person, data protection law might also apply to its generation. This is the case if the synthetic data is generated from a dataset containing information about actual natural persons. While the ensuing database might not be personal data, creating it requires the processing of personal data. For example, InnovaHospital might use create a synthetic dataset from some of its medical records. In that case, the hospital remains obliged to the follow the GDPR as it creates the dataset, though the subsequent use of that dataset might not be covered by it.

Regardless of its legal classification, synthetic data remains subject to the data quality issues raised above. This kind of data is not a silver bullet for the construction of AI. Still, it can be useful if used judiciously.

2.3 The technical infrastructure of AI

Learning outcomes

By the end of this section, learners will be able to distinguish between the various components of the “stack” that supports the execution of an AI system.

While discussions about AI often focus on algorithms, models, and data, it is essential to understand that these are all abstractions — simplified representations of what is happening under the hood. Ultimately, an AI system is a computer program, which relies on the underlying technical infrastructure to function. This infrastructure includes not just the computers executing the code but also the networks and storage systems that provide the necessary resources. In this section, we will introduce the main elements of this infrastructure and discuss how they can matter for data protection purposes.

2.3.1 Computing power as a need for AI

Let us start with the concept of compute. In technical terms, compute refers to the processing power required to run an AI program. Compute power is what allows an AI system to process data, execute complex algorithms, and generate outputs.

While a typical laptop might be sufficient for running simple AI tasks, the training process of more sophisticated AI models — such as those used in more complex tasks of natural language processing or image recognition — requires much greater compute power. Depending on the scale of the model, even running it after training can demand many resources. These tasks often rely on specialized hardware like Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs), which are designed to handle the heavy computational loads involved in AI training and inference.

One measure that is often used to capture how much compute is used is that of floating-point operations (FLOPs). Without going into much technical detail, a FLOP is a type of mathematical operation that happens within a computer processor. Training a large AI model requires a substantial number of these operations. For example, the rules on systemic risk under the AI Act apply (by presumption) to advanced models trained over more than 10²⁵FLOPs, that is, more than ten septillions of those mathematical operations. A few of the models that exist nowadays, such as Google’s Gemini or OpenAI’s GPT-4o, are said to exceed this threshold.

As of 2024, most of the compute costs in AI training happen during the training process. However, as some studies suggest (Erdil 2024), there is a trade-off between compute during training and compute at inference time, that is, at the moment when an AI system is expected to generate its outputs. There are strategies that allow model builders to reduce the costs involved in training, but at the expense of increasing the number of operations that a trained AI system must perform to generate output.

This trade-off can have implications for organizations using pre-trained AI systems. Each FLOP a processor executes costs a tiny bit of energy and takes some time. The amounts for each operation are vanishingly small, but, as we have seen, there are many operations involved even in the simplest AI tasks. This means that a model that does its most to reduce compute costs at inference time can be cheaper to use, even if at a greater expense to its creator. Conversely, developers might reduce their training costs in a way that makes it more expensive to run the finished AI system.

2.3.2 Memory and storage of data in AI systems

Compute is not the only physical factor at play when it comes to AI systems. Those systems rely heavily on memory and storage, that is, on physical supports that allow a computer to store and process information. The information that needs to be preserved includes not just the system’s output and its input, but the intermediary steps involved in the enormous number of calculations described above. As a result, both the training and use of AI systems can be dependent on the availability of means for memory and storage.

Memory is used for temporarily holding data that the AI system needs to access quickly while processing tasks. The more memory available, the more data the system can access while executing its model. However, memory is volatile — it only holds data temporarily. Once a program finishes its execution, it will ideally free up memory for the next one. For example, the memory used to make an inference about a user’s preferences for a recommender system will likely be overwritten when the system makes a reference for another user.

Sometimes, a computer needs to preserve information for longer. For example, when one generates data as the result of an AI system’s operation, there is usually some interest in preserving that data. To do so, computers rely on long-term storage, such as hard drives or solid-state drives. Those sources of storage can retain information for a long time, without requiring the kind of active effort needed to preserve memory.

The trade-off, here, is that reading information in long-term storage is much slower than reading information in memory. In fact, one of the major sources of delay when a program is running can be the time that is spent taking information from long-term storage and sending it to memory when it needs to be used often. But, since storage devices are cheaper and more lasting than memory, they are essential for storing large datasets and pre-trained AI models that can be used repeatedly, as well as the data one needs to preserve.

2.3.3 Network connectivity

Many AI applications are dependent on the flow of information from other devices. For example, AI systems used in social networks rely on the internet to transmit and receive information. This means that the properties of internet connection, such as download speed and bandwidth, become particularly relevant for their operation.

For instance, a virtual assistant on a smartphone might need to send a voice recording to a cloud server for analysis, requiring a fast and reliable internet connection. If the network speed is insufficient, the response time might lag. Whenever that happens, the user’s experience is negatively impacted, even if the AI system manages to generate inferences quickly enough.

2.3.3.1 The cloud as an AI enabler

As discussed above, running anything but the most trivial AI systems requires a lot of resources. However, few organizations have the financial wherewithal or the technical capabilities to maintain all that technical infrastructure. Therefore, the use of AI models and systems has been incredibly facilitated by the fact that individuals and organizations can contract the use of those resources through cloud platforms.

A cloud is a network of remote servers that provide computing power, storage, and other resources over the internet. These resources are made available for customers, who can, for example, acquire access to a machine by paying a fee based on time or on the amount of resources used. When an AI application is described as “cloud-based,” it means that the heavy computational tasks are not performed on the user’s device, such as a smartphone or laptop. Instead, the heavy work of processing data and making AI inferences is carried out on powerful servers maintained by cloud providers. This setup allows organizations to access vast amounts of computing power without investing in expensive hardware, making it easier and more cost-effective to deploy AI technologies.

However, cloud computing also raises important considerations for data protection. Storing data in the cloud means outsourcing the maintenance and security of that data to a third party. While this can offer benefits in terms of scalability and cost, it also introduces potential risks.

Many major cloud providers, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud, are based outside Europe. This raises concerns about cross-border data transfers and compliance with the GDPR, as seen in the general concerns about cross-border data transfers. Organizations will need to cope with other potential sources of risk as well, such as potential vulnerabilities in the cloud infrastructure that could be exploited by malicious actors.

Finally, cloud platforms have a variety of reliability mechanisms. Nonetheless, they are still a single point of failure outside the control of the organization relying on them. It follows from this that a cloud outage can reduce the availability of many services at the same time, as all services relying on a given provider will be affected by its failures. Organizations need to take these potential risks into account when considering the savings and other advantages they might derive from relying on a cloud provider.

2.4 Conclusion

Why does a training module with a legal focus need to zoom into the technicalities of AI? After all, the GDPR is designed to be a technology-neutral regulation, which means its provisions apply regardless of whether data is processed by AI or another technological arrangement. Even so, there are several reasons why technical understanding can be helpful for data protection professionals.

Sometimes it is possible to adequately describe problems with “algorithms” and “models” without going into technical details. For example, one can identify algorithmic biases by looking at the outputs of AI systems rather than inspecting their inner workings, as we discuss in Chapter 4. This means that abstractions can help us make sense of why AI matters from a legal reason. However, abstractions in computing are always “leaky,” in the sense that the technical details that are abstracted away can sometimes have significant real-world implications.

For example, a defect in the processor used by a cloud server could lead to errors in the AI system’s calculations, producing incorrect or biased results (see, for example, Hochschild et al. 2021). Similarly, a security vulnerability in the cloud provider’s infrastructure could allow unauthorized access to the data being used by the AI system, potentially compromising sensitive personal information.

Given these risks, data protection professionals need to take a proactive role in assessing the technical infrastructure of AI systems used by their organizations. This includes evaluating the security measures implemented by cloud providers, understanding where and how data is stored and processed, and ensuring that crossborder data transfers comply with relevant legal requirements. By gaining a basic understanding of the infrastructure that supports AI, data protection professionals can better identify potential vulnerabilities and work towards mitigating risks.

Exercises

Exercise 1. What is the primary distinction between an AI model and an AI system?

a. An AI model processes data, while an AI system collects data.
b. An AI model operates only on neural networks, while an AI system can be based on other architectures.
c. An AI model is a component of an AI system.
d. An AI model requires an internet connection, whereas an AI system does not.
e. There is no distinction; the terms are interchangeable.

Exercise 2. Which of the following is a characteristic of reinforcement learning?

a. Learning from labelled datasets
b. Learning by receiving rewards and penalties from interactions.
c. Grouping data into clusters without pre-set labels.
d. Utilizing neural networks exclusively for training.
e. Using expert knowledge to define decision-making rules.

Exercise 3. What distinguishes training data from validation data?

a. Training data is synthetic, while validation data is real.
b. There is no difference between the two.
c. Training data evaluates performance, while validation data trains the model.
d. Validation data ensures outputs, while training data remains unused.
e. Training data is used to define model parameters, while validation data is used to tune a model after training.

Exercise 4. What issue arises if synthetic data can be traced back to real individuals?

a. The synthetic dataset becomes subject to data protection regulations.
b. The data remains outside the scope of data protection law.
c. The AI system automatically requires retraining.
d. There is no legal impact as long as the data values are random.
e. The system’s output is considered invalid by default.

Exercise 5. Which of the following is NOT an advantage of cloud-based AI systems?

a. Access to significant computational power without major upfront costs.
b. Scalability to adjust resources based on demand.
c. Reduced dependency on local infrastructure.
d. Outsourcing compliance with data protection duties to the cloud provider.
e. Cost-effectiveness for organizations without extensive hardware.

2.4.1 Prompt for reflection

Reflect on the distinction between an AI system and an AI model. Why is it important for data protection officers to understand this distinction when evaluating compliance with legal requirements?

2.4.2 Answer sheet

Exercise 1. Alternative C is correct. An AI model is a core component that processes inputs to produce outputs, but an AI system includes additional infrastructure and tools to operationalize the model. The other options misinterpret or oversimplify this relationship.

Exercise 2. Alternative B is correct. Reinforcement learning involves feedback from rewards or penalties. Options A and C describe supervised and unsupervised learning, respectively. Option D overstates the role of neural networks, and E relates to expert systems.

Exercise 3. Alternative E is correct. Training data helps a model learn, while validation data helps refine it. The other alternatives assign incorrect roles to at least one data set, or, in the case of alternative A, ascribe a property that is not related to the definition.

Exercise 4. Alternative A is correct, considering the GDPR’s definition of “personal data.” Options B, C, and E are incorrect; D misunderstands the legal implications of reidentifiable data.

Exercise 5. Alternative D is correct, as organizations must still actively manage data protection and compliance concerns.

References

Marianne Bellotti, Kill It with Fire: Manage Aging Computer Systems (No Starch Press 2021).

Kate Crawford, The Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence(Yale University Press 2021).

Ege Erdil, Optimally Allocating Compute Between Inference and Training (Epoch AI 2024).

Peter H Hochschild and others, ‘Cores That Don’t Count’, Proceedings of the Workshop on Hot Topics in Operating Systems(Association for Computing Machinery 2021).

Ronald T Kneusel, How AI Works: From Sorcery to Science (No Starch Press 2024).

Joe Reis, Matt Housley. Fundamentals of Data Engineering (O’Reilly 2022).

Giovanni Sartor and Francesca Lagioia, ‘The Impact of the General Data Protection Regulation (GDPR) on Artificial Intelligence’ (European Parliamentary Research Service, 2020).

Joel Spolsky, ‘The Law of Leaky Abstractions’ in Joel on Software: And on Diverse and Occasionally Related Matters That Will Prove of Interest to Software Developers, Designers, and Managers, and to Those Who, Whether by Good Fortune or Ill Luck, Work with Them in Some Capacity(Apress 2004).

Though not always. The output might, for example, be a statistical aggregate of individual properties that cannot be traced to a single individual.↩︎