7 Verification and Validation of AI Systems and Models

Learning outcomes

By the end of this chapter, learners will be able to:

Distinguish between various approaches to examining an AI system, such as software testing, evaluation of metrics, and audits.
Identify moments when assessment is needed, before and after the initial deployment of a system; and
Incorporate data protection questions into those assessments.

In a linear picture of the AI life cycle, the verification and validation stage takes place once the major software development activities take place. At this point, software developers (as well as specialized QA professionals) evaluate the mostly finished system (or model) to determine whether it is ready for use. They do so by subjecting it to a variety of tests, which are meant to evaluate whether the system or model meets the requirements identified in the inception stage discussed in Chapter 5. If the system fails to meet those standards, it goes back to the design and development stage (covered in Chapter 6) for adjustments. Otherwise, it is deemed ready to be sold or put into service. In this chapter, we examine how those practices matter for data protection compliance.

Before that, we need to consider what it means for an AI system or model to be “ready.” There is no tried-and-true formula that allows us to determine when a system has met the requirements that motivated its original design. Even if those requirements have been defined in objective terms, such as “the system must achieve 99.9999% accuracy according to [an established metric],” those might no longer be relevant by the time the software system is ready. Sometimes this happens because technology has evolved and what was previously acceptable is now a deficient performance. Sometimes the problem resides in the relevant criteria themselves, which are no longer relevant for the new context of an organization. Ultimately, what makes a software ready is the developer’s decision to commercialize it (or put it into service).

That decision is, more often than not, influenced by external factors such as business needs or an attempt to catch up with the hype surrounding AI. Still, the criteria laid down in the requirements stage—as well as any subsequent updates—can play a part in the organization’s decision-making process. Incorporating data protection considerations into those factors is therefore one way to increase their weigh in the AI development process.

A data protection professional can make a business case for that integration. First, addressing data protection risks at this stage can help organizations avoid issues once a system has been deployed, thus reducing the costs of compliance with the GDPR’s requirements for data protection and security by design. Second, the AI Act reinforces this general requirement by stipulating conditions (including data protection requirements) that must be met before high-risk AI systems and general-purpose AI models with systemic risk can be placed on the market. Third, active compliance with data protection law can have commercial advantages, in particular by making an AI product more attractive to business clients who will themselves need to comply with data protection requirements. A developer would do well to integrate data protection considerations into all stages of its development cycle rather than dealing with problems as they emerge.

Accordingly, this chapter discusses three moments within this life cycle stage in which data protection can be a relevant consideration. Section 7.1 provides an overview of metrics track various properties related to data protection, such as accuracy, fairness, and data minimization. Section 7.2 discusses tests and benchmarks that can be used to evaluate those metrics. Finally, Section 7.3 discusses how audits can help evaluate systems before and after deployment.

7.1 Measuring data protection

Learning outcomes

By the end of this section, learners will be able to exemplify metrics that can be used to support compliance with data protection and describe their limits.

Performance measurement is part of the software development process. At various moments during that process, software developers can measure various indicators that describe aspects of the development process. Some of these indicators can be used to track functional requirements, such as the accuracy level of an AI model for a particular task. Others can be used to track non-functional requirements, such as the amount of data used for training the model or the amount of energy it consumes during the training process. In this section, we will discuss how those measurements can be applied to track data protection requirements.

Generally, there is no obligation to track specific indicators when it comes to AI systems. There is no mandatory threshold for indicators such as accuracy, either. This is because such measurements are highly contextual. One type of measurement that can be useful in one context might be unhelpful in another. For example, there is no sense in measuring the use of training data if one is using an expert system that is not trained on data.

Likewise, thresholds that are perfectly acceptable in a particular context might be unacceptable elsewhere. If InnovaHospital creates a system that diagnoses a complex disease in 99.99% of the cases, this might be an improvement over the performance of human physicians. But if a large social network creates an automated content moderation system with the same level of accuracy, it will result in thousands, maybe even millions of valid posts being removed by the system.

There is no indicator or set of indicators that is guaranteed to be useful in all cases. Instead, an organization must look at the risks potentially created (or amplified) by their AI system or model and choose what indicators can be relevant for their problem.

Usually, this means organizations will need to rely on a broad range of indicators, each capturing a different aspect of the AI system or model. Even considered in aggregate, those indicators will only offer a partial view of their object:

Some relevant aspects of the impact of an AI system might not be amenable to metrification. For example, one might argue that core aspects of human personality cannot be quantified (Hildebrandt 2019).
Alternatively, something might be measurable in theory, but not measured in practice. This can happen, for instance, if somebody decides not to measure a certain indicator, or if measurement is too expensive or otherwise unfeasible.
An indicator might be inadequate for the task at hand. This is likely to be the case when a system is faced with a scenario that is far away from its usual range of operation. For example, during the Chernobyl disaster, the radiation counters available to first responders could only ascertain the radiation levels were above 3.6 Roentgen per hour, which was the limit of their instruments, but actual levels were much higher.
An accurately measured indicator is of no help if no one bothers to read it.

Measurement is not enough to ensure compliance with data protection requirements. But a proper use of a diverse set of quantitative and qualitative metrics can help organizations identify risks associated with their AI system or model, either before development or after deployment. Hence, the use of data protection metrics can be a powerful tool for compliance.

When it comes to high-risk AI systems, Article 15(1) AI Act mandates that such systems must have “appropriate” levels of accuracy, robustness, and cybersecurity. It does not define what counts as appropriate; as discussed above, what is adequate in a context might be awful in another. Instead, Article 15(2) stipulates that the Commission, in cooperation with other stakeholders, shall encourage the development of benchmarks and measurement methodologies. Likewise, sector-specific rules and industry standards, discussed in Chapter 13, will provide more information about acceptable thresholds in particular contexts. Once those definitions become available, developers of high-risk AI systems must ensure the relevant levels of accuracy, robustness, and cybersecurity. For developers of other AI systems and models, the applicable rule might not be mandatory, but it can still offer guidance for determining what levels are appropriate for their application.

Furthermore, Article 15(1) AI Act also requires that high-risk AI systems be consistent in those respects throughout the entire life cycle. That is, they must not suffer substantial degradation when it comes to those properties. To ensure that is the case, developers will need to track their AI systems and models after deployment, potentially rolling out updates if changes in technology or context make things worse. Compliance with this requirement must consider the data protection factors discussed above whenever the high-risk AI system involves the processing of personal data.

As of the end of 2025, there is limited agreement on what metrics and indicators are suitable for tracking various aspects of AI systems and models. In Chapter 13, we discuss instruments that are likely to provide more clarity in this regard, such as harmonized technical standards and codes of practice sponsored by the European Commission. In the meantime, it will be useful to define certain metrics and indicators that can support data protection assessments.

7.1.1 Measuring accuracy

The term “accuracy” often appears in the context of AI technologies. It is used, for example, in Article 15 AI Act, which oblige the providers of high-risk AI systems to ensure that their systems are sufficiently accurate for their purposes. In the broadest sense, this requirement for accuracy can be understood as a requirement that the AI system performs as close as possible to the results one would expect in that context. To measure that, technical experts have proposed a variety of indicators.

7.1.1.1 Classification accuracy

Some of those indicators are tailored for classification tasks. A classification task is a scenario in which an AI system is expected to assign an output to one of two (or more) possible classes. For example, an image recognition system might distinguish between photos that feature a dog and photos without a dog. When the AI system’s goal is formulated like that, its performance can be measured through some specific measures.

Those measures are often built from the same building blocks, that is, they offer diverse ways to combine certain indicators. For binary classification problems (in which an object can belong to one of two classes), a few indicators are common:

Precision is the likelihood that an object assigned to a class actually belongs to that class. For example, if the UNw university builds a classifier for predicting student dropout rates, its precision can be measured by computing how many of the predicted dropouts dropped out.
Recall, also known as sensitivity, refers to the likelihood that the system will correctly label the elements belonging to a given class. In the previous example, for instance, recall would refer to how many actual dropouts were identified.

An example of an indicator built from those two indicators is the F1 score that is commonly used in binary classification problems. That score is calculated as the harmonic mean of a system’s precision and recall.

7.1.1.2 Accuracy in regression

Not all problems solved by AI are classification problems. Some applications, for instance, focus on what is usually called regression. That is, an AI system is expected to predict a future value of a variable based on its present value. For example, DigiToys might use a regression tool to forecast its future sales based on data about its current performance and other relevant market values.

In a regression problem, it is very unlikely that an AI system will predict the exact value of the target variable. This does not mean that all errors are all the same. If DigiToys’s predictor gets the revenue forecast wrong by some million Euros, the company is likely to have serious problems. If it gets things wrong by a few cents, the impact is much less relevant. As such, indicators for evaluating regression performance need to account for the distance between the expected result and the real result.

One common indicator is the mean average error (MAE). This indicator is relatively simple to calculate. One calculates the difference between the expected value and the value that was observed in each case, takes the absolute value of that difference (that is, ignores the sign), and then gets the mean between all those values. This metric’s simplicity is an advantage for calculation, and it can be more easily explained. However, it treats all errors equally, which might not be desirable in all circumstances. For example, one cannot easily distinguish between a scenario where a high MAE is the result of high errors in all cases or of a single outlier that is incredibly wrong.

To compensate for this shortcoming, practitioners often rely on other metrics. One such metric is the mean squared error (MSE). The MSE is calculated like the MAE, with one difference: before computing the mean, one takes the square of all the differences. By doing so, one ensures that large errors will become even larger, while smaller errors become vanishingly small. Thus, reducing MSE would show that a system is less prone to significant deviation.

7.1.2 Robustness metrics

The robustness of an AI system or model refers to its capability to continue operating as expected even when it faces errors and unexpected inputs. This means a robust system will continue to be reliable under varying conditions. Software engineering professionals have developed a variety of indicators for a system’s robustness, many of which are related to time:

The mean time between failures (MTBF) counts how much time a system can operate without undergoing an incident that disrupts its operation.
The recovery time, instead, focuses on how fast a system can recover from such disruption.

One might, for example, want to track the types of failures to which a system is exposed. Different AI systems or models might fail in diverse ways, and the impact of each kind of failure also varies depending on the context of use:

A system used for medical diagnoses at InnovaHospital faces high stakes, as it must remain functional when faced with sudden data influxes or when exposed to data that is not particularly accurate, as the conditions for measurement are not always ideal in practice.
The systems produced by DigiToys must be able to cope with the unpredictability of child behaviour. For instance, a learning puzzle must withstand incorrect or inconsistent input without freezing or providing nonsensical feedback.
An AI system operated by UNw might need to deal with huge variations in its operation volume. For example, the demand for tutor chatbots is likely to grow considerably right before the university’s exams.

In those cases, robustness could be measured by context-specific quantities, such as indicators that capture how much the system’s operation is impacted by minor changes in the input data. Those might be complemented by context-specific qualitative indicators, such as those derived from customer satisfaction evaluations.

7.1.3 Cybersecurity metrics

Over the past decades, cybersecurity professionals have developed a variety of specialized metrics to capture several aspects of security. It would not be feasible to cover them all in detail, but the introduction to cybersecurity in Chapter 3 already suggests a few measurable aspects.

In terms of quantitative measurements, one can look at the main objects of cybersecurity. It might be possible to measure the number of identified vulnerabilities, or the time it takes to patch a vulnerability once a fix is available. Measurements might also cover the organization’s cybersecurity practices, for example by capturing the frequency with which the organization searches for vulnerabilities, or the time it takes to respond to incidents or carry out adversarial tests of its systems.

Other measurements might not be so easy to translate into numbers but are nonetheless relevant. An organization might evaluate the suitability of its technical measures (such as encryption) and the extent to which it complies to existing cybersecurity standards. By defining qualitative and quantitative targets beforehand, an organization can gain a more holistic perspective on its security situation.

7.2 Evaluating AI software for data protection issues

Learning outcomes

By the end of this section, learners will be able to describe different approaches for software testing and identify when they are legally required for AI systems.

Software metrics, such as those discussed in the previous section, can be used to describe an AI system or model’s operation and evaluate how it changes over time. As such, they are particularly useful for tracking its post-deployment life cycle. However, measurements are also important before a system is cleared for deployment. On the one hand, measuring properties of a system or model before deployment might tell us that the system requires further development before it is ready for use. On the other hand, those initial measurements offer a baseline against which one can compare future changes in the AI system. To obtain those initial values for the relevant indicators, a developer can follow software testing practices.

In EU data protection law, software testing is required under Article 32(1)(d) GDPR, which requires “testing, assessing and evaluating” of technical and organizational measures for secure processing. Article 25(1), on data protection by design, does not feature an explicit mention to software testing. However, this provision requires data controllers to address risks to data protection principles that can emerge from processing. It is difficult to see how such risks can be identified without comprehensive testing.

Acknowledging that, the AI Act provides explicit testing requirements for high-risk AI systems and general-purpose AI models with systemic risk. For high-risk systems, Article 17(1)(d) AI Act obliges providers to define procedures for examining, testing, and validating the system throughout the entire life cycle. For general-purpose AI models, Article 55(1)(a) obliges providers to perform model evaluation “in accordance with standardised protocols and tools reflecting the state of the art”. Those two provisions add more details to the general testing requirement that can be read in the GDPR.

In this section, we will discuss how those tests can be carried out.

7.2.0.1 Levels of software testing

Software engineers have developed various approaches for systematically testing computer programs. Those tests can be used to evaluate various aspects of a system, capturing information about (for instance) the metrics we discussed in the previous section. A comprehensive testing suite might therefore ensure that an AI solution is functional, reliable, and compatible with other software and hardware components.

One can distinguish between four types of tests:

Unit tests focus on verifying the functionality of individual components in isolation. For example, DigiToys might evaluate the speech recognition unit in an AI-powered doll, ensuring it correctly identifies a single spoken command in controlled conditions.
System tests assess the AI system as a whole, ensuring that all components work together as intended in a realistic environment. As an example, UNw might perform system tests on an AI scheduling tool by simulating real-world use cases, such as assigning classrooms and faculty to hundreds of overlapping courses during peak enrolment periods.
Integration tests focus on ensuring compatibility and proper communication between different components or systems. At InnovaHospital, for example, integration tests might confirm that a diagnostic AI system retrieves real-time patient data from hospital servers without introducing delays or errors.
Acceptance tests are conducted to determine whether the AI system meets the user’s requirements and is ready for deployment. These tests typically involve end users interacting with the system in a simulated or real environment. For instance, DigiToys could have parents and children test an AI educational toy to assess whether its interaction is engaging, safe, and aligned with educational goals.

Those tests deal with distinct aspects of an AI system, and as such they complement one another. By combining them, organizations can ensure that AI systems and models not only function correctly but also meet real-world expectations and requirements. However, implementing those levels of testing in concrete scenarios will likely require the use of techniques that attend to the specifics of AI technologies.

7.2.1 Software benchmarking and its use for AI

Another way to evaluate computer systems is to subject them to pre-defined benchmarks. In the context of AI, a benchmark typically involves a dataset, a set of tasks, or performance metrics that an AI system is tested against. Benchmarks provide a standardized way to measure how well an AI system performs specific tasks, allowing developers to identify strengths, weaknesses, and areas for improvement.

This approach has been embraced by the AI Act, which establishes that the classification of a general-purpose AI model as a general-purpose AI model with systemic risk depends on whether the model meets established benchmarks to that end. However, the utility of benchmarks does not end with this classification: at least in theory, benchmarks can be designed to evaluate several aspects of an AI system or model.

One example of benchmark available to AI developers is the MLPerf Training benchmark suite. This suite is formed by a variety of datasets and tasks, and it is meant to evaluate the time that a high-performance computer system takes to train an AI system that can reach a pre-defined level of quality at that task. The components of this benchmark suite are themselves benchmarks for specific problems. For example, ImageNet is a large dataset of labelled images that are used for evaluating the performance of image classifiers.

While benchmarks are invaluable for assessing and comparing AI systems, they are not without limitations. A key challenge is that benchmarks often measure performance in controlled, idealized conditions that may not reflect the complexities of real-world scenarios. For instance, an AI system trained and tested on the ImageNet dataset might perform well in the benchmark but fail to generalize to new, diverse images encountered in practice. This limitation is especially critical in high-stakes applications, such as healthcare or autonomous driving, where systems must operate reliably in unpredictable and dynamic environments.

Another limitation is that benchmarks can oversimplify tasks, focusing on narrow performance metrics that may not capture the full range of an AI system’s capabilities or ethical implications. For example, accuracy metrics used in benchmarks often ignore fairness, robustness, or interpretability—factors that are crucial in domains like hiring or law enforcement. This narrow focus may inadvertently encourage developers to optimize for benchmark performance at the expense of these broader considerations.

Additionally, the useful of certain benchmarks might be eroded by some factors. As technology evolves, a specific benchmark might no longer be a stress test of a system’s capabilities, and thus become irrelevant. Another path to irrelevance is that sometimes an AI system might be trained on the benchmark’s dataset, or a dataset terribly similar to it. Doing so might ensure an exceedingly high performance on the benchmark that does not mean necessarily that the system is useful for real-world tasks. Organizations can still benefit from adequate benchmarking, but they cannot afford to take results at face value.

7.3 AI auditing requirements

Learning outcomes

By the end of this section, learners will be able to distinguish between blackbox and white-box audits and examine what kind is suitable in each context.

So far, we have considered situations in which a software is tested by the organization that develops it. Such tests are an essential part of the development process. They are also desirable from a legal perspective, as they allow organizations to understand the risks that their AI systems or models might create, and thus anticipate legal exposure. Yet even a scrupulous internal process of testing might not capture all potential issues. Consequently, organizations developing software systems often rely on external audits of their products.

Audits can also be a useful tool for the governance of AI systems and models. But, given that the state of the art in AI technologies has evolved quickly over the past few years, techniques for auditing AI technologies are still relatively undeveloped as of the end of 2024. This situation is likely to change in the next few years, as considerable research is taking place about how to best audit technologies. For the time being, this section will focus on explaining fundamental concepts rather than presenting individual techniques that might soon become outdated.

The development of AI auditing techniques will be shaped, at least in part, by the legal requirements for audits. Many such requirements were already in the GDPR:

Article 28 requires data processor to collaborate with audits conducted by (or on behalf of) the controller.
Article 39(1)(b) mentions audits as part of the data protection officer’s toolkit for monitoring compliance.
Article 47(2)(j) refers to the need for data protection audits within groups of undertakings or enterprises engaged in a joint economic activity, to verify compliance with binding corporate rules on data protection.
Article 58(1)(b) empowers data protection authorities to carry out investigations in the form of data protection audits.

To the extent that AI systems or models process personal data, be it during their training process or after deployment, they are covered by those audit powers.

Further audit requirements emerge from the AI Act’s rules on high-risk AI systems. Under Article 74, the market surveillance authorities are empowered to request data and documentation from the providers of high-risk AI systems, which they can use for auditing purposes. Additionally, Annex VII details that certification bodies responsible for the third-party certification of some high-risk AI systems must carry out periodic audits of the systems they certify, as we examine in Section 13.2. Those audits might require elements beyond those demanded by data protection law, but, to the extent that personal data is relevant to the system or model, they will need to include a data protection audit.

A recent paper by Casper, Ezell, and others (2024) distinguishes between three types of audits. In white-box audits, the auditor has access to the inner workings of an AI system or model, being able to change internal parameters and observe the consequences of that change. Black-box audits take place when an auditor has no access to the inner workings of an AI system or models but can provide inputs to that system or model and see which outputs it produces. Finally, outside-the-box audits analyse the development process and associated artefacts.

7.3.1 White-box audits as an ideal standard

A white-box audit, at least in theory, allows for the greatest level of scrutiny of an AI system or model. In this kind of audit, an auditor can thoroughly inspect the technical object in question. They have full visibility of the system (or model)’s internal parameters and can change them to see what happens with the system. This allows an auditor, for example, to detect whether the examples being tested have not been cherry-picked to show the system (or model) at its best performance, or to analyse how sensible that system (or model) is to external perturbations. A good white-box audit would therefore detect issues that would escape less intrusive means of observation.

In an ideal world, this would mean that AI systems and models are subject to white-box audits before and after deployment. There are, however, many obstacles to this approach in practice. From a practical standpoint, a comprehensive white-box audit is likely to take a long time, as AI systems and models have immense numbers of parameters that can be tinkered with. Given the technical expertise needed to make sense of the technical arrangements of even the simplest AI models, the costs associated with such audits are likely to be high.

Technical factors can also reduce the appeal of white-box audits in practice. Because AI systems are relatively recent, there is limited knowledge of what those audits should cover. To mitigate this factor, the European Data Protection Board has commissioned an AI Auditing project, which offers freely-accessible criteria that must be evaluated in an audit. Those factors can help an organization in setting up its audit requirements, which can be updated to match new technological developments.

White-box audits are further complicated by the technical arrangements of AI systems. Even if an organization allows an auditor to access every parameter it controls, some parts of an AI system might remain opaque to the audit. For example, if InnovaHospital uses ChatGPT to power a medical chatbot, an audit of that chatbot will not be able to access the inner workings of the large language model. It will still be able to access everything that the hospital has done with ChatGPT, but an important part of the AI system will be out of reach. So, the white-box audit in this case will need to deal with an unremovable black box.

Secrecy considerations might also reduce the attractiveness of a white-box audit. For example, the DigiToys company might fear that an external audit will result in a leak of its commercial strategy to competitors. This kind of risk can be mitigated by legal requirements of secrecy, such as contractual obligations concerning a company’s trade secrets. However, an organization might be precluded from disclosing some information it holds due to agreements with third-party suppliers, too. A confidentiality clause in the contract with an AI provider, for instance, might prevent an organization from seeking a white-box audit.

Furthermore, white-box audits are not mandated by law. The transparency requirements in the GDPR and the AI Act do not go as far as to mandate disclosure of the AI system (or model)’s inner workings to external auditors.¹ A data processor might be obliged to undergo a white-box audit if that is stipulated in its contract with a data controller, and likewise, a joint controllership agreement can feature a requirement for this kind of audit. But, in the absence of such a contractual agreement or of a regulatory requirement, organizations can exercise their discretion on whether to pursue a white-box audit.

7.3.2 Black-box audits of AI systems

At first glance, a black-box audit might appear to be a suitable alternative to a white-box approach. In this kind of audit, an auditor inspects a system (or model) without having access to its inner workings. They can only observe the system (or model)’s behaviours: what outputs it generates for each input it receives. This is the approach followed by most techniques currently proposed for AI auditing, which try to exhaustively test the system without changing how it works. By doing so, those techniques would offer some guarantees about system behaviour while avoiding many of the pitfalls discussed above.

However, black-box audits can be inadequate for many real-world contexts. This is because they are vulnerable to several forms of (deliberate or accidental) distortion:

A black-box audit cannot assert that the system being tested is configured just like the system that will be examined in the real world.
A black-box approach cannot, by definition, be as exhaustive as a white-box approach, as it cannot evaluate how the system’s behaviour change when internal parameters are altered.
A black-box approach creates obstacles when it comes to finding the source of any issues detected during the inspection, as one cannot trace those issues to specific aspects of the system.

As such, a black-box audit can be a useful technique, but it might not cover all potential sources of legal issues with an AI system or model. A data protection professional will need to evaluate whether the guarantees offered by this kind of audit are enough in a particular context, considering the trade-off between clarity and feasibility. As a rule of thumb, the higher the risk associated with an application, the more access will be needed for an audit. Otherwise, an organization might miss valuable information and find itself with an unwarranted sense of security.

7.3.3 Outside-the-box audits

Unlike the previous kinds of audits, outside-the-box audits do not look at the AI system (or model) directly. Instead, they engage with artefacts that are related to the technical object they inspect. They look, in particular, to the documentation created during the development process of that AI system. For audits that take place later in a system’s life cycle, the inspection might also cover documents about its deployment procedures. The idea is that those sources will contain information about the system itself.

This is the approach followed by the AI Act. Under its Article 43, some high-risk AI systems must undergo a third-party conformity assessment before they can be placed on the EU market. Such an assessment, as detailed in Annex VII, covers the documented process for quality management, as well as the system’s technical documentation. No inspection of the system itself is carried out at this point.

Such an approach is prone to some of the issues with black-box audits. There are several reasons why a system’s documentation might not match the actual system. If a system undergoes self-learning, its parameters will soon diverge from whatever is documented at a given moment. Even without that, it might be the case that some features of a system have not been entirely documented. This is more likely to be true in systems developed in accordance with agile processes, in which documentation is seen as less relevant than functionality. And, as the life cycle of a software system goes on, the documentation might become outdated considering software updates. The result is that software documentation tends to provide an incomplete picture of what happens within a system.

To mitigate those differences, the AI Act creates an obligation for providers of high-risk AI systems to keep updated the documents they are obliged to draft. But, even in the absence of such an obligation, an organization would do well to keep its documents updated. After all, up-to-date documentation can be used as an element to demonstrate compliance with data protection requirements. To the extent that an organization takes care of software-related documents, an outside-the-box audit might provide useful insights about the AI system or model. Or at least some points that warrant further investigation by white-box (or black-box, if applicable) audits.

7.4 Conclusion

Verification and validation are continuous processes for AI technologies. A sensible provider will address its risks by carrying out tests and audits during the development of an AI system or model, while deployers would do well to extensively examine the systems they want to use before deployment. However, the same techniques described above can also be used for evaluating AI systems and models after the initial deployment. Such evaluations are, in fact, necessary, given both the possibility of technical changes to an AI system and the likely changes to the environment in which operates.

Based on the discussions in this Chapter, a data protection professional can carry out several types of intervention at this stage of the AI life cycle. They can:

Ensure an organization selects a good mix of qualitative and quantitative metrics, which should cover various aspects of data protection compliance (such as accuracy, fairness, robustness, and cybersecurity).
Urge organizations to keep track of those metrics throughout the life cycle, relying on tools such as up-to-date dashboards that concentrate information.
Participate in the design of software testing protocols to ensure that factors relevant to data protection are covered by the tests.
Carry out internal audits with a view to diagnosing data protection issues.
- White-box audits offer a technical gold standard, but they might not always be feasible in practice.
- Black-box audits are vulnerable to several limitations and possibilities of manipulation, but they might represent what is technically feasible in a certain context.
- If black-box audits must be carried out, they need to be supplemented by outside-the-box methods, such as close analyses of software documentation.

By preparing robust practices for verification and validation, data protection professionals will help organizations comply with their legal duties throughout the entire life cycle.

Exercises

Exercise 1. What is the key limitation of relying solely on accuracy metrics for highstakes applications?

a. They are difficult to calculate.
b. Accuracy metrics are irrelevant for compliance.
c. They are only valid for classification problems.
d. Their values do not remain stable over time.
e. They may ignore fairness and robustness considerations.

Exercise 2. When is benchmarking particularly useful in evaluating AI systems?

a. When assessing a model’s cybersecurity.
b. When auditing a system’s data protection compliance.
c. When examining documentation for discrepancies.
d. When testing for real-world robustness.
e. When comparing multiple systems on standardized tasks.

Exercise 3. What is a key disadvantage of black-box audits?

a. They are too expensive for most organizations.
b. They require full disclosure of trade secrets.
c. They might be misled by tampering with internal parameters.
d. They are legally prohibited for high-risk systems.
e. They can only be conducted post-deployment.

Exercise 4. Which of the alternatives below represents the most common obstacle to conducting white-box audits?

a. Confidentiality agreements with third-party suppliers.
b. Restrictions on using external auditors.
c. Lack of benchmarks.
d. Difficulty in simulating real-world scenarios.
e. Regulatory prohibitions on internal access.

Exercise 5. Which of the following approaches is more likely to provide a broader coverage of data protection requirements before commercializing an AI system or model?

a. Rely solely on functional testing.
b. Use benchmarks alongside black-box audits.
c. Conduct white-box audits and ignore testing.
d. Emphasize cybersecurity metrics during integration testing.
e. Focus on precision while disregarding fairness metrics.

7.4.1 Prompt for reflection

Discuss the advantages and limitations of black-box, white-box, and outside-the-box audits in ensuring compliance with data protection laws for high-risk AI systems. How would you approach auditing in cases where confidentiality agreements or technical opacity limit access to internal system parameters? Use examples from the case studies to ground your discussion.

7.4.2 Answer sheet

Exercise 1. Alternative D is correct. Some accuracy metrics can be expressed in simple and reliable terms, and there are metrics that can be used for regression problem. Furthermore, accuracy is also a legal requirement for data protection.

Exercise 2. Alternative E is correct. Benchmarks provide a consistent basis for comparison. They do not directly address cybersecurity or documentation issues. Also, many data protection issues, such as the protection of fundamental rights, do not lend themselves to being expressed in the kind of objective standard needed for a benchmark.

Exercise 3. Alternative C is correct. Without seeing the internal arrangements of an AI system, a black-box audit is likely to have multiple possible explanations for a given pattern of results. This allows providers to manipulate black-box tests.

Exercise 4. Alternative A is correct. Alternative E is also a major source of opacity, but one that applies mostly to AI systems and models used in sensitive public-sector applications such as law enforcement. Alternative B might be circumvented by internal audits (and E as well), and the broad access granted by white-box audits can be compensate for the lack of benchmarks.

Exercise 5. Alternative B is correct. None of the approaches discussed in this chapter is enough to detect all issues with a given AI system. Alternative D also moves toward a more complete evaluation, but its focus on cybersecurity might lead it to overlook safety issues such as those discussed in Unit 4 of this training module.

References

Paul Ammann and Jeff Offutt, Introduction to Software Testing (2nd edn, Cambridge University Press 2016).

Stephen Casper and others, ‘Black-Box Access is Insufficient for Rigorous AI Audits’ in (ACM 6 March 2024) FAccT ’24 2254.

Gemma Galdon Clavell, ‘AI Auditing’ (EDPB Supporting Pool of Experts, 2023).

David Lehr and Paul Ohm, ‘Playing with the Data: What Legal Scholars Should Learn About Machine Learning’ (2017) 51 UCDL Rev 653.

Mireille Hildebrandt, ‘Privacy as Protection of the Incomputable Self: From Agnostic to Agonistic Machine Learning’ (2019) 20 Theoretical Inquiries in Law 83.

Jakob Mökander and others, ‘Conformity Assessments and Post-Market Monitoring: A Guide to the Role of Auditing in the Proposed European AI Regulation’ (2022) 32 Minds & Machines 241.

Rob van der Veer, ‘ISO/IEC 5338: Get to know the global standard on AI systems’ Software Improvement Group. Accessed 26 September 2024.

Sandra Wachter and others, ‘Why Fairness Cannot Be Automated: Bridging the Gap between EU Non-Discrimination Law and AI’ (2021) 41 Computer Law & Security Review 105567.

Even if regulatory authorities might have access to these inner workings by using the extensive powers that both regulations grant them.↩︎