Chapter 9 Troubleshooting and Testing Pipelines

Troubleshooting machine learning software is complicated for several reasons: the data may be huge (Section 9.1.1), may be collated from a number of different sources fed by different pipelines (Section 9.1.2) or may change over time (Section 9.1.3). Models may be too large for mere humans to interpret their parameters and eyeball incorrect behaviour patterns (Section 9.2.1) especially in the case of black-box models (Section 9.2.2). The time and cost of training them may also limit our ability to investigate any issues that require updating models (Section 9.2.3), especially if we are using several of them chained together in the pipeline (Section 9.2.4).

Software testing is a natural complement to troubleshooting: once we know where trouble lies (Sections 9.1 and 9.2), we can either actively prevent it by “defining errors out of existence” (Ousterhout 2018) or we can put in place tests to detect it before it can meaningfully degrade our software’s performance. While every bug is unique, some patterns of behaviour are indicative that something is amiss that we should be aware of (Section 9.3). When expected and observed behaviour are markedly different, it is worth looking into it! What we should test (Section 9.4.2) depends on the data (Section 9.4.3), but it should span local and global behaviour (Sections 9.4.4) as well as conceptual and implementation errors (Sections 9.4.5 and 9.4.6).

9.1 Data Are the Problem

Machine learning models effectively compile data into code and dictate to a large extent the behaviour of the software they are embedded in (Section 5.1). Hence it is only logical that issues in the data will impact the software by affecting model training or predictions. Before we do anything else, we should make sure that the data are correctly recorded, properly labelled and without duplicates: only 3% of data are acceptable in this respect even with pretty loose quality standards (Kenett and Redman 2019) and technical debt arising from data is a common issue (Section 5.2.1).

The shape of the data and how the data are collected can result in very different types of issues. For the former, we may have tall data (large sample size, few variables), wide data (small sample size, many variables; also known as “small \(n\), large \(p\)”) and big data (large sample size, many variables, changing over time and possibly unstructured (Katal, Wazid, and Goudar 2013)). For the latter, we should distinguish between experimental and observational data. Experimental data are collected following some experimental design (Montgomery 20AD) that involves identifying a limited set of variables of interest from available knowledge (domain experts, the literature, small-scale preliminary experiments, etc.) and a small number of variables we wish to intervene on (like giving targeted discounts and recommendations or administering specific medical treatments). Eligible data points are chosen based on their characteristics to ensure that the conclusions we draw from models apply to the population of interest, and are randomly assigned different interventions. Randomisation ensures that all types of individuals are observed with different interventions and prevents confounding to some extent. (More on this later.) In contrast, observational data are collected as they arise. Often individuals are added to the data as their information is recorded, without taking their characteristics into account. This, along with the fact that we are not performing any randomised intervention, can bias the models we learn from observational data: either we do not observe data points with certain characteristics (enough of them, or at all) or we do not observe them in a wide-enough range of situations to model their behaviour. This issue is called sampling bias (Section 5.3.1) and affects many applications of machine learning. For instance, 96% of participants in genome-wide studies were of European descent in 2009; while new studies performed on Asian populations have reduced that figure to around 80% by 2016, other ethnicities remain chronically underrepresented (Popejoy and Fullerton 2016). The practical consequence of this disparity is that personalised medicine treatments currently under development will not benefit individuals from those backgrounds.

9.1.1 Large Data

Consider the three possible dimensions of data mentioned above: the sample size, the number of variables and the number of time points. The larger the data are in at least one of these dimensions, the more difficult it is to troubleshoot the models we learn from them.

If the data are wide, changes in one variable may induce changes in the contributions of other variables to the model: this phenomenon is called entanglement (Sculley et al. 2015, 2014). As the number of variables grows (“why not add one more input?”), it becomes increasingly likely that multiple variables will express the same information in different ways. The parameters that encode that information in the model will then be jointly determined by those variables. If the distribution of one such variable changes, making it a legacy feature that is no longer significant in the model, the effects of the other variables will increase to compensate. And even if it still retains some degree of statistical significance, it may become an epsilon feature that contributes so little to the model that it is not worth the effort of including it in the first place. (Both legacy and epsilon features should in principle be dropped from models, but they are often not when they are included as a bundle with features that are actually useful.) In other words, “changing anything changes everything” (Sculley et al. 2014), as we discussed in Section 5.2.2 with respect to technical debt. This is all the more true for time series data because, in addition to different variables being entangled with each other, each variable is entangled with itself at previous time points (Section 9.1.3).

As a side effect, entanglement makes it difficult to identify true causal features20 within a set of correlated features. This is problematic because it prevents us from keeping models simple and small without a significant amount of feature engineering.

The other problem in troubleshooting large data is latency: accessing the data takes time and computational resources, which in turn slows down our iteration speed. This is particularly true for models like deep neural networks that require GPUs and TPUs, which have limited bandwidth and memory (Section 2.2). One possible solution is to choose a good-quality, representative subset of the data and work with that (more in Section 9.1.2), keeping in mind that (repeated) subsampling also has a cost. Another is taking the last known-good snapshot of the model and working on it with a subset of recent data as if we were doing online training.

9.1.2 Heterogeneous Data

Furthermore, we must consider that data may be heterogeneous, comprising variables encoded with different data types and complex data structures. Data ingestion and preparation (Section 5.3.3) then require several algorithms and auxiliary models to filter out poor-quality data points, impute missing data and extract relevant features. Additional models may also be required to post-process the outputs of the core machine learning models. If one input variable changes, it is bound to affect one or more of these models: their output will in turn affect even more models in what we called a correction cascade (Sculley et al. 2014) in Section 5.2.1. In a sense, we can see it as a form of entanglement that spans multiple models (Section 9.1.1); or as a form of coupling between models that are (sometimes undeclared) consumers of each others’ outputs, effectively making them work as a single large model (Section 9.2.1).

Heterogeneous data are difficult to subsample as well: choosing data points at random is unlikely to yield a subset that is representative of the overall data set. Observations belonging to less-frequent classes in imbalanced data are unlikely to appear in a random subsample in sufficient numbers or at all: our estimates of predictive accuracy for the machine learning models can remain high even if they are consistently mispredicted. Subsets are also likely to have a different distribution (as captured by summary statistics) compared to the overall data, which may trigger calibration issues. Outliers that may be causing trouble in the original data are likely to be dropped, making it difficult to replicate the issues we are troubleshooting (reliably or at all). All these problems become more and more pronounced as the difference in size between the original data and the subsamples grows.

9.1.3 Dynamic Data

The data, the models, the code and the architecture can all be sources of technical debt in a machine learning pipeline (Section 5.2). The data sources we use to feed our machine learning models, in particular, are often outside of our control. Hence data dependencies are more costly than code dependencies (Sculley et al. 2015): it takes more effort to troubleshoot their behaviour and to quantify and mitigate their potential impact on the performance of our pipeline.

Data may change slowly over time, either following a medium- to long-term trend or in periodic patterns. (The former is known as data drift, and the latter is called seasonality in statistics.) Both can be encoded in machine learning models at the cost of increasing model complexity. However, models take time to adapt to change: if change is sudden or drastic enough predictions will be miscalibrated. Using dynamic thresholds that are updated regularly and frequently allows models to adjust to change, but there may be a noticeable lag. Setting such thresholds, however, will require additional, dedicated models thus introducing additional complexity. Any fixed threshold, whether implicit or explicit, will require domain experts to constantly monitor (Section 5.3.6) the inputs and outputs of data ingestion and preparation modules (Section 5.3.3) to keep it up to date, possibly introducing an even longer lag. (This is an instance of the human-in-the-loop approach we recommended in different places in Chapter 5).

A type of change that is particularly difficult to identify is when a feature we are using in our models stops correlating with a causal feature. If we include the former instead of the latter by mistake (Section 9.1.1), we suddenly lose access to the information that the causal feature was indirectly providing to the models. Recovering that information may require re-evaluating our data sources and an extensive re-engineering of our data ingestion and preparation modules. And it may be difficult to understand what happened: if two features showed a significant degree of association at the time the models were trained, but gradually drifted apart over time, the (non-causal) feature we included may suddenly become irrelevant for no apparent reason.

9.2 Models Are the Problem

Machine learning models tend to be complex beasts: this is especially the case for deep neural networks but holds for many Bayesian hierarchical models as well. Our ability to troubleshoot models with a large number of parameters estimated from data (and with hyperparameters as well, usually) is severely limited by the sheer number of moving parts we need to track.

9.2.1 Large Models

Firstly, it is difficult to map the effect of any change in the model behaviour or in the data to individual parameters because parameters interact with each other. In order to capture complex patterns of behaviour from the data, machine learning models mix the information present in individual input variables in many (linear and non-linear) ways that are encoded in different parameters. As a result, any change in even a single variable will affect multiple parameters at the same time in ways that may be difficult to understand. Changing the values of some parameters in a way that locally improves some part of a model may have a knock-off effect on the parameters in other parts of the same model. Both these effects compound across the models in a pipeline as we discussed in Sections 5.2.2 and 9.1.2.

Secondly, dealing with a large number of parameters makes it impractical to investigate them individually. Each parameter may have little or no real-world meaning by itself. As we just discussed, its behaviour will be intertwined with that of other parameters: they should be grouped and each group investigated as a single, meaningful entity. Hence we have to resort to an auxiliary model that investigates the parameters for us: it may be something simple like a diagnostic plot based on summary statistics, or something more complex like a second, independent machine learning model. However, summary statistics by their nature lose information, making bugs easily go undetected, and adding a second machine learning model may not be worth the additional complexity of ensuring that model is also working properly. It is troubleshooting all the way down!

9.2.2 Black-Box Models

Thirdly, most large machine learning models are effectively black boxes. Individual parameters are mathematical constructs that often have no real-world meaning, even when considered in groups. An entire research field, focusing on explainability and interpretability, has sprung up in an effort to relate changes in the model inputs to changes in the model outputs. Ideally, we want to do that in a way that can make these relationships meaningful to a domain expert: for instance, visualising word relevance in NLP (Li et al. 2016) and pixel relevance in computer vision (Simonyan, Vedaldi, and Zisserman 2014) or splitting images into layers with semantic meaning (Ribeiro, Singh, and Guestrin 2016). Observing the behaviour of a model around key input values with local approaches like LIME (Ribeiro, Singh, and Guestrin 2016) and SHAP (Lundberg and Lee 2017) can also provide insights: both approaches work by perturbing the inputs and checking whether the outputs are stable, and mapping any instabilities to specific subsets of parameters.

9.2.3 Costly Models

Fourthly, training large machine learning models is expensive and time-consuming. This makes for slow iterations and may very well make troubleshooting impractical. Among recent deep neural network architectures for NLP, Google’s XLNet (Yang et al. 2019) costs an estimated $61,440 to train, taking 2 days with 512 TPU v3 chips (Google’s proprietary AI coprocessors); University of Washington’s Grover-Mega (Zellers et al. 2019) takes two weeks and $25,000; Google’s BERT (Devlin et al. 2019) costs between $500 and $6,912 and takes 4 days to 2 weeks to train. It is currently unknown how much OpenAI’s GPT-2 (Radford et al. 2019) originally cost to train, but the open-source OpenGPT-2 (Cohen, Pavlick, and Tellex 2019) took $50,000. And this only covers training: hyper-parameter tuning can easily involve training 10-100 models before finding a well-performing one. A recent study from the American Medical Association has found that simply reproducing one of these models using publicly available resources can cost between $1 million to $3.2 million (Beam, Manrai, and Ghassemi 2020).

The numbers above represent a worst-case scenario. Deep neural networks for applications other than NLP are typically much smaller and thus much cheaper and quicker to train. For instance, the ResNet-50 architecture for computer vision tasks can be trained in minutes for a few dollars (Tabuchi et al. 2019) because it only has 25 million parameters (Grover-Mega and GPT-2 have 1.5 billion, XLNet has 340 million). And we rarely have to retrain models from scratch: it is common to use the current model as a pre-trained starting point or to buy a pre-trained model from a commercial vendor. (However, this practice may produce technical debt at the model level as discussed in Section 5.2.2.) We can also trade training speed for cost and vice versa: slower solutions are cheaper, and their prices have been steadily falling in recent years. We may also be tempted to reduce the overall computing costs with lazy code execution but that may introduce non-deterministic behaviour and make troubleshooting even harder. Using cloud resources as massive parallel compute facilities to divide-and-conquer training may complicate things rather than make them easier because remote debugging in the cloud comes with its own set of problems (Section 2.3).

Finally, let’s not forget that there are machine learning models other than deep neural networks: random forests and gradient-boosted trees (Natekin and Knoll 2013) are much faster and cheaper to train and quite often achieve competitive performance, especially on tabular data.

9.2.4 Many Models

As we mentioned in Section 9.1.2, dealing with complex data may require a complex machine learning pipeline involving several models linked by an orchestrator and to some extent by glue code. On the one hand, such code may be helpful in isolating the peculiarities of the different models and of the libraries that are used to implement them. On the other hand, glue code may introduce bugs in how models interact. Such bugs are not easily detected without extensive integration tests, and are common in the “pipeline jungles” we discussed in Section 5.2.3. Unit tests would cover the correctness of individual models, but not the correctness of how they are wired together. The more models we include in our pipeline, the more difficult it is to troubleshoot their interactions because the number of possible pipeline configurations explodes combinatorially as the number of models increases. This may be compounded by the presence of dead and experimental code paths that are not essential to the functioning of the machine learning models (Section 5.2.4).

Another issue, which we covered in Section 5.2.2, is that the more models we have in our pipeline, the more likely it is that they will create feedback loops or correction cascades.

9.3 Common Signs That Something Is Up

How can we tell whether one or more of the issues discussed above are affecting the performance of our machine learning pipeline? There are so many (combinations of) things that can go wrong that it is difficult to compile an exhaustive list of signs that something is up. There are, however, some common patterns of behaviour that should be regarded as suspicious.

Predictive accuracy is really bad. Models may be unable to capture enough relevant information from the training data to be able to predict new data points. The data may not contain such relevant information in the first place. That information may not be usable without further effort into engineering a suitable set of features. Or the information may be there, but the models fail to capture it due to computational issues or because they make the wrong assumptions on the distribution of the data. If any of these is true, we should focus our troubleshooting efforts on data preparation (Section 5.3.3) and model training (Section 5.3.4) modules. We should also re-evaluate our data sources: were there any changes that made (some of) them no longer useful?

Predictive accuracy is really good. If the models we are implementing are appropriate for the problem they are tasked to solve, and if the data provide relevant information to train them, we would expect them to perform “well”. How well is “well” depends on a combination of these two factors, and on how we chose the problem and the metrics with which we define success (Section 5.3.1). Narrowly-defined tasks are easier to put into precise mathematical terms, making them easier to optimise for. On the other hand, tasks with broad definitions typically conflate multiple subtasks with different requirements and goals that may conflict with each other. However, if a task is nontrivial we should treat extremely high performance (say, like 99.9+% classification accuracy) as a possible red flag. Unbalanced data sets in which not all the classes we are trying to predict are well represented may result in unrealistically high accuracy if the models always predict the most common 1-2 classes and miss the rest. The different types of feedback loops we discussed in Section 5.2.2 may have a similar effect. Finally, high accuracy may be indicative of an information leakage between what we are trying to predict and the data we use to train our models, for instance because one of the variables is an alias21 of the prediction target.

Furthermore, data leakage will also happen when part of the training set is implicitly used in the test or validation sets. This may involve different data points originating from the same individual or from related individuals being included in either data set. For instance, these may be two sentences from the same page of text, two web product accounts opened by the same person or by people in the same family, health information from siblings or online questionnaires administered to the same person at different times. In any of these cases, instead of validating the machine learning models with a realistic simulation of the production setting they will work in (completely new data points), we are validating them against data points they already know about at least to some extent. Hence our assessment will give us a biased estimate of the models’ predictive accuracy and overconfidence in their capabilities.

Predictive accuracy suddenly changes. Mathematical models of reality, including machine learning models, make various regularity assumptions that encode the idea that reality varies smoothly: small changes in the inputs of the models should produce small changes in their outputs; and the larger the changes in the inputs, the potentially larger the changes in the outputs. Any marked change in a model’s behaviour that cannot be immediately linked to a known real-world event may be indicative of an incorrect model that just happened to work and finally broke down, making it apparent that it was wrong in the first place. (Losing any connection between the training data and unobserved causal features as described in Sections 9.1.1 and 9.1.3, for instance.) It may also be indicative of some inputs changing in a fundamental way (changes in the variable types or meaning, feedback loops, etc.) or becoming unavailable (Sections 5.2.1 and 5.2.2). The only way of troubleshooting such issues is to put in place comprehensive monitoring facilities covering all the modules in the pipeline and to aggregate all metrics in a monitoring server, where they can be correlated and cross-referenced across time (Section 5.3.6).

The resources required to train the models or to make predictions with the machine learning pipeline are at odds with the computational complexity of the algorithms it implements. As we discussed in Section 4.6, real-world resource usage is not a perfect reflection of big-O notation: it does not take constant factors and different hardware capabilities (parallel execution, cache sizes, etc.) into account, nor can it easily incorporate all the optimisations performed by modern compilers and language interpreters. There should be, however, some discernible relationship between the two. Large discrepancies suggest that training data or input features may be breaking some of the assumptions on the model, or that there are too few data points. In either case, model training and hyperparameter tuning will struggle to identify an optimal model, taking more time than expected. Large clusters of related variables (Section 9.1.1) may have a similar effect, because model training will struggle to separate their (overlapping) effects. Prediction, by comparison, is less likely to be problematic. As before, we should be able to point out any anomalies in resource usage by a combination of monitoring and logging across modules.

9.4 Tests Are the Solution

Current practices from software engineering strongly suggest that the most reliable way of identifying defects in software is testing. Much has been written on this topic in classic books such as “The Pragmatic Programmer” (Thomas and Hunt 2019) and “Test-Driven Development” (Beck 2002). Few resources touch on the topic of testing machine learning software: among them are Alice Zheng’s “Evaluating Machine Learning Models” (Zheng 2015), the “ML Test Score” rubric from Google Research (Breck et al. 2017) as well as a few survey papers in academic literature (Braiek and Khomh 2020; Zhang et al. 2020). We will do our best to give an overview of all the facets of testing machine learning pipelines in the remainder of this chapter, complementing our discussion of software testing from Chapters 5 and 6. We will also rely heavily on the automated and reproducible deployment practices we discussed in Chapter 7: we should run each test in a clean environment to make sure that its results are not influenced by external factors (including other tests). That is typically implemented by using the base container images we use for our production systems in our continuous integration setup.

9.4.1 What Do We Want to Achieve?

Following (Zhang et al. 2020), we can summarise our goals as:

  • Model correctness: if input data follow the distribution we expect them to, outputs should be correct and predictions should be accurate with high probability.
  • Empirical correctness: outputs should be correct and predictions accurate for new data points, that is, the empirical performance of the models should be reliably above the threshold we set for our metrics (Section 5.3.1).
  • Model relevance: models should be able to represent the distribution of the data and to fit them well without overfitting.
  • Robustness: models should handle invalid or extreme inputs gracefully.
  • Adversarial robustness: models should also handle malicious inputs that are crafted to be hard to detect and to produce specific outputs.
  • Efficiency: model training and inference should use the least possible amount of compute and memory that produces the desired level of predictive accuracy.
  • Interpretability, fairness and privacy: as discussed in Section 5.3.1.

Tests should strive to ensure that these goals are met by investigating a variety of valid and invalid inputs and outputs for both individual models and the machine learning pipeline as a whole. They should give confidence in the ability of the pipeline to perform its assigned task well for common inputs and to degrade gracefully otherwise.

9.4.2 What Should We Test?

In principle, a comprehensive test suite should cover:

  • The raw data, covering invalid or missing values, variable representations (scaling, one-hot encoding, etc.), variables that are of little to no use along with those that are redundant because they encode the same information (legacy and epsilon variables). We should also have offline and online tests for:
    • Insufficient sample size: Do we have enough data points to (re)train the model? Is the sample size large enough to make it possible to observe infrequent configurations of the variables?
    • Data drift: Does new data have a distribution comparable to that of the data the model was trained from?
    • Outliers: Are there any data points with values different enough from the rest that we may think of them as recording errors?
  • The key components of the models:
    • Models: Are they appropriate for the data? Can they regularise (smooth) noisy outputs?
    • Parameters: Are parameter values unusually large or small? Are there parameters that have no effect on predictions (for instance, because they are equal to zero)?
    • Hyperparameters: Do they encode expert knowledge correctly? Or, conversely, are they really non-informative? Do they restrict the range of models we can learn?
    • Loss functions: Do they express meaningful properties of the model outputs (Section 5.3.4)? Can they differentiate between models well, picking models that predict well and that capture the key relationships between the variables?
    • Optimisers: Can they explore a wide range of models efficiently? Do they converge reliably or are they prone to settling for suboptimal models?
  • The post-processed data and inference outputs to spot features that become problematic or are not worth keeping and to ensure that predictions are accurate enough to be fit for purpose.
  • Any glue code that is used to wrap models, to help access their inference capabilities or to orchestrate them (Sections 5.2.3 and 5.2.4).

This is, of course, in addition to any tests required to ensure that the underlying infrastructure is working, feeding inputs to the pipeline and putting its outputs to use. For this to be possible, we must be able to track data, models, predictions, hyperparameters and parameters simultaneously through configuration management under version control (see also Sections 5.1 and 5.2).

Even if we can effectively test all the above, a crucial problem remains: how do we determine whether a test should pass or fail? In order to do so we must be able to determine what is the expected behaviour of each individual model and of the pipeline as a whole, which is difficult when dealing with the stochastic nature of machine learning models. Typically, we do not have access to an oracle:22 we do not know in advance what the “correct behaviour” should be or we would not need the models in the first place! The models give us some clues in their assumptions and their mathematical and probabilistic properties: the former determine what valid inputs are, the latter suggest what output we should get for a given input. Model invariants (that is, changes in the inputs that should not change the output) give more theoretical properties that should be empirically satisfied. This is a form of property-based testing in which the properties to test are mathematical statements that we can derive from model definitions. If we are using models that have multiple implementations, we can also compare the output of the implementation we are using to that of other implementations. If they agree up to some tolerance threshold, and we trust those other implementations to be correct, we can take them as pseudo-oracles and validate our models. This practice is called differential testing, and can supplement property-based testing for models without easily-testable properties like black-box models (Section 9.2.2).

9.4.3 Offline and Online Data

Tests based on offline data and online data are quite different.

Offline data are mainly used for tuning hyperparameters and training models, and they are collected by combining historical data and new data points into a static sample until its size is large enough (Section 5.3.4). These data will then be labelled to obtain a ground truth to train the model. Images will be tagged based on which items they display; sentences will be tagged by their main topic(s); lab samples will be tested to detect the phenomena we would like models to identify. (Note that in many cases a label is a discrete, categorical variable, but it needs not to be. It can be an ordinal variable, such as age brackets, or a numeric value.) The labelling process acts as a pseudo-oracle: it is expensive, time-consuming, and with a non-zero error rate, but it is the closest thing to ground truth we can access in most settings. In a sense, it allows us to train a model and compare its performance against human performance (assuming labelling is done by domain experts, see Section 5.2.1).

Therefore, testing model training and hyperparameter tuning with offline data together with the offline data themselves is relatively straightforward. We have a large sample, which allows us to test the pre-processing of raw data and feature engineering to ensure that they produce suitable inputs for the models. In the spirit of property-based testing, we can test that the models behave correctly when they are fed features that satisfy their assumptions; and that they either report errors or degrade gracefully otherwise. From the empirical distributions of the data and the model assumptions, we can identify both corner cases to test limit behaviour and cases that are well-spaced in the sample space and cover a variety of typical behaviour. Thanks to the labels, we can estimate the model’s predictive performance with some sort of train-test-validation data split, making it possible to perform hyperparameter tuning and to rank different model choices. The accuracy observed during training will also serve as a benchmark to monitor the performance of the models in production (Section 5.3.6).

Online data are generated as a constant stream from external sources in the form of individual data points or small batches. Therefore, testing takes the form of online monitoring, A/B testing (which is covered in depth in (Zheng 2015)) or one of the other strategies outlined in Section 7.2. Online data often come without labels, so we cannot directly assess whether models handle them correctly. We can test whether the data we see in production follow the same distribution as the training data by collecting data points across a short period of time and testing whether their empirical distribution is different from what we would expect. If the data are unlabelled, we will be limited in doing so either by the availability of domain experts to perform the labelling in a short time frame or by the limited accuracy of machine learning models at this task. We can then set dynamic thresholds to detect both sudden and gradual losses in accuracy. Similarly, we can test for changes in the distribution of input features. In either case, we can flag the test to be reviewed by a domain expert or assume that the model is now out of date and must be retrained automatically. In practice, such tests can fail in benign ways for a number of reasons, so keeping a human in the loop to check why failing tests are failing is preferable (Section 5.3.4).

If we do not have enough data to both train the models and to test them, we can generate more either by resampling or by stochastic simulation. Both bootstrap and cross-validation make it possible to create new data sets by resampling an offline data set (see, for instance, (M. Kuhn and Johnson 2013) for a brief introduction and several examples). They both start from the idea that data are sampled from the population of interest, hence the distribution of the variables in the data is an empirical approximation of their distributions in the population. Sampling again from the data can be implemented so that the bootstrap samples and cross-validation splits preserve this property. The resulting data sets are perturbed versions of the original containing a subset of its data points: 63.2% in case of bootstrap, in proportion to the fold structure in the case of cross-validation. The remaining data points can then be used to build test and validation sets to evaluate the models, as in random forests (Breiman 2001a, 1996).

Preserving the empirical distribution of a variable while resampling is a simple endeavour if all data points are independent, but it can become very complicated very quickly when the data have some kind of structure such as spatial and temporal dependencies. Using stochastic simulations may be more straightforward in such cases. A simple approach is to perturb data points with either stochastic noise or randomly-chosen deterministic transformations (addition, subtraction, multiplication, etc.). Small perturbations should not alter the outputs of a model if the model is sufficiently robust for practical use. They make overfitting less likely by effectively smoothing the data in the same way as ridge regression (Bishop 1995), which will help us in identifying whether our models are overfitting or are singular in places. Using deterministic transformations, on the other hand, facilitates testing model invariants and some types of model properties. If a transformation is invariant, the model and its outputs should not change: the original and transformed data belong to the same equivalence class, in the sense that they result in equivalent models.23 If a transformation is not invariant, we may still be able to map the transformed inputs to the corresponding parameter estimates and predictions based on the properties of the model. For instance, models constructed using linear functions of the data, like linear regression models, are closed against linear transformation: multiplying a variable by a constant will result in an equivalent change in the associated regression coefficient; adding a constant to a variable should not change the associated regression coefficient, which expresses the change in the response for a unit change in the variable; and adding a constant to all variables will shift the intercept of the model by the same amount. These are all properties that are easy to test and that our model implementation must satisfy. If we think of including and excluding data points as a deterministic transformation of the data, we can consider bootstrap and cross-validation themselves as stochastic simulations! Which makes intuitive sense if we consider that they use random sampling with and without replacement, respectively.

A more complex approach to stochastic simulation is to train a generative model on the data, and use it as an auxiliary model that generates new data points to build tests with. If the generative model captures the distribution of the data well, the data points that it generates should follow the same distribution and thus be a valid substitute. Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) are a popular choice, but graphical models (Scutari and Denis 2021) may provide an alternative that is simpler to learn and that requires fewer data to train. The advantage of this approach is that it is more flexible than those we discussed above: it can be tweaked to generate outliers and adversarial data points as well as data points with the expected distribution. We can also make sure that the generated data sets are sufficiently different from each other to test the model under various scenarios. However, training a generative model requires a significant amount of data, and it adds to the complexity of the machine learning pipeline (see Sections 9.2.1 and 9.2.4). If nothing else, it means more models to test. A cheaper alternative may be an interpolation algorithm like SMOTE (Fernandez et al. 2018), which is more computationally efficient at the cost of being more limited in the data points it can generate.

9.4.4 Testing Local and Testing Global

We can only understand the emergent properties of a machine learning pipeline by considering it as a whole, which suggests that testing the whole pipeline is as important as testing the individual models it orchestrates. Hence the following classes of tests are all equally important to implement:

  • Unit tests: testing that the individual models display the theoretical properties we know they have, including their resource usage based on big-O notation.
  • Integration tests: testing that all models accept valid inputs, reject invalid inputs, produce valid outputs, and generate errors instead of producing bad outputs. We want to make sure that if models are wired up properly they will not trip each other up.
  • System tests: feeding raw data to the pipeline and testing that the final output is correct, insofar as we can determine that from theoretical considerations (like model evaluation in Section 5.3.4).
  • Acceptance tests: checking whether the final outputs of the pipeline are of sufficient quality for their intended use (like model validation in Section 5.3.4).

This list broadly follows standard naming conventions for different types of tests established in Code Complete (McConnell 2004), but requires some clarifications to make sense in the context of machine learning pipelines. First of all, what is a “unit”? The traditional definition is “a complete class, routine, or small program that has been written by a single programmer or team of programmers”. In our case, we consider that to be a single model in the pipeline or a module performing associated tasks like data ingestion or data preparation (Section 5.3.3) or inference (Section 5.3.5). Often we will be able to use models that are already implemented in third-party libraries, in which case unit tests should be provided by their developers. (Given the realities of the software produced in academia, that may very well not happen, leaving all the testing to us.) If we are implementing any machine learning models ourselves, we can make model evaluation code double as a suite of tests as well.

Integration testing is “the combined execution of two or more classes, packages, components or subsystems that have been created by multiple programmers or programming teams”. Since we are treating each machine learning model and each module as a unit, we should test that their outputs are valid inputs for the modules that consume them. In particular, integration tests involving data ingestion and data preparation together with models ensure that our quality gates are effective (Section 5.3.3). Often these tests can only be very basic, because even with property-based testing we may only have some very general knowledge about what a module inputs and outputs look like. As for machine learning models, their sample and parameter spaces are both very large and difficult to test in a comprehensive way.

This leaves system testing, “the execution of the software in its final form” focusing on “security, performance, resource loss, timing problems, and other issues that can’t be tested at lower levels of integration”. Ideally we can implement it by starting from a limited, representative set of data and tracing how the data is acted upon by all the modules in the pipeline, all the way from data ingestion (Section 5.3.3) to reporting (Section 5.3.6). Or we can do the same with randomly generated data. System testing provides the most realistic assessment of the correctness and the performance of the pipeline, especially if we are using real-world data to seed the test. It allows us to test the propagation of errors, meaning both programming errors (like incorrect code and floating point errors) and stochastic errors (errors in the distributions of intermediate outputs that are taken as input by other models). Even in the absence of errors, we usually do not know what the distribution of the output of a model looks like, so it is difficult to simulate it to build integration tests.

If a machine learning pipeline passes unit, integration and system testing, we may have some degree of confidence that it works like it is supposed to. This, however, does not necessarily mean that it will prove to be useful to the people it was designed for, be they scientists trying to figure out how nature works or marketing people trying to make people click on ads. That is what acceptance testing is for: checking whether the pipeline solves the problem that motivated its development during project scoping (Section 5.3.1) and whether it meets all its targets. The software may be too slow, while users need real-time feedback; it may be too resource intensive, so it does not scale well enough to work on future data sets; or it may not be accurate enough in its predictions to meet service-level agreements or relevant regulations. The difference between being technically correct and being useful is, in a sense, a reflection of the difference between statistical significance and practical significance. Even if one machine learning model performs better than another, and even if the difference is statistically significant, it does not necessarily mean we should pick that model over other alternatives. The metric we are measuring may not correlate well with the task we are trying to model; the difference between the two models may be real but too small to matter in practice; or the better model has some undesirable characteristics that make it difficult to deploy it. None of these issues are, per se, the concern of unit, integration or system tests. Nevertheless they are real issues for the users of the machine learning pipeline and thus we should give them serious consideration.

9.4.5 Conceptual and Implementation Errors

What types of errors do we expect to catch with tests? If we exclude issues with infrastructure and input data, one way we can think about them is in terms of conceptual errors and implementation errors.

Machine learning models with a closed-form formulation, from simple logistic and ridge regression models (Hastie, Tibshirani, and Friedman 2009) to hierarchical Bayesian models implemented via variational inference (Blei, Kucukelbir, and McAuliffe 2017), often have closed-from estimators for their parameters and the respective distributions (for a given choice of the hyperparameters) as well as for loss functions and key statistical tests. The algebraic derivations involved in constructing them are prone to human errors. Some of these errors will be incorrect algebraic manipulations that can be spotted, albeit with difficulty, either by machine learning experts or by software for the symbolic manipulation of mathematical expressions. Errors involving modelling choices are more difficult to catch: for instance, incorrect assumptions on model inputs, approximations that prove to be too coarse, asymptotic considerations that do not work out or the inability to capture particular patterns of dependence between variables. These kinds of conceptual errors may require an experienced machine learning expert or two and much eyeballing to identify, and they are especially difficult to detect when the model uses stochastic optimisation for hyperparameter tuning or inference because stochastic noise tends to hide errors with relatively small magnitudes.

On the other hand, many machine learning models have an implicit formulation that relies on numeric or stochastic optimisation to learn a model that has some set of properties for some loss function. It is less common for such models to be affected by conceptual errors, simply because their mathematical formulation is not explicit and thus requires fewer algebraic derivations or probabilistic assumptions. However, implicit models are more prone to implementation issues. In order to make optimisation computationally feasible, or to be able to use commercial solvers, their implementation often looks nothing like their theoretical specification. For example, in the last 20 years many machine learning models have been reimplemented on top of CUDA (Nvidia 2021) to leverage the parallelism of GPU linear algebra operations. To benefit from parallelism, model training had to be refactored in as many small, independent operations as possible. On top of that, mathematical operations were restricted to those implemented in silicon on GPUs and TPUs which means, for the most part, linear operations on vectors and matrices.24 GPUs and TPUs have limited memory, which has encouraged the use of single-precision floating point instead of the more common double-precision and made floating point errors and rounding a pressing issue to consider. They also have limited bandwidth, so the code they run had to be designed not to require frequent interaction with the main program running on the CPU. And given limited memory and bandwidth, models were also required to operate on limited subsets of the data and collate the results instead of loading all data into memory. Another example is implementing machine learning models as distributed models over cheap cloud compute instances. (More on this in Chapter 2.)

9.4.6 Code Coverage and Test Prioritisation

Then, the more tests we put in place, the better? Not quite. Each test comes at a cost. Software tests are themselves software: they involve writing code, troubleshooting it and ensuring that it is correct. We should also keep them in sync with the modules they are testing and with the machine learning pipeline. Every time we introduce a new model or a new module, remove or modify one, and every time we revisit how they are wired up, we should also review the associated software tests. In other words, every time the specification of the pipeline in our configuration management platform changes (Section 5.1), continuous integration will re-run all the tests (Section 5.3) and we will have to revisit those that fail. Furthermore, running tests to check whether they pass or not can take a significant amount of time and hardware resources.

We walk a fine line between having enough tests to ensure the pipeline works well and having as few tests as we can get away with. Given the constraints of what hardware we have available and of how much time is acceptable for the tests to complete, we should aim for the tests to cover as much of the functionality of the pipeline as possible. How can we prioritise tests to achieve the best possible coverage with limited resources?

For traditional software, the answer is to measure code coverage (Myers, Badgett, and Sandler 2012): the proportion of the code executed by the tests. The goal is to make sure that as many functions, conditional branches and code paths are executed as possible so that it is difficult for bugs to remain undetected. Implicitly, what we are saying is that the algorithms and the logic we are implementing in the software are encoded in the code, hence the more code we test, the more we can ensure that the expected behaviour of the software matches our expectations. At the same time, we want tests to overlap as little as possible in terms of what they cover so as to implement as few as possible.

Machine learning software, however, differs from traditional software in that its behaviour is determined by data as much as by code (Section 5.1). Using different data for training, or predicting data points that are markedly different from what the models expect, may very well exercise the same code paths as “typical data” while producing pathological outputs. Hence code coverage is not a useful measure of how much of the functionality of the pipeline is being tested, because code is only part of the story. Sample space, for both inputs and outputs, parameter space and model space coverage are more meaningful indicators. This is not to say that code coverage is useless: but it is orthogonal to measures of coverage built on data, models and parameters. By all means, we should test code paths to be working to specification if in use, and remove them as dead code if not.

What does that mean in terms of choosing and prioritising tests? Sample space, parameter space and model space are effectively infinite in size so we cannot fully cover them. We can, however, make sure that we test a good selection of boundary values, typical values and invalid values (Thomas and Hunt 2019). In a very limited way, this is what we did at the end of the refactoring example in Section 6.8.

Boundary values are data points or parameter values that are close either to the boundary of their domain or to a decision boundary. The former are typically corner cases that produce some sort of limit behaviour, like hugely inflated or biased values in prediction or singular models in training. In general, limit behaviour is never desirable because extreme predictions will be wrong in most cases; and because singular models are overfitting the training data and will have a very poor predictive accuracy. The latter are values which make a model’s outputs unstable because a small change in such values will lead to the model producing outputs that lead to a different course of action. This is common in classification models, where we map continuous inputs (the variables in the data) to a discrete output (the class set) by dividing the input space in regions separated by hard thresholds. If one or more variables take values close to the boundary for a data point, a small change in their values will make the model choose different classes for practically identical data points.

Typical values are data points or parameter values that the model should handle well, without displaying any kind of pathological behaviour. They are mainly useful to implement property-based tests verifying that the theoretical properties of the model hold in its software implementation. Ideally, we would like to cover the space of typical values with a grid such that each point in the grid is sufficiently different from its neighbours and that all regions in the space are tested. This would ensure little or no duplication in the tests while ensuring coverage of the sample space (in the case of data points) or of the parameter space (in the case of parameter values). We can choose grid points either deterministically (a regular grid) or stochastically (by sampling them at random); the latter may be easier to implement if the space of typical values is high-dimensional or if we are making assumptions on the distribution of the typical values (say, prior distributions for the parameters). A practical example of this approach is the TensorFuzz debugging library for neural networks (Odena et al. 2019). TensorFuzz implements coverage-guided fuzzing: it samples possible inputs to a neural network from a corpus of test data, creates new inputs by changing them using a set of possible transformations, and checks which neurons are activated by the transformed inputs. If the transformed inputs result in a pattern of activations that is too similar to that of one of the inputs already in the corpus, as established by an auxiliary nearest-neighbour model (Hastie, Tibshirani, and Friedman 2009), then they are discarded because they are deemed not to increase coverage. If, on the other hand, the pattern of activations is sufficiently different from those we have already observed, the transformed inputs are added to the corpus. Therefore, TensorFuzz gradually builds a corpus of inputs that contains data points with typical values for all variables and that puts the neural network in a variety of states, increasing the likelihood of finding instances of misbehaviour that would not be caught by the original test data.

Finally, invalid values lie beyond the boundaries of the acceptable inputs or outputs of a model. If valid values are limited to an interval, that means any values outside of that interval. Values that are of the wrong type (say, a character string when a real number is expected) and special values like NaN, +Inf or -Inf (Section 3.1) should also be considered. NA may or may not be invalid depending on the context: it is certainly desirable for machine learning models to be able to handle missing data, and if they are able to do so, NA should be treated as a boundary value. Otherwise, we should ensure that the output is NA if any input is NA, that is, that we are propagating missing values correctly; or the model should fail with an error. In general, we test invalid values to verify that model performance degrades gracefully and to make sure errors are generated when no meaningful output can be produced.

Testing a good selection of boundary, typical and invalid values will provide insights on the behaviour of our machine learning software. Testing both typical values and corner or invalid values, we can ensure that models are robust and display the expected theoretical properties. Testing pairs of values for data and parameters (in addition individual values in isolation) increases the probability of finding bugs from 67% to 93% (D. R. Kuhn, Kacker, and Lei 2013); testing higher-order combinations produces quickly-diminishing returns and may not be worth the effort in applications that are not life-critical. As a side effect, we can also achieve some degree of code coverage: if different code paths map to different regions of the sample and parameter spaces, testing both well will execute many code paths.


Beam, A. L., A. K. Manrai, and M. Ghassemi. 2020. “Challenges to the Reproducibility of Machine Learning Models in Health Care.” Journal of the American Medical Association 323 (4): 305–6.

Beck, K. 2002. Test-Driven Development by Example. Addison-Wesley.

Bishop, C. M. 1995. “Training with Noise is Equivalent to Tikhonov Regularization.” Neural Computation 7 (1): 108–16.

Blei, D. M., A. Kucukelbir, and J. D. McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of American Statistical Association 112 (518): 859–77.

Braiek, H. B., and F. Khomh. 2020. “On Testing Machine Learning Programs.” Journal of Systems and Software 164: 110542.

Breck, E., S. Cai, E. Nielsen, M. Salib, and D. Sculley. 2017. “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction.” In IEEE International Conference on Big Data, 1123–32.

Breiman, L. 1996. Out-of-Bag Estimation.

Breiman, L. 2001a. “Random Forests.” Machine Learning 45 (1): 5–32.

Cohen, A. Gokaslan V., E. Pavlick, and S. Tellex. 2019. OpenGPT-2: We Replicated GPT-2 Because You Can Too.

Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NNACL-HLT), 4171–86.

Fernandez, A., S. Garcia, F. Herrera, and N. V. Chawla. 2018. “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-Year Anniversary.” Journal of Artificial Intelligence Research 61: 863–905.

Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems (NIPS), 2672–80.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.

Kanagawa, M., P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur. 2018. Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences.

Katal, A., M. Wazid, and R. H. Goudar. 2013. “Big Data: Issues, Challenges, Tools and Good Practices.” In Proceedings of the International Conference on Contemporary Computing, 404–9.

Kenett, R. S., and T. C. Redman. 2019. The Real Work of Data Science. Wiley.

Kuhn, D. R., R. N. Kacker, and Y. Lei. 2013. Introduction to Combinatorial Testing. CRC Press.

Kuhn, M., and K. Johnson. 2013. Applied Predictive Modeling. Springer.

Li, J., X. Chen, E. Hovy, and D. Jurafsky. 2016. “Visualizing and Understanding Neural Models in NLP.” In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 681–91. Association for Computational Linguistics.

Lundberg, S. M., and S.-I. Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems (NIPS), 4765–74.

McConnell, S. 2004. Code Complete. 2nd ed. Microsoft Press.

Montgomery, D. C. 20AD. Design and Analysis of Experiments. 10th ed. Wiley.

Myers, G. J., T. Badgett, and C. Sandler. 2012. The Art of Software Testing. 3rd ed. Wiley.

Natekin, A., and A. Knoll. 2013. “Gradient Boosting Machines, a Tutorial.” Frontiers in Neurorobotics 7 (21): 1–21.

Nvidia. 2021. CUDA Toolkit Documentation.

Odena, A., C. Olsson, D. Andersen, and I. Goodfellow. 2019. “TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing.” Proceedings of Machine Learning Research (ICML 2018) 97: 4901–11.

Ousterhout, J. 2018. A Philosophy of Software Design. Yaknyam Press.

Popejoy, A. B., and S. M. Fullerton. 2016. “Genomics Is Failing on Diversity.” Nature 538: 161–64.

Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. 2019. Language Models Are Unsupervised Multitask Learners.

Ribeiro, M. T., S. Singh, and C. Guestrin. 2016. “Why Should I Trust You? Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–44. ACM.

Sculley, D., G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young. 2014. “Machine Learning: The High Interest Credit Card of Technical Debt.” In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop).

Sculley, D., G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. 2015. “Hidden Technical Debt in Machine Learning Systems.” In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), 2:2503–11.

Scutari, M., and J.-B. Denis. 2021. Bayesian Networks with Examples in R. 2nd ed. Chapman & Hall.

Simonyan, K., A. Vedaldi, and A. Zisserman. 2014. “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.” In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Workshop Track.

Tabuchi, A., A. Kasagi, M. Yamazaki, T. Honda, M. Miwa, T. Shiraishi, M. Kosaki, et al. 2019. “Extremely Accelerated Deep Learning: ResNet-50 Training in 70.4 Seconds.”

Thomas, D., and A. Hunt. 2019. The Pragmatic Programmer: Your Journey to Mastery. Anniversary. Addison-Wesley.

Yang, Z., Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. 2019. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” In Advances in Neural Information Processing Systems (NeurIPS), 5753–63.

Zellers, R., A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi. 2019. “Defending against Neural Fake News.” In Advances in Neural Information Processing Systems (NeurIPS), 9054–65.

Zhang, J. M., M. Harman, L. Ma, and Y. Liu. 2020. “Machine Learning Testing: Survey, Landscapes and Horizons.” IEEE Transactions on Software Engineering 48 (1): 1–36.

Zheng, A. 2015. Evaluating Machine Learning Models. O’Reilly.

  1. That is, a feature that is built on a variable with a (direct) causal effect on the target variable of interest.↩︎

  2. Data leakage arises when information from outside the training data set is used to learn a model, typically because one or more variables carry the same information as the prediction target but in different form. Such variables are sometimes called aliases in the context of linear regression models.↩︎

  3. A test oracle is a mechanism for determining whether a test has passed or failed; it has no relationship with oracle properties from the statistics literature.↩︎

  4. There may be other equivalence classes beyond those we can identify in this way: domain knowledge about the data may help in identifying them.↩︎

  5. This is not as severe a limitation as it may seem. Non-linear models are mathematically harder to work with, so most have linear formulations that operate on transformed inputs to encode non-linear relationships. A common example are kernel-based methods (Kanagawa et al. 2018).↩︎