Chapter 7 Packaging and Deploying Pipelines

Packaging machine learning models into artefacts is an important step in making pipelines reproducible. It also makes models easier to deploy, that is, to bring them into production (or another target) systems and to put them to use. Choosing the right combination of packaging formats and deployment strategies ensures that we can build on CI/CD solutions (Duvall, Matyas, and Glover 2007) to do that efficiently and effectively. Our ultimate goal is to ship a pipeline with confidence because we have designed (Chapter 5), implemented (Chapter 6), documented (Chapter 8) and tested it well (Chapter 9).

Models are part of a machine learning pipeline as much as code is, and are packaged (Section 7.1) and deployed (Sections 7.2 and 7.3) in similar ways to traditional software. However, their behaviour is less predictable (Sections 5.2 and 9.2): we should monitor them when they are deployed and when they are running in production (Section 7.4). We should also have contingency plans for when they fail (Section 7.5) so that we can restore the pipeline to a functional state (Section 7.6).

7.1 Model Packaging

Models can be stored into different types of artefacts, as we briefly discussed in Section 5.3.4. There are several ways in which model artefacts can be integrated into a pipeline, with varying degrees of abstraction from the underlying machine learning systems.

7.1.1 Standalone Packaging

The most minimalist form of packaging is simply the artefact produced by the machine learning framework that we used to train the model: for instance, a SavedModel file from TensorFlow (TensorFlow 2021a) or an ONNX (ONNX 2021) file. Such files are easy to make available to third parties and convenient to embed in a library or in a (desktop or mobile) application with frameworks like Apple Core ML (Apple 2022). They can also be shipped as standalone packages via a generic artefact registry such as those offered by GitHub (GitHub 2022c), GitLab (GitLab 2022b) or Nexus (Sonatype 2022). Tracking the version of the trained model, its parameters, its configurations and its dependencies is delegated to the configuration management platform supporting the pipeline (Section 5.3.5).

7.1.2 Programming Language Package Managers

Python has become the most popular programming language in machine learning applications because of the availability of mature and versatile frameworks such as TensorFlow (TensorFlow 2021a) and PyTorch (Paszke et al. 2019) (Section 6.1). As a result, it is increasingly common to ship models as Python packages to simplify the deployment process, and to make the model depend on a specific version of the Python interpreter and of those frameworks. Doing so throughout the pipeline helps avoid the technical debt arising from polyglot programming (Section 5.2.4). In practice, this involves distributing packages, modules and resource files following the Python standard (known as “Distribution Package”), using tools like Setuptools (Python Packaging Authority 2022) and Pip (Python Software Foundation 2022a) to install them, and possibly uploading them to the central Python Package Index to make them easily accessible.

7.1.3 Virtual Machines

Type-1 and type-2 hypervisor virtualisation architectures.

Figure 7.1: Type-1 and type-2 hypervisor virtualisation architectures.

All modern CPUs (Section 2.1.1) implement instruction sets to support hardware virtualisation: for instance, Intel CPUs have Virtualisation Technology (VT-x) and AMD CPUs have AMD-V. This has made virtual machines (VMs, also known as “guest operating systems”) a convenient choice on local hardware and resulted in the wide availability of cloud instances. VMs run on top of a hypervisor, a specialised software allowing multiple guest systems to share a single compute system (the host hardware). A VM is like a normal compute system: the main difference is that its CPU, memory, storage and network interfaces are shared with the underlying hardware through the hypervisor which allocates them to the guests as needed. vSphere (VmWare 2022), KVM (Open Virtualization Alliance 2022) and HyperV (Microsoft 2022h) are some examples of hypervisors (Figure 7.1, left panel): they run directly on the host hardware, either as standalone pieces of software or integrated in the host operating system. hypervisors (Figure 7.1, right panel) like Virtual box (Oracle 2022) and VMware Workstation (VMware 2022), on the other hand, run on top of the host operating system. Both types are limited to executing applications compiled for the same type of CPU they are running on.

Thanks to hardware virtualisation, VMs can run on the host CPU and can access the host’s hardware resources with limited overhead via PCIe pass-through (GPUs are a typical example, see Section 2.1.1). Overhead can be further reduced by moving from (complete) virtualisation to paravirtualisation, which trades off complete isolation of the guests for better throughput and latency. The guest operating system is now aware of running in a virtualised environment, and it can use a special set system of calls (hypercalls) and I/O drivers (especially for storage and networking) to communicate directly with the hypervisor.

VMs are the second type of artefact we mentioned in Section 5.3.4. We can either create them from scratch, installing and configuring the operating system and all the libraries we need, or we can start from pre-baked images that come configured with most of the software we need. For the former, we have tools like Hashicorp Packer (HashiCorp 2022a) or Vagrant (HashiCorp 2022d), which can install the operating system, and configuration management software like Ansible (Ansible Project 2022), which can install the models as well as the software stack they depend on. As for the latter, a vast selection of pre-baked images is available from cloud providers: an example is the catalogue of Amazon Machine Images (AMIs) (Amazon Web Services 2022b). VM configurations and images are typically stored in a standardised open format such as the Open Virtualisation Format (OVF) (DMTF 2022). Finally, VMs can be managed automatically by the orchestrator of the machine learning pipeline through the hypervisor and the associated software tools, which can create, clone, snapshot, start and stop individual VMs.

VMs offer three main advantages:

  • They are flexible to operate: we can run multiple instances of different operating systems and of different software stacks on the same host, consolidating their configurations using pre-baked images and managing them centrally as individual entities.
  • They can also be easily scaled to deal with peak loads, both by starting new ones (horizontal scalability) or by increasing the hardware resources they have access to (vertical scalability, Section 2.4).
  • They can be moved to another host (portability) and are easy to snapshot, facilitating disaster recovery in the case of hardware failure.

However, VMs have one important disadvantage: they contain an entire operating system and therefore require large amounts of hot storage. As a result, the deployment time of a VM can range from tens of seconds (in the best case) to minutes (in the average case) (Hao, anang, and Kim 2021), depending on the cloud provider or the on-premises hypervisor configuration.

7.1.4 Containers

In contrast, containers are more lightweight (Espe et al. 2020) because they only virtualise the libraries and the applications running on top of the operating system, not an entire machine learning system (Figure 7.2). Instead of a hypervisor, they are managed by a container runtime (sometimes called a “container engine”) like Docker (Docker 2022a) which controls the access to the hardware and to the operating system of the host.

Container runtimes are typically built on top of a set of Linux kernel capabilities (Rice 2020):

  • Namespaces: an isolation layer that allows each process to see and access only those processes, directories and system resources of the host that are bound to the same namespace it is running in.
  • Cgroups (control groups): a resource management layer that sets and limits CPU, memory and network bandwidth for a collection of processes.
  • Seccomp (secure computing): a security layer that limits a container to a restricted subset of system calls (the kernel’s APIs).
Virtualisation and containers high-level architectures.

Figure 7.2: Virtualisation and containers high-level architectures.

As was the case with VMs, containers can package machine learning applications with all the associated libraries, dependencies and tools in a single self-contained artefact: a container image which is immutable, stateless and ephemeral by design.16 In the case of Docker, we commonly refer to it as a Docker image. Container images are created from declarative configuration files, also known as Dockerfiles, that define all the necessary commands. Each command produces an immutable layer reflecting the changes that the command itself introduces into the image, allowing for incremental changes and minimising disk space usage. The starting point of this process are base images that provide a stripped-down environment (not a complete operating system, as was the case for pre-baked VM images) to which we can add our models and the libraries, tools and applications that complement them.

Below is an example of a Dockerfile that creates an image for a FastAPI RESTful application (a framework to create web services and APIs). For reproducibility, both the Dockerfile and the requirements.txt file it references should be stored under version control in a configuration management platform (Section 6.4).

FROM python:3.10.6-bullseye

WORKDIR /app

COPY requirements.txt .
RUN pip3 install --no-cache-dir  -r requirements.txt

COPY . .

CMD [ "uvicorn", "main:app", "--host=0.0.0.0"]

Firstly, the Dockerfile explicitly identifies the system dependencies of the image it generates. The first line, “FROM python:3.10.5-bullseye” identifies a base image with the stable release of Debian GNU/Linux, codenamed “Bullseye”, and version 3.10.5 of the Python interpreter. Secondly, it identifies the Python packages we depend on. The third and fourth lines, “COPY requirements.txt .” and “RUN pip3 install -r requirements.txt”, copy the file requirements.txt which lists the Python dependencies into the image and uses the Python package manager (pip) to install them. It is important that all dependencies are listed and pinned to the exact versions we have tested, to avoid accruing technical debt (Sections 5.2.4 and 6.3). If we upgrade one or more dependencies, the corresponding container layer is invalidated. Docker caches layers as they are created: those that have not been affected by our changes will be taken from that cache instead of being re-created from scratch. The second line (“WORKDIR /app”) changes the working directory to that containing the application files, the fifth line (“COPY . .”) copies them into the container image, and the last line defines the command that is run when the container is started.

After a successful build, we can store containers into a container registry such as Docker registry (Docker 2022b) or Harbour (Harbor 2022). Container registries are server applications that provide a standardised API for uploading (push), versioning (tag) and downloading (pull) container images. The registry structure is organised into repositories (like Git (The Git Development Team 2022)) where each repository holds all the versions of a specific container image. The container’s runtime, registry and image specifications are based on the Open Container Initiative (OCI) (Open Container Initiative 2022), an open standard by the Linux Foundation, and are therefore highly portable across platforms and vendors.

Like any other software artefact, container images may have security vulnerabilities (Bhupinder et al. 2021) inherited from vulnerable libraries in an outdated base image, rogue images in an untrusted container registry or a vulnerable Dockerfile. To identify these vulnerabilities, we should enforce compliance and security checks to validate both the Dockerfiles, with tools such as Hadolint (The Hadolint Project 2022), and the resulting images, with static analysis and image scanner tools such as Trivy (Aquasecurity 2022). Cloud providers such as Amazon AWS (Services 2022) and Google Cloud (Google 2022b) have public container registries with secure and tested base images ranging from vanilla operating system installations to pre-configured machine learning stacks built on TensorFlow (TensorFlow 2021a) and PyTorch (Paszke et al. 2019).

Container runtimes integrate with orchestrators to allow for a seamless use of container images. The orchestrator is responsible for managing a fleet of containers in terms of deployment, scaling, networking and security policies. The containers are responsible for providing different pieces of functionality as modular and decoupled services that communicate over the network, that can be deployed independently and that are highly observable. This is, in essence, the microservices architecture (Newman 2021). In addition, container runtimes integrate with CI to enable reproducible software testing: base container images provide a clean environment that ensures that test results are not tainted by external factors (Section 9.4 and 10.3).

Kubernetes (The Kubernetes Authors 2022a) is the de facto standard among orchestrators.17 Orchestrators specialising in machine learning pipelines integrate Kubernetes with experiment tracking and model serving to provide complete MLOps solutions: two examples are Kubeflow (The Kubeflow Authors 2022), which is more integrated, and MLflow (Zaharia and The Linux Foundation 2022), which is more programmatic. Container runtimes enhance them by implementing a GPU pass-through from the physical host to the container (with the “--gpus” flag, in the case of Docker). Kubernetes can use this functionality to apply the appropriate label selector (The Kubernetes Authors 2022b) to each container and to schedule training and inference workloads on machine learning systems with the appropriate hardware (Section 2.1.1).

7.2 Model Deployment: Strategies

A deployment strategy or deployment pattern is a technique to replace or upgrade an artefact or a service in a production environment while minimising downtime and impact on users. Here we will focus on how we can deploy machine learning models (Section 5.3.5) without impacting their consumers, that is, the final users and the modules in the pipeline that depend on the models’ outputs. Clearly, there are similarities to how traditional software is deployed: we want automated and reproducible releases via CI/CD, in most cases using containers as artefacts (Section 7.1.4). Furthermore, parts of a machine learning pipeline are in fact traditional software and are deployed as such.

Model deployment can take advantage of modern software deployment strategies from progressive delivery. A pipeline will usually contain multiple instances of each model (say, version A) to be able to process multiple inference requests and data preparation queues in parallel. Therefore, we can initially replace a small subset of these instances with a new model (say, version B). If no issues emerge, we then gradually replace the remaining instances: the new model has effectively passed acceptance testing (Section 9.4.4). If any issues do arise, our logging and monitoring facilities (Section 5.3.6) will have recorded the information we need to troubleshoot them. We can also deploy multiple models at the same time to compare their performance in terms of accuracy, throughput and latency. As a result, progressive delivery speeds up model deployment (by reducing the amount of pre-deployment testing), decreases deployment risk (because most consumers will not be impacted by any issues that may emerge in the initial deployment) and makes rollbacks easier (Section 7.6).

Blue-green (left), canary and A/B testing (top right) and shadow (bottom right) deployment strategies.

Figure 7.3: Blue-green (left), canary and A/B testing (top right) and shadow (bottom right) deployment strategies.

We can implement progressive delivery with a number of related deployment strategies (Tremel 2017):

  • The blue-green deployment pattern (Humble and Farley 2011) assumes that we are using a router (typically a load balancer) to spread requests over a pool of instances that serve the version A of a machine learning model (Figure 7.3, left). When we deploy a new version B of the model, we create a second pool of instances that serves it and send a subset of the new incoming requests to this new pool. If no issues arise, the router will then gradually send more and more requests to the pool that serves model B instead of that serving model A. Existing requests being processed by model A are allowed to complete to avoid disruptions. The pool serving model A will eventually not be assigned any more requests and may then be decommissioned. If any issues arise, rollback is simple: we can send all requests to the pool serving model A again. Keeping the two pools in separate environments or even separate machine learning systems will further reduce deployment risk.
  • We already mentioned the canary deployment pattern (Humble and Farley 2011) in Section 5.3.4: the main difference with the blue-green pattern is that we deploy instances with model B in the same pool that is already serving model A (Figure 7.3, top tight). The router will redirect a small number of requests to the instances with model B, taking care of session affinity.18 Other requests act as our control group: we can inspect and compare the performance of the two models without any bias because they run in the same environment. Again, if no issues arise we can gradually retire the instances with model A. Canary deployments are typically slower than other deployment patterns because collecting enough data on the performance of model B with a small number of instances requires time. However, they provide an easy way to test new models in production with real data and in the same environment as existing models.
  • In a shadow deployment (Microsoft 2022g), a new model B is deployed in parallel to model A and each request is sent to both models (Figure 7.3, bottom right). We can compare their accuracy using the outputs they produce from the same input, as well as their latency and any other metric we collect through logging and monitoring. In fact, we can deploy several models in parallel to test different approaches and keep only the model that performs best. Shadow deployment therefore requires us to set up a different API endpoint for each model we are testing, and to allocate enough hardware resources to handle the increased inference workload. However, it allows for testing new models without disturbing operations.
  • In the rolling or ramped deployment pattern, we simply replace the instances with model A in batches on a pre-determined schedule until all the running instances are serving model B. Rolling deployments are easy both to schedule and to roll back.
  • Another deployment pattern we mentioned elsewhere (Sections 5.3.4 and 9.4.3) is A/B testing (Amazon 2021; Zheng 2015): the router randomly splits the requests 50%-50% across two models A and B, we evaluate the relevant metrics for each model, and we promote model B if and only if it outperforms model A. The key difference from canary deployments is that in the latter only a small proportion of the requests is sent to instances with model B to reduce deployment risk: the split is 90%-10% or at most 80%-20% (Figure 7.3, top right).
  • Destroy and re-create is the most basic deployment strategy: we stop all the instances with model A and we create from scratch a new set of instances with model B to deploy in their place. As a result, the pipeline will be unavailable and consumers that are performing multiple requests in a sequence may receive inconsistent outputs.

We can integrate these deployment patterns by adding feature flags (Section 6.5) to our models: then models A and B can share large portions of code. In this way, we can easily create new models just by switching different combinations of flags, without building and deploying new artefacts at all. However, both models will be served at the same time during the progressive delivery process: all consumers should support both their APIs or model B should be fully backward compatible with model A.

7.3 Model Deployment: Infrastructure

In a machine learning pipeline, model deployment is the part of the pipeline orchestration that enables models to be deployed and served in the development, testing and production environments (Section 5.3.5). Ideally, it should be completely automated via CI/CD to avoid catastrophic failures like that at Knight Capital (.Seven 2014) which we touched on in Section 5.2.3.

The nature of the continuous deployment part of CI/CD can vary depending on the type of artefact (Section 7.1) and on the type of compute systems (Section 2.4) we are deploying to. Our artefacts may be container images that wrap and serve our models through APIs: we can deploy them locally by manually invoking Docker, or remotely by instructing Kubernetes to call an automated script stored in the pipeline’s CI/CD configuration. In both cases, the image is fetched from the registry at deployment time if it is not available locally. Our artefacts may also be VMs: continuous deployment can then leverage configuration management tools like Ansible (Ansible Project 2022) to deploy and upgrade them. In both these cases, the CI/CD pipeline standardises the deployment process, hiding the differences between local and cloud environments (Section 2.3) and shifting complexity from glue code to declarative configuration files (Sections 5.2.3). This has standardised the deployment process to the point where it is largely the same to target orchestrator platforms like Kubernetes (The Kubernetes Authors 2022a) and commercial providers like Amazon AWS ECS.

We may also run machine learning pipelines on top of an integrated MLOps platform: model deployment then depends entirely on the platform’s opinionated workflows. For example, an MLOps platform like Databricks (Databricks 2022) integrates many open-source components through MLflow (Zaharia and The Linux Foundation 2022) and wraps them with APIs that support multiple deployment targets. These APIs present a standardised interface similar to that of Docker and Kubernetes regardless of what target we choose. Machine learning platforms from cloud vendors (“Machine Learning as a Service”) like Azure ML (Microsoft 2022c) or Amazon AWS SageMaker (Amazon 2022d) provide a much higher level of abstraction. On the one hand, they give us little control over how the pipeline is implemented and how models are deployed. On the other hand, they are accessible for teams that do not have the skills or the budget to manage their own CI/CD, monitoring and logging infrastructure. They also provide an experiment tracking web interface (with an API to use it programmatically) to test new models and to visualise them along with their parameters and performance metrics.

7.4 Model Deployment: Monitoring and Logging

We should track automated model deployments through all their stages with our logging and monitoring infrastructure to achieve the observability we need to diagnose any issue we may run into (Section 5.3.6). All continuous deployment platforms allow that: MLflow (Zaharia and The Linux Foundation 2022) has MLflow tracking, Airflow (The Apache Software Foundation 2022a) can use Fluentd (The Fluentd Project 2022) and general-purpose CI/CD solutions like GitLab have built-in mechanisms for issuing metrics and log events as well as support for Prometheus (Prometheus Authors and The Linux Foundation 2022). It is essential to log every entry and exit point of every module, as well as any retries and the successful conclusion of all tasks in the pipeline: we should be able to construct descriptive activity reports that include comprehensive stack traces. Machine learning pipelines have many moving parts and can fail in many different places and in ways that are difficult to diagnose even with that much information (Sections 9.1, 9.2 and 9.3). Furthermore, logging should automatically trigger external notification systems like PagerDuty (PagerDuty 2022) to become aware of any issues during deployment as early as possible.

After a model is deployed, we should check that it is being served, that it is ready to accept inference requests (readiness) and that it produces correct results (liveness). The software that we use to serve the model may expose a health-check API (like the readiness and liveness probes in Kubernetes (The Kubernetes Authors 2022a)) which the orchestrator can use to only route inference requests to models that can to process them. The monitoring client inside the model itself can serve the same purpose by exposing metrics to check that performance has not degraded over time. As we discussed in Section 5.3.6, we should locate the logging and monitoring servers on dedicated systems to make sure that they are not affected by any of the issues caused by or affecting the models and that they can be used to perform a root cause analysis of what went wrong.

7.5 What Can Possibly Go Wrong?

Many kinds of issues can arise when we deploy a new model, for different reasons: lack of control or observability for either the deployment process or its targets (Section 2.4); manually executing pre- or post-deployment operations (Section 5.2.3); or a critical defect in a model or in a module slipping through our software test suite (Section 9.4). We can minimise deployment risk by taking advantage of CI/CD (Chapter 5) and following modern development practices (Chapter 6), but some problems cannot be fully resolved or even detected automatically.

Hardware resources may be unavailable. The environment we are deploying to may be running on machine learning systems that have inadequate resources (say, not enough storage space or memory), hardware faults or network connectivity issues (say, the systems themselves are unreachable, or they cannot access remote third-party resources needed by the model).19 These problems can occur both in local (on-premises) and remote (cloud) environments; in the latter, scheduling a new deployment will typically solve them since the underlying hardware will change (Section 2.3).

Hardware resources may not be accessible. The machine learning systems may be fine, but there are access restrictions in place that prevent us from using them. Firewalls may be preventing us from connecting to them across networks; file permissions may be preventing us from reading data and configurations from their storage. This is a common issue with cloud instances and managed services because their identity and access management (IAM) policies are difficult to write and to understand. In fact, it is often only possible to test the configurations controlling authentication and authorisation to those services interactively which makes it easy to break them accidentally. As a result, there have been many instances of machine learning engineers removing too many access restrictions and leaving S3 buckets full of personal data publicly accessible on AWS (Twilio (The Register 2020) and Switch (VPNOverview 2022) are two notable examples from recent years). This is also clearly undesirable, but it can be prevented by writing IAM policies according to the principle of least privilege, by tracking them with configuration management tools (Section 11.1) and by including them in code reviews (Section 6.6) before applying them.

People do not talk to each other. Model deployment is when we actually put to use the models we trained and the code that supports them. Therefore, it is also when defects arising from the lack of communication between domain experts, machine learning experts, software engineers and users may come to light. Scoping and designing the pipeline (Section 5.3), validating machine learning models (Section 5.3.4) and inference outputs (Section 5.3.6), designing and naming modules and their arguments (Section 6.2), code reviews (Section 6.6) and writing various forms of documentation (Chapter 8) should all be collaborative efforts involving all the people working and using the pipeline. When this collaboration is not effective, different people will be responsible for different parts of the pipeline and the resulting lack of coordination may cause issues at the boundaries of the different areas of responsibility. Machine learning engineers may develop models without consulting the domain experts (“Are the models meaningful? Do we have the right data to train them?”) or the software engineers (“Can the models run on the available systems and produce inference with low enough latency?”). Domain experts may fail to get across their expert knowledge to machine learning engineers (“This model class cannot express some relevant domain facts!”) or to software engineers (“This variable should be coded in a specific way to make sense!”). Software engineers may take liberties in implementing machine learning models that change their statistical properties without the machine learning engineers noticing (“Maybe I can use this other library… or it may be faster to reimplement it myself!”) or structure the code in ways that make it difficult for a domain expert to understand (“What does this theta_hat argument mean again?”). The segregation of roles is an organisational anti-pattern that should be avoided at all costs in favour of the shared responsibility and constant sharing of skills and knowledge originally advocated by DevOps (Humble and Farley 2011).

Missing dependencies. The deployment of a module may fail because one or more of its dependencies (inside or outside the pipeline) is missing or is not functional. For instance, if module A requires the outputs of module B as inputs, we should ensure that module B is present and in a working state before deploying module A. In practice, this requires a coordinated deployment of the two modules, which is an anti-pattern when we strive for modules to be decoupled from each other. We can, of course, also implement appropriate retry policies in module A to make it resilient to module B being temporarily offline. On Kubernetes (The Kubernetes Authors 2022a), we can use liveness and readiness probes (Section 7.4) together with “init containers” (specialised containers that run before app containers in a pod) for this purpose.

Incomplete or incorrect configuration management. Configuration management tools (Section 10.1 and 11.1) promote and automate the reuse of templates, environment variables and configuration files. However, this means that we should be careful to store those that correspond to different environments separately, and to keep them clean and complete at all times. In a complex pipeline with many modules and environments, it is easy to mistakenly use the configuration of a different environment than what we intended. In the best case, what we are trying to do will fail and an exception will be logged. In the worst case, we will apparently succeed in what we are trying to do but the results will be silently wrong because we are accessing different resources than we think we are. For instance, we may inadvertently cause an information leakage by accessing training data instead of validation data. Similar misconfiguration issues may involve any part of the pipeline (training, software testing, inference, etc.) and any of the entities tracked by configuration management (database references, secrets, model parameters, features, etc.).

7.6 Rolling Back

When a model that is deployed in production fails to meet the required performance and quality standards (Section 8.3), we have two choices: either we replace it with a previous model that is still fit for use (rolling back) or with a new model that we train specifically to address the reason why the current model is failing (rolling forward). In the following, we will focus on rollbacks, but our discussion will be as relevant for rolling a model forward.

Model rollbacks are only possible if the model APIs are backward compatible between releases. Then every version of our model can be restored to any previous version at any given moment in time without disrupting the rest of the pipeline because we can guarantee that the model delivers the same functionality, with the same protocol specifications and the same signature. Achieving backward compatibility requires a significant amount of planning and effort in terms of software engineering. In addition to wrapping models in a container that abstracts and standardises their interface, encapsulating their peculiarities and their implementation, we also need an experiment management platform that versions the pipeline modules, the models and the respective configurations. At a minimum, such a setup involves a model registry (Section 5.3.4) and a version control system for the code (Section 6.5).

Sometimes maintaining backward compatibility is simply not possible: if we replace a model with another from a completely different model class, or if the task the model was trained for has changed, the APIs should change to reflect the new model capabilities and purpose. We can transition between the two different sets of APIs by versioning them. For example, the old set of APIs may be available from the URL path https://api.mlmodel.local/v1/ while the new ones may be made available from https://api.mlmodel.local/v2/, and the old APIs may raise a warning to signal that they are deprecated. (OpenAPI supports deprecating API “Operations” (SmartBear Software 2021)). We can then deploy new, incompatible models with the strategies we discussed in Section 7.2, and the pipeline modules will be able to access both sets of APIs at the same time and without any ambiguity about what version they are using. This in turn makes it possible to update individual modules in an orderly transition.

If a model is shipped with a built-in configuration that is versioned along with its APIs, the function that loads it should support the older versions. Similarly, if a model is stateful and needs to access a database to retrieve assets and configurations, the function that accesses these resources should be able to deal with different database schemas. Our ability to perform rollbacks will then depend on our ability to perform database migrations.

Whether rollbacks should be manual (that is, triggered by a human-in-the-loop domain expert) or automatic (that is, triggered by the pipeline orchestrator on the basis of the metrics collected by the monitoring infrastructure) is not a simple decision to make. From a technical perspective, we should evaluate the impact of the deployment strategy we plan to use in terms of how long it will take to return the pipeline to a fully functional state. From a business perspective, domain experts may want more solid evidence before asking for a rollback: they may be fine with an underperforming model while they acquire more data points and they better understand the underlying reason why the model is no longer accurate. Machine learning experts can help during that time by deploying alternative models with a canary or shadow deployment strategy to investigate their performance and compare it with that of the failing model. The only case in which an automatic rollback is clearly the best option is when the model’s poor performance is not caused by changes in the data or in the inference requests but by issues with the hardware and software infrastructure underlying the pipeline. (For instance, a newly deployed model uses too much memory or becomes unresponsive.) Even in such a case, the decision to roll back should be supported by monitoring and logging evidence (Section 5.3.6).

References

Alquraan, A., H. Takruri, M. Alfatafta, and S. Al-Kiswany. 2018. “An Analysis of Network-Partitioning Failures in Cloud Systems.” In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 51–68.

Amazon. 2021. Dynamic A/B Testing for Machine Learning Models with Amazon SageMaker MLOps Projects. https://aws.amazon.com/blogs/machine-learning/dynamic-a-b-testing-for-machine-learning-models-with-amazon-sagemaker-mlops-projects/.

Amazon. 2022d. Machine Learning: Amazon Sagemaker. https://aws.amazon.com/sagemaker/.

Amazon Web Services. 2022b. Amazon Machine Images (AMI). https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html.

Ansible Project. 2022. Ansible Documentation. https://docs.ansible.com/ansible/latest/index.html.

Apple. 2022. TensorFlow 2 Conversion. https://coremltools.readme.io/docs/tensorflow-2.

Aquasecurity. 2022. Trivy Documentation. https://aquasecurity.github.io/trivy/.

Bhupinder, K., M. Dugré, A. Hanna, and T. Glatard. 2021. “An Analysis of Security Vulnerabilities in Container Images for Scientific Data Analysis.” GigaScience 10 (6): giab025.

Databricks. 2022. Databricks Documentation. https://docs.databricks.com/applications/machine-learning/index.html.

DMTF. 2022. Open Virtualization Format. https://www.dmtf.org/standards/ovf.

Docker. 2022a. Docker. https://www.docker.com/.

Docker. 2022b. Docker Registry HTTP API V2 Documentation. https://docs.docker.com/registry/spec/api/.

Duvall, P. M., S. Matyas, and A. Glover. 2007. Continuous Integration: Improving Software Quality and Reducing Risk. Addison-Wesley.

Espe, L., A. Jindal, V. Podolskiy, and M. Gerndt. 2020. “Performance Evaluation of Container Runtimes.” In Proceedings of the 10th International Conference on Cloud Computing and Services Science, 273–81.

GitLab. 2022b. GitLab Container Registry. https://docs.gitlab.com/ee/user/packages/container_registry/.

Google. 2022b. Deep Learning Containers. https://cloud.google.com/deep-learning-containers.

Hao, J., T. Jiang anang, and K. Kim. 2021. “An Empirical Analysis of VM Startup Times in Public IaaS Clouds: An Extended Report.” In Proceedings of the 14th Ieee International Conference on Cloud Computing, 398–403.

Harbor. 2022. Harbor Documentation. https://goharbor.io/docs/.

HashiCorp. 2022a. Packer Documentation. https://www.packer.io/docs.

HashiCorp. 2022d. Vagrant Documentation. https://www.vagrantup.com/docs.

Humble, J., and D. Farley. 2011. Continuous Delivery. Addison Wesley.

Microsoft. 2022c. Azure Machine Learning. https://azure.microsoft.com/en-us/services/machine-learning/.

Microsoft. 2022h. Virtualization Documentation. https://docs.microsoft.com/en-us/virtualization/.

Newman, S. 2021. Building Microservices: Designing Fine-Grained Systems. O’Reilly.

ONNX. 2021. Open Neural Network Exchange. https://github.com/onnx/onnx.

Open Container Initiative. 2022. Open Container Initiative. https://opencontainers.org/.

Open Virtualization Alliance. 2022. Documents. https://www.linux-kvm.org/page/Documents.

Oracle. 2022. Oracle VM Virtualbox. https://www.virtualbox.org/.

PagerDuty. 2022. PagerDuty: Uptime Is Money. https://www.pagerduty.com/.

Paszke, A., S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” In Advances in Neural Information Processing Systems (Nips), 32:8026–37.

Prometheus Authors, and The Linux Foundation. 2022. Prometheus: Monitoring System and Time Series Databases. https://prometheus.io/.

Python Packaging Authority. 2022. Building and Distributing Packages with Setuptools. https://setuptools.pypa.io/en/latest/userguide/index.html.

Python Software Foundation. 2022a. PyPI: The Python Package Index. https://pypi.org/.

Rice, L. 2020. Container Security: Fundamental Technology Concepts that Protect Containerized Applications. O’Reilly.

Services, Amazon Web. 2022. AWS Deep Learning Containers. https://aws.amazon.com/en/machine-learning/containers/.

.Seven, D. 2014. Knightmare: A DevOps Cautionary Tale. https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/.

SmartBear Software. 2021. OpenAPI Specification. https://swagger.io/specification/.

Sonatype. 2022. Nexus Repository Manager. https://www.sonatype.com/products/nexus-repository.

TensorFlow. 2021a. TensorFlow. https://www.tensorflow.org/overview/.

The Apache Software Foundation. 2022a. Airflow Documentation. https://airflow.apache.org/docs/.

The Fluentd Project. 2022. Fluentd: Open Source Data Collector. https://www.fluentd.org/.

The Git Development Team. 2022. Git Source Code Mirror. https://github.com/git/git.

The Hadolint Project. 2022. Hadolint: Haskell Dockerfile Linter Documentation. https://github.com/hadolint/hadolint.

The Kubeflow Authors. 2022. All of Kubeflow documentation. https://www.kubeflow.org/docs/.

The Kubernetes Authors. 2022a. Kubernetes. https://kubernetes.io/.

The Kubernetes Authors. 2022b. Kubernetes Documentation: Schedule GPUs. https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/.

The Register. 2020. Twilio: Someone Waltzed into Our Unsecured AWS S3 Silo, Added Dodgy Code to Our JavaScript SDK for Customers. https://www.theregister.com/2020/07/21/twilio_javascript_sdk_code_injection.

VmWare. 2022. VMware vSphere Documentation. https://docs.vmware.com/en/VMware-vSphere/index.html.

VMware. 2022. VMware Workstation Pro. https://www.vmware.com/products/workstation-pro.html.

VPNOverview. 2022. Fintech App Switch Leaks Users’ Transactions, Personal IDs. https://vpnoverview.com/news/fintech-app-switch-leaks-users-transactions-personal-ids.

Wiggins, A. 2017. The Twelve Factor App. https://12factor.net.

Zaharia, M., and The Linux Foundation. 2022. MLflow Documentation. https://www.mlflow.org/docs/latest/index.html.

Zheng, A. 2015. Evaluating Machine Learning Models. O’Reilly.


  1. Containers are ephemeral in the sense that they should be built with the expectation that they may go down at any time. Therefore, they should be easy to (re)create and to destroy, and they should be stateless: any valuable information they contain will be irrevocably lost when they are destroyed. These characteristics make them a key tool in “The Twelve-Factor App” (Wiggins 2017) and other modern software engineering practices.↩︎

  2. A group of one or more containers that encapsulates an application is called a “pod” in the Kubernetes documentation.↩︎

  3. Each consumer or user is always served the same version of the model. This happens implicitly in the blue-green deployment pattern because each consumer or user is assigned to a pool, and all instances within each pool serve the same model.↩︎

  4. Connectivity issues between compute systems, clusters or data centers due to the failure of network devices or network connections are also called “network splits” or “network partitioning” (Alquraan et al. 2018).↩︎