Chapter 10 Tools for Developing Pipelines

In Part 2 we discussed how we may adapt software engineering practices to the development of machine learning pipelines. Practices are predicated on the tools we adopt to implement them, which will be the focus of Part 3.

In this chapter, we present an up-to-date selection of tools for data exploration, experiment tracking (Section 10.1), for developing code (Section 10.2) and for building, testing and documenting it (Section 10.3). Together, these tools provide a development environment suitable for creating machine learning pipelines. We will then move to those used to manage pipelines in production and maintain them (Chapter 11). We fully expect that the tools available at the time of this writing will consolidate over time, as has happened to other types of productivity tools in software engineering and systems administration.

10.1 Data Exploration and Experiment Tracking

Many issues in machine learning pipelines can be traced to data that are not sufficiently clean or well-structured and therefore are not suitable for training or inference. Early exploratory analyses, together with domain experts, will improve our understanding of the data and can help us improve their quality to the point where we can address the issues we discussed in Section 5.2.1 and 9.1. The code we write in these explorations is the initial prototype that, after much polishing and refactoring, will become the data ingestion and preparation modules of the pipeline (Section 5.3.3). Using a programmatic approach to data exploration, cleaning and transformation is always preferable because it provides reproducible results and enables code versioning (Section 6.5). The code we produce should be tested using property-based testing (Section 9.4.2) with sample data to check whether it works correctly; the same tests can be reused as quality gates for new data during the pipeline lifetime. We will suggest different tools to write such code in Section 10.2: notebooks like Jupyter (Project Jupyter 2022); IDEs like RStudio (RStudio 2022a); Python libraries like NumPy (Harris et al. 2020) and Pandas (McKinney 2017); R packages like dplyr (Wickham, François, et al. 2022), tidyr (Wickham, Girlich, and RStudio 2022) and janitor (Firke et al. 2022) are just a few examples. In addition, most integrated MLOps tools incorporate experiment tracking, so we can save these explorations in a centralised and versioned repository and then compare different approaches and their parameters on the basis of the metrics that we chose (Section 5.3.1) for model evaluation and validation (Section 5.3.4).

In addition, high-level visual tools to explore and clean the data may be useful to involve domain experts who may not be comfortable with programming. Some examples are:

  • Openrefine (Openrefine 2022): an open-source client-server solution that provides a collaborative web interface for working on data, as well as client libraries to automate tasks using the API exposed by the server.
  • Trifacta (Trifacta 2022): a commercial solution that provides an easy-to-use interface to work on data quality, data transformation and data processing pipelines in general. It is designed for non-technical users and supports deployment on all major cloud providers.
  • Tableau Prep Builder (Tableau Software 2022): offers a straightforward interface to interactively clean, format and visualise data from different sources. It is available both as a local, graphical application and as a web application.
  • Web solutions like Airtable (Formagrid 2022), which provides the functionality of a database combined with the features of a spreadsheet, may also be suitable for working on small datasets in a collaborative manner.

Data exploration is an iterative process: interactively visualising its outputs as they change is essential. We can do that directly from the Python and R code we use to explore the data if we are using either Jupyter or RMarkdown (Xie, Allaire, and Grolemund 2022). As an alternative, we can produce a dedicated interactive dashboard from Python, R, Julia or Bokeh (Bokeh 2022) code and from Jupyter notebooks with a visual tool like Tableau (Tableau Software 2022) or with a programmatic tool like Dash (Plotly 2022a) that can generate one. We will discuss more tools in Section 11.3.

If the data are too large for the tools above to handle, we can store them using “big data” frameworks based on Hadoop (The Apache Software Foundation 2022c) like Cloudera (Cloudera 2022). We can then use tools like Apache Pig (The Apache Software Foundation 2022e), Apache Hive (The Apache Software Foundation 2022d), Apache Impala (Apache Software Foundation 2022b) and Apache Spark (The Apache Software Foundation 2022f) to manipulate them and implement our own data ingestion and cleaning; or we can use integrated cloud-based solutions like Snowflake (Snowflake 2022) and Databricks Lakehouse (Databricks 2022). The advantage of these integrated solutions is that they handle all the aspects of data management as well as machine learning applications development and delivery, supporting integration with data engineering, data science and machine learning open-source projects.

Databricks, for instance, includes many open-source components. One is Delta Lake (The Delta Lake Project Authors 2022a): an abstraction layer for existing data lakes and object storage like S3 which is fully compatible with Apache Spark and which supports features such as ACID transactions, schema enforcing and data versioning. Databricks also offers a managed version of MLflow (Zaharia and The Linux Foundation 2022) an open-source library-agnostic platform to manage machine learning pipelines which we will describe in more detail in Section 11.2.

DVC (Data Version Control) (Iterative 2022b) is an open-source tool that applies GitOps principles to data.25 DVC manages data and machine learning models through metadata stored in text files, and it uses Git (The Git Development Team 2022) to version them and to track their provenance. DVC is both a command-line tool and a library, and is language- and framework-agnostic.

DVC and MLflow (Zaharia and The Linux Foundation 2022) implement experiment tracking in two different ways. DVC organises experiments within Git projects using commits, branches and tags. It automatically tracks data dependencies, machine learning code, parameters and model artefacts: we can compare different experiments through the associated metrics using either its command line or its web interface. MLflow instead provides a tracking server and client libraries that can be integrated into Python, R and Java code as discussed in Section 5.3.6. The tracking server stores the metadata, parameters, metrics and tags collected by the clients for each experiment run into a file or a database. Larger outputs like data files, images and model artefacts are saved separately, for instance, in an object storage.

In addition, there are also proprietary SaaS offerings with experiment tracking and model registry functionalities, to name a few: Neptune (Neptune Labs 2022), Comet (Comet 2022) and Weights & Biases (Weights & Biases 2022); more details on this type of software are in Section 11.2.

10.2 Code Development

Modern software development is a collaborative effort based on knowledge sharing, constant iteration and continuous feedback. Distributed version control systems and Git (The Git Development Team 2022) in particular make all this possible by setting the standard for code versioning and collaborative development and by powering popular platforms such as GitHub and GitLab. Our ability to deliver and deploy software using DevOps relies heavily on Git together with semantic versioning (Preston-Werner 2022) and commit tags. Therefore, Git is a tool that every software engineer should be familiar with. Machine learning and data science professionals should be familiar with it as well because it is used in and influences the design of software like DVC that is used to work with pipelines.

Choosing the right set of tools for writing code is a matter of prior experience with specific tools and personal taste. It may be a decision made either by individual developers or at the level of the team, research group or company the developers belong to in order to standardise on a predetermined set of software. In either case, to work efficiently on a machine learning pipeline we will need support for:

  • the programming languages that we will use (Section 6.1);
  • enforcing coding standards (Section 6.3);
  • automated refactoring (Section 6.7);
  • integrations with source code versioning (Section 6.5);
  • running software tests and summarising their results (Section 9.4);
  • interactive debugging (Section 9.4);
  • managing the containers (Section 7.1.4) that encapsulate the developing environment, and the ability to remotely work within them.

Ensuring that all developers use similar tooling is useful for compliance, to simplify the training of new developers and to improve reproducibility. (For the same reasons, we should avoid polyglot programming as discussed in Section 5.2.4). There is a wide variety of tools to choose from, falling into the following categories:

  • modern and relatively lightweight code editors such as Atom (Atom 2022) and Visual Studio Code (Microsoft 2022d, also known as VS Code) that can be extended to provide the features above with the use of third-party extensions;
  • integrated development environments (IDEs) such as Eclipse (Eclipse Foundation 2022a) and JetBrains IntelliJ IDEA (JetBrains 2022a);
  • shared interactive computing platforms such as Jupyter notebooks (Project Jupyter 2022).

10.2.1 Code Editors and IDEs

The main difference between an IDE and a code editor is the amount of functionality that is built-in and configured with sane defaults. On the one hand, IDEs integrate most functionality out of the box on a single programming language. For instance, PyCharm (JetBrains 2022b) offers features such as code inspection, code completion, syntax highlighting, version control, debugging, refactoring, test execution and container integration like other major IDEs, but also supports the Python REPL and provides introspection into the objects created by scientific computing libraries such as NumPy and Pandas. The reference IDE for the R language is RStudio (RStudio 2022a), which integrates a console, an editor that supports syntax highlighting and direct code execution, tools for plotting and inspecting R objects as well as history, debugging and workspace management.

On the other hand, code editors are more limited out of the box, but they can reach feature parity with IDEs by installing and configuring third-party extensions. For example, VS Code can provide similar functionality to PyCharm by using a language server like the Python Language Server (Palantir 2022) that is compliant with the language server protocol specification (Microsoft 2022f). Other alternatives are Mypy (The mypy Project 2014), Pylance (Microsoft 2022a) (based on the Pyright (Pyright 2022) static type checker from Microsoft), Pytype (Batchelder and et al. 2022) (from Google) and Pyre (Meta Platforms 2022a) (from Facebook). The same level of integration can be accomplished with the languageserver package (Lai and Ren 2022) and the VS Code R extension (REditorSupport 2022) for R and with LanguageServer.jl (Julia VS Code 2022a) and the VS Code Julia extension (Julia VS Code 2022b) for Julia.

Both code editors and IDEs can also run as web applications: a web browser session connects to a cloud instance replicating a common, unified development environment. This approach has two advantages: it reduces technical debt arising from polyglot programming (Section 5.2.4) and makes it possible to develop in environments that are too complex or too resource-intensive to run locally (Section 5.3.2). Code editors like VS Code provide web interfaces (Microsoft 2022j) to navigate files and repositories and to commit small code changes, while IDEs like GitHub Codespaces (GitHub 2022a) and AWS Cloud9 (Amazon 2022c) provide complete cloud development environments backed by virtual machines (Section 7.1.3). We also have the option to self-host them using Docker (Docker 2022a) and Kubernetes (The Kubernetes Authors 2022a): base container images are readily available for Eclipse Che (Eclipse Che 2022), Eclipse Theia (Eclipse Foundation 2022b) and GitPod (Gitpod 2022). As for R, RStudio Server (RStudio 2022b) makes available the same features as the RStudio IDE through a browser-based interface that is connected to R sessions running on a remote server. We would also like to point out DagsHub (DagsHub 2022) as a collaboration platform: it provides a shared work environment for data science and machine learning projects that follows the development patterns and the practices presented in this book. It integrates with GitHub, DVC, MLflow, Jenkins (Jenkins 2022b) and many other open-source tools.

Finally, we would like to mention one last set of code editors: Vim, Neovim (Neovim 2022) and Emacs (GNU Project 2022). They are valued by developers who prefer to create a modular development environment that fits their specific needs. Both editors provide full support for R, Julia and Python through plug-ins that communicate with the respective language servers. Although their learning curve is steep at first, they allow for unparalleled speed of action on code bases of any size in the long run.

10.2.2 Notebooks

In recent years, notebooks have seen widespread adoption in machine learning projects and, more in general, in scientific research. They are typically implemented as Jupyter notebooks (Project Jupyter 2022, from the programming languages Julia, Python and R they support), an interactive development tool that is ideal for building proofs of concept. Jupyter notebooks are designed to quickly test ideas, to evaluate the trade-offs of different alternatives and to share code, results and figures intermixed with documentation in Markdown format. They can be executed interactively directly from GitHub and GitLab, from a dedicated collaboration platform such as Google Colaboratory (Google 2022g, also known as Colab) or from MLOps platforms such as Amazon SageMaker (Amazon 2022d) or Azure Machine Learning (Microsoft 2022c).

Despite considerable programming language support and a significant adoption by the scientific community, the use of Jupyter notebooks has several shortcomings in the context of modern development practices (Chapter 6, in particular Section 6.3). The Jupyter notebook file format stores code, outputs, images and Markdown text in a single huge JSON object to produce a self-contained, portable artefact. This architectural choice has four major shortcomings:

  1. It is challenging to version a notebook correctly because stochastic outputs change every time it runs.
  2. Representing code in “cells” that can be executed in any order is at odds with the imperative nature of the programming languages used in machine learning software.
  3. Dividing code into cells to interleave their outputs and the surrounding text impacts code modularity and code reuse and reduces our ability to produce good abstractions. We can accept some level of coupling between cells (because they run code in a shared, hidden global environment) and add glue code to make things work, but that leads to an increase in technical debt (Sections 5.2.3 and 5.2.4).
  4. Notebooks do not have any built-in support for automated testing (Section 9.4) or deployment (Chapter 7). While this can be acceptable for exploring data and prototyping models (Section 5.3.2), it makes them unsuitable for developing production-level software.

In particular, executing cells in a non-linear order can lead to inconsistent results because cells affect each other’s environment, effectively creating a “hidden state” that is very difficult to track. The only way to achieve reproducibility is to always execute all the code in the notebook from the top in a clean environment. For these reasons, we cannot use notebooks directly to serve machine learning models in a production environment. However, this may change in the future: there are ongoing efforts to develop tools for diffing and merging (nbtime (Alnæs and Project Jupyter 2022) and nbstripout (Rathgeber 2022)), automatic testing (testbook (Nteract Team 2022b) and nbval (Cortes-Ortuno et al. 2022)), automation (Papermill (Nteract Team 2022a)) and quality assurance (nbQA (nbQA Team 2022)).

RMarkdown and Julia notebooks are fully reproducible because they execute all the code in the notebook from the top in a clean environment by default, so there is no state inconsistency after we change the code in a cell and re-run it. Furthermore, both are easier to version and to diff than Jupyter notebooks because they store text, outputs and figures in a separate Markdown, PDF or HTML file when compiled. RMarkdown notebooks are well supported by RStudio, which provides code auto-complete, linting and suggestions, but they can be edited with any text editor and compiled through the command line as well. We can also enhance them with the workflowr R package (Blischak, Carbonetto, and Stephens 2022), which combines literate programming (with knitr (Xie 2015)) and version control (with git2r (Widgren and et al. 2022)) to generate shareable HTML pages containing time-stamped, versioned code blocks and outputs. For reproducibility, each analysis is run in a new R session.

Given how Jupyter notebooks are geared towards prototyping, we suggest that they should be integrated in a modern development workflow as follows:

  1. Experiment and build a prototype of the code using notebooks.
  2. When the prototype is complete, move the code to a new Git repository and start refactoring it using an IDE or a code editor to make it modular and scalable. At the same time, add software tests.
  3. Add docstrings (Goodger and Rossum 2022) to the code using the text in Jupyter Markdown as a base.
  4. Package your artefact using pip (Python Software Foundation 2022a) and setuptool (Python Packaging Authority 2022) (Section 7.1.2) for later use as a module within the machine learning pipeline or as a library that will be imported by other Jupyter notebooks.

10.2.3 Accessing Data and Documentation

Quickly accessing documentation during development is invaluable when working on complex code bases. We can use offline documentation browsers such as Velocity (Silverlake Software 2022) or Dash, or the open-source Zeal (The Delta Lake Project Authors 2022b). All three can automatically download the docsets (HTML documentation archives for offline usage) for major programming languages and machine learning frameworks, and also integrate with the leading IDEs and code editors. Overall, they are interchangeable in terms of features.

As for accessing data, object storage like AWS Amazon S3 is becoming the de facto standard for data interchange in data science and machine learning. Therefore, it is very useful to integrate code editors and IDEs with libraries capable of abstracting the listing, downloading and uploading of data into object storage buckets across multiple cloud vendors. A popular example is MinIO (MinIO 2022), which is fully compatible with the S3 APIs and provides an open-source SDK for multiple languages.

10.3 Build, Test and Documentation Tools

Using appropriate tools for building, testing and performing software quality assurance is important to improve ergonomics and reduce the likelihood of errors. In addition, we may want to use the same set of tools in all environments (developer workstations, staging and production environments) and in all stages of development, both to avoid inconsistencies and to maintain a common environment shared by all the people who work on the pipeline.

Currently, containers (Section 7.1.4) are the most common way of packaging the modules of a machine learning pipeline: either individually with Docker, in groups with Docker compose (Docker 2022c) or Podman (The Containers Organization 2022), or as a single-node Kubernetes with Minikube (The Kubernetes Authors 2022c) or MicroK8s (Canonical 2022b). All these solutions build on Docker, with different trade-offs in terms of architecture and functionality, and therefore support the deployment practices described in Chapter 7.

One of the points of using containers is to isolate different modules and applications from each other. We can further decouple our code from the software installed within each container by using pipenv or pip (Python Software Foundation 2022a) plus virtualenv (Python Packaging Authority (PyPA) 2022) to create isolated Python environments and to manage dependencies on packages and on specific versions of the Python interpreter (without collisions with the globally installed ones). We can install and switch between multiple versions of Python using pyenv (Yamashita, Stephenson, and et al. 2022) or Poetry (Kalnytskyi 2022a), or a more general-purpose tool like asdf (Manohar 2022) that supports multiple runtime versions for the most used interpreters, compilers and development tools. If our needs are too complex for this approach, we might consider tools such as Pipenv (Reitz and Python Packaging Authority (PyPA) 2022) and Conda (Anaconda 2022b). Conda in particular has a broad support for machine learning and data science applications (Anaconda 2022a) but is rather cumbersome to use. The R counterpart of Pipenv is the packrat package (Ushey et al. 2022), which also uses a locally installed R interpreter.

Automated tests are another key feature of modern practices for developing (Section 6.5), refactoring (Section 6.7) and maintaining software (Section 9.4). Each test should be run in a clean environment such as a container that is re-created at each run: we want to avoid the execution of one test influencing the results of another. (The automated and reproducible deployment practices we discussed in Chapter 7 are a key enabler of automated testing!) Test results should be included as pass/fail by the CI/CD pipeline to facilitate code review (Section 6.6) and to ensure that the pipeline is always in a functioning state. There are many frameworks and libraries that we can use to implement the types of tests described in Section 9.4. For individual modules, we can use the unittest (Python Software Foundation 2022c) and doctest (Python Software Foundation 2022b) packages in the Python standard library, the testthat R (Wickham, RStudio, and R Core Team 2022) package and the Julia test module.

We should also test that the machine learning pipeline as a whole works as expected. Tools like Airflow (The Apache Software Foundation 2022a), DVC and Pachyderm (Pachyderm 2022) use the DAG that maps the dependencies between the modules to allow for local, iterative testing. In DVC and Pachyderm, dependencies are specified in a declarative configuration file (say, dvc.yaml for DVC) which can be either written manually or built programmatically using helper commands. DVC does not have any built-in support for testing, so we should instrument modules ourselves (for unit tests) and embed the whole pipeline in a testing framework (for integration, system and acceptance tests). In Airflow, the pipeline is implemented in Python code and dependencies are encoded in a dedicated DAG object: this makes it easy to test individual modules with unittest and to validate data with frameworks such as Great Expectations (Superconductive 2022). As for pipelines running on GitHub, GitLab or Jenkins, we can use actions and runners (Gitlab 2022a; Nektos 2022; Jenkins 2022a) that, albeit with some limitations, can run the complete pipeline or some of its parts using containers. Another alternative is to validate the pipeline directly by iteratively committing changes to a test branch and pushing them to the mainline branch to force the CI to run any tests that may be relevant. Jenkins also provides a testing framework for implementing unit tests on the configuration and on the conditional logic of the pipeline code and a command-line tool for linting the pipeline. GitLab provides APIs to trigger validation and linting for the same purpose.

Enforcing code styles and standards, which we discussed in Section 6.3, is a crucial complement to testing to ensure that we produce maintainable, working software. Pylint (Logilab and PyCQA and contributors 2022) is the reference static code analyser and linter for Python: it is based on PEP-8 (Python Enhancement Proposal 8), the official document that contains the guidelines and best practices on how to write Python code. An alternative is Flake8 (Stapleton Cordasco 2022), which builds on other tools such as pycodestyle (a style guide checker), pyflakes (a source files checker for errors) and mccabe (a tool to check the complexity of the code). A comprehensive linting step for Python code should apply a sequence of tools such as the following:

  1. isort (Crosley 2022) to sort imports alphabetically and separate into sections by type;
  2. black (Langa and et al. 2022) to format the code;
  3. flake8 to check the code style;
  4. pylint as the final step to run static code analysis.

The styler package (Müller and Walthert 2022a), which enforces compliance with the tidyverse style guide (Wickham 2022b), and the lintr package (Hester et al. 2022), which performs static code analysis and which identifies syntax errors and possible semantic issues, fill the roles of the Python packages above for R code. Both lintr (see vignette("continuous-integration")) and styler support CI/CD integration (Müller and Walthert 2022b), accept user-provided code style policies and integrate with RStudio.

Writing documentation and keeping it up-to-date is also key to maintaining machine learning pipelines over time. Documentation should be versioned like code and kept as close as possible to the code it refers to. Documentation on module, function and class interfaces or on method definitions can placed in both Python (Goodger and Rossum 2022) and Julia (Krämer 2022) code using structured comments in the docstring or native Sphinx format; Sphinx (Brandl and the Sphinx Team 2022) can then compile those comments into documents in various file formats via the Sphinx autodoc extension (see Section 8.2 for an example). Sphinx can also be used to (re)compile documentation automatically using CI (with “Read the Docs” (Read the Docs 2022)) and to render OpenAPI specification files as static HTML pages (Kalnytskyi 2022c). The OpenAPI specification files can in turn be automatically generated from docstrings using Sphinx (Kalnytskyi 2022b), Apispec (Loria and et al. 2022) or a framework such as FastApi (Ramírez 2022) and Flask (Pallets Team 2022).

In R, we can use comments in the Doxygen (van Heesch 2022) format for the same purpose: they can be parsed by the Roxygen2 package (Wickham, Danenberg, et al. 2022) to generate R documentation in various formats as discussed in Section 8.2.

References

Alnæs, M. S., and Project Jupyter. 2022. nbdime – Diffing and Merging of Jupyter Notebooks. https://nbdime.readthedocs.io.

Amazon. 2022c. AWS Cloud9 Documentation. https://docs.aws.amazon.com/cloud9.

Amazon. 2022d. Machine Learning: Amazon Sagemaker. https://aws.amazon.com/sagemaker/.

Anaconda. 2022b. Package, Dependency and Environment Management for Any Language. https://docs.conda.io.

Apache Software Foundation. 2022b. Impala Documentation. https://impala.apache.org/impala-docs.html.

Atom. 2022. A hackable text editor for the 21st Century. https://atom.io/.

Batchelder, N., and et al. 2022. A Static Type Analyzer for Python Code. https://google.github.io/pytype.

Blischak, J. D., P. Carbonetto, and M. Stephens. 2022. workflowr: A Framework for Reproducible and Collaborative Data Science. https://cran.r-project.org/web/packages/workflowr.

Bokeh. 2022. Bokeh Documentation. https://docs.bokeh.org/en/latest/.

Brandl, G., and the Sphinx Team. 2022. Sphinx: Python Documentation Generator. https://www.sphinx-doc.org/en/master/.

Canonical. 2022b. MicroK8s Documentation. https://microk8s.io/docs.

Cloudera. 2022. Cloudera: The Hybrid Data Company. https://www.cloudera.com/.

Comet. 2022. Comet Documentation. https://www.comet.com/docs/v2.

Cortes-Ortuno, D., O. Laslett, T. Kluyver, V. Fauske, M. Albert, MinRK, O. Hovorka, and H. Fangohr. 2022. IPython Notebook Validation for py.test: Documentation. https://nbval.readthedocs.io.

Crosley, T. 2022. A Python Utility and Library to Sort Imports. https://pycqa.github.io/isort/.

DagsHub. 2022. Welcome to the DagsHub Docs. https://dagshub.com/docs/.

Databricks. 2022. Databricks Documentation. https://docs.databricks.com/applications/machine-learning/index.html.

Docker. 2022a. Docker. https://www.docker.com/.

Docker. 2022c. Overview of Docker Compose. https://docs.docker.com/compose.

Eclipse Che. 2022. Run your favorite IDE on Kubernetes. https://www.eclipse.org/che/technology/.

Eclipse Foundation. 2022a. Desktop IDEs. https://www.eclipse.org/ide/.

Eclipse Foundation. 2022b. Theia: Cloud & Desktop IDE. https://theia-ide.org/docs/.

Firke, S., B. Denney, C. Haid, R. Knight, M. Grosser, and J. Zadra. 2022. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://cran.r-project.org/web/packages/janitor.

Formagrid. 2022. Airtable Is a Modern Spreadsheet Platform with Database Functionalities. https://airtable.com.

GitHub. 2022a. GitHub Codespaces. https://github.com/features/codespaces.

Gitlab. 2022a. GitLab Runner Documentation. https://docs.gitlab.com/runner/.

Gitlab. 2022b. What Is GitOps? https://about.gitlab.com/topics/gitops.

Gitpod. 2022. Gitpod: Always Ready to Code. https://www.gitpod.io.

GNU Project. 2022. GNU EMacs. https://www.gnu.org/software/emacs/.

Goodger, D., and G. van Rossum. 2022. PEP 257: Docstring Conventions. https://peps.python.org/pep-0257/\#what-is-a-docstring].

Google. 2022g. Welcome to Colab! https://colab.research.google.com.

Harris, C. R., K. J. Millman, Stéfan J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, et al. 2020. “Array Programming with NumPy.” Nature 585 (7285): 357–62.

Hester, J., F. Angly, R. Hyde, M. Chirico, K. Ren, and A. Rosenstock. 2022. A Linter for R Code. https://cran.r-project.org/web/packages/lintr.

Iterative. 2022b. DVC: Data Version Control. Git for Data & Models. https://github.com/iterative/dvc.

Jenkins. 2022a. A Command Line Tool to Run Jenkinsfile as a Function. https://github.com/jenkinsci/jenkinsfile-runner.

Jenkins. 2022b. Jenkins User Documentation. https://www.jenkins.io/doc/.

JetBrains. 2022a. IntelliJ IDEA. https://www.jetbrains.com/idea/.

JetBrains. 2022b. PyCharm. https://www.jetbrains.com/pycharm/.

Julia VS Code. 2022a. An Implementation of the Microsoft Language Server Protocol for the Julia Language. https://juliapackages.com/p/languageserver.

Julia VS Code. 2022b. Julia for Visual Studio Code. https://www.julia-vscode.org.

Kalnytskyi, I. 2022a. Poetry Documentation. https://python-poetry.org/docs.

Kalnytskyi, I. 2022b. sphinxcontrib-openapi Is a Sphinx Extension to Generate APIs Docs from OpenAPI. https://sphinxcontrib-openapi.readthedocs.io.

Kalnytskyi, I. 2022c. The Sphinx Extension that Renders OpenAPI Specs Using ReDoc. https://sphinxcontrib-redoc.readthedocs.io/en/stable.

Lai, R., and K. Ren. 2022. An Implementation of the Language Server Protocol for R. https://cran.r-project.org/web/packages/languageserver.

Langa, and et al. 2022. Black: The Uncompromising Code Formatter. https://black.readthedocs.io/en/stable/.

Logilab and PyCQA and contributors. 2022. Pylint is a Static Code Analyser for Python 2 or 3. https://pylint.pycqa.org/en/latest/.

Loria, S., and et al. 2022. A Pluggable API Specification Generator. https://apispec.readthedocs.io/en/latest.

Manohar, A. 2022. asdf Documentation. https://asdf-vm.com/guide/getting-started.html.

McKinney, W. 2017. Python for Data Analysis. 2nd ed. O’Reilly.

Meta Platforms. 2022a. A Performant Type-Checker for Python 3. https://pyre-check.org.

Microsoft. 2022a. A performant, Feature-Rich Language Server for Python in VS Code. https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance.

Microsoft. 2022c. Azure Machine Learning. https://azure.microsoft.com/en-us/services/machine-learning/.

Microsoft. 2022d. Code editing. Redefined. https://code.visualstudio.com/.

Microsoft. 2022f. Language Server Protocol. https://microsoft.github.io/language-server-protocol.

Microsoft. 2022j. VS Code in the Web. https://vscode.dev.

MinIO. 2022. MinIO Documentation. https://docs.min.io/docs.

Müller, K., and L. Walthert. 2022a. Non-Invasive Pretty Printing of R Code. https://cran.r-project.org/web/packages/styler.

Müller, K., and L. Walthert. 2022a. Non-Invasive Pretty Printing of R Code. https://cran.r-project.org/web/packages/styler.

2022b. Third-Party Integrations. https://styler.r-lib.org/articles/third-party-integrations.html.

nbQA Team. 2022. Run isort, pyupgrade, mypy, pylint, flake8, and More on Jupyter Notebooks. https://github.com/nbQA-dev/nbQA.

Nektos. 2022. Run Your GitHub Actions Locally. https://github.com/nektos/act.

Neovim. 2022. Hyperextensible Vim-Based Text Editor. https://neovim.io/.

Neptune Labs. 2022. Neptune Documentation. https://docs.neptune.ai/.

Nteract Team. 2022a. Papermill Is a Tool for Parameterizing and Executing Jupyter Notebooks. https://papermill.readthedocs.io.

Nteract Team. 2022b. Testbook. https://testbook.readthedocs.io/en/latest/.

Openrefine. 2022. A Free, Open Source, Powerful Tool for Working with Messy Data. https://openrefine.org.

Pachyderm. 2022. Data-Centric Pipelines and Data Versioning. https://docs.pachyderm.com/latest.

Palantir. 2022. Python Language Server. https://github.com/palantir/python-language-server.

Pallets Team. 2022. Flask Documentation. https://flask.palletsprojects.com/en/latest.

Plotly. 2022a. Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required. https://github.com/plotly/dash.

Preston-Werner, T. 2022. Semantic Versioning. https://semver.org/.

Project Jupyter. 2022. Jupyter. https://jupyter.org/.

Pyright. 2022. Static Type Checker for Python. https://github.com/microsoft/pyright.

Python Packaging Authority. 2022. Building and Distributing Packages with Setuptools. https://setuptools.pypa.io/en/latest/userguide/index.html.

Python Packaging Authority (PyPA). 2022. Virtualenv Documentation. https://virtualenv.pypa.io/en/latest/.

Python Software Foundation. 2022a. PyPI: The Python Package Index. https://pypi.org/.

Python Software Foundation. 2022b. Test Interactive Python Examples. https://docs.python.org/3/library/doctest.html.

Python Software Foundation. 2022c. unittest: Unit Testing Framework. https://docs.python.org/3/library/unittest.html.

Ramírez, S. 2022. FastAPI Framework, High Performance, Easy to Learn, Fast to Code, Ready for Production. https://fastapi.tiangolo.com.

Rathgeber, F. 2022. Strip Output from Jupyter and IPython Notebooks. https://github.com/kynan/nbstripout.

Read the Docs. 2022. Read the Docs: Documentation Simplified. https://docs.readthedocs.io.

REditorSupport. 2022. R in Visual Studio Code. https://marketplace.visualstudio.com/items?itemName=REditorSupport.r.

Reitz, K., and Python Packaging Authority (PyPA). 2022. Pipenv: Python Dev Workflow for Humans. https://pipenv.pypa.io.

RStudio. 2022a. Open Source and Enterprise-Ready Professional Software for Data Science. https://www.rstudio.com.

Silverlake Software. 2022. Velocity: The Documentation and Docset Viewer for Windows. https://velocity.silverlakesoftware.com/.

Snowflake. 2022. Snowflake Documentation. https://docs.snowflake.com.

Stapleton Cordasco, I. 2022. Flake8: Your Tool for Style Guide Enforcement. https://flake8.pycqa.org/en/latest/.

Superconductive. 2022. Great Expectations. https://docs.greatexpectations.io/docs.

Tableau Software. 2022. Tableau. https://www.tableau.com/.

The Apache Software Foundation. 2022a. Airflow Documentation. https://airflow.apache.org/docs/.

The Apache Software Foundation. 2022c. Apache Hadoop. https://hadoop.apache.org/.

The Apache Software Foundation. 2022d. Apache Hive Documentation. https://cwiki.apache.org/confluence/display/Hive.

The Apache Software Foundation. 2022e. Apache Pig Documentation. https://pig.apache.org/docs/latest.

The Apache Software Foundation. 2022f. Apache Spark Documentation. https://spark.apache.org/docs/latest.

The Containers Organization. 2022. podman. https://podman.io.

The Delta Lake Project Authors. 2022a. Delta Lake Documentation. https://docs.delta.io.

The Delta Lake Project Authors. 2022b. Zeal Is an Offline Documentation Browser for Software Developers. https://zealdocs.org.

The Git Development Team. 2022. Git Source Code Mirror. https://github.com/git/git.

The Kubernetes Authors. 2022a. Kubernetes. https://kubernetes.io/.

The Kubernetes Authors. 2022c. minikube. https://minikube.sigs.k8s.io/docs.

The mypy Project. 2014. mypy: Optional Static Typing for Python. http://mypy-lang.org/.

Trifacta. 2022. Profile, Prepare, and Pipeline Data for Analytics and Machine Learning. https://www.trifacta.com.

Ushey, K., J. McPherson, J. Cheng, A. Atkins, JJ. Allaire, and T. Allen. 2022. Packrat: Reproducible Package Management for R. https://rstudio.github.io/packrat/.

van Heesch, D. 2022. Doxygen. https://www.doxygen.nl/index.html.

Weights & Biases. 2022. Weights & Biases Documentation. https://docs.wandb.ai/.

Wickham, H. 2022b. The tidyverse Style Guide. https://style.tidyverse.org/.

Wickham, H., P. Danenberg, G. Csárdi, M. Eugster, and RStudio. 2022. roxygen2: In-Line Documentation for R.

Wickham, H., R. François, L.Henry, and K. Müller. 2022. A Fast, Consistent Tool for Working with Data Frame Like Objects, Both in Memory and Out of Memory. https://cloud.r-project.org/web/packages/dplyr.

Wickham, H., M. Girlich, and RStudio. 2022. tidyr: Tidy Messy Data. https://cloud.r-project.org/web/packages/tidyr.

Wickham, H., RStudio, and R Core Team. 2022. Unit Testing for R. https://cloud.r-project.org/web/packages/testthat.

Widgren, S., and et al. 2022. git2r: Provides Access to Git Repositories. https://cran.r-project.org/web/packages/git2r/index.html.

Xie, Y. 2015. Dynamic Documents with R and knitr. 2nd ed. CRC Press.

Xie, Y., J. J. Allaire, and G. Grolemund. 2022. R Markdown: The Definitive Guide. https://bookdown.org/yihui/rmarkdown/.

Yamashita, Y., S. Stephenson, and et al. 2022. pyenv: Simple Python Version Management. https://github.com/pyenv/pyenv.

Zaharia, M., and The Linux Foundation. 2022. MLflow Documentation. https://www.mlflow.org/docs/latest/index.html.


  1. GitOps is an application of DevOps practices such as version control, collaboration, compliance, and CI/CD, and applies them to automate infrastructure management (Gitlab 2022b).↩︎