The modern practice of data analysis is shaped by the convergence of many disciplines, each with its own history: information theory, computer science, optimisation, probability and statistics among them. Machine learning and data science can be considered their latest incarnations, inheriting the mantle of what used to be called “data analytics”. Software engineering should be considered as a crucial addition to this list. Why do we need it to implement modern data analysis efficiently and effectively?
There are many definitions of machine learning. Broadly speaking, it is a discipline that aims to create computer systems and algorithms that can learn a structured representation of reality without (or with less) human supervision in order to interact with it (Russell and Norvig 2009). At one end of the spectrum, we can take this to be a narrow version of artificial general intelligence in which we want our computer systems to learn intellectual tasks independently and to generalise them to new problems, much like a human being would. At the other end, we can view machine learning as the ability to learn probabilistic models that provide a simplified representation of a specific phenomenon to perform a specific task (Ghahramani 2015) such as predicting an outcome of interest (supervised learning) or finding meaningful patterns in the data (unsupervised learning). Somewhere in between these two extremes lie expert systems (Castillo, Gutiérrez, and Hadi 1997), which “capture the ability to think and reason about as an expert would in a particular domain” and can provide “a meaningful answer to a less than fully specified question.”
Broadly speaking, in order to do this:
- We need a working model of the world that describes the task and its context in a way that a computer can understand.
- We need a goal: how do we measure the performance of the model? Because that is what we optimise for! Usually, it is the ability to predict new events.
- We encode our knowledge of the world, drawing information from training data, experts or both.
- The computer system uses the model as a proxy of reality and, as new inputs come in, to perform inference and decide if and how to perform the assigned task.
The exact form these elements take will depend on the domain we are trying to represent and on the model we will use to represent it. Machine learning is, at its core, a collection of models and algorithms from optimisation, statistics, probability and information theory that deal with abstract problems: from simple linear regression models (Weisberg 2014), to Bayesian networks (Scutari and Denis 2021), to more complex models such as deep neural networks (Goodfellow, Bengio, and Courville 2016) and Gaussian processes (Rasmussen and Williams 2006). These algorithms can be applied to a variety of domains from healthcare (van der Schaar et al. 2021) to natural language processing (Aggarwal 2018) and computer vision (Voulodimos et al. 2018), with some combinations of algorithms and domains working out better than others.
In classical statistics (Figure 1.1, bottom right), analysing data required the modeller to specify the probabilistic model generating them in order to draw inferences from a limited number of data points. Such models would necessarily have a simple structure for two reasons: because the modeller had to manually interpret their properties and their output, and because of the lack of any substantial computing power to estimate their parameters. This approach would put all the burden on the modeller: most of the utility that could be had from the model would come from the ability of the modeller to distil whatever he was modelling into simple mathematics and to incorporate any available prior information into the model structure. The result is the emphasis on closed-form results, low-order approximations and asymptotics that characterises the earlier part of modern statistics.
There are, however, many phenomena that cannot be feasibly studied in this fashion. Firstly, there are limits to a human modeller’s ability to encode complex behaviour when manually structuring models. These limits can easily be exceeded by phenomena involving large numbers of variables or by non-linear patterns of interactions between variables that are not very regular or known in advance. Secondly, there may not be enough information available to even attempt to structure a probabilistic model. Thirdly, limiting our choice of models to those that can be written in closed form to allow the modeller to fit, interpret and use them manually, without a significant use of computing power, does not necessarily ensure that those models are easy to interpret. For instance, there are many documented pitfalls in interpreting logistic regression (Mood 2010; Ranganathan, Pramesh, and Aggarwal 2017), which is arguably the simplest way to implement classification.
Classical applications of Bayesian statistics (Figure 1.1, top right) address some of these limitations. The modeller still has to structure a model covering both the data and any prior beliefs on their behaviour, but the posterior may be estimated algorithmically using Markov Chain Monte Carlo (MCMC).
In contrast (Breiman 2001b), algorithmic approaches shift the burden from the modeller to data collection and computer software (Figure 1.1, top left). The modeller’s role in constructing the probabilistic model is limited, and is largely replaced by a computer system sifting through large amounts of data: hence the name “machine learning”. The structure of the model is learned from the data, with few limitations in what it may look like. Neural networks and Gaussian processes are universal approximators, for instance. Almost all the information comes from the data, instead of being prior information that is mediated by the modeller, which is why machine learning approaches are data-hungry.
Data science is similarly data-driven (Figure 1.1, top left), but focuses on extracting insights from raw data and presenting them graphically to support principled decision making. Kenett and Redman (Kenett and Redman 2019) describe it as follows: “the real work of data scientists involves helping people make better decisions on the important issues in the near term and building stronger organizations in the long term”. It requires strong involvement from the data scientist in all areas of business, shifting the focus from computer systems to people. Nevertheless, data scientists use statistical and machine learning models as the means to obtain those insights.
Compared to classical statistics, when data are abundant (Big Data! (Katal, Wazid, and Goudar 2013)) we do not really need to construct their generating process from prior knowledge. The data contain enough information for us to “let them speak for themselves” and obtain useful insights, which are what we are mainly interested in. Of course, prior information from experts is still useful: models that incorporate it tend to be better at producing insights that can be acted upon.
As a result, data science puts a strong focus on the quality of the data, which is often problematic when dealing with data aggregated from multiple sources (data fusion) or with non-tabular data (natural language processing and computer vision). Often, data are poorly defined, simply wrong or ultimately irrelevant for the purpose they were collected for. Expert knowledge is crucial to assess them, to integrate them and to fix them if possible. Machine learning is widely applied to both text and images as well, but focused mostly on modelling their hidden structure until recently, when explainability became a hot topic (see, for instance, Li et al. 2016; Simonyan, Vedaldi, and Zisserman 2014).
Computer systems are key to data science, albeit with a different role than in machine learning. Storing and accessing large amounts of data, exploring them interactively, building the software pipelines that analyse them, handling the resulting spiky workloads: these are all tasks that require a sophisticated use of both hardware and software.
Software engineering is the systematic application of sound engineering principles to all phases of the software life cycle: design, development, maintenance, testing and evaluation (van Vliet 2008). Its central tenet is mastering the complexity inherent to developing large pieces of software that are reliable and efficient; that are usable and can be evolved over time; and that can be developed and maintained in a viable way in terms of both cost and effort (Ousterhout 2018).
Early definitions of software engineering suggested that we should treat it as if it were a traditional engineering discipline like, say, civil engineering. The result is the waterfall model (Royce 1987), which lays out software development as a sequence of steps starting from collecting requirements and finishing with the deployment of the finished product. Modern practices recognise, however, that this model is flawed in several ways. Firstly, civil engineering arises from and is bound by the laws of physics, whereas we make up our own world with its own rules when we develop software. These rules will change over time as our understanding of the problem space evolves; the laws of physics do not. Secondly, the task the software is meant to perform will change over time, and our working definition of that task will change as well. Civil engineering mostly deals with well-defined problems that stay well-defined for the duration of the project. Finally, modifying a large building after its construction is completed is very difficult, but we routinely do that with software. Most of the overall effort in the software lifetime is usually in maintaining and evolving it.
Current software engineering practices take the opposite view that software development is an open-ended (“software is never done”), iterative (the “software life-cycle”) process: this is the core of the “Agile Manifesto” (Beck et al. 2001). At a high level, it is organised as shown in Figure 1.2: a perpetual cycle of planning, analysis, design, implementation, testing and maintenance. The design of the software is heavily influenced by the domain it operates in (domain-driven development, Evans 2003). It uses tests (test-driven development, Beck 2002), refactoring (Fowler 2018) and continuous integration (Duvall, Matyas, and Glover 2007) to incorporate new features, to fix bugs in a timely manner and to keep the code “in shape”. Admittedly, all of these approaches have been touted as silver bullets to the point they have become buzzwords, and their practical implementation has often distorted them to the point of making software development worse. However, the key ideas of agile have merit, and we will discuss and apply them in moderation in this book. They are well suited to structure the development of machine learning pipelines, which are built on a combination of mutable models and input data.
The centrality of computing in machine learning and data science makes software engineering practices essential in modern data analysis: most of the work is done by computer systems, which are powered by software.1 Encoding the data, storing and retrieving them efficiently, implementing machine learning models, tying them together and with other systems: each of these tasks is complex enough that only sound engineering practices can ensure the overall correctness of what we are doing. This is true, in different ways, for both academic research and industry applications. As Kenett and Redman (Kenett and Redman 2019) put it, using a car analogy:
“If data is the new oil, technology is the new engine. The engine powers the car and, without technological advancements, a data- and analytics-led transformation would not be possible. Technologies include databases, communications systems and protocols, applications that support the storage and processing of data, and the raw computing horsepower (much of it now in the cloud) to drive it all.”
In academia, there is a widespread belief that the software implementations of novel methods can be treated as “one-off scripts”. “We only need to run it once to write this paper, there is no point in refactoring and re-engineering it.” is a depressingly common sentiment. As is not sharing code to “stay ahead of the competition”. However, research and application papers using machine learning rely crucially on the quality of the software they use because:
- The models themselves are often black boxes whose mathematical behaviour is not completely understood (Section 9.2).
- The data are complex enough that even experts in the domains they come from struggle to completely explain them (Section 9.1).
If we do not understand both the data and the models completely, it becomes very difficult to spot problems in the software we use to work on them: unexpected behaviour arising from software bugs may be mistaken for a peculiarity in either of them. It is then crucial that we minimise the chances of this happening by applying all the best engineering practices we have at our disposal. Present and past failures to do so have led to a widespread “reproducibility crisis” in fields as diverse as drug research (Prinz, Schlange, and Asadullah 2011, 20–25% reproducible), comparative psychology (Stevens 2017, 36% reproducible), finance (Chang and Li 2015, 43% reproducible) and computational neuroscience (Miłkowski, Hensel, and Hohol 2018, only 12% of papers provide both data and code). Machine learning and artificial intelligence research is in a similarly sorry state: that “when the original authors provided assistance to the reproducers, 85% of results were successfully reproduced, compared to 4% when the authors didn’t respond” (Pineau et al. 2021) does suggest that there is margin for improvement. Fortunately, in recent years scientists have widely accepted this is a problem (Nature 2016), and the machine learning community has reached some consensus on how to tackle it (Tatman, VanderPlas, and Dane 2018).
In industry, poor engineering leads to lower practical and computational performance and a quick accumulation of technical debt (Sculley et al. 2015, and Section 5.2). Badly engineered data may not contain the information we are looking for in a usable form; models that are not well packaged may be slow to deploy and difficult to roll back; data may contain biases or may change over time in ways that make models fail silently; or the machine learning software may become an inscrutable black box whose outputs are impossible to explain, making troubleshooting impossible.
To conclude, we believe that solid machine learning applications and research rest on three pillars:
- The foundations of machine learning (mathematics, probability, computer science), which provide guarantees that the models work.
- Software engineering, which provides guarantees that the implementations of the models work (effectively and efficiently).
- The quality of the data in terms of features, size, fairness, and in how they were collected.
In this book, we will concentrate on the software engineering aspect, touching briefly on some aspects of the data. We will not discuss the theoretical or methodological aspects of machine learning, which are better covered in the huge amount of specialised literature published to date (such as Hastie, Tibshirani, and Friedman 2009; Russell and Norvig 2009; Goodfellow, Bengio, and Courville 2016; Gelman et al. 2013; Rasmussen and Williams 2006 and many others).
Aggarwal, C. C. 2018. Machine Learning for Text. Springer.
Beck, K. 2002. Test-Driven Development by Example. Addison-Wesley.
Beck, K., M. Beedle, A. Van Bennekum, A. Cockburn, W. Cunningham, M. Fowler, J. Grenning, et al. 2001. The Agile Manifesto. https://www.agilealliance.org/wp-content/uploads/2019/09/agile-manifesto-download-2019.pdf.
Breiman, L. 2001b. “Statistical Modeling: The Two Cultures.” Statistical Science 16 (3): 199–231.
Cass, S. 2019. “Taking AI to the Edge: Google’s TPU Now Comes in a Maker-Friendly Package.” IEEE Spectrum 56 (5): 16–17.
Castillo, E., J. M. Gutiérrez, and A. S. Hadi. 1997. Expert Systems and Probabilistic Network Models. Springer.
Chang, A. C., and P. Li. 2015. “Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ‘Usually Not’.” In Federal Reserve Board Finance and Economics Discussion Paper, 083.
Duvall, P. M., S. Matyas, and A. Glover. 2007. Continuous Integration: Improving Software Quality and Reducing Risk. Addison-Wesley.
Evans, E. 2003. Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley.
Fowler, M. 2018. Refactoring: Improving the Design of Existing Code. 2nd ed. Addison-Wesley.
Gelman, A., B. Carlin, H. S. Stern, D. B. Dunson, and A. Vehtari. 2013. Bayesian Data Analysis. 3rd ed. CRC Press.
Ghahramani, Z. 2015. “Probabilistic Machine Learning and Artificial Intelligence.” Nature 521: 452–59.
Goodfellow, I., Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press.
Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.
Katal, A., M. Wazid, and R. H. Goudar. 2013. “Big Data: Issues, Challenges, Tools and Good Practices.” In Proceedings of the International Conference on Contemporary Computing, 404–9.
Kenett, R. S., and T. C. Redman. 2019. The Real Work of Data Science. Wiley.
Li, J., X. Chen, E. Hovy, and D. Jurafsky. 2016. “Visualizing and Understanding Neural Models in NLP.” In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 681–91. Association for Computational Linguistics.
Miłkowski, M., W. M. Hensel, and M. Hohol. 2018. “Replicability or Reproducibility? On the Replication Crisis in Computational Neuroscience and Sharing Only Relevant Detail.” Journal of Computational Neuroscience 45: 163–72.
Mood, C. 2010. “Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It.” European Sociological Review 26 (1): 67–82.
Nature. 2016. “Reality Check on Reproducibility.” Nature 533 (437).
Nvidia. 2021. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/.
Ousterhout, J. 2018. A Philosophy of Software Design. Yaknyam Press.
Pineau, J., P. Vincent-Lamarre, K. Sinha, V. Larivière, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and H. Larochelle. 2021. “Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program).” Journal of Machine Learning Research 22: 1–20.
Prinz, F., T. Schlange, and K. Asadullah. 2011. “Believe It or Not: How Much Can We Rely on Published Data on Potential Drug Targets?” Nature Reviews Drug Discovery 10: 712.
Ranganathan, P., C. S. Pramesh, and R. Aggarwal. 2017. “Common Pitfalls in Statistical Analysis: Logistic Regression.” Perspectives in Clinical Research 8 (3): 148–51.
Rasmussen, C. E., and C. K. I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press.
Reuther, A., P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner. 2020. “Survey of Machine Learning Accelerators.” In Proceedings of the 2020 Ieee High Performance Extreme Computing Conference (Hpec), 1–12.
Royce, W. W. 1987. “Managing the Development of Large Software Systems: Concepts and Techniques.” In Proceedings of the 9th International Conference on Software Engineering, 328–38.
Russell, S. J., and P. Norvig. 2009. Artificial Intelligence: A Modern Approach. 3rd ed. Prentice Hall.
Scutari, M., and J.-B. Denis. 2021. Bayesian Networks with Examples in R. 2nd ed. Chapman & Hall.
Simonyan, K., A. Vedaldi, and A. Zisserman. 2014. “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.” In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Workshop Track.
Stevens, J. R. 2017. “Replicability and Reproducibility in Comparative Psychology.” Frontiers in Psychology 8: 862.
Tatman, R., J. VanderPlas, and S. Dane. 2018. “A Practical Taxonomy of Reproducibility for Machine Learning Research.” In Proceedings of 2nd the Reproducibility in Machine Learning Workshop at ICML 2018.
van der Schaar, M., A. M. Alaa, A. Floto, A. Gimson, S. Scholtes, A. Wood, E. McKinney, D. Jarrett, P. Liò, and A. Ercole. 2021. “How Artificial Intelligence and Machine Learning Can Help Healthcare Systems Respond to COVID-19.” Machine Learning 110: 1–14.
van Vliet, H. 2008. Software Engineering: Principles and Practice. Wiley.
Voulodimos, A., N. Doulamis, A. Doulamis, and E. Protopapadakis. 2018. “Deep Learning for Computer Vision: A Brief Review.” Computational Intelligence and Neuroscience 2018 (7068349): 1–13.
Weisberg, S. 2014. Applied Linear Regression. 4th ed. Wiley.