Preface

Pitching new ideas by prefacing them with quotes like “Data scientist: the sexiest job of the 21st century” (Harvard Business Review 2012) or “Data is the new oil” (The Economist 2017) has become such a cliché that any audience (in business and academia alike) will collectively roll their eyes in exasperation. And for good reason. Likewise, we do not believe radiologists or lorry drivers will be replaced by artificial intelligence and out of a job for the foreseeable future, and we are not alone in realising the limits of machine learning (The Economist 2020).

Even so, it is difficult to understate the impact that machine learning is having on many aspects of our lives. It has taken the pre-existing trends of using data and analytics (under the banner of “data mining”, “big data” and similar buzzwords) to inform business decisions and drive scientific discovery, and made them ubiquitous. Machine learning has combined the mathematical rigour of information theory and statistics, the computational aspects of computer science and the goal-driven flexibility of optimisation theory, redefining how we work with data.

The flip side of trying to distil parts of so many different disciplines has been the clash between their respective cultures, which has been well summarised by Leo Breiman in “Statistical Modeling: The Two Cultures” (Breiman 2001b). On top of that, there is a tension between machine learning practice in the industry and academia: the latter strongly values producing novel models and theoretical results, while the former is driven by the need to produce practical results that have business value. With so many different perspectives, it is a wonder that a rough consensus on what machine learning is has actually evolved! (Personally, our red line is conflating deep learning with machine learning. There is life beyond deep neural networks!)

In this melting pot of ideas, we feel that software engineering has played a remarkably small role compared to other disciplines. Machine learning, after all, is “a technique that allows computer systems to improve with experience and data” (Goodfellow, Bengio, and Courville 2016). Therefore, there is a presumption that one will interact with a computer system, which in turn happens by engineering a piece of software that communicates to the computer system what it is supposed to do. The quality of this engineering is crucial in both academia and industry. In academia, software quality issues are one of the underlying causes of the “reproducibility crisis” (Nature 2016; Tatman, VanderPlas, and Dane 2018). In industry, poor engineering leads to lower practical and computational performance (Kang et al. 2021), to a quick accumulation of technical debt (Sculley et al. 2015) and sometimes to catastrophic failures with costs in the millions (.Seven 2014; The Register 2020; VPNOverview 2022; Sherman 2022). There is, of course, a sizeable body of accumulated wisdom on how to architect and write software in foundational books like The Pragmatic Programmer (Thomas and Hunt 2019) and A Philosophy of Software Design (Ousterhout 2018). However, these books are written with business software in mind, and we find that they do not capture or touch only tangentially on key practices that go a long way towards successfully implementing and deploying machine learning models. Analysis of algorithms; matching data and algorithms with appropriate hardware; embracing data as part of the software; testing and documenting algorithms and their implementations; modularising and building pipelines; and, last but not least, naming variables. From our experience in academia and in the industry, engineering software and teaching software engineering to students and new staff alike, these topics are often not given the importance they deserve. We hope to convince the readers of this book that the viability of any software that analyses data, whether you call it machine learning, data science or business analytics, depends crucially on putting careful thought into these engineering practices. We do not aim to be prescriptive: the individual practices that we discuss will be more or less relevant in different settings, and can be implemented with a variety of software tools. On the contrary, we want our readers to think about what we wrote in the context of their own experience and to figure out which parts apply and which do not!

The book starts with a brief introduction to machine learning and software engineering, to set out how we view them and how we think that they should interact in practical applications. The remainder is structured in four parts, from foundational to practical:

  1. Foundations of Scientific Computing: covering key topics that are foundational for the planning, analysis and design of machine learning software, such as: the trade-offs of using different hardware configurations; the characteristics of different data types and of suitable data structures; and the analysis of algorithms to determine their computational complexity.
  2. Best Practices for Machine Learning and Data Science: revisiting best practices in software engineering from the point of view of a machine learning engineer, from writing, troubleshooting and deploying code to production (that is, serving models) to writing technical documentation.
  3. Tools and Technologies: discussing broad classes of tools that shape how we think about what is feasible to do with machine learning pipelines, with examples from the state of the art and the trade-offs they make.
  4. A Case Study: putting the recommendations in the previous chapters into practice by discussing and prototyping a machine learning pipeline for natural language understanding from the work of Lipizzi et al. (Lipizzi et al. 2022).

All the material in this book, including the book itself, is available online at

https://ppml.dev

and will be updated to fix assorted typos and code problems as they become known to us.

Finally, we would like to thank all the people who supported us and made this book possible. First of all, our families who put up with our long working hours. The colleagues who gave us feedback on early drafts of the book: Vincenzo Manzoni, Fabio Stella and Ron Kenett. And, last but not least, our editor Randi Cohen who bore with us through the many delays this book suffered during the Covid pandemic.

References

Breiman, L. 2001b. “Statistical Modeling: The Two Cultures.” Statistical Science 16 (3): 199–231.

Goodfellow, I., Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press.

Harvard Business Review. 2012. Data Scientist: The Sexiest Job of the 21st Century. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century.

Kang, S., R. Jin, X. Deng, and R. S. Kenett. 2021. “Challenges of Modeling and Analysis in Cybermanufacturing: A Review from a Machine Learning and Computation Perspective.” Journal of Intelligent Manufacturing Online first.

Lipizzi, C., H. Behrooz, M. Dressman, A. G. Vishwakumar, and K. Batra. 2022. “Acquisition Research: Creating Synergy for Informed Change.” In Proceedings of the 19th Annual Acquisition Research Symposium, 242–55.

Nature. 2016. “Reality Check on Reproducibility.” Nature 533 (437).

Ousterhout, J. 2018. A Philosophy of Software Design. Yaknyam Press.

Sculley, D., G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. 2015. “Hidden Technical Debt in Machine Learning Systems.” In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), 2:2503–11.

.Seven, D. 2014. Knightmare: A DevOps Cautionary Tale. https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/.

Sherman, E. 2022. What Zillow’s Failed Algorithm Means for the Future of Data Science. https://fortune.com/education/business/articles/2022/02/01/what-zillows-failed-algorithm-means-for-the-future-of-data-science/.

Tatman, R., J. VanderPlas, and S. Dane. 2018. “A Practical Taxonomy of Reproducibility for Machine Learning Research.” In Proceedings of 2nd the Reproducibility in Machine Learning Workshop at ICML 2018.

The Economist. 2017. The World’s Most Valuable Resource Is No Longer Oil, but Data. https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data.

The Economist. 2020. An Understanding of AI’s Limitations Is Starting to Sink In. https://www.economist.com/technology-quarterly/2020/06/11/an-understanding-of-ais-limitations-is-starting-to-sink-in.

The Register. 2020. Twilio: Someone Waltzed into Our Unsecured AWS S3 Silo, Added Dodgy Code to Our JavaScript SDK for Customers. https://www.theregister.com/2020/07/21/twilio_javascript_sdk_code_injection.

Thomas, D., and A. Hunt. 2019. The Pragmatic Programmer: Your Journey to Mastery. Anniversary. Addison-Wesley.

VPNOverview. 2022. Fintech App Switch Leaks Users’ Transactions, Personal IDs. https://vpnoverview.com/news/fintech-app-switch-leaks-users-transactions-personal-ids.