What Machine Learning Practitioners Actually Do, by Rachel Thomas, fast.ai

Dealing with messy and inconsistent data is necessary

By Rachel Thomas, founder, fast.ai and Asst. Professor, Data Institute, University of San Francisco

There are frequent media headlines about both the scarcity of machine learning talent and about the promises of companies claiming their products automate machine learning and eliminate the need for ML expertise altogether. In his recent keynote at the TensorFlow DevSummit, Google’s head of AI Jeff Dean estimated that there are tens of millions of organizations that have electronic data that could be used for machine learning but lack the necessary expertise and skills. I follow these issues closely since my work at fast.ai focuses on enabling more people to use machine learning and on making it easier to use.

Rachel Thomas, founder, fast.ai

In thinking about how we can automate some of the work of machine learning, as well as how to make it more accessible to people with a wider variety of backgrounds, it’s first necessary to ask, what is it that machine learning practitioners do? Any solution to the shortage of machine learning expertise requires answering this question: whether it’s so we know what skills to teach, what tools to build, or what processes to automate.

This post is the first in a 3-part series. It will address what it is that machine learning practitioners do, with Part 2 explaining AutoML and neural architecture search (which several high profile figures have suggested will be key to decreasing the need for data scientists) and Part 3 will cover Google’s heavily-hyped AutoML product in particular.

Building Data Products is Complex Work

While many academic machine learning sources focus almost exclusively on predictive modeling, that is just one piece of what machine learning practitioners do in the wild. The processes of appropriately framing a business problem, collecting and cleaning the data, building the model, implementing the result, and then monitoring for changes are interconnected in many ways that often make it hard to silo off just a single piece (without at least being aware of what the other pieces entail). As Jeremy Howard et al. wrote in Designing Great Data Products, Great predictive modeling is an important part of the solution, but it no longer stands on its own; as products become more sophisticated, it disappears into the plumbing.

A team from Google, D. Sculley et al., wrote the classic Machine Learning: The High-Interest Credit Card of Technical Debt, about the code complexity and technical debt often created when using machine learning in practice. The authors identify a number of system-level interactions, risks, and anti-patterns, including:

  • glue code: massive amount of supporting code written to get data into and out of general-purpose packages
  • pipeline jungles: the system for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output
  • re-use input signals in ways that create unintended tight coupling of otherwise disjoint systems
  • risk that changes in the external world may make models or input signals change behavior in unintended ways, and these can be difficult to monitor

The authors write, A remarkable portion of real-world “machine learning” work is devoted to tackling issues of this form… It’s worth noting that glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles… It may be surprising to the academic community to know that only a tiny fraction of the code in many machine learning systems is actually doing “machine learning”. (emphasis mine)

When machine learning projects fail

In a previous post, I identified some failure modes in which machine learning projects are not effective in the workplace:

  • The data science team builds really cool stuff that never gets used. There’s no buy-in from the rest of the organization for what they’re working on, and some of the data scientists don’t have a good sense of what can realistically be put into production.
  • There is a backlog with data scientists producing models much faster than there is engineering support to put them in production.
  • The data infrastructure engineers are separate from the data scientists. The pipelines don’t have the data the data scientists are asking for now, and the data scientists are under-utilizing the data sources the infrastructure engineers have collected.
  • The company has definitely decided on feature/product X. They need a data scientist to gather some data that supports this decision. The data scientist feels like the PM is ignoring data that contradicts the decision; the PM feels that the data scientist is ignoring other business logic.
  • The data science team interviews a candidate with impressive math modeling and engineering skills. Once hired, the candidate is embedded in a vertical product team that needs simple business analytics. The data scientist is bored and not utilizing their skills.

I framed these as organizational failures in my original post, but they can also be described as various participants being overly focused on just one slice of the complex system that makes up a full data product. These are failures of communication and goal alignment between different parts of the data product pipeline.

So, what do machine learning practitioners do?

As suggested above, building a machine learning product is a multi-faceted and complex task. Here are some of the things that machine learning practitioners may need to do during the process:

Understanding the context:

  • identify areas of the business that could benefit from machine learning
  • communicate with other stakeholders about what machine learning is and is not capable of (there are often many misconceptions)
  • develop understanding of business strategy, risks, and goals to make sure everyone is on the same page
  • identify what kind of data the organization has
  • appropriately frame and scope the task
  • understand operational constraints (e.g. what data is actually available at inference time)
  • proactively identify ethical risks, including how your work could be mis-used by harassers, trolls, authoritarian governments, or for propaganda/disinformation campaigns (and plan how to reduce these risks)
  • identify potential biases and potential negative feedback loops

Read the source post at fast.ai.