Fixing Big Data’s blind spot: beyond correlation, into root causes


Data-fueled machine learning has spread to many corners of science and industry and is beginning to make waves in addressing public-policy questions as well. It’s relatively easy these days to automatically classify complex things like text, speech, and photos, or to predict website traffic tomorrow. It’s a whole different ballgame to ask a computer to explore how raising the minimum wage might affect employment or to design an algorithm to assign optimal treatments to every patient in a hospital.

The vast majority of machine-learning applications today are just highly functioning versions of simple tasks, says Susan Athey, professor of economics at Stanford Graduate School of Business. They rely in large part on something computers are especially good at: sifting through vast reams of data to identify connections and patterns and thus make accurate predictions. Prediction problems are simple, because, in a stable environment, it doesn’t really matter how or why the algorithm operates; it’s easy to measure performance just by seeing how well the program works on test data. All of which means that you don’t have to be an expert to deploy prediction algorithms with confidence.

Despite the proliferation of data collection and computing prowess, machine-learning algorithms aren’t so good at distinguishing between correlation and causation — determining whether the connection between statistically linked patterns is coincidental or the result of some cause-and-effect force. “Some problems simply aren’t solvable with more data or more complex algorithms,” Athey says.

If machine-learning techniques are going to help address public-policy problems, Athey says, we need to develop new ways of marrying them with causal-inference methods. Doing so would greatly expand the potential of big-data applications and transform our ability to design, evaluate, and improve public-policy work.

What Predictive Models Miss

As government agencies and other public sector groups embrace big data, Athey says it’s important to understand the realistic limitations of current machine-learning methods. In a recent article published in Science, she outlined a number of scenarios that highlight the distinction between prediction problems and causal-inference problems, and where common machine-learning applications would have trouble drawing useful conclusions about cause and effect.

Read the source article at Stanford Graduate School of Business.