There's No Escape from Inference
I came across a reddit post a while ago that asked data scientists how inferential statistics can be useful in business. At first, I was a bit surprised that this was even brought under question and that the post was not flooded with example use cases from the respondents' daily work. But then I recalled my own encounters with some of the more senior data scientists who completely missed that what they were doing was inference. I've seen some who successfully trained complex models but couldn't comfortably explain the semantics of basic inferential statistics. I've seen shallow understanding leading to awkward conversations, where answering questions about a chart's error bars is a struggle, let alone discussions on "what do you exactly mean by significant?"
In their defence, data science is a huge field that builds on so many disciplines, including statistics. It also attracts professionals from all sort of backgrounds. Nobody can master everything. There are, however, topics that are key foundations, and statistical inference is one of them. If I were to recommend a junior data scientist one topic to get a deep understanding of, this would be it. It's of course not only for those who hold the title of data scientist. Most analytical roles at some point realize they could benefit from learning more statistics. I bet they'd get a lot of bang for their buck by getting a good grip on the basics of inferential statistics. This is because, almost every question worth answering from the data requires inference.
Why we need inference every day
Being data-driven means looking in data, answering questions or learning models, for making better decisions. Everybody seems to be aware of the benefits, but here's a less talked about part of this process: Unless our only goal is to summarize what has happened in the past, we need to do inference. That is, we need to reason, using the data, about what is not in the data.
When we have a sample of the world and answer questions about the bigger world, we're doing inference. And guess what, we always have a sample, even if it's all the data we could have recorded, even if it's 'Big data' that we have. You've got lots of data on your users? it's just a sample. You still don't have data on your future users, or even the same users with different intentions or needs than what you saw before. Consider this simplest of questions: What is the average time a user will spend on your site? When you answer it using your historical data you're doing inference.
Big data doesn't solve inference. More data can help us make better inferences, but it doesn't free us from it. There's no escaping doing inference, unless we're dealing with a static fully-observed system. There are not many interesting decision making problems there. The real world is dynamic and partially observed. We always have a sample of reality.
Inference: a key mindset
Since we need to answer questions with inference, whatever conclusions we make from our data and the outcome of all our data-driven decisions have one necessary component: uncertainty.
Inference always comes with uncertainty. Although it might sound trivial, it's also really easy to forget, specially in the midst of solving technical challenges or the rush to meet a deadline. It might not even register that we're answering questions with inference, that our answers are uncertain, and that we can and should quantify this uncertainty. But we are not simply measuring, we are estimating.
Decision making can hugely improve by adopting this simple but key mindset. Not only for major statistical analyzes and predictive modelling, but also for our everyday tasks. Seeing the world like this for the first time can be quite a transformative experience. You see statistical inference everywhere, and become instantly aware of conclusions that miss this picture. You'd also be equipped with a marvellous toolbox that comes handy in lots of everyday applications.
Killer everyday applications
Beside the textbook example applications of inferential statistics, there are many reoccurring daily data science tasks that can benefit from the inference mindset and tools. Here are some of the use cases I've faced most in data science projects:
Data quality evaluation. We need to assess the quality of our input data from some sample. How accurate are the movie genres in this dataset? Or these assignment of POIs to cities? Or the product categorizations in our database? What about the image tags that this company is trying to sell us? How well did the annotators label our training data? Questions like this are among the most common and most valuable in real-world data projects.
Debugging machine learning. It's almost a rule that your first learned models don't work as expected. But what do we need to do when they don't perform? One of the best places to start is the model errors. We check them manually, try to find patterns and generate ideas for improvement (e.g., Andrew Ng suggests checking 100 examples). More experienced data scientists are better at this art, but anyhow only a few would deny its benefits. This is inference: we take a sample of the output, analyze it, and make decisions for the next steps.
Performance of models and systems. The key question about any system or model is "how good is it?". How fast is the system we've built? How good is the learned model? To answer these questions, we use samples of their performance. We measure the run time of the system in a number of runs under different settings and context. We measure the accuracy, precision/recall, or AUC-ROC of our model given a validation data set. We need inference to generalize beyond the test samples and make a statement about our system or model.
Comparison of groups. Decision making is commonly based on comparison of alternatives. Is this product more popular among younger or older customers? Does this newly learned model beat the old one? Which of these ads performs better for our product? Probably the most well-known application of inference for comparison is A/B testing and experimentation. How much of a difference does this diet program make? How much does this feature increase conversion of our site? These questions are answered based on comparison of two or more samples of data.
Business metrics. Businesses rely heavily on their key KPIs for monitoring, feedback loops, and decision making. Even when looking at the past, most of these KPIs have forward-looking applications that benefit from measurement of uncertainty. How many people opt to buy a subscription after their trial period on our platform? How much does the average customer spend in a month in our super market? What percentage of people purchase this product category? Which branch sells the highest percentage of their inventory of that product? The decisions that rely on the answers can rarely be made optimally with a single number summary of the past. They need to know about the uncertainty. If you've ever computed an aggregation and put a filter on group size, it was because of the high uncertainty in inference (even if that wasn't the explicit goal).
All the above cases benefit from theory and tools that statistical inference provides. For instance, a key question is: How much data is enough? Data rarely comes to us for free. Manual checks take time. Annotations cost time and money. A/B tests can drop revenue and decrease user satisfaction, let alone the opportunity costs. We need to be able answer questions like: how much uncertainty is there with a sample of 10? Or 100? Or 1000? The theory behind statistical inference is our best friend.
Learning the magic
Take a moment and question: how is it even possible to do inference, to reason, using the data, about what is not in the data? What in the world allows us to do this? It is kind of like magic. The good news is that there are loads of solid, beautiful and simple statistical techniques that allow us to do inference. Luckily, with a few key concepts you can build a solid toolbox that deals with a good majority of real-world applications.
The return on investment of getting a deep understanding there is amazing. But it doesn't have to be a mountain to climb. Look into the Law of Large Numbers and Central Limit Theorem as foundations. Learn highest-probability interval estimation, from traditional (frequentist) confidence intervals to Bayesian and bootstrap methods. Get a good understanding of p-values (although not the best tools, they're still part of everyday language). Learn what each of them exactly means and does not mean, what kind of probability statements they make and what kind of assumptions they make. There are loads of resources out there. With a good understanding of the fundamentals, we can go a long way just by a few formulas for estimating ratios and averages from a given sample.
Statistical inference is not just another topic in data science; it's a cornerstone. Fortunately, the theory is elegant and even simple tools are effective in the field. It's easy to make a difference. The key is to have the inference mindset.