Statistics is hard

I’ve been thinking about how statistics and probability are really non-intuitive. When you’re programming, if you’re wrong, it’s obvious. Your code won’t compile, or when you run it, the output is clearly broken. With statistics, you can be doing it completely wrong and have no idea.

In programming, you get feedback. You access an array index that doesn’t exist, your program crashes. You use the wrong variable name, you get a compilation error. You divide by the wrong number, the result looks wrong.

In statistics, formulas work with whatever data you feed in. They’ll give you an answer. But that doesn’t mean the answer is right.

A lot of statistical methods assume your data is independent and identically distributed (iid). If you’re measuring website traffic over time and treating each day as independent, you’re violating that assumption. Today’s traffic is obviously correlated with yesterday’s. The formula will still run and produce numbers, but the output isn’t reliable. There’s no compiler to tell you that.

You might calculate a p-value or run a regression, and everything looks fine. The numbers are there. The math checks out. But if your data violates the assumptions, you’re in undefined behavior territory. The results might still be directionally correct, or they might be completely wrong.

Statistics is also about hypothesis testing. When you see statistical output, you have to know what question it’s actually answering. Say you train a linear model to predict house prices with features like square footage, bedrooms, and neighborhood. You might look at the weights and think “neighborhood has the highest weight, so it’s the most important factor.” But that’s not necessarily what the weights tell you. They’re affected by the scale of variables and correlations between features. If you want to know what matters most for prediction, you’d need to actually test the model’s performance. If you want causal effects, that’s a different thing entirely.

There’s also a lot of human judgment. You have to choose which data to include, how to handle missing values and outliers, pick hyperparameters. There’s no formula that tells you the “right” values. Two people working on the same problem will make different choices. Even when you have metrics like precision and recall, you still need agreement on thresholds, edge cases, what the ground truth even is.

Run enough tests and you’ll find “significant” results just by chance. Correlation doesn’t mean causation. Statistical significance doesn’t mean practical significance.

LLMs make this worse. They’ll generate statistical code without checking if your data meets assumptions. And because the output looks professional and the code runs, it’s easy to think you did it right. The LLM is just another tool that works with whatever you feed it. Just like the statistical formulas themselves.

Misframe

Statistics is hard