Cassie Kozyrkov

@kozyrkov

Statistics Savvy Self-Test

September 21st 2018

Find out whether you fell for a lie from your college stats course

Setting the scene

If you have all the data you’re interested in, there’s no need for fancy statistical methods. You’re lucky enough to be working with pure facts, so just tally up the numbers and report them.

If you had facts, you wouldn’t need statistics.

The part that forces you to go jump through all the extra hoops is that you can’t get hold of facts. You only get to glimpse your population through the keyhole that is your sample. It’s an impoverished, incomplete view and you’ll be using it to make the leap to what you’re actually interested in. Uncertainty comes along as bad baggage.

Populations in AI

The concept is not confined to statistics courses. Population definitions are extremely important for testing that a machine learning / AI system actually works. In those settings, the population is usually defined in terms of the instances the decision-maker needs the system to work on. Testing is fundamentally statistical, because you’re interested in how the system will perform in the future and for some reason you’re having trouble remembering things that haven’t happened yet. (Humans.) Your test dataset is your sample and you want to make inferences about whether the system will crash and burn when it meets the population it’s supposed to work on.

Statistics savvy self-test

Now that you’re oriented with the important role the concept of population plays in statistics (I’ve also written about it here), take a small quiz to check your statistical expertise:

Imagine I’m a decision-maker submitting a request for statistical work. You, the statistical expert, review my population definition. It’s long, relevant to the request, and pretty darn thorough, but you notice that instead of all the days of the week, my population definition includes only users who are active on Mondays. Is there a problem?

Moment of truth! If you answered there’s no problem, you are either new enough to statistics to have an open mind or you’re a Jedi master. Either way, good for you! This is how experts should feel. Their response is, “Sure, seems okay.”

On the other hand, if something rubs you the wrong way about this, you’ve probably been exposed to just enough statistics to be dangerous. Perhaps you took a few undergraduate courses? Turns out we prefer not to trust undergrads with the truth about populations.

The truth about populations

The truth is that the population is, quite literally, whatever the decision-maker chooses to interest themselves in for the purpose of making the decision. How could we possibly tell this to STAT101 students? Imagine how obnoxious they could get on their exams. “I’m interested in my sample, it’s my population, so no calculations are required. My job here is done… hand me my A+.”

That would be a disaster, so instead we tell them, “It’s all the things!”

Well, now that you’re all grown up, it’s time for the truth.

The population is whatever the decision-maker chooses to interest themselves in for the purpose of making this decision.

Since this scenario’s decision-maker (me) chose this population definition thoughtfully, there’s nothing wrong with it. It’s up to me to choose how I want to frame my decision-making. I don’t even have to base my decision on my own product’s users if I don’t want to. Wow, so many choices!

Different decision-makers are allowed to frame the decision differently, and defining the population is part of that.

If you’re the decision-maker, you might choose to include all days of the week (and then I hope you’d go about collecting and analyzing your data differently… there’s no single right way to frame a decision, but there are correct and incorrect analysis choices subject to how the decision has been framed).

Ah, but I see the statistician in you is in an argumentative mood. Let’s raise some valid objections together, shall we? Read on in the next article.

More by Cassie Kozyrkov

More Related Stories