paint-brush
Populations — You’re doing it wrongby@kozyrkov
7,650 reads
7,650 reads

Populations — You’re doing it wrong

by Cassie KozyrkovSeptember 7th, 2018
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Let’s cover the basics briefly so we can get to the howling-in-frustration bit. In <a href="http://bit.ly/quaesita_statistics" target="_blank">statistics</a>,<strong> a population is the collection of all items that you are interested in </strong>(for the purpose of <a href="http://bit.ly/quaesita_pointofstats" target="_blank">making a decision rigorously</a>).

Coin Mentioned

Mention Thumbnail
featured image - Populations — You’re doing it wrong
Cassie Kozyrkov HackerNoon profile picture

Why lawyers might be better than you at statistics

Let’s cover the basics briefly so we can get to the howling-in-frustration bit. In statistics, a population is the collection of all items that you are interested in (for the purpose of making a decision rigorously).

Should you even be attempting statistics?

You can’t answer that until you’re clear on what your population is (and it’s up to you to define it). The whole reason you’d want to take a statistical — as opposed to fact-based — approach is that you’re dealing uncertainty.

A statistical approach only makes sense when there’s a mismatch between the information you want and the information you have.

In other words, your available data (sample) doesn’t cover your whole population. If it did, you’d be dealing with facts, and facts are better than uncertainty. (If you’re thinking this last is a proclamation by Captain Obvious, perhaps you haven’t had the pleasure of grading college exam papers.) Facts mean you don’t need statistical expertise — simply state them and get on with life. No finicky p-values or credible intervals required.

Please don’t try to use statistical terms as bling for making your report more sophisticated. Attempting hypothesis testing when there’s no uncertainty is about as cute as looking in the sky for dead birds.

Cringeworthy populations

Okay, hopefully you’re convinced that the concept of population is pretty important to the whole practice of statistics.

In the Icarus-like leap from sample to population, expect a big splat if you don’t know where you’re aiming.

Now let me show you a classic way decision-makers keep getting it wrong.

Imagine that you’re a lawyer reviewing a contract for me and my friends. We’ve told you we want to give our product’s users a $50 voucher for chocolate. When you look inside the contract to see how the people eligible for a voucher are described, it simply says “all users.” No more and no less.

Anything wrong here?

You don’t have to be a legal expert to see that there’s a big problem! We haven’t defined “all users.”

What does “all users” even mean?

If we let this contract see the light of day before we’ve really thought about what we mean by “all users”, we’ll find ourselves flat-footed as all kinds of users climb out of the woodwork demanding chocolate. What about the people who don’t sign up but use the product on their friend’s account? Do they count? What about the ones who use the product for one second and drop it… just to score some chocolate? What about the people who can claim they’ve used it on their friends’ account in the past without signing up? Do we give them chocolate too? What about the ones who claim they’ll be future users (but want the chocolate now)? We’ll be bankrupt from chocolate vouchers before we know it.

Think of a population as the legal contract at the heart of statistics. Here’s the deal: by writing down a description of your population, you’re specifying exactly what your decision is based on.

What a nightmare! Imagine if whoever approved the contract says, “Oops, I didn’t even think of that.” Unacceptable. My lawyer friends assure me that the task here is to think of everything and be sure that what you write is precisely what you mean. No loopholes. Who gets chocolate and who doesn’t should be crystal clear from the description.

To avoid messing up, rely on your inner lawyer. Or, better yet, an outer one.

I hope you can see how important it is to use detailed legal descriptions with zero room for ambiguity. Detail is just as important in statistics.

Icarus, don’t get hurt!

You opted for statistics because (1) your decision is important — otherwise you’d prefer data-mining for a faster path to inspiration — and (2) the data you have doesn’t cover all the entities you’re interested in, so you’re trying to make an Icarus-like leap from your sample to your population. If you can’t even specify where you’re leaping, expect a big splat! Any amount of vagueness makes your entire endeavor melt into nonsense. Pretty bad when we’re dealing with an important decision.

If you leave any wiggle room in the definition, you’ve set yourself up to fail.

Despite all this obviousness, I keep seeing decision-makers write nothing but “all users” when framing their decisions. That’s just plain sloppy. In a real project, the population description involves plenty of fineprint. Alas, decision-makers don’t always realize that thinking deeply about this is their job.

Advice for those who work with decision-makers

If you see a vague population description, set up a picket line until the decision-maker does their homework. The project isn’t ripe for fancy calculations yet.

When decision-makers don’t realize that thinking deeply is their job, remind them.

This goes beyond population definition. There are a lot of tasks the decision-maker has to complete before your math can be useful. Spending all weekend rigorously chasing down some half-baked question a decision-maker drops on your desk is a well-known rookie mistake, but I see so many junior data scientists falling for it repeatedly.

All the statistical effort you’re tempted to put in makes no sense until the decision-maker’s homework is done.

When the decision really matters, why not brainstorm your population definition with your buddy from Legal? Chances are that they’re better than you at finding the population loopholes your study is about to die of. (Photo credit: Victoria Jones/PA. Also, if you like fascinating outfits, the British legal system delivers.)

Advice for decision-makers

Ask your buddy from Legal to help you out — they’re probably better at thinking through your population definition than you are. Law school might not call it statistical thinking, but it teaches this bit better than a stats PhD program does.

For the DIY version, rely on your inner lawyer: next time you’re defining a population, ask yourself, “Is it airtight? Would a lawyer put their stamp of approval on this… or should I go think about it a little harder?”

Now that you’re au fait with populations, you’re ready to take my little self-test of statistical savvy.