**Secure your critical AI workloads!**

12,568 reads

by James HeathersMarch 1st, 2017

People who care about the integrity of the social sciences live in a well-connected world.

So, if you’re reading this, you’ve probably heard about the recent trouble with a rash of papers from the Cornell Food and Brand Lab.

If you haven’t, well, you have some catching up to do. Popular accounts are here (Slate), here (New York Magazine) and here (Andrew Gelman); the problems with specific papers are outlined here (the central pre-print) and here; responses are given or linked here and here.

Oh, and now — since yesterday — there’s this.

To sum it up quickly: the work from this lab is under a great deal of serious scrutiny at present. There are irregularities present in several papers which are very difficult to explain.

But push all that onto the sideboard for now, because we’re going to talk about carrots.

Here you can see Jordan Anaya writing about the table reproduced below. It is from the paper: “Attractive Names Sustain Increased Vegetable Intake in Schools” by Wansink et.al. (2012). It presents a simple thesis: change the name of ‘carrots’ and ‘beans’ and ‘broccoli’ to something exciting that the kids are doing (I don’t know, ‘Buzz Lightyear chard’ or ‘Pokemon kale’ etc.) and children will eat more of it. The paper has 99 citations at present. Here’s the results of Study 1:

Jordan’s objection to the above is extremely simple, and perfectly correct — if you take a number of anything [A], and then you eat some [B], and do not eat others [not B], then [A] will be equal to [B] + [not B].

If it isn’t, then you have made a mistake. In other words, take the first column above — 17.1 is not the sum of 11.3 and 6.7, and it should be.

The rest of the paper has a thunderclap of similar errors.

*But there’s another error that no-one has pointed out yet, and one I think you should see.*

*Also, frankly, it’s kind of funny.*

To see this error, you’re going to have to meet a new friend of mine…

**SPRITE** stands for **S**ample **P**arameter **R**econstruction via **I**terative **TE**chniques. It’s a statistical technique similar to the GRIM test, for a few reasons:

- it uses descriptive statistics to investigate the properties of published datasets
- it is extremely simple, and capable of being automated
- I feel faintly embarrassed talking about it, because I’m not sure if it’s important or novel… I only have enough confidence to discuss it because I’ve found so many fun things with it

To explain how it works, a story:

*********

Let’s say we have a scale between 1 and 7, onto which 20 people place an answer to the question “what do you think of carrots?”

On this scale, 1 is “pure white hot burning disgust” and 7 is “a love story beyond time, space and reality”.

Also, let’s say everyone puts 3. Hey, it could happen — this is a terribly designed scale. 3 corresponds to something like “yeah, carrots, sure, they’re alright…? I mean, whatever?”

(Note: this is actually how Americans talk)

AREN’T THEY EXCITING

So, here’s our sample:

3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3

Not very interesting, is it. This has a mean of 3 and an SD of, well, 0. There is no deviation, standard or otherwise.

Now, the ** hypothetical** carrot-based researcher who collected this data is disappointed, and grumbles quietly into his copy of Excel 2010 “this isn’t the carrot-based glory I was hoping for”. It doesn’t look right. And it ruins his calculations. So he decides to change a few of the values on the sly, and jazz up his carrot-based life.

Here’s his first effort:

3,3,3,3,3,3,**4**,3,3,3,3,3,3,3,3,3,3,3,3,3

Then he remembers: he already mentioned to a colleague the average was *exactly three*. Bollocks. Ahh, but! If he changes one value up…

3,3,3,3,3,3,**4**,**2**,3,3,3,3,3,3,3,3,3,3,3,3

… he can just change another down! The mean is preserved, but the standard deviation starts to grow, now it’s 0.32. He grows bolder and starts swapping a whole cavalcade of values — one up, one down, one up, one down, and then suddenly:

2,3,1,2,4,3,5,2,2,2,7,3,2,3,6,1,3,2,3,4

Now, it’s looking more realistic: the mean is still 3, but the SD=1.56.

His nefarious work done, he goes for lunch in the cafeteria (it’s carrots).

*********

What SPRITE does, at the most basic level, is simple — it automates the above, and it’s very fast. The median time it takes to generate a sample with the properties described above (M=3.00,SD=1.56,n=20) is 7.37ms.

In other words, for any given mean, we shuffle the available values (very quickly) until we generate a sample with the parameters we’re interested in. Then we do it again, and again, and again. We find hundreds or thousands of plausible solutions. We can model them further if we want.

This lets us do a great deal of things.

- At its most basic level, we can give SPRITE a mean and a standard deviation, and it will find a distribution which fits those values. This is good if we just want to check what the data
*could*look like. We can fit a distribution to those multiple solutions we get out easily. - We can easily give SPRITE a mean and standard deviation,
*and a series of restrictions*, like “no values below 1, and no values above 7”, or “only use whole numbers”. - We can let SPRITE find realistic limits of a sample — if we feed it a mean, SD, and restrictions, we can find where samples might have maxima or minima.
- With a few tiny modifications, we can let SPRITE find medians and interquartile ranges for us.
- Speaking of ranges, we can incorporate those easily.
- We can determine cell sizes from sample sizes, and vice versa.
- We can approximate simple results from ANOVA, t-test, Chi-squared, and so on, by comparing multiple SPRITE distributions that produce the right test statistics.
- We can determine rounding errors by comparing exact test statistics or p-values to sample values.
- And, naturally, we can do all of the above, and then look at the values we have generated, and apply
*common sense*.

And that brings us neatly back to carrots.

Go back and look at that table of values again. See the column on the right?

**Mean=19.4, SD=19.9, n=45**

Does that look a bit… funny? Maybe not. Let’s add a fact: *you can’t have less than zero carrots* (there are no negative carrots, this isn’t Star Trek).

So, let’s ask SPRITE: what are some plausible values for the *maximum* number of carrots taken by a single child? Using a nifty little script, we generate 500 plausible datasets (as before, mean=19.45, SD = 19.9, no values below zero, and no upper limit) and record the maximum value for each dataset we generate.

Oh.

If you haven’t read the original paper, or have forgotten it by now, let me refresh your memory: our original values describe *the amount of carrots served to elementary school children (ages 8 to 11) at lunch in a control condition*.

And, apparently, at least one of them is a Clydesdale horse.

HELLO I AM REGULAR HUMAN CHILD GIVE CARROTS NO

What other animal thinks it can eat 60 carrots? Also note: this is in the *control* intervention to get schoolkids to eat MORE carrots. I think this child is sorted, frankly.

Even considering the size of ‘baby carrots’ (which are little orange carrot-tubes lathed off real carrots so they are geometric enough to be acceptable to Americans, who largely treat vegetables as a concept), this is a thoroughly confusing amount of vegetables. Especially for a child who is a) still reading Sweet Valley High books, and b) presumably eating other non-carrot foods as well.

Just how much carrot is in ’60 carrots’? I decided to find out.

Here are two packets of carrots.

One packet I carefully selected to have as many tiny fiddly carrots present as possible, by the simple act of ransacking the whole supermarket shelf and peering into the bags (“do you need any, uh, help sir?” “NO, CHILD! I SEEK TINY CARROTS!”).

The other packet looked rather more chunky. They’ll do as reasonable limits.

60 of the small ones, chosen at random:

About three quarters of a pound in imperial “measurement”

And 60 of the large ones was actually the whole packet’s worth — there were only 55 in there. I had to make up the difference with 5 of the small ones.

More than a pound.

And the final indignity, would you like to see what 60 normal carrots look like on a plate?

How’s your eyesight now?

Note: this is a full sized regular dinner plate, not a side plate.

Two further method notes:

- According to the draft of this paper, located on the internet for me by Data Police Cadet Brown, these carrots were not TAKEN — as in, That One Kid Who Always Makes Things Difficult (we all know him) did not simply take four enormous handfuls of carrots and return to his lunch table, carrots spilling from every pocket and orifice — they were SERVED. So, someone with a serving spoon the size of a saucepan looked a child in the eye and said “
*here is your pound of requested carrots, wee Cindy… make sure you remember to brush your mane*”. - These carrots were also, apparently, surreptitiously weighed in a small dish. Well, it’d be a heaving small dish.

Any inconsistency found in a scientific paper doesn’t mean much by itself. I’ve seen hundreds of inconsistencies in various papers, kicked them around, analysed them, wondered how they got there.

And you know what?

*Most of them are just inconsistencies*. They have no more troubling provenance than a busted margin or a dropped Oxford comma. They are process errors, with logical explanations, and minimal effects on the scientific conclusions drawn. Even this rather amusing horse-child/carrot based inconsistency might come to nothing. SPRITE might be wrong, I might have made a bad assumption, and so on. *NEVER discount this possibility*.

**But.**

What is always exponentially more problematic than an inconsistency is a PATTERN of inconsistencies, which is very much the situation when we take the existing issues and add The Case of the Carthorse Child.

It seems something went badly wrong with this paper. We don’t know what exactly — the data collection, the analysis, the measurement technique, or all of the above. We shall see.

Vegetable-based humour aside, this has also been a *very* brief introduction to how to use SPRITE to investigate sample properties, and the implications of those properties when we find them.

Expect to be hearing a lot more about that soon.

**NOTE: For those of you who are interested, I will have a working code, examples and all the technical details on SPRITE available soon. It is designed to be somewhat complementary to the GRIM and GRIM-SD tests, and will be presented in the same way as the original pre-print of the GRIM paper** **here****. There are a wide variety of scenarios the technique can be used for, a series of improvements I’d like to add, and some other concerning results detected with it so far.**

If you’re still reading at this point, there’s two things you can do to help:

*make suggestions*, if you’re a comp sci, statistical or engineering type — my code is unwieldy and awful, and you might have a better idea of how to approach this*send me any weird results you find in papers*, and if it’s possible, we’ll see what SPRITE has to say about them

For both of these, the email is [email protected]. As always, happy to talk to you. And, as always, further scruffiness ensues at Facebook.

And, horror of horrors, I have activated my five-year-and-six-month old Twitter account. I expect to use it to be awful to people.

P.S. No baby carrots were harmed in the making of this blog post.

P.P.S. Actually, that’s not true, I made the weird little buggers into soup.

P.P.P.S. I am very proud to be the first person to use the Medium tag ‘Carrots’

L O A D I N G

. . . comments & more!

. . . comments & more!