People who care about the integrity of the social sciences live in a well-connected world.
So, if you’re reading this, you’ve probably heard about the recent trouble with a rash of papers from the Cornell Food and Brand Lab.
If you haven’t, well, you have some catching up to do. Popular accounts are here (Slate), here (New York Magazine) and here (Andrew Gelman); the problems with specific papers are outlined here (the central pre-print) and here; responses are given or linked here and here.
Oh, and now — since yesterday — there’s this.
To sum it up quickly: the work from this lab is under a great deal of serious scrutiny at present. There are irregularities present in several papers which are very difficult to explain.
But push all that onto the sideboard for now, because we’re going to talk about carrots.
Here you can see Jordan Anaya writing about the table reproduced below. It is from the paper: “Attractive Names Sustain Increased Vegetable Intake in Schools” by Wansink et.al. (2012). It presents a simple thesis: change the name of ‘carrots’ and ‘beans’ and ‘broccoli’ to something exciting that the kids are doing (I don’t know, ‘Buzz Lightyear chard’ or ‘Pokemon kale’ etc.) and children will eat more of it. The paper has 99 citations at present. Here’s the results of Study 1:
Jordan’s objection to the above is extremely simple, and perfectly correct — if you take a number of anything [A], and then you eat some [B], and do not eat others [not B], then [A] will be equal to [B] + [not B].
If it isn’t, then you have made a mistake. In other words, take the first column above — 17.1 is not the sum of 11.3 and 6.7, and it should be.
The rest of the paper has a thunderclap of similar errors.
But there’s another error that no-one has pointed out yet, and one I think you should see.
Also, frankly, it’s kind of funny.
To see this error, you’re going to have to meet a new friend of mine…
SPRITE stands for Sample Parameter Reconstruction via Iterative TEchniques. It’s a statistical technique similar to the GRIM test, for a few reasons:
To explain how it works, a story:
*********
Let’s say we have a scale between 1 and 7, onto which 20 people place an answer to the question “what do you think of carrots?”
On this scale, 1 is “pure white hot burning disgust” and 7 is “a love story beyond time, space and reality”.
Also, let’s say everyone puts 3. Hey, it could happen — this is a terribly designed scale. 3 corresponds to something like “yeah, carrots, sure, they’re alright…? I mean, whatever?”
(Note: this is actually how Americans talk)
AREN’T THEY EXCITING
So, here’s our sample:
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
Not very interesting, is it. This has a mean of 3 and an SD of, well, 0. There is no deviation, standard or otherwise.
Now, the hypothetical carrot-based researcher who collected this data is disappointed, and grumbles quietly into his copy of Excel 2010 “this isn’t the carrot-based glory I was hoping for”. It doesn’t look right. And it ruins his calculations. So he decides to change a few of the values on the sly, and jazz up his carrot-based life.
Here’s his first effort:
3,3,3,3,3,3,4,3,3,3,3,3,3,3,3,3,3,3,3,3
Then he remembers: he already mentioned to a colleague the average was exactly three. Bollocks. Ahh, but! If he changes one value up…
3,3,3,3,3,3,4,2,3,3,3,3,3,3,3,3,3,3,3,3
… he can just change another down! The mean is preserved, but the standard deviation starts to grow, now it’s 0.32. He grows bolder and starts swapping a whole cavalcade of values — one up, one down, one up, one down, and then suddenly:
2,3,1,2,4,3,5,2,2,2,7,3,2,3,6,1,3,2,3,4
Now, it’s looking more realistic: the mean is still 3, but the SD=1.56.
His nefarious work done, he goes for lunch in the cafeteria (it’s carrots).
*********
What SPRITE does, at the most basic level, is simple — it automates the above, and it’s very fast. The median time it takes to generate a sample with the properties described above (M=3.00,SD=1.56,n=20) is 7.37ms.
In other words, for any given mean, we shuffle the available values (very quickly) until we generate a sample with the parameters we’re interested in. Then we do it again, and again, and again. We find hundreds or thousands of plausible solutions. We can model them further if we want.
This lets us do a great deal of things.
And that brings us neatly back to carrots.
Go back and look at that table of values again. See the column on the right?
Mean=19.4, SD=19.9, n=45
Does that look a bit… funny? Maybe not. Let’s add a fact: you can’t have less than zero carrots (there are no negative carrots, this isn’t Star Trek).
So, let’s ask SPRITE: what are some plausible values for the maximum number of carrots taken by a single child? Using a nifty little script, we generate 500 plausible datasets (as before, mean=19.45, SD = 19.9, no values below zero, and no upper limit) and record the maximum value for each dataset we generate.
Oh.
If you haven’t read the original paper, or have forgotten it by now, let me refresh your memory: our original values describe the amount of carrots served to elementary school children (ages 8 to 11) at lunch in a control condition.
And, apparently, at least one of them is a Clydesdale horse.
HELLO I AM REGULAR HUMAN CHILD GIVE CARROTS NO
What other animal thinks it can eat 60 carrots? Also note: this is in the control intervention to get schoolkids to eat MORE carrots. I think this child is sorted, frankly.
Even considering the size of ‘baby carrots’ (which are little orange carrot-tubes lathed off real carrots so they are geometric enough to be acceptable to Americans, who largely treat vegetables as a concept), this is a thoroughly confusing amount of vegetables. Especially for a child who is a) still reading Sweet Valley High books, and b) presumably eating other non-carrot foods as well.
Just how much carrot is in ’60 carrots’? I decided to find out.
Here are two packets of carrots.
One packet I carefully selected to have as many tiny fiddly carrots present as possible, by the simple act of ransacking the whole supermarket shelf and peering into the bags (“do you need any, uh, help sir?” “NO, CHILD! I SEEK TINY CARROTS!”).
The other packet looked rather more chunky. They’ll do as reasonable limits.
60 of the small ones, chosen at random:
About three quarters of a pound in imperial “measurement”
And 60 of the large ones was actually the whole packet’s worth — there were only 55 in there. I had to make up the difference with 5 of the small ones.
More than a pound.
And the final indignity, would you like to see what 60 normal carrots look like on a plate?
How’s your eyesight now?
Note: this is a full sized regular dinner plate, not a side plate.
Two further method notes:
Any inconsistency found in a scientific paper doesn’t mean much by itself. I’ve seen hundreds of inconsistencies in various papers, kicked them around, analysed them, wondered how they got there.
And you know what?
Most of them are just inconsistencies. They have no more troubling provenance than a busted margin or a dropped Oxford comma. They are process errors, with logical explanations, and minimal effects on the scientific conclusions drawn. Even this rather amusing horse-child/carrot based inconsistency might come to nothing. SPRITE might be wrong, I might have made a bad assumption, and so on. NEVER discount this possibility.
But.
What is always exponentially more problematic than an inconsistency is a PATTERN of inconsistencies, which is very much the situation when we take the existing issues and add The Case of the Carthorse Child.
It seems something went badly wrong with this paper. We don’t know what exactly — the data collection, the analysis, the measurement technique, or all of the above. We shall see.
Vegetable-based humour aside, this has also been a very brief introduction to how to use SPRITE to investigate sample properties, and the implications of those properties when we find them.
Expect to be hearing a lot more about that soon.
NOTE: For those of you who are interested, I will have a working code, examples and all the technical details on SPRITE available soon. It is designed to be somewhat complementary to the GRIM and GRIM-SD tests, and will be presented in the same way as the original pre-print of the GRIM paper here. There are a wide variety of scenarios the technique can be used for, a series of improvements I’d like to add, and some other concerning results detected with it so far.
If you’re still reading at this point, there’s two things you can do to help:
For both of these, the email is [email protected]. As always, happy to talk to you. And, as always, further scruffiness ensues at Facebook.
And, horror of horrors, I have activated my five-year-and-six-month old Twitter account. I expect to use it to be awful to people.
P.S. No baby carrots were harmed in the making of this blog post.
P.P.S. Actually, that’s not true, I made the weird little buggers into soup.
P.P.P.S. I am very proud to be the first person to use the Medium tag ‘Carrots’