The Dangers of Too Much Data By FREAKONOMICS
Wondering whether aspirin will protect your heart or cause internal bleeding? Or whether you should kick your coffee habit or embrace it? It’s often hard to make sense of the conflicting advice that comes out of medical research studies. John Timmer explains that our statistical tools simply haven’t kept up with the massive amounts of data researchers now have access to. In medical (and economic) research, scientists claim a “statistically significant” finding if there’s a less than 5% chance that an observed pattern (between coffee and liver disease, for example) occurred at random. In the new age of data, that rule causes problems: “Even given a low tolerance for error, the sheer number of tests performed ensures that some of them will produce erroneous results at random.” In lay terms, all those new tests you get at the doctor’s office are translated into data sets, which researchers then pore over searching for connections and patterns. And, if you have enough data to examine, eventually you’ll find a statistically significant relationship where no such relationship actually exists — by sheer coincidence. (HT: Matthew Rotkis)
freakonomics.blogs.nytimes.com
Selected comments
@Dzof
I think the point is that if you looked at the set of people who read Freakonomics, the probability that they will also all eat spinach is very low, but the probably that they all do *something* (own a dog/use Firefox/go jogging three times a week/drink Pepsi) gets higher the more things you compare them against.
I.e. groups will sometimes overlap strongly even with there being no reason for it, if enough groupings are measured. I think it’s just the standard correlation != causation thing, restated. — Robert Grant
# 5. March 8, 2010 10:49 am Link
Dzof,
The problem isn’t with data sets containing many data points. Those data are, and will always be, the most statistically reliable.
The point of the article is that we are generating a lot of data sets with a modest number of data points. If you have thousands of data sets, at least a few of them will erroneously show a statistical relationship at 5% confidence by pure chance. — johnd
@Dzof: It’s more like this: if you have data for readers of thousands of books, and what they eat, and you run enough studies, eventually you are going to find a “statistically significant” correlation between readers of one book (say, Freakonomics) and eaters of some food (say, spinach). But it may simply be one of those 5% of cases where the pattern actually did occur at random. — kip |