← back to essays/

P-Hacking: Finding "Significant" Results in Random Noise

2020-01-15

A researcher tests whether eating chocolate is linked to winning the Nobel Prize. She finds a statistically significant correlation (p < 0.05). The paper gets published. The headline reads: "Study Finds Chocolate Consumption Linked to Nobel Prizes."

This sounds absurd, and it is. But it actually happened (a real paper in the New England Journal of Medicine, 2012, partly tongue-in-cheek). The result was "statistically significant." It was also meaningless.

How is this possible? Because of a practice called p-hacking: testing many hypotheses and reporting only the ones that come up significant. If you test enough random relationships, some will pass the significance threshold by pure chance. That isn't a bug in the method. It's the definition of what p < 0.05 means.

What p < 0.05 Actually Means

A p-value of 0.05 means: "If there is truly no effect, there's a 5% chance we'd see data this extreme by random chance." So if you test 20 completely random, unrelated hypotheses, you should expect about one of them to come up 'significant' by definition.

The test isn't wrong. It's working as designed. The problem is in how we use it: testing many things, ignoring the misses, and spotlighting the hit.

Try It: Test Random Hypotheses

Below, we generate completely random data with no real effects. Every "food" and every "outcome" is pure noise. But watch how often a "statistically significant" result appears:

20
30
0.05
Click "Run experiment" to test random hypotheses.

Run It Many Times

One experiment might get lucky (or unlucky). Run hundreds of experiments below, each testing your chosen number of hypotheses, and see how often at least one "significant" result appears:

20
0
Experiments
0
Found "significant" result
--
% with false discovery
--
Expected (mathematical)

The Math Behind the Problem

If each test has a 5% false positive rate and the tests are independent, the probability of getting at least one false positive in k tests is:

P(at least one false positive) = 1 − (1 − 0.05)k = 1 − 0.95k

At 20 hypotheses, you have better than a coin flip's chance of finding something "significant" in pure noise. At 100, it's nearly guaranteed. And researchers routinely test hundreds of variables.

How This Happens in Practice

Deliberate fraud is rare. Most p-hacking is unconscious and well-intentioned:

The Transferable Insight

This isn't just a problem for scientists. Anytime you search through data looking for patterns, you're p-hacking:

The habit worth building: when someone presents a surprising finding, ask how many things they tested before finding it. A single significant result from a single pre-registered hypothesis is meaningful. A single significant result cherry-picked from dozens of attempts is expected noise.

The number of questions you ask of the data is as important as the answer you get. Ask enough questions and the data will always tell you something, even when there's nothing there.