P-Hacking: Finding "Significant" Results in Random Noise

2020-01-15

A researcher tests whether eating chocolate is linked to winning the Nobel Prize. She finds a statistically significant correlation (p < 0.05). The paper gets published. The headline reads: "Study Finds Chocolate Consumption Linked to Nobel Prizes."

This sounds absurd, and it is. But it actually happened (a real paper in the New England Journal of Medicine, 2012, partly tongue-in-cheek). The result was "statistically significant." It was also meaningless.

How is this possible? Because of a practice called p-hacking: testing many hypotheses and reporting only the ones that come up significant. If you test enough random relationships, some will pass the significance threshold by pure chance. That isn't a bug in the method. It's the definition of what p < 0.05 means.

What p < 0.05 Actually Means

A p-value of 0.05 means: "If there is truly no effect, there's a 5% chance we'd see data this extreme by random chance." So if you test 20 completely random, unrelated hypotheses, you should expect about one of them to come up 'significant' by definition.

The test isn't wrong. It's working as designed. The problem is in how we use it: testing many things, ignoring the misses, and spotlighting the hit.

Try It: Test Random Hypotheses

Below, we generate completely random data with no real effects. Every "food" and every "outcome" is pure noise. But watch how often a "statistically significant" result appears:

Hypotheses to test: 20

Sample size per test: 30

Significance threshold: 0.05

Click "Run experiment" to test random hypotheses.

Run It Many Times

One experiment might get lucky (or unlucky). Run hundreds of experiments below, each testing your chosen number of hypotheses, and see how often at least one "significant" result appears:

Hypotheses per experiment: 20

Experiments

Found "significant" result

% with false discovery

Expected (mathematical)

The Math Behind the Problem

If each test has a 5% false positive rate and the tests are independent, the probability of getting at least one false positive in k tests is:

P(at least one false positive) = 1 − (1 − 0.05)^k = 1 − 0.95^k

1 test: 5% chance of a false positive
5 tests: 23%
10 tests: 40%
20 tests: 64%
50 tests: 92%
100 tests: 99.4%

At 20 hypotheses, you have better than a coin flip's chance of finding something "significant" in pure noise. At 100, it's nearly guaranteed. And researchers routinely test hundreds of variables.

How This Happens in Practice

Deliberate fraud is rare. Most p-hacking is unconscious and well-intentioned:

Flexible analysis: The researcher tries the data with and without outliers, with different subgroups, with different time windows, with different control variables, then reports whichever version gives a significant result.
Optional stopping: Collecting data until the result becomes significant, then stopping. If you check after every 10 participants, you get many chances to hit p < 0.05 by chance.
Publication bias: Journals publish significant results. Null results sit in drawers. So the published literature is a filtered set skewed toward false positives.
Outcome switching: The study was designed to test X, but X wasn't significant. Y was. The paper reports Y as if it were the primary hypothesis all along.

The Transferable Insight

This isn't just a problem for scientists. Anytime you search through data looking for patterns, you're p-hacking:

A/B testing 50 variations of a landing page and declaring the "winner" without adjusting for multiple comparisons
Looking at stock performance across 100 metrics and finding the one that "predicts" returns
Reviewing employee performance across 20 dimensions and highlighting the one where a group differs
Noticing that your team wins when you wear a particular shirt (after 50 games of looking for patterns)

The habit worth building: when someone presents a surprising finding, ask how many things they tested before finding it. A single significant result from a single pre-registered hypothesis is meaningful. A single significant result cherry-picked from dozens of attempts is expected noise.

The number of questions you ask of the data is as important as the answer you get. Ask enough questions and the data will always tell you something, even when there's nothing there.