Why P-Values Don't Mean What Most People Think They Mean

Very early in my PhD program, I began working with and testing hypotheses. One of the first challenges I had to overcome was understanding the p-value.

The p-values are one of those tricky statistical measures that often create more misunderstanding than clarity among researchers.

There have been multiple papers discussing this issue; one in particular is by Lytsy et al. (2022). They found that only 12% of professional statisticians and epidemiologists correctly interpret a statistically significant result.

What p-values Actually Measure

One of the big issues is that many researchers and statisticians use the p-value to assess evidence for a model or hypothesis.

In essence, p-values are used to decide whether to accept or reject a null hypothesis.

Take a flip of a coin. We have a general understanding, or it is expected, that it has a 50/50 chance of landing on heads or tails. But if in 10 flips the coin comes up heads 9 times, you start wondering if it is rigged.

The null hypothesis is your default assumption: the coin is fair. Using p-value, we can answer this question: if the coin were truly fair, how likely is it that I’d see 9 heads out of 10 just by random chance?

The p-values measure how surprising your results are when nothing unusual is happening.

But researchers need a cutoff to determine whether something is random. The most common cutoff, called alpha \(a\), used in sciences, is 0.05.

The \(p\) < 0.05 has been the gold standard. Many areas in science and statistics treat that number as the definite line between noise and discovery.

Back to the coin example, the \(p\) ≥ 0.05 would fail to reject the null hypothesis. One noteworthy highlight is that it doesn’t prove the coin is fair. It just means there isn’t enough evidence to say it isn’t.

This 0.05 threshold was suggested by Ronald Fisher in the 1920s. In the last century, statisticians and scientists have argued that the number is arbitrary rather than a gold standard.

Vidgen and Yasseri explained that 1 in 3 “statistically significant” findings have false positives.

If Not p-value, What Then?

In data science, rarely will one metric provide a definitive answer. It is a balance and harmony with other elements.

For instance, accuracy is not the only number used to determine whether the model is providing the correct information. You have to check f1-scores, recall, ROC-AUC, etc.

The p-value is no different. It is a starting point. But it should not be the only metric measured to validate a hypothesis.