Null hypothesis significance tests: A mix-up of two different theories – the basis for widespread confusion and numerous misinterpretations
Jesper W. Schneider
Note: The paper is a difficult read but the bottom line is that there are real problems with statistical analysis, that are foundational and conceptual, not just a failure in implementation. Are they even fit for purpose in the areas where they are most important, e.g., biomedical research?
Null hypothesis significance tests: A mix-up of two different theories – the basis for widespread confusion and numerous misinterpretations
Jesper W. Schneider
Danish Centre for Studies in Research and Research Policy,
Department of Political Science & Government, Aarhus University,
Bartholins Allé 7, DK-8000, Aarhus C, Denmark
jws@cfa.au.dk
Summary by ChatGPT
Schneider’s paper, Null hypothesis significance tests: A mix-up of two different theories – the basis for widespread confusion and numerous misinterpretations, provides a deep critique of NHST by arguing that it is fundamentally a conflation of two distinct statistical traditions:
Fisher’s Significance Testing – A model based on rejecting a null hypothesis when results are improbable under that assumption, but without providing an explicit alternative hypothesis or a strict decision-making framework.
Neyman-Pearson Hypothesis Testing – A decision-theoretic model that involves setting up competing hypotheses (null and alternative), specifying Type I and Type II error rates, and making pre-specified decisions about rejection based on these probabilities.
Core Argument
Schneider contends that NHST, as it is commonly practiced, is an incoherent hybrid of these two approaches. The Fisherian model was intended as a flexible, exploratory tool rather than a rigid decision-making framework. Neyman-Pearson testing, on the other hand, was explicitly designed for repeated decision-making under risk but assumes that alternative hypotheses are explicitly formulated and error rates are pre-determined. The problem arises because many researchers conflate these models, leading to logical inconsistencies in how NHST is interpreted.
Key Issues Identified
Misinterpretation of p-values
Many researchers treat the p-value as the probability that the null hypothesis is false, which is incorrect.
Under Fisher’s model, a p-value simply indicates the rarity of observed data under the null but does not measure evidence for or against the null.
Under Neyman-Pearson, rejecting a null does not imply confirmation of an alternative hypothesis unless error rates are explicitly controlled.
Confusion Between Significance and Practical Importance
Statistical significance (small p-values) does not imply that an effect is meaningful in the real world.
Large sample sizes can produce statistically significant but trivial effects.
Lack of Proper Consideration of Alternative Hypotheses
Many NHST applications fail to consider well-specified alternative hypotheses, which is a fundamental requirement for making inferences under Neyman-Pearson testing.
Arbitrary Thresholds (e.g., α = 0.05)
The widespread use of p < 0.05 as a binary decision criterion lacks justification and is an artifact of historical convention rather than logical necessity.
Misuse in the Social Sciences and Empirical Research
NHST has become a ritualistic practice in many fields, where researchers blindly apply it without understanding its assumptions or limitations.
It often substitutes for proper statistical reasoning, leading to an overemphasis on obtaining "significant" results rather than engaging in genuine hypothesis-driven research.
Examples from Scientometric Literature
Schneider illustrates these issues by examining how NHST has been misapplied in the field of scientometrics, showing how it leads to misinterpretations of citation distributions and impact factors. These cases exemplify the broader problems seen across multiple disciplines.
Recommendations
Abandon or Reform NHST – Researchers should either abandon NHST in favor of more informative approaches (e.g., confidence intervals, Bayesian methods, or effect size estimation) or at least recognize its limitations.
Education on Proper Statistical Thinking – Statistical training should emphasize the conceptual foundations of hypothesis testing rather than merely teaching NHST as a mechanical procedure.
Move Toward Bayesian Approaches – Bayesian inference, which allows for the direct calculation of probabilities regarding hypotheses, may offer a more coherent alternative.
Overall Significance
Schneider’s critique aligns with broader concerns about the misuse of NHST in empirical research. By showing that NHST is based on a conceptual misunderstanding—an incoherent merging of two incompatible frameworks—the paper adds weight to ongoing calls for reform in statistical practices across multiple disciplines.