Note: This is the outcome of weeks of effort, trying to understanding at a deeper level material I thought I had a handle on in graduate school, and now look askance at. The more I looked at things, the more problematic they seemed. My initial concerns were sparked by the failure to replicate experimental work in psychology and other areas. This is considered to be a crisis by many. I am one of the many; I did my graduate work in the former discipline. Statistical malpractice may be one of the causative factors, and some problems may be due to conceptual incoherence in the models. Perhaps this is beyond my paygrade, but these concerns have been raised by many with stellar credentials.

Introduction

Frequentist statistics, as developed by Ronald Fisher, has been a dominant force in scientific inference for nearly a century. The framework introduced the concept of the null hypothesis, p-values, and statistical significance—ideas that have shaped empirical research across numerous disciplines. However, while these methods have become deeply embedded in scientific practice, their conceptual foundations remain highly questionable.

This analysis does not focus on how Fisher’s methods have been applied or misused but instead scrutinizes their theoretical underpinnings. Specifically, it examines:

The null hypothesis as a theoretical device, questioning its logical status and its relationship to real-world causal inference. The arbitrariness and meaning of p-values, evaluating their interpretive limitations and the assumptions required for their calculation. The inference from sample to population, addressing the nebulous nature of "populations" in frequentist statistics and whether the model genuinely justifies generalization.

These concerns are not simply matters of misuse or misunderstanding but reflect fundamental conceptual problems that persist in the modern hybridized Fisher-Neyman-Pearson framework. This critique aims to provide a clear and rigorous examination of these issues, demonstrating why frequentist statistics, despite its historical dominance, is built on unstable theoretical ground.

A Conceptual Critique of Fisher’s Frequentist Statistical Framework

This analysis strictly examines the conceptual foundations of Fisher’s frequentist approach—not its applications, misuses, or practical consequences. The goal is to determine whether the framework provides a rational basis for inference and whether its assumptions hold up under scrutiny.

The critique focuses on three major conceptual issues:

1. The Null Hypothesis: A Theoretical Device That Misrepresents Causality

Fisher’s null hypothesis is not a genuine scientific hypothesis but an artificial statistical device used to structure hypothesis testing. Conceptually, it assumes:

Two Identical Groups (A and B) – The groups differ only in the treatment applied, with all other variables held constant or assumed to be random. All Variability is Either Due to Treatment or "Random" Factors – If differences between the groups exist, they must be due to either the treatment itself or unexplained random variation. Causal Effects are Ignored in Favor of a Strawman Baseline – The null hypothesis asserts, as a strawman, that there is no causal relationship between treatment and outcome, meaning any differences are entirely random.

This setup raises serious conceptual issues:

Causality is Side-Stepped – The framework does not actually evaluate causal relationships, only whether observed differences are surprising under a particular statistical model.

The "No Effect" Assumption is Arbitrary – The assumption of exactly zero effect is not derived from the data —it is imposed as a default assumption, regardless of whether it makes sense in a given context.

Randomness is an Unexamined Black Box – Fisher treats unexplained variability as “random,” but what does this mean? Does it reflect unknown causal factors? Measurement noise? A deeper structure we have failed to account for? The framework never resolves this.

Thus, the null hypothesis is not epistemically incoherent, but it is an absurd and contrived assumption in most real-world research contexts. It is a device to facilitate statistical testing, not a meaningful claim about reality.

2. p-Values: Arbitrary, Mechanistic, and Epistemically Misleading

A p-value is calculated using the following process:

Choose a Test Statistic – A numerical summary of the data (e.g., a t-score or z-score) is computed based on the observed sample. Assume a Probability Model – The null hypothesis dictates that the data follow a particular probability distribution (e.g., normal, t-distribution). Compute the Tail Probability – The p-value represents the probability of obtaining a result as extreme or more extreme than the observed one, assuming the null hypothesis is true.

This process introduces multiple conceptual flaws:

The p-Value is Not a Measure of a Hypothesis’ Truth – It does not tell us how likely it is that the null hypothesis is correct or incorrect, only how unusual the data would be if the null were true.

The 0.05 Threshold is Arbitrary – There is no logical basis for treating p = 0.05 as a special cutoff; it is a historical accident.

Probability Distributions Are Assumed, Not Verified – The test assumes a probability model that may not reflect reality. If the data do not actually follow the assumed distribution , the entire calculation becomes meaningless.

Statistical "Surprise" Does Not Mean Scientific Significance – A small p-value simply indicates that the data are somewhat surprising under a particular model, but that does not mean the effect is real, important, or causal.

Thus, p-values are an arbitrary statistical convention rather than a meaningful inferential tool. The process of their calculation relies on assumptions that may have no basis in reality, and their interpretation is mechanical rather than epistemically justified.

3. Inference from Sample to Population: A Conceptually Dubious Leap

A key assumption in Fisher’s framework is that we can generalize from a sample to a larger population, but conceptually, this is deeply flawed:

The Population is Often Undefined – Fisherian statistics assumes the existence of a well-defined population from which the sample is drawn. But what exactly is this population? Future cases? A hypothetical infinite set? The framework never specifies this.

Generalization is Not Justified—It is Assumed – The leap from sample to population is treated as a natural consequence of probability theory , when in fact, it is a hidden assumption embedded within the framework.

Statistical Models Presuppose Structure That May Not Exist – The entire method depends on assuming that the sample reflects a stable probability distribution. But in many domains—especially in social sciences and medicine—this assumption is ungrounded.

Thus, the method does not actually justify inference—it assumes inference is valid from the start. The leap from finite data to general claims is not derived from the method itself but imposed as an unquestioned convention.

Conclusion

Fisher’s framework, though mathematically structured, is conceptually fragile.

The null hypothesis is a theoretical contrivance that misrepresents causality and imposes an arbitrary assumption of no effect.

p-values are mechanistic but lack interpretability , and their calculation relies on probabilistic assumptions that may not hold in reality.

The assumption that samples tell us something about a broader population is not justified—it is simply assumed.

These conceptual flaws persist today, even in the hybridized Fisher-Neyman-Pearson framework. The replication crisis suggests that these flaws may not be merely academic concerns but practical barriers to reliable inference in complex domains.

While Fisher’s methods may have been useful in controlled agricultural experiments, their fitness-for-purpose in broader scientific research is highly questionable.

