Author’s Preface

Looking back at my own statistical training in graduate school, I have begun to question the whole research paradigm that is based upon statistics. I have grave concerns that these method tells us anything reliable about the world in some domains. I base this view upon many writings on the use and misuse of statistics, perhaps in some measure leading to the failure for research findings to replicate.

The Evolution of a Methodologist

As a young man, a student, decades ago, I held a naive, misguided belief that if one adopted just the right methods of work, the ills of the world could be solved. I started with methods of psychological research. Later on, abandoning psychology, I worked in information systems development in many roles over the years.

At one time, I was, through very suspect means, made Acting Director of Quality Assurance in a small governmental information systems shop. Later on, a decade later or so, I had the words "methods and procedures" in my job title. Throughout that time, I looked into various methods that were of more interest to me than actual experimental work in psychology or actual development work in information systems. I became that most dreaded of beasts, a methodologist.

A Methodologist’s Reputation

There was a joke a few decades back—not a very good joke—and it went along the lines of: "What's the difference between a methodologist and a terrorist?" The punchline was, "You can’t negotiate with a terrorist." I resented that remark, or did I resemble it? I'm not sure which. (Aside: an opaque reference to an old Three Stooges sketch.)

Discovering Fitness for Purpose

About the time that I was made Acting Director of Quality Assurance through some dubious process. I had been researching—or came to research, I'm not sure of the order now—the literature on quality assurance and total quality management, looking at the works of Deming and his acolytes, Crosby, and I'm sure quite a few others. I read quite a bit, trying to understand it. In the process, I came across the term "fitness for purpose," which I accepted without any real analysis.

Now, I'm looking at it in terms of analysis because I have very great concerns about the fitness for purpose of many of the methods we use in scientific research in complex areas, such as biomedicine, nutrition, psychology itself, economics, and other soft sciences. Some of my concerns are derived from a study of the literature—some of the readings anyway.

Conceptual and Practical Concerns

There are conceptual concerns with the foundations of the methods and concerns with the application of the methods. Do they really work? If so, a 5W and H analysis is useful. So, I now have had occasion to look at what "fitness for purpose" really means.

First, we need to define the purpose, or purposes, of whatever method we're talking about. Then, we must define what it means to be fit—and that's contextual. Again, we'll go back to the 5W and H analysis of these things—the fitness for purpose. This may be complex, may be domain-dependent, and may be dependent on any number of factors.

Two Aspects of Fitness

It has occurred to me that, in addition to that, we have two aspects of fitness:

Deity-Level Fitness: If the method were applied perfectly by the deity, whoever he or she may be, would it work reliably to give us the results we want to satisfy our purpose? Human-Level Fitness: Is the tool actually able to be used appropriately by average human beings or even highly trained human beings? Or is it just too hard to do? Does it mean that some of the assumptions of the method we can't actually determine in practice? And that's true of statistics for sure.

Assessing Fitness for Purpose

Another aspect is: how are we going to know if it's fit for purpose or not? Circular justifications—such as seem to be involved in a lot of simulation studies—don't cut it. We need empirical, real-world tests. And that's where we come into the replication crisis, which may, in some measure, be due to the fundamental fitness for purpose of statistics—both in terms of its theoretical fitness (deity level) and its fitness in practice (flawed humanity level).

Statistics and Fitness for Purpose

We are not assessing fitness for purpose using statistics; rather, we are evaluating whether statistics themselves are fit for purpose. The replication crisis highlights significant concerns regarding statistical methods in certain domains, where their applicability may be questionable.

We will avoid the misconception that Bayesian methods are about degrees of belief—they are not. They are computations. Likewise, priors are not uniquely subjective; all statistical methods involve human judgment, from defining measures to interpreting results.

There have been foundational objections to all branches of statistics, asserting that they may be fundamentally incoherent. Some argue that the mathematics does not map to the real world in a logically sound way. While we will not delve into the specific issues within statistical methodology, we acknowledge these broader concerns as part of the fitness for purpose discussion.

And so, the analysis continues.

Introduction

Statistics is widely used as a tool for inference across multiple disciplines, from biomedical research to economics and psychology. However, the fundamental question remains: Do statistical methods provide meaningful and reliable insights about the real world? In which domains? The issue is not whether the mathematics of statistics is computationally correct—its internal logic is deterministic and follows precise rules. The issue is whether the statistical framework maps meaningfully onto the complexities of reality.

The lens of fitness for purpose provides a structured way to assess statistical reasoning. This framework distinguishes between two levels of evaluation:

Deity-Level Fitness – If applied perfectly in an idealized world, does the method achieve its stated goal? Human-Level Fitness – Can real-world researchers, with all their limitations, reliably apply the method to produce valid conclusions?

By considering both aspects, this discussion examines whether statistics, as currently applied, is fit for purpose in producing meaningful and replicable knowledge, particularly in light of the replication crisis.

1. Conceptual Foundations of Statistical Methods

Statistical reasoning is based on the manipulation of probability distributions to draw inferences about data. The primary approaches include:

Frequentist statistics , which relies on long-run frequencies and significance testing.

Bayesian statistics , which incorporates prior knowledge through probability updates.

Signal Detection Theory, which tries to distinguish signal from noise amidst great variability.

These frameworks have been challenged in their ability to map onto real-world processes in a meaningful way, their conceptual foundations. Critics argue that statistical models often oversimplify complexity, fail to account for real-world variability, and may generate conclusions that lack empirical grounding. The core question is whether statistics provides genuine insight into reality or merely offers internally consistent computations with uncertain empirical relevance.

2. Fitness for Purpose: Theoretical Considerations (Deity-Level Fitness)

A method can be deemed theoretically fit for purpose if, under ideal conditions, it reliably produces meaningful conclusions. This level of analysis considers:

Internal Consistency – Is the statistical framework logically sound within its own mathematical constraints?

Causal Insight – Does the method actually uncover causal relationships, or does it merely detect correlations?

Probability Interpretation – Are the probability values in statistical models meaningful in describing real-world uncertainty?

Domain-Specific Limitations – Can statistical methods adequately describe complex systems such as psychology, medicine, and economics, where causal factors are interdependent, feedback loops predominate, non-linearity abounds, interactions are the norm, and confounding factors are unknown and unknowable?

A key issue is that statistical inference often operates detached from reality, working within an abstract mathematical space where assumptions may not hold in real-world scenarios. Even if a statistical method is logically valid, its assumptions about data, randomness, and underlying distributions may not correspond to actual systems being studied.

3. Fitness for Purpose: Practical Usability (Human-Level Fitness)

Even if statistical methods were theoretically sound, their practical application poses major challenges. A method that is too complex, fragile, or easily misused cannot be considered fit for purpose. Issues include:

Misapplication of Methods – Many researchers lack deep statistical expertise, leading to incorrect applications of models.

Complexity and Usability – Some statistical techniques are too intricate to be applied reliably in real-world research.

Misinterpretation of Results – Statistical outputs often require nuanced interpretation, but practitioners may overstate or misread significance.

The Replication Crisis – Many findings, particularly in psychology, medicine, and economics, fail to replicate, suggesting that statistical techniques may be generating unreliable conclusions.

Over-Reliance on Significance Testing – Statistical significance (e.g., p < 0.05) is frequently mistaken for real-world importance, leading to misleading conclusions.

A method that works in principle but is inaccessible, easily distorted, or frequently misused by actual researchers fails in terms of practical usability. This suggests that many statistical tools may not be fit for purpose in applied research.

4. The Replication Crisis and Its Implications

The replication crisis has revealed deep flaws in how statistical methods function in practice. Key aspects of the crisis include:

Failure to Replicate Key Findings – A significant portion of published research in psychology, medicine, and economics has been found to be irreproducible.

P-Hacking and Data Dredging – Researchers selectively analyze data to produce statistically significant but spurious results.

Publication Bias – Studies with null results are often unpublished, distorting the literature toward false positives.

Inadequate Statistical Power – Many studies use small sample sizes, making results unstable and unreliable.

False Positives and Overgeneralization – Even valid statistical findings often fail to generalize beyond narrowly controlled conditions.

These issues indicate that statistical methods, as applied in empirical research, may be failing at their fundamental purpose—producing reliable, meaningful knowledge.

5. Statistical Simulation Studies and Circular Justifications

Many statistical methods are validated through simulations rather than real-world testing. This raises concerns about:

Circular Justifications – Methods are often tested within artificial scenarios that assume their correctness.

Internal Consistency vs. Empirical Validity – A model can be mathematically correct but still fail to describe real-world phenomena.

The Limits of Simulation – Many models function correctly within controlled conditions but break down when applied to complex, unpredictable environments.

This highlights the critical gap between mathematical correctness and empirical fitness for purpose.

6. The Role of Judgment in Statistical Analysis

Despite its computational structure, statistics requires human judgment at multiple levels:

Defining Variables and Measures – What counts as "data" is always a human decision.

Choosing Models and Assumptions – Statistical models rely on human-selected assumptions about distributions, independence, and error structures.

Interpreting Results – Data do not speak for themselves; conclusions depend on human reasoning.

A common misconception is that Bayesian methods are uniquely subjective due to the use of priors. In reality, all statistical methods require subjective judgment, from variable selection to model design. No method is inherently more "objective" than another—what matters is whether these judgments align with real-world processes.

7. Evaluating Statistical Methods as Tools

To determine whether statistical methods are fit for purpose, they must be evaluated as tools for inference rather than as abstract mathematical exercises. Key questions include:

Does the method yield reliable and meaningful conclusions in real-world settings?

Can it be applied correctly by researchers in practice?

Does it align with the complexity of the systems it seeks to model?

Are its assumptions testable and empirically justified?

A method that fails any of these criteria may not be fit for purpose, even if it is mathematically well-constructed.

Summary

Statistics, as a formal system, is computationally correct, but its fitness for purpose depends on whether it maps onto real-world complexity in a meaningful way in all domains of application. The replication crisis has exposed significant problems in practical applications, revealing that statistical tools often produce unreliable and non-replicable findings.

The analysis of fitness for purpose highlights two levels of failure:

Theoretical Failure (Deity-Level Fitness) – Statistical methods may lack meaningful real-world interpretability despite the deterministic mathematics. Practical Failure (Human-Level Fitness) – Many methods are too complex, misapplied, or fundamentally unreliable in empirical research.

Ultimately, statistical methods should not be assessed solely on their mathematical properties, but on their ability to produce reliable, actionable knowledge in the real world. The current crisis in replication and validity suggests that many statistical tools may not be fit for purpose, at least in their present form and application.

Readings

Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature, 567(7748), 305–307. https://doi.org/10.1038/d41586-019-00857-9

The authors argue against the overreliance on statistical significance thresholds and advocate for a more nuanced interpretation of statistical data in scientific research.

Begley, C. G., & Ioannidis, J. P. A. (2015). Reproducibility in science: Improving the standard for basic and preclinical research. Circulation Research, 116(1), 116–126. https://doi.org/10.1161/CIRCRESAHA.114.303819

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997

Cohen discusses the limitations of null hypothesis significance testing and advocates for a greater emphasis on effect sizes and confidence intervals to assess practical significance.

Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216. https://doi.org/10.1098/rsos.140216

Ehrenfest, P., & Ehrenfest, T. (1959). The conceptual foundations of the statistical approach in mechanics. Cornell University Press.

archive.org

This classic work explores the foundational aspects of statistical mechanics, providing insights into the statistical approach in physical sciences.

Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033

Gigerenzer critiques the automatic application of statistical methods without proper understanding, emphasizing the pitfalls of misinterpreting statistical significance as practical importance.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3

This paper provides a comprehensive guide to common misinterpretations of statistical measures, including the distinction between statistical and practical significance.

Hacking, I. (2006). The emergence of probability: A philosophical study of early ideas about probability and statistics. Cambridge University Press.

Hacking delves into the historical and philosophical development of probability and statistical reasoning, offering a deep understanding of their conceptual underpinnings.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124

Ioannidis, J. P. A. (2018). The proposal to lower p value thresholds to .005. JAMA, 319(14), 1429–1430. https://doi.org/10.1001/jama.2018.1536

Matthews, R. (2001). Why should clinicians care about Bayesian methods? Journal of Statistical Planning and Inference, 94(1), 43–58. https://doi.org/10.1016/S0378-3758(00)00228-2

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615–631. https://doi.org/10.1177/1745691612459058

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716

Pearl, J. (2009). Causality: Models, reasoning, and inference (2nd ed.). Cambridge University Press.

Pearl presents a comprehensive framework for understanding causation, distinguishing it from mere correlation, and introduces models for causal inference.

Ritchie, S. (2020). Science fictions: How fraud, bias, negligence, and hype undermine the search for truth. Metropolitan Books.

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA's statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108

This statement from the American Statistical Association addresses the proper use and interpretation of p-values, highlighting the difference between statistical significance and practical relevance.