Layered Explanations for Statistical and Other Formal Models
A Framework for Understanding Statistical Inference and Scientific Methodology
Note: My ideas reinterpreted, expanded somewhat and also reduced somewhat, by ChatGPT. Still captures the essence of my thoughts.
Introduction
The purpose of science is to understand cause and effect in order to explain, predict, and control. This proposition appears self-evident, and I have not encountered any alternative perspectives that hold up to scrutiny. Science achieves this through abstraction—by constructing models that represent aspects of reality in a way that allows us to reason about them. Every level of our perceptual and intellectual understanding relies on abstraction, which, in essence, is the creation of models. These models have validity insofar as they cohere with and accurately represent the world. However, this sense of validity differs from the formal concept of logical validity; it is closer to the notion of soundness, where the premises and the structure of reasoning must both correspond to reality.
Models should be viewed as tools, and like any tool, their usefulness depends on their fitness for purpose. Scientists use a wide array of tools depending on the domain of inquiry, but one nearly universal tool in modern scientific research is statistical inference. Within statistics, there are three primary types of inference:
Signal detection, which identifies patterns amid noise.
Bayesian inference, which updates probabilities in light of new evidence.
Frequentist inference, which evaluates data using probability models based on long-run frequencies.
All three approaches are probabilistic and inductive in nature, though some varieties of induction approach determinism, while others remain highly probabilistic. The domain of application determines which is most appropriate. A common misunderstanding is that Bayesian inference is about subjective "degrees of belief." In reality, Bayesianism is simply a framework for updating probabilities given new evidence, and its subjectivity is no greater than any other statistical method, all of which require judgment at various points.
Despite its central role in scientific reasoning, statistics is often misunderstood and misapplied. One of the major reasons for this confusion is the failure to distinguish between different layers of statistical reasoning. Here, I propose a layered architecture for statistical thinking, which provides a structured way to analyze the strengths and weaknesses of statistical methods. These layers are analogical, not literal or strictly hierarchical, but they offer a systematic way to dissect statistical inference and its relationship to scientific inquiry. My framework separates concerns in a way that allows for a clearer analysis than traditional critiques of statistical methods, which often conflate different levels of reasoning.
The Layered Model of Statistical Thinking
1. The Mathematical Layer
At the most fundamental level, statistical reasoning is built on mathematical formalism. This layer consists of deterministic calculations of probabilities, transformations, and statistical estimations. Purely mathematical operations, by their nature, are always deterministic—even if they describe probabilistic phenomena. The rigor and internal consistency of mathematical statistics are beyond question, but their utility depends entirely on how well they map onto real-world phenomena.
2. The Conceptual Layer
This layer deals with the underlying assumptions and axioms of statistical models and their mapping onto reality. Every statistical model is an abstraction, and for the model to be meaningful, there must be a coherent correspondence between its assumptions and the empirical world. If this mapping fails, the statistics remain mathematically valid but become scientifically meaningless. For example, assuming normally distributed errors in a dataset where errors are actually heavy-tailed leads to misleading inferences, even though the calculations themselves remain valid. The conceptual layer is where we assess whether statistical models have soundness—that is, whether their structure and premises align with reality.
3. The Application Layer
Here, we move from conceptual foundations to practical use. This layer encompasses the actual application of statistical methods in empirical research. There are two major concerns at this level:
Fidelity to assumptions – In practice, statistical methods are seldom applied with perfect adherence to their underlying assumptions. Researchers may use a t-test assuming normality in data that are not truly normal, leading to errors in inference.
Competence of researchers – The correct application of statistical tools requires a certain level of statistical literacy, and in many cases, researchers fail to apply methods correctly due to misunderstanding or misuse.
Thus, even if a statistical model is conceptually coherent, its practical application may introduce errors that undermine its conclusions.
4. The Interpretive Layer
Once statistical methods have been applied, researchers must interpret the results. Interpretation is inherently subjective, and as Friedrich Nietzsche put it, “There are no facts, only interpretations.” This is particularly true in complex scientific fields where statistical results are subject to multiple competing narratives. For example, a study may find a weak correlation between two variables, but whether this correlation is interpreted as evidence of a causal relationship depends on the theoretical biases of the researcher.
5. The Validation and Verification Layer
This layer assesses whether statistical methods and their interpretations achieve their intended scientific goals. In principle, scientific validation comes through replication—if a result is real, it should be reproducible by other researchers. However, in practice, replication efforts have often failed, particularly in complex fields such as psychology, nutrition, and medicine. These failures suggest that many statistical models are not fit for purpose in these domains.
Fitness for purpose is a concept I first encountered in my role as a Quality Assurance Director in information systems. The idea is that tools must be evaluated not just on their internal correctness but on their ability to solve real-world problems. Applied to statistical inference, the question is not merely whether statistical methods are mathematically sound but whether they actually help us understand and predict real-world phenomena.
Fitness for Purpose in Statistical Inference
Statistical methods are meant to aid scientific inquiry by helping researchers establish relationships between variables. However, many critics—including John P. Ioannidis—have pointed out that statistical methods often fail in practice. Ioannidis' work focuses largely on the application layer—that is, he argues that statistical failures arise because researchers are not using statistical tools correctly. His proposed remedies, however, often assume an ideal world where researchers could simply be better trained and more rigorous.
What Ioannidis does not fully address is whether statistics are fit for purpose at a more fundamental level—the conceptual layer. Some critics argue that the statistical models themselves are flawed, regardless of how well they are applied. The long-standing divide between Fisherian significance testing and Neyman-Pearson hypothesis testing exemplifies this issue: both frameworks have conceptual limitations, and efforts to reconcile them have led to confusion rather than clarity.
Statistics also suffers from a major blind spot regarding causality. Most statistical methods, particularly regression analysis, deal with correlation, not causation. While some frameworks attempt to formalize causal inference, these approaches introduce their own conceptual and practical problems. Bayesian networks and Pearl’s causal inference models are promising but still subject to limitations in real-world applicability.
Another key issue is effect sizes. While measuring effect sizes should be central to scientific inference, frequentist methods—especially those focused on null hypothesis significance testing (NHST)—tend to obscure effect sizes. Even when effect sizes are reported, researchers often fail to contextualize them properly, leading to misinterpretation.
Conclusion
By adopting a layered framework for analyzing statistical inference, we can more effectively diagnose its problems and limitations. Traditional critiques of statistical methods often assume a single level of analysis, whereas a multi-layered approach allows us to separate concerns and target problems at different levels. The layered model clarifies where issues arise—whether in mathematical formalism, conceptual mapping, practical application, interpretation, or validation.
Ultimately, the fitness for purpose of statistical methods remains an open question. Many statistical tools were developed for controlled experiments in physical sciences but are now applied to complex, nonlinear, and highly variable domains such as psychology and medicine. Whether these tools are adequate for such purposes—or whether they need fundamental revision—is a debate that remains unresolved.
References
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124
Pearl, J. (2009). Causality: Models, reasoning, and inference (2nd ed.). Cambridge University Press.
Fisher, R. A. (1925). Statistical methods for research workers. Oliver & Boyd.
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231(694-706), 289–337.