Reason: The Replication Crisis in Some Research Areas
Is there a real replication crisis, or is replication an unreasonable standard for the domains of research where there is alleged to be a crisis?
Is there a real replication crisis, or is replication an unreasonable standard for the domains of research where there is alleged to be a crisis? What is the nature of the crisis, what are conjectured causes, what are proposed solutions, and are they likely to work?
Author’s Preface
I am going to talk about replication crisis in some research areas. This has been the subject of various articles and analysis over the last few years. I’m going to look at what could account for the crisis, if indeed it is a crisis, and look at conjectures about causes. I will also touch on the broader question: that is, is it really a crisis, or should the expectation of replication in some fields, such as psychology, be something which might seem clear in principle, but is misguided in reality?
The crisis in replication of research has been treated as a fundamental problem with research in psychology and in other disciplines. I’m not very familiar with all of these areas, but a number of reasons for this have been proposed. Some reasons given have centred on:
1 – Institutional factors, including corporate and even regulatory corruption.
2 – Some have centred on personal failings, including corruption and incompetence.
3 – Some dwell on the poor applications of methodology for research, most critically in the statistical arena and research design.
4 – Some have centred on the fitness of purpose as the methods have been used, i.e., “doing things right,” giving fitness of purpose.
5 – Some have centred on incoherence of certain statistical semantics and criticisms to be levelled at all types of statistics in the domains under consideration. This is a fitness of purpose explanation for semantics, which may influence “doing the thing right.”
6 – Some have centred on the fundamental fitness of purpose of the statistical methods themselves in complex domains such as psychology, and fundamental flaws with the methods' applicability to complex domains, i.e., “doing the right thing,” giving fitness of purpose.
7 – Particularly, there have been many analyses of the methods used involving statistics in psychological research. There have been criticisms of the inappropriate use of operational definitions in unstable psychological dimensions and the inappropriate treatment of ordinal data as computational-level data.
8 – Some have proposed that, as opposed to the fundamental fitness of the methods themselves (in terms of applicability of the models to those domains, such as psychological domains and other soft sciences, and medicine and nutrition, where statistics are used), replication makes no sense at all as a standard in psychology due to the nature of the human and of society.
So these are all possible explanations, and it may be that a number of the factors are operational. Some may preclude other factors, and there’s a lot of conjecture, a lot of theorizing. It is unclear as to how to even assess the replication crisis. Some have tried—like John P. Ioannidis (he was far from the first to opine), who has given many recommendations—but they tend to be cosmetic and geared mostly towards which methods are used and how they are applied.
To a great degree, the criticisms of research replication problems are a dog’s breakfast of conjecture mixed with solid observations. Perhaps some criticisms are true, but the elephant in the room is that the whole idea that we should be able to replicate psychological research may be very flawed. Not many scholars want to go there, and those who have, have been unable to gain much of an audience.
Introduction
The so-called “replication crisis” has attracted growing attention across multiple research disciplines. Psychology, medicine, and the social sciences have come under sustained scrutiny for their inability to consistently reproduce published results. In response, a wide range of explanations—methodological, institutional, philosophical—have been offered. Yet despite growing literature and reform initiatives, a deeper question remains unresolved: is replication itself a meaningful standard for these domains?
In the natural sciences, replication is often straightforward: phenomena are stable, variables can be isolated, and external conditions controlled. But in human-focused disciplines, the subject matter is dynamic, context-sensitive, and interpretive. The demand for replication may misapply expectations appropriate to chemistry or physics to fields in which variability is inherent rather than extraneous.
This essay follows the structure and categories identified in the Author’s Preface. Each of the eight explanations will be addressed in sequence, followed by broader reflections on whether replication itself may be a misplaced ideal in domains where human meaning, interpretation, and behavior are central.
Discussion
1. Institutional Factors: Corruption and Incentive Distortion
Institutional dynamics can distort research outputs long before any experiment is conducted. Academic publishing rewards novelty and significance over reliability. Funding bodies prioritize impact over verification. Career advancement hinges on publication quantity, not methodological rigor. As a result, the research ecosystem encourages overstated conclusions and discourages null findings.
Moreover, corporate interests and regulatory entanglements can influence the research agenda, particularly in medicine and nutrition. Such pressures create systemic incentives for results that align with commercial or political goals, regardless of replicability. This is not isolated misconduct, but structural corruption.
2. Personal Failings: Incompetence and Misconduct
Not all problems are systemic. Researchers themselves may contribute to replication failure through lack of expertise, confirmation bias, or even fraud. Some simply lack the statistical training to conduct valid inference. Others unconsciously tailor data or interpretations to fit expected outcomes. And in rare cases, researchers fabricate results.
Yet the problem is not primarily about bad actors. The more widespread issue is a research culture in which flawed methods and shallow analysis are normalized and incentivized. Personal failings occur within an ecosystem that rewards appearance over substance.
3. Methodological Weakness: Fragile Designs and Inadequate Controls
Many replication failures arise from poor experimental design. Underpowered studies, flexible hypotheses, inadequate blinding, and small sample sizes all contribute to unstable results. Replication attempts often reveal how sensitive original findings were to methodological particulars or accidental artifacts.
Psychological research, in particular, frequently relies on ad hoc instruments and constructed scenarios whose reliability is hard to test. If original studies lacked proper control or transparency, replication is less a diagnostic tool than a belated exposure of flawed design.
4. Fitness of Purpose: Doing Things Right, But to What End?
Much reform has focused on encouraging better practice: pre-registering hypotheses, increasing sample sizes, promoting open data, and improving training. These changes aim to ensure researchers are “doing things right.” But even when implemented perfectly, they may not be enough.
The core issue is whether the methods are fit for the kinds of questions being asked. If the methods themselves are misaligned with the nature of the subject matter—if they are incapable of capturing what is being studied—then methodological improvement is cosmetic. Replication failures in such cases do not indicate sloppiness but a deeper mismatch between tool and domain.
5. Semantic Incoherence: Contradictions Within Statistical Formalism
Null hypothesis significance testing (NHST), widely used in psychology and other fields, rests on a semantic contradiction. The method claims to assess the probability of the observed data assuming the null hypothesis is true. Yet it is routinely used to draw inferences about the alternative hypothesis—despite the formalism denying any such inference.
This produces an internal incoherence: the method is invoked to reject the null but disavows any capacity to evaluate the alternative. The statistical framework, by design, prohibits direct assessment of the thing it is used to support. Researchers are told the p-value “says nothing about” the alternative hypothesis, while simultaneously encouraged to treat rejection of the null as evidence in its favor.
This is not a misunderstanding but a structural contradiction within the semantics of NHST itself. The method enables an inferential leap that it formally denies, and widespread reforms do little to address this foundational inconsistency.
6. Inapplicability of Methods: Doing the Wrong Thing Well
Even if statistical procedures were internally coherent and flawlessly executed, they may be conceptually unfit for certain domains. Psychological and social phenomena are often unstable, context-dependent, and multiply determined. They are not governed by fixed laws but by interpretive processes and shifting cultural norms.
Statistical models assume fixed parameters, randomness, and independence. Applying such models to dynamic human systems may produce technically valid computations but epistemologically meaningless results. In such cases, replication failure is not a problem with execution—it is a sign that the method was never appropriate for the domain.
7. Measurement Breakdown: Unstable Constructs and Data Illusions
Psychological research relies heavily on operational definitions and rating scales. Constructs like “intelligence,” “depression,” or “resilience” are defined by instruments whose validity is uncertain and often population-specific. Many such measures yield ordinal data, yet are treated as interval or ratio data to permit mathematical manipulation.
This treatment produces an illusion of precision. Correlations, means, and regressions calculated from such data often reflect artifacts of the coding system rather than properties of the underlying construct. Replication then fails not because the finding was false, but because the measurement was unstable, ill-defined, or context-sensitive.
8. Domain-Relative Implausibility: Replication as Misapplied Ideal
The most fundamental criticism is that replication may not be a meaningful or even coherent goal in some domains. Human subjects are not constant; their behavior, interpretation, and environment vary over time and context. A result that holds in one population or cultural moment may fail elsewhere without indicating error.
This is not an argument for relativism, but for epistemic realism about variability. In psychology and the social sciences, the expectation of strict replication may reflect an inappropriate ideal borrowed from the physical sciences. In these fields, replication failure may not signal broken methods but rather the ontological instability of the subject matter.
Summary
The replication crisis is often presented as a straightforward methodological problem with straightforward reforms. But the eight categories examined above reveal a more complex situation. Replication failures may result from institutional corruption, poor training, and fragile design—but also from deeper semantic, philosophical, and epistemological problems. In some cases, the expectation of replication itself may be inappropriate to the domain.
Calls for reform must therefore be cautious about assuming that better methods will resolve the issue. If the crisis reflects a mismatch between the nature of human phenomena and the tools used to study them, then the solution lies not in more rigorous replication, but in rethinking what kinds of knowledge these domains can yield—and on what terms.
Readings List
Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124
Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584–585. https://doi.org/10.1126/science.aal3618
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
Trafimow, D., & Marks, M. (2015). Editorial: The methodological iniquity of null hypothesis significance testing. Basic and Applied Social Psychology, 37(1), 1–2. https://doi.org/10.1080/01973533.2015.1012991
Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195–244. https://doi.org/10.2466/pr0.1990.66.1.195
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997

