Reason: Statistical Significance, Effect Size and Variability

A common source of confusion when people try to make sense of research studies is the confusion between statistical significance and effect size. Of course, I may be wrong, but there is some backup.

Apr 23, 2025

"I want to discuss research, effect sizes, statistical significance, variability, and believing nonsense. We will discuss this by first pointing out that measured electrical quantities, as in Ohm's Law, are subject to variability, and we could, in fact, use statistical techniques to measure electrical phenomena. It would be pointless, but we could do it."

This introductory remark sets up a central contrast: between the physical sciences, where measurements like voltage, current, and resistance conform reliably to principles like Ohm’s Law, and the human sciences, where measurements are often unstable, noisy, and influenced by subjective interpretation. In physics, if a voltage of 10 volts is applied across a 5-ohm resistor, the current is predictably 2 amperes—within a margin so small that statistical correction is practically irrelevant. However, in domains like psychology or health sciences, one might administer a questionnaire to measure anxiety and get different results from the same person across different days, or even across different contexts on the same day. Using statistical tools to analyze such data becomes necessary not because the tools themselves are universally valid, but because the domain lacks the regularity found in physical systems.

An everyday analogy would be trying to measure room temperature using a thermometer versus trying to gauge someone's mood using a survey. The former can be done reliably with little room for doubt. The latter, even when averaged across populations and corrected for biases, always carries interpretive and contextual ambiguity.

On Large Effects and Small Variability

"Decades ago, I read an article, maybe in Psychology Today, by a psychologist who asserted, without proof, that if the variability was small and the effect sizes were large, we didn't need no steenkin’ statistics. I don't know if he said it quite that way, but that was the implication that stuck with me decades later."

This anecdote underscores a fundamental truth in empirical reasoning: when effects are dramatic and consistency is high, the need for formal statistical inference diminishes. For example, if every time someone takes a particular medication, their blood pressure drops by 20 mmHg within an hour, across many individuals and trials, no one would require a statistical test to be persuaded that the drug is effective. The signal is obvious, persistent, and difficult to attribute to chance.

In contrast, if a new teaching method improves test scores by an average of 2% across students, but individual outcomes vary wildly—some improve, some worsen, many show no change—then statistical inference becomes necessary to determine whether the average improvement is likely due to the intervention or just random fluctuation.

An analogous real-world example is found in smoking and lung cancer research. The link between heavy smoking and lung cancer was so strong—dozens of times higher incidence among smokers than non-smokers—that the case for causation became persuasive even before the widespread use of inferential statistics in epidemiology. In contrast, the supposed benefits of certain superfoods—like blueberries or kale for “brain health”—are often statistically significant but involve effects so minuscule and variable that they vanish in practical terms.

Measurement Domains and Disciplinary Differences

"We're talking about measurements, not about Ohm's law specifically. I'm not sure that we have a lot in psychology where the variability is small and the effect size is large. Psychology is based on some very unstable phenomena, behavioral and self-reported, and huge variability and typically small effect sizes. It sure doesn't just apply to psychology, it applies to any number of soft sciences, and medicine and nutrition, and fitness, and a number of other disciplines, where the common feature seems to be mankind, society, or individuals."

This section draws attention to the core problem in the human sciences: the volatility of the subject matter. Psychological states are difficult to measure with consistency, and human behavior is influenced by a multitude of often uncontrollable factors—context, memory, stress, social pressure, and personal history. Even physiological measures, like hormone levels or reaction times, fluctuate significantly due to factors unrelated to the variables under study.

In medicine, a new drug might lower cholesterol by 5%, but the individual responses will vary dramatically—some people might see a 20% drop, others no effect, and a few may even experience adverse outcomes. The same is true in fitness: one person following a high-protein diet might lose considerable weight, while another sees no change or even gains weight due to differing metabolism, activity level, or adherence.

The commonality across these domains is that human variability overwhelms the consistency found in physical systems. As such, interpreting empirical data becomes less about direct observation and more about modeling noisy, partial signals.

Inapplicability of Statistical Models

"I have argued elsewhere, quite forcefully, that statistical models themselves are not always fit for purpose, because they don't apply to some domains. They model simpler domains, not the complex domains I've just mentioned. Also, even if they were fit, they're not used correctly, according to any number of critics. Also, the models have some very large conceptual flaws, whether frequentists of any variety, or Bayesian, or even signal detection—they all have conceptual problems. As opposed to simply misapplied, they may be used badly because they don't apply. That is they are not just misapplied—they just don't apply to those realms. But that's another discussion that I've made elsewhere."

This critique targets a foundational overreach: applying models developed for physical or mechanical systems—where variables are well-defined and regular—to systems characterized by complexity, feedback loops, and unmeasured influences. A common example is the use of linear regression in psychology. These models assume fixed relationships between variables, but in reality, the relationship between, say, self-esteem and performance is context-dependent and likely non-linear.

In economics, similar failures abound. Models predicting market behavior based on rational agents have repeatedly collapsed under the weight of actual human decision-making, which is emotional, heuristic-based, and often irrational. Statistical models, in such cases, are not simply being used with error—they are structurally mismatched with the phenomena they are intended to describe.

Statistical Judgment and Conceptual Issues

"Well, I didn't want to get deeply into the problems of statistics since I've covered that elsewhere. All statistical methods require judgment. Inputs and outputs are based upon judgment. So Bayesian statistic has a problem not with priors—that’s just a red herring—it has a problem with normalization. That's the big problem, not priors. I don't know why people repeat the nonsense that it's the priors. It's not. It's the normalization step. The other thing is that it's often characterized as being degree of belief, and that's nonsensical. It's a category error."

The argument here redirects the standard critique of Bayesian inference away from the contentious issue of priors (initial assumptions) and toward normalization (the need to scale results so they represent valid probabilities). In practical terms, many Bayesian methods produce results that are hard to interpret or normalize unless the underlying model is extremely well specified—which, in complex domains, it rarely is.

Moreover, interpreting posterior probabilities as degrees of belief creates confusion between formal probability calculus and informal human psychology. For example, saying that there is a 70% probability someone will relapse into addiction isn’t a reflection of an belief —it’s a judgment call derived from a complex model based on sparse or noisy data. Conflating this output with “belief” mistakes the output of a fragile inference as an internal state of conviction.

Frequentist Illogic and the Null Hypothesis

"Now, frequentist statistics of any variety depends upon the illogic of the null hypothesis, which is not what one wants to actually determine. The null hypothesis is a statistical artifact. It's a thought experiment, if anything. It makes no sense. Not because there's no null effects—you can easily come up with things where there's zero correlation—but that its basis for inference is dubious, as well as the arbitrariness of statistical significance levels, and the dubious proposition of going from samples to undefined populations."

This passage challenges one of the core ideas in classical (frequentist) statistics: that we test a “null hypothesis” (e.g., no difference, no correlation) and use p-values to decide whether to reject it. The criticism is not that null effects are implausible—in many cases they are realistic—but that this framework assumes one wants to know whether an effect is zero, when in practice, the interest usually lies in how big the effect is and whether it's meaningful.

An example: suppose a new learning app increases student test scores by 1%. A large enough sample size will likely make this difference "statistically significant," even though a 1% gain may be irrelevant in practice. Worse, the use of p < .05 as a magical boundary is arbitrary and often misunderstood as a marker of importance. It encourages binary thinking—either an effect is “significant” or it isn’t—rather than promoting an understanding of the strength, direction, and reliability of effects.

Another issue raised here is generalizability. Researchers often draw broad conclusions from limited samples without fully understanding the population they are generalizing to. A study on undergraduates at a single university is taken to say something universal about human behavior—a leap of logic not supported by the data.

The Misuse of Statistical Significance

"So in many references to studies, they talk about significance, statistical significance, without recognizing that that's a trivial thing, and it may be an artifact of the dubious nature of the statistical tools and the inapplicability. But in any case, the key issue is not significance—it's effect size, and that's not really dealt with in many treatments of statistical studies."

Statistical significance is often misinterpreted as a sign that something is “real” or “important.” In fact, it merely indicates that the observed data are unlikely under the null hypothesis, assuming a correct model. But given the widespread misuse of statistical methods—questionable assumptions, data dredging, p-hacking—the p-value loses much of its intended meaning.

To illustrate: imagine a fitness study finds that drinking beet juice before exercise leads to statistically significant improvements in endurance. If the average improvement is 12 seconds on a 10-kilometer run, this result might be significant in a statistical sense, especially with a large sample size. But for a casual runner, the benefit is negligible. Yet, such results are frequently used to market products or fuel media headlines, often with no discussion of how large or meaningful the effect actually is.

Effect Size and Epistemic Intractability

"And it has a problem in any case, because we have variability, we have data, we have means, we can compute statistics based upon some probabilistic assumptions, but the effect size is buried within noise, so there's no known technique. It's epistemically intractable to figure out what the true effect size is, so we just approximate by saying, well, it's the means, even though we know that's not true. We can't actually come up with a statistic that gives effect size, we can only estimate it."

This highlights a deeper epistemological issue: even when we compute an average effect, that figure is not the true, stable effect—it is merely an approximation derived from noisy data under imperfect assumptions. The actual magnitude of an effect, especially in messy real-world contexts, may be unknowable in principle.

For example, suppose researchers claim that cognitive behavioral therapy (CBT) reduces depression scores by an average of 5 points on a particular scale. That estimate is based on a specific population, specific measurement tools, and specific circumstances. In reality, the effect size likely varies widely across individuals and contexts. There may be no single "true" number, only a family of context-sensitive estimates. Yet much scientific communication treats point estimates as definitive truths, giving a misleading impression of precision.

Practical Relevance of Effect Size

"But yet effect size is what we really need to know for real-world use of studies. If someone says, H-I-I-T training improves my lifespan, I don't want to know that it's statistically significant or not, I want to know how much of an effect are we talking about? Does it add a few extra days, as in a lot of studies on statins? Or does it add an extra year? And can we even know, given that it's statistical and not deterministic?"

The issue raised here is that real-world decisions—about health, education, public policy—require understanding not just whether something works, but how well it works. A statistically significant finding that adds a few days of life expectancy may be irrelevant for most people, especially if the cost, inconvenience, or side effects of the intervention are substantial.

Statin drugs provide a useful case study. Many trials show statistically significant reductions in cardiovascular events, but for many individuals, the absolute benefit in life expectancy is measured in days or weeks—not years. Without understanding the magnitude of the effect, individuals are left making decisions in the dark, guided by statistical abstractions rather than meaningful expectations.

The Need for Estimated Magnitudes

"But effect size is still something we'd like to have some sort of an estimate of to find out if the thing we're about to do is even worth doing. Without some way of estimating how big an effect, it's all a crapshoot. Well, it's a practical issue. I've seen a lot of videos where doctors and fitness experts say that the way to a better and longer life, a healthier life, is HIIT training. But they always make that as a blanket assertion. They never tell me the effect size, because they don't know."

This is a pragmatic concern: recommendations without magnitude are little better than slogans. The popularity of high-intensity interval training (HIIT) in health media is an example. Claims abound about its effectiveness for longevity, fat loss, cardiovascular health—but these claims rarely include concrete figures. How much longer does one live? How much more fat is lost? How much better are the biomarkers?

For someone making time-consuming lifestyle changes, knowing that an intervention provides a 2% improvement versus a 20% improvement makes all the difference. Without magnitude, behavioral recommendations lack actionable precision and veer toward motivational rhetoric.

Uncertainty and Individual Prediction

"And also, average effect sizes are what we're talking about. We can't predict individual results. We can only do it probabilistically. But I'd still like to have some probabilistic idea of how much of an effect are we talking about. Am I going to live an extra 10 years if I do HIIT training, or an extra three days? Makes a difference, you know."

The distinction between group averages and individual outcomes is crucial. Scientific studies can offer average effects based on statistical aggregation, but these rarely translate cleanly into individual-level predictions. Two people following the same exercise or diet program may experience radically different results due to genetic factors, existing conditions, or unknown variables.

A real-world analogy is weather forecasting. A 30% chance of rain tells us something about a region's aggregate risk, but it doesn't say whether rain will fall on a specific backyard. Likewise, if HIIT increases average life expectancy by 3 months, it doesn’t mean every participant gains 3 months. Some may gain nothing; others may gain more.

Skepticism of Research Quality and Claims

"And the other thing is, every fitness proponent—medical, scientific, research trainer, what have you—has a different idea on the suitable, most appropriate, most health-enhancing exercise. And I don't know whether any of the research is sound. But I need to know if it's sound, if it's well-conducted. I never have certainty, of course. I'm not looking for certainty. I'd like to have it. It would be nice, but you don't expect it."

This is a call for epistemic humility. In domains filled with contradictory recommendations—such as health and fitness—consensus is elusive, and methodological transparency becomes critical. The abundance of conflicting claims, each citing some form of research or anecdotal success, makes it hard to discern sound guidance from trend-driven speculation.

Consider the shifting landscape of diet recommendations. Low-fat was once king, then low-carb, then paleo, now intermittent fasting. Each has proponents and detractors, studies and counter-studies. The average person is left with the sense that “experts” don’t agree—and perhaps don’t know. Without strong, replicable evidence and clarity about effect sizes, public trust erodes.

Marketing vs. Meaningful Effects

"But I'd like to know if the research was well-conducted, and if the effect sizes were big enough to make a difference to my life, or just marketing or something. I think a lot of these claims for fitness benefits aren't even backed up by research. They're more based on personal experience. Totally confounded, of course."

This final reflection ties the concerns together: without rigorous methodology and meaningful effect size reporting, research collapses into anecdote and marketing. Personal experience, while valuable, is deeply confounded—affected by placebo, confirmation bias, regression to the mean, and motivational enthusiasm.

Fitness marketing often exploits this. Before-and-after photos, personal testimonials, and claims of transformation are used to sell methods or products. But without independent verification and quantified effects, such claims are promotional, not scientific.

The core issue is not whether individual results can be impressive—they often are—but whether those results generalize, scale, and justify the time, effort, or cost for others. Without effect sizes, evidence devolves into persuasion.

Ephektikoi - Guerrilla Epistemologist

Discussion about this post