Understanding the World – The World is a Crapshoot
Can statistics help us see more clearly? Where, when, how?
Note: I go where a saner person might fear to tread, with half-baked opinions. Fully-baked will cost extra. Yes, it could be more concise but my brain hurts already.
I want to thank ChatGPT for the formatting and headings, and Dall-E for the picture. Have you noticed that graphic AI is getting the hands right these days?
Thought Experiment: The Deity and Statistics
As a thought experiment, imagine the deity—the omniscient one—runs all research and publication and remedies all the identified flaws in the use of statistics. The question then becomes: do we find that statistics work in various domains or not since we have eliminated all problems with their application?
Verification Through Omniscience
How do we verify this? Well, it turns out that the deity knows, so he tells us whether they work or not. If he tells us they don’t work, it means the methods themselves don’t work. If he tells us they do work, it means the problem was in the application of the methods.
Absurdity of the Experiment
So, that’s a thought experiment. Of course, being omniscient, the deity could just tell us the answer straight away without running the experiment—but that’s a different thought experiment, equally absurd.
Games of Chance in Human History
I imagine that people were intuitively assessing odds and playing games of chance from time immemorial, I'm sure long before recorded history.
Of course, we have at the moment, as far as I know, no evidence of that, but it's supposition that as long as people could reason, they played games of chance.
Probably before they could count, they played games of chance, but that's entirely conjecture, not proven and cannot be proven as far as I can determine, unless we have a TARDIS, a time travel device. Does Amazon sell them?
Risk and Chance in Daily Life
Living as a Crapshoot
Living in the world is a crapshoot. Let me explain why.
We don't usually explicitly think of this, but everything we do involves some threats and opportunities. Collectively, we call that risk. Risk is just another way of looking at odds—chance.
How well we'll succeed, how well we'll fail. We try to use either intuition to understand things or use some numbers with some method or other—there are lots.
Quality Assurance and Purpose
Role as Acting Director of Quality Assurance
At one time, I was acting Director of Quality Assurance in a government Information Systems Department, and I did a lot of reading on quality assurance: Edwards Deming, John Crosby, and others—a long time ago, so I'm forgetting exactly who—but the essential message from one of them, I think Crosby, was: you need to make things fit for purpose, which is kind of a slogan.
And then you need to look at that to understand what “purpose” is and what "fit" means. I would say the purpose is one of understanding things and events in the world in order to predict outcomes and control things. That seems to be why we use probabilities, odds, and the various considerations of chance. And fitness means how well, and in what situations, you can achieve this—or fail to achieve this—prediction, control, and understanding.
Dual Views of Probability
Objective and Subjective Perspectives
There's two ways of viewing probability. One is as an objective feature of the world, and the other is as a subjective feeling or hunch about things—or maybe an inner calculation without using numbers, or numbers based on measurement and counting.
Approaches to Counting and Measurement
Formal Methods
I can currently identify the following approaches toward counting and measurement: formal methods, where you have some axioms and assumptions and perform some calculations, and you come out with some numbers. A lot of different sorts of numbers, it turns out.
Categories of Statistical Methods
So, there's basic probability based on counting, permutations, and combinations.
There's descriptive statistics, which, underlying it, is counting and measurement, where a pattern is observed numerically. So, that's not even getting into odds or chance.
And then, when we get into statistics proper, we have methods that are called frequentist, methods that are called Bayesian, and methods that are called signal detection. There could be others, but I'm not currently aware of them. But they're all based on calculation—axiomatic, formal, deterministic with assumptions that must be satified for alleged soundness of the conclusions.
Applications of Probabilities
Use of Counting and Measuring
These methods are calculated for probabilities: counting and measuring and sometimes combining and permuting.
Probabilities and statistics are used extensively today. In some cases, they seem to work pretty well. They can reveal patterns that can't predict individual outcomes, but we use them to predict patterns overall and then take guesses as to particular outcomes.
Simple Probabilities
In the case of simple probabilities—dice, coin flipping, roulette wheels, games of chance—we can calculate odds, and in the long run, we can win if we calculate them correctly.
Advanced Considerations
In some situations, it gets very tricky to use the math because you encounter advanced considerations where you're not quite sure how to apply the mathematical model to the world.
Classification of Cases
Easy, Hard, Medium, and Incoherent Cases
I've mentioned before in another essay about easy cases, medium cases, hard cases and incoherent cases. Each of these cases specifies not just the statistical methods used but also the domains of applicability and the kinds of successes we can expect. The fitness for purpose, in other words.
Issues in Modern Research
Fitness in Modern Research
I contend that, although there are some realms where statistical methods are arguably fit for purpose, in large areas of modern research, the methods are arguably not fit for purpose; maybe just not very well—or maybe not at all.
Replication Crisis
The replication crisis is one indication of this, although theoretical debates range all over the map. Truly understanding the details is beyond my pay grade, but I do know that these issues exist.
Equal Weight for Methods
Importance Across Realms
You have to give all methods equal weight because they're all important and used in certain realms. In fact, my training in experimental psychology and psychophysics was heavily weighted toward signal detection methods.
Fitness for Purpose and the Debate on Appropriate Tools
Getting back to fitness for purpose, statistics involves, in part, applying the right tools to the right domain. However, determining which tool is appropriate for which domain is often the subject of intense debate among mathematical scholars. Even experts in the field can't agree.
So, what am I, as someone with very lightweight training in the field, to make of it all?
Underlying Methods and Misapplication
Well, a few things. First, it may be that things are fit for purpose because the underlying method is sound. Or, perhaps, things are not fit for purpose because the underlying method is not sound—either in that particular domain or in any domain. Alternatively, the method might be sound but misapplied due to any number of faults, including imperfect understanding of the method, bias, financial incentives, or even outright dishonesty.
Ioannidis and the Fallibility of Research
The issue of problematic research has been much discussed by many scholars, and I won't attempt to give an exhaustive list. Search engines will give extensive citations. However, for those interested, John P. Ioannidis has written extensively on these issues in many publications. One of his most notable works, in terms of citations, discusses the claim that 90% of research findings are false. Ioannidis is a biostatistician, and his arguments are complex and difficult to follow for someone with my rudimentary training in statistics.
The Frequentist Approach
Historical Development of Frequentist Statistics
In statistics, there are various branches, of course, as I've already mentioned, but the one I was trained in is called frequentist. It has a long history, with two influential groups involved, almost a century back: those associated with Fisher, a scientist and statistician in the UK, and Neyman and Pearson, statisticians from elsewhere who collaborated extensively.
They had two fundamentally different conceptions of statistics, its underlying conceptual foundations, and how it should be applied. Nowadays, we seem to have merged the two in a way that is arguably not particularly defensible.
Theoretical Foundations and Practical Application
The Role of Conceptual Foundations
Although the underlying theoretical and conceptual foundations of a method affect how it is applied and interpreted—in fact, they govern how the calculations are supposed to be done—it may be that the explanations are wrong or even incoherent, as I will argue later.
Usable Results Despite Theoretical Issues
However, the calculations might still yield usable results that are fit for some purposes. Of course, verifying this is necessary, but that verification itself is problematic, particularly in realms like medicine and other softer disciplines.
Proper Application and Misapplication
Garbage In, Garbage Out
I don't want us to lose sight of the fact that even if a method is fit for purpose and properly designed, it doesn't follow that it will be properly applied. This can result in a lot of garbage input and garbage interpretations, of course.
Criticisms of Frequentist Statistics
Null Hypothesis: Logical Issues
There are many criticisms of these methods in the literature, but most are beyond my pay grade, as I've said before. However, some of the criticisms of the formalism of frequentist statistics involve the notion of the null hypothesis, with some saying it's inadequate. The null hypothesis is a straw man positing that there is no effect. I argue that it's not just inadequate, it is logically incoherent.
That may be a failure of my understanding, though. Fifty years ago, I thought I understood it; now I say it makes no sense to me.
Arbitrary Thresholds and Significance
Nevertheless, there are criticisms around the null hypothesis, such as its lack of treatment of the alternative hypothesis and the fact that significance is determined using an arbitrary threshold, whose meaning is unclear in any case. It’s essentially about how numbers stack up against some arbitrary threshold called misleadingly statistical significance.
Another issue is the whole idea of inference from a sample to a population, where the concept of a population is often ill-defined in many cases.
Significance and Effect Size
Conflation of Two Senses of Significance
One criticism of frequentist methods is that they conflate significance, which has two very different senses. One is the failure to meet, or the meeting of, an arbitrary threshold called the significance level.
This threshold has little to do with "significance" in the real-world sense, which means observing an effect and understanding how big it is. Frequentist statistics doesn’t provide a direct way to determine the magnitude of an effect—at least, it’s an underemphasized feature of the approach.
Limitations of Frequentist Methods
Correlation, Not Causation
Another criticism is that it doesn’t directly show causality. Well, none of the methods do. Causality is an inference made by the experimenter, and the methods themselves only demonstrate correlation.
The model may define dependent and independent variables, but this is fundamentally flawed because the calculations don’t actually make such distinctions in any meaningful sense, as far as I can remember. If they do, there are still problems with interpretation in any case.
Logical Coherence of the Null Hypothesis
Rejecting the Null Hypothesis
To me, it seems illogical that the rejection of a hypothesis based on some arbitrary threshold does not also imply the acceptance of some alternative, unspecified hypothesis based on the same arbitrary threshold.
We can critique the arbitrariness, but I would critique the entire coherence of the null hypothesis. It makes no intuitive sense to me, though apparently it makes sense to statisticians—and I used to think it made sense too.
So something has changed. Maybe I’ve gotten stupider, or maybe I was stupid back then and I’m smarter now. I don’t know which, although I perfer the latter supposition.
I claim that, logically, if you reject the null hypothesis, you must also be accepting the alternative hypothesis. It doesn’t have to be explicitly defined, but statisticians don’t see it that way.
So, I obviously can’t think like a statistician. I maintain that there position is probably logically incoherent, though they would argue that I just don’t understand it. Well, that part is true.
Neyman and Pearson’s Contributions
I would think that if we reject the null hypothesis at some arbitrary level of significance, we should accept the alternative hypothesis at the complement of that arbitrary level of significance. But that doesn’t seem to be how it’s viewed.
Yet, it is my understanding that Neyman and Pearson’s methods of calculation actually do incorporate assertions about the alternative hypothesis as well as the null hypothesis. However, once again, understanding this in detail is beyond my pay grade.
Personal Motivation for Exploring Statistical Issues
I think we need to talk about why I’m concerned with this topic. Well, it’s all academic for me right now, but I’ve got to do something between the time I’m born and the time I die, and what I’ve chosen to do in recent years is to write.
Right now, I’m also thinking about statistical issues because of pragmatic concerns—specifically, trying to determine what I need to do to improve my health. I’m getting old, so that’s becoming more and more of a consideration. What I’m finding is that the advice I’m getting from various sources on certain health issues and drug treatments is so conflicting that I don’t know where to start, other than flipping a coin.
Formal Training and Initial Skepticism
Although I was trained in statistics, I took a graduate-level course in statistics for experimental psychology dummies. I’m not sure if I’m being humorous there or not, in any case. I was formally trained years ago and did okay in my courses. Actually, I got the best mark, apparently, in the graduate course. I didn’t do very well in the undergraduate courses—amazing I got into grad school, but I didn’t work very hard at first as an undergrad (or at all) either.
Distrust in Biomedical Research
At some point in my life, I read a paper on the failure of research, with some comprehension. I’m sure I read it several times. I also heard, sometime in the past—I don’t know if it was just before or after reading Ioannidis—about the replication crisis and many times heard alternative claims about certain drugs I’ve been told to take, like statins, associated with the cholesterol hypothesis. This hypothesis has been critiqued by a number of alternative perspectives, detailed in book after book. So, if the mainstream claims and the alternative claims are based on statistics (and they are) we have a problem Houston. Did I mention that bit about flipping a coin as a rational solution?
Fitness for Purpose in Biomedical Research
Now, looking at research, I’ve realized that research in the biomedical field is not to be trusted for various reasons. I think it’s associated with fitness for purpose, but whether the methods themselves or their application is flawed, I’m not sure. It leaves me in a quandary. As I said, flipping a coin might be just as reliable, given the state of information and understanding in the biomedical research community.
Evidence-Based Medicine and Shaky Evidence
Although the medical establishment including the meta-analysis folks claim to practice evidence-based medicine, I would argue that the evidence is pretty shaky for a number of reasons. For example, Ioannidis has made well-supported claims that a vast majority of published research findings are false. One can argue about the percentages, but many others have buttressed Ioannidis’ case. Additionally, studies show that replication is failing left, right, and center in certain fields.
Defining and Understanding Failures
This raises several issues. First, are these assertions about failure, falsehood, and replication problems well-grounded? I think so—arguably so. Second, what are the causes? What does failure of replication mean? What does it mean for findings to be false? These are epistemological issues, I suppose, or definitional issues at worst.
Causes of Research Failures
Looking at causes, some are not related to probability and statistics at all. Arguably, many real causes are institutional, involving individuals, motivations, money, knowledge, and understanding—any number of factors.
Statistical Methods and Fitness for Purpose
But some of the problems are statistical. Many assert that these problems stem from the misuse of statistics, which is undoubtedly true. However, I would argue that while there is misuse, perhaps the methods themselves are at best applied inappropriately—they’re not fit for purpose. It’s not just that they’re used incorrectly; it’s that they’re inherently not fit for purpose in domains like biomedical research, psychological studies, or other soft sciences. That case is arguable too.
So, this isn’t just about improper use—it’s also about methods that don’t work very well or at all in these domains.
And let’s not avoid the word fraud or the expression “follow the money.” Those are key parts of the problem too.
Key Question: Application or Fitness of Methods?
My key point is this: is the problem one of application, or is it an essential lack of fitness for the problem? Statistics may not work at all, no matter how well they are applied.
We know there are non-statistical problems, and we know there are problems around the application of statistics. But this raises the question: is the problem simply that we’re not using statistics well, or is it that statistics do not work in certain contexts—contexts where we assume, without verification, that they do? This assumption is both unverified and essentially unverifiable.
The Limits of Improving Statistical Methods
In practice, we will never be able to clean up our act to the extent required to make the methods—if they are inherently fit—work, due to any number of factors related to the fact that we are people, operating within organizations. It’s simply not possible to address all the faults identified by Ioannidis and others.
The Replication Crisis and Its Implications
Now, if the problem extends to the methods themselves, it will require a total mindset change. The replication crisis indicates there may be a problem, and part of it may be due to statistics. But I would contend that part of it may be due not just to the misapplication of statistics, but to the fact that the methods themselves are simply not fit for the purpose of understanding, prediction, and control in complex domains such as medicine and psychology.
Real-World Consequences of Statistical Failures
This has real-world consequences since we routinely make life-or-death decisions based on statistical evidence, which may not be fit for purpose in any sense.
Judgment in Statistical Techniques
All statistical techniques involve judgment at the input and output stages. Bayesian priors are not subjective—they are based on judgment. The mathematical computations themselves are deterministic, but everything else depends on judgment.
Do not treat Bayesian statistics as being based on subjective priors—that’s nonsense. It is about judgment in input and judgment in interpreting output, as is the case with all methods of analysis, computational or not.
Problems in Bayesian and Machine Learning Methods
The problems with Bayesian techniques and emerging machine learning methods may be different than with Frequentist methods, but the all are problematic. I don’t know what the ChatGPT suggested systems-level modeling is, so I have no idea if it’s an improvement or not. But I suspect that machine learning models are an illusory hope.
Nietzsche’s View on Facts and Interpretations
As Nietzsche observed, “There are no facts, only interpretations.” While this might have been too sweeping a claim, much of what we believe to be fact is, in reality, interpretation—and that’s easy to prove.