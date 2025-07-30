Author’s Preface

In this series of essays called Reason, I opine about the world, trying to do analysis of concepts and questioning of assumptions based upon my current understanding of the world, based upon decades of reflection and decades of reading. Do I deliver truth? Well, what’s truth? I’m not being a smart-ass here—I really don’t know. But “corresponds with the world” is one way of looking at it, though it has its flaws too. All I can say is: don’t believe everything you think, and moreover, don’t believe everything the other person thinks either.

In previous months I created a couple of essays on measuring the unmeasurable. Now I'm going to reflect further. I've done a little more thinking, a little more research, and I'm going to talk about calculating the incalculable, where I'm going to discuss how we actually use numbers in ways that the theorists say we can't use these number surrogates for calculation, but it seems to work. Practice trumps theory.

Introduction

Measurement theory draws a clear formal line: some kinds of numbers can sustain arithmetic operations; others cannot. Interval and ratio measures may be added, subtracted, multiplied, and divided; ordinal and nominal measures may not. In that formal, mathematical sense, the theory is largely correct.

Yet in everyday practice, people cross that line constantly—and with some success. They add together ordinal scores, compute averages from ratings, rank athletes and rank teams by calculated points. They grade students by adding disparate item scores. Formal theory says these manipulations are meaningless. Empirical experience says otherwise.

This essay looks at the historical and contemporary ways in which humans have calculated on what theory says cannot be calculated. It distinguishes between what is documented and what is speculative, examines the history of thought about measurement before S. S. Stevens’s well-known 1946 typology, and reflects on why such violations sometimes work and sometimes fail.

Discussion

1. Documented and Speculative History of Using Numbers in “Improper” Ways

Speculative but Plausible Prehistory

We do not and cannot know the full story of early numerical use. Written records only appear after a long pre-numeric history. It seems plausible that long before writing, people used tallies and marks to count items of trade, days until a seasonal event, or the number of animals hunted. Less certain—but also plausible—is that they also used numbers in symbolic or qualitative ways: ranking warriors, comparing the “greatness” of leaders, or scoring contests informally. It is reasonable to infer that even in early societies, numbers may have been applied to qualities, and those numerical symbols may possibly have been combined arithmetically because it worked. There would have been no theorists to say no.

Recorded Historical Practice

By the early modern period, records clearly show numbers being used for purposes that violated later measurement theory. Gambling odds were routinely calculated and recalculated on uncertain events. Betting books in 18th- and 19th-century England show odds moving according to wagers placed, regardless of whether the events themselves could be assigned a true probability.

In sport, points systems emerged that treated ordinal events—first, second, third place—as numbers, summed them, and declared winners based on the totals. The modern Olympic scoring of decathlon events still aggregates performances across qualitatively different events by assigning each a point value and summing.

At some point, students answers to questions were scored and added to give academic rankings. When I was in elementary school, we even sat in out desks in order of our ranking.

2. Modern Examples Across Domains

IQ and Psychometrics

From the early 20th century onward, intelligence testing combined item scores—each reflecting a mixture of skills, knowledge, and problem-solving—into a single number, often by adding points and scaling the result to a standardized curve. By Stevens’s formal categories, such an operation assumes at least an interval scale, but in reality, the underlying items are neither uniform nor strictly additive. It is actually incoherent to attribut size to such categories. Nevertheless, these IQ scores became widely used for education placement, employment screening, and military selection. The fact they they only examine intelligence under a few narrow hypothesized dimensions, computed by proxy, is irrelevent to this discussion.

Pain Scales

Pain ratings, often taken on a 0–10 verbal or visual analog scale, are inherently subjective and ordinal. Yet researchers and clinicians compute means, compare percentage differences, and apply parametric statistical tests to them—violations of strict measurement theory. The results nevertheless guide clinical practice. These ratings of course are not stable from category to category, but certainly make sense at the extremes.

Sports Rankings

League standings in many sports award points for wins, ties, and sometimes even for narrow losses. These points are summed across a season, producing a ranking. Playoff qualification, championships, and player bonuses depend on these sums. Strictly speaking, the numbers represent counts of heterogeneous events, yet they are treated as additive scores.

Gambling and Probability Forecasting

Bookmakers use odds both as a pricing mechanism and as a predictive tool. Odds shift dynamically based on betting patterns, which may or may not reflect true likelihoods. Analysts in horse racing and other betting markets often model outcomes using historical performance rankings, averaged scores, and weighted factors—all of which apply arithmetic to ordinal or mixed-scale data.

Educational Grades and Rankings

Letter grades are mapped to numbers (A = 4.0, B = 3.0, etc.), summed, and averaged to produce grade-point averages (GPA). These are then ranked to determine class standing. In theory, grades represent broad performance categories, not fixed intervals. In practice, they are treated as interval data.

3. The Operations Themselves: How Theory Says “No” but Practice Says “Yes”

The key operations used in these domains—addition, subtraction, multiplication, division—are, in formal measurement terms, restricted to interval and ratio data. Yet they are routinely applied to ordinal data. The most common forbidden operation is computing the mean.

Other operations include:

Addition – summing points, as in sports league tables or GPA calculation.

Division – computing ratios, such as goals per game or GPA as quality points per credit hour.

Multiplication – weighting scores, as in composite indices for rankings.

Subtraction – finding “differences” in ratings that are not on an interval scale, as in a difference between two pain scores.

These are linguistic manipulations as much as they are mathematical. Arithmetic is a symbolic convention, not a direct link to a Platonic realm of numbers. The formal prohibition arises when the symbols lack the properties assumed in idealized pure arithmetic—yet the conventions persist because they are useful.

4. Theoretical Underpinnings Before Stevens

S. S. Stevens’s 1946 paper formalized the nominal–ordinal–interval–ratio typology, but debates about what counts as “measurement” go back much further.

19th-century physics and metrology developed precise definitions of length, mass, and time, insisting that measurement involves comparison with a standard unit.

Early statistics (Quetelet, Galton, Pearson) applied quantitative methods to human traits before resolving the question of whether those traits were genuinely measurable in the physical sense.

Philosophy of science in the late 19th and early 20th centuries debated whether psychological attributes could ever be truly quantified, with figures like Helmholtz and Campbell framing measurement as fundamentally about ratios of magnitudes.

By the time Stevens published, practice in psychology, education, and other applied domains had already been ignoring or bending these stricter definitions for decades.

5. Why It Works in Some Domains and Fails in Others

Some realms tolerate theoretical violations surprisingly well. Sports rankings, GPAs, and even IQ scores can produce stable and useful outcomes over time. In others—particularly in certain areas of social science—these methods seem to collapse, producing results that do not replicate, predictions that fail, and conclusions that mislead.

The reasons are still unclear. Possible factors include:

Stability of the underlying phenomenon – Some systems (like sports performance) may have enough internal consistency to survive rough numerical treatment.

Feedback loops – In gambling, the odds adjust dynamically, partially correcting errors.

Institutional embedding – Once a method is entrenched (like GPA), its usefulness becomes self-reinforcing.

In the end, these possible explanations are all hand-waving, or maybe just so stories.

6. Incompleteness and Partial Wrongness of the Theory

Formal measurement theory is correct in its narrow domain of pure arithmetic: operations require data with appropriate properties. It becomes incomplete—and sometimes wrong—when it claims that operations outside those boundaries cannot produce meaningful results in practice. These results allow description, rough prediction, and control. Their use is found to work empiricall. Evidence from history, from gambling tables to Olympic scoreboards, shows otherwise. Practice trumps theory, and something has to give.

Summary

Human beings have always pushed numbers into service beyond what theorist now allege is right and proper. I speculate that such practices could be as old as counting itself. More recent documented examples—from gambling odds and sports scores to IQ testing and GPAs—show a pattern: numbers are manipulated in ways that formal theory forbids, often to good practical effect.

This is not to deny that the theory is right in describing the limits of formal arithmetic. But in asserting that arithmetic on ordinal data is inherently meaningless, it is incomplete and sometimes wrong. The central mystery remains: why such violations work in some contexts and fail in others.

Readings

Hand, D. J. (2016). The imperfect science of measurement. Significance, 13(6), 10–15. https://doi.org/10.1111/j.1740-9713.2016.00949.x

This article discusses how measurement in the real world rarely matches the clean, idealized conditions assumed in formal theory. Hand notes that practical measures are often approximate, indirect, and dependent on context, yet can still be useful. The essay’s claim that “practice trumps theory” in many cases finds direct support here: Hand explicitly states that imperfect measures can still yield actionable results. His examples from applied science illustrate how deviations from strict measurement theory are not only common but unavoidable, reinforcing the essay’s emphasis on pragmatic, results-based calculation rather than rigid adherence to theoretical rules.

Michell, J. (1999). Measurement in psychology: A critical history of a methodological concept. Cambridge University Press.

Michell provides a historical and philosophical critique of the very idea of “measurement” in psychology, showing that the discipline adopted the language of measurement long before it had methods that could meet the formal definition. He traces this back well before S. S. Stevens’s typology, showing that debates about what counts as measurement go back to the 19th century and earlier. This directly connects to the essay’s point that numerical practices predate—and often contradict—theories about what can and cannot be measured or calculated. Michell’s historical coverage also helps situate examples like IQ scores, educational ratings, and psychological scales as part of a much older pattern of applying numbers where formal theory says they don’t belong.

Porter, T. M. (1995). Trust in numbers: The pursuit of objectivity in science and public life. Princeton University Press.

Porter examines how quantification gained cultural authority in science, government, and public life, even when the numbers themselves were based on shaky foundations. He shows that the social prestige of numbers often allows them to override theoretical objections about validity. This is highly relevant to the essay’s discussion of how calculated results are widely accepted—such as league tables, school rankings, and quality scores—even when they involve adding, averaging, or comparing values that formal theory says should not be treated arithmetically. Porter’s work helps explain the persistence and influence of these practices despite ongoing and ill-founded theoretical doubts about their technical legitimacy.

Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before 1900. Harvard University Press.

Stigler provides a detailed historical account of how statistical reasoning developed before the 20th century, including early attempts to quantify uncertainty, rank outcomes, and average disparate measures. His coverage of gambling, astronomy, and early social statistics offers historical grounding for the essay’s claim that numerical calculation on “uncalculable” data is not new but an enduring feature of human reasoning. This supports the essay’s observation that people have been aggregating and manipulating heterogeneous data—often without theoretical justification—for centuries, making it a deep-rooted practice rather than a modern aberration.

Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47(1), 65–72. https://doi.org/10.1080/00031305.1993.10475938

Velleman and Wilkinson critique the Stevens typology, arguing that the neat four-level classification oversimplifies real-world measurement. They point out that practical data often mixes features from different categories, making the rigid rules about permissible arithmetic operations unrealistic. This directly supports the essay’s point that in practice, people routinely add, average, and otherwise manipulate data that formal theory says should not be handled that way. Their critique legitimizes the essay’s examples of “forbidden” calculations (e.g., averaging pain scores, computing GPA from letter grades) as normal, if theoretically proscribed, parts of applied numerical work.