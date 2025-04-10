Note: I was in training to be an experimental psychologist many decades ago. I went to grad school. I did not much question the whole basis of the field at first. Later? Well, color me full-blown contrarian, maybe a devils advocate. However, I acknowledge that this is only one possibe perspective.

Author’s Preface

Measurement in Early Life

When I was a young man, measurement was easy. We had tools that worked and could be seen to work and were understandable. We had rulers and protractors and squares and levels, all of which were not mysterious in the least. And they were reasonably accurate, reasonably precise. They had their limits, of course, but it wasn't important. Partly it was a matter of the skill of the user, and partly it was that the tools did not allow great accuracy and precision, but it was close enough. We were not metrologists.

Technical Training and Measurement Tools

I was a would-be electrician for a while and also went to electronics technology school to study electronics. There we had measurements based on slightly more abstract concepts—oscilloscopes and so on, or various meters for measuring resistance or voltage or amperage, primarily at that time based upon magnetism and swinging a needle.

The Measurement of the Unmeasurable

Years later, starting in 1970, I decided I'd become an experimental psychologist, rejecting being a philosopher, thinking that philosophy lead nowhere in terms of understanding the world. I learned about measurement of the unmeasurable. But I didn't, at the time, think that things of the mind were unmeasurable. I was just told that this is how you can measure things, using operational definitions, which had to be validated, of course, in various ways.

Things could be scaled in various fashions: nominal, ordinal, equal interval, ratio, perhaps others. I was taught that you could use these numbers in mathematical computations. And generations of mathematicians worked to prescribe and proscribe how that would work, with some arguably dubious results. I learned some of the methods.

Statistical Techniques and the Assumptions of Measurement

It was taught that you could use these numbers for various descriptive statistics, for correlations and for various brands of inferential statistics. I was taught you could operationalize psychological phenomena, create a proxy, and it would give you sound information. No one I was taught by ever questioned that, other than to note that you had to make sure that the measure was validated.

Origins of Measuring the Unmeasurable

I don't know when this trend of measuring the unmeasurable started. I'm sure it was at least as early as the measurement of intelligence, or IQ as it was called, where researchers thought they could assign measures to intelligence. Turns out it was a fairly unsophisticated effort, but the proponents claimed success in predictive validity when it came to certain tasks, primarily military recruiting.

Reaction Time and the Challenges of Variability

Arguably the origins of psychological measurement lie earlier, perhaps in my graduate studies field of psychophysics. In my own work at a graduate school, I measured something that was clearly measurable: human reaction time. However, the variability was huge, and it was just about impossible to make sense of the results.

The Use of Likert Scales

But I also did survey work using Likert scales, which I'd been led to believe were a sound way of measuring the unmeasurable. I didn't question it, although I did study the theorizing on how you should construct sound Likert scales. I now look back on the results of that work, assessing professors as teachers, to be pernicious and misleading nonsense.

Defining the Unmeasurable

I haven't defined what I mean by unmeasurable. I'm talking about measuring things which are subjective. Beliefs, attitudes, personality traits. They may have behavioural manifestations in some cases. They may be based on linguistic behaviour, but occasionally on other behaviour. That's usually how these things are measured, linguistic response.

Sometimes there would be physiological measures which are actually measurable: temperature, cortisol, galvanic skin response. There are lots of things that are measurable of course.

The Illusion of Measuring Subjective Factors

But attitudes, beliefs, psychological factors may not be inherently measurable. But psychologists think they can measure them. Are they right? To what extent? They can propose scales and operationally define how to measure these things. Basically, operational definition result in recipes. You do this, do that, and you have your measurement.

Proxies

So we end up with proxies, surrogates, stand-ins for poorly hypothesized characteristics of man and mind. Numbers that are unstable but yet we attach meaning to them, compute with them, make inferences with them and attempt to predict. It can be argued that so far results are not all that encouraging.

Introduction: Some Fractured History

Measurement, at its foundation, began as a practical enterprise. In early experience, measurement was simple and concrete: rulers, protractors, levels, and squares. These instruments were physical, visible, and transparent in their operation. Their limitations were known, their precision bounded by material properties and the skill of the user, yet they sufficed for the tasks at hand. They embodied an immediate clarity—measurement of length, angle, flatness, or alignment, all rooted in direct observation of the physical world.

Over time, however, measurement began to reach beyond the plainly tangible. Fields such as electronics introduced tools like oscilloscopes and analog meters—still grounded in physical realities but operating at a level of abstraction one step removed from everyday perception. Electricity itself is invisible, yet its effects can be captured, displayed, and quantified through magnetism and other measurable forces. Measurement remained fundamentally physical, even as it charted into less intuitively graspable domains.

But as the ambition of science expanded, so too did the ambition of measurement. Psychology, aspiring to the rigor of the physical sciences, sought to impose measurement upon subjective phenomena—beliefs, attitudes, personality traits—none of which are visible, tangible, or inherently quantitative. Thus emerged the era of operational definitions, scaling techniques, and the construction of proxies for the unmeasurable. What had once been the clear measurement of physical realities became the questionable measurement of abstractions, where numbers were assigned not to observable properties, but to interpretations of internal states.

In this evolution, measurement drifted from the transparent reliability of physical tools toward an enterprise built increasingly on assumptions—assumptions about what is being measured, assumptions about the validity of proxies, and assumptions that numbers can capture the essence of subjective human experience. This drift is not merely a historical curiosity; it lies at the heart of profound methodological and philosophical problems that continue to afflict psychology today.

Discussion

A Philosophical Transformation

The transition from physical to abstract measurement is not merely a matter of technological advancement, but of philosophical transformation. In the realm of experimental psychology, measurement was redefined. Rather than measuring phenomena directly, psychologists were taught to operationalize concepts: to define a phenomenon by the method of its measurement. If intelligence could not be seen, then it would be whatever intelligence tests measured. This was not cynicism, but a methodological assertion. "Intelligence is what intelligence tests measure" became an emblematic statement of this attitude.

The notion of operational definitions held considerable appeal. It offered a procedural solution to the intractability of abstract phenomena. If attitudes could not be weighed or electrified, they could at least be inferred from linguistic responses on surveys. Likert scales emerged as a favored method, promising to translate subjective experience into ordered numbers that could be subjected to statistical analysis. Constructs could be treated as though they possessed equal intervals or even ratio properties, allowing them to enter the realm of computation alongside physical measures.

However, this practice introduced profound difficulties. The metrics of physical tools were understood by their users; the precision and limits of rulers or voltmeters were apparent. In contrast, the precision of psychological measures was obscured beneath layers of abstraction. Concepts like beliefs or attitudes might manifest behaviorally or physiologically at times, but the measures typically employed were linguistic or self-reported. Responses were filtered through layers of self-awareness, social desirability, memory, and interpretation. Despite this, generations of psychologists adopted these methods, supported by frameworks of validation and reliability testing, and by statistical theories that prescribed allowable manipulations of ordinal and interval data.

Earlier experiences with reaction time offered a lesson in the challenges of even seemingly objective psychological measures. Reaction time can be precisely recorded, yet the variability within and between individuals defies simple interpretation. Factors such as fatigue, distraction, learning and motivation contribute noise to the data, complicating any effort to derive meaningful conclusions. If variability bedevils reaction time—a relatively concrete measure—what hope remains for more nebulous constructs like belief or attitude?

The quest to measure the unmeasurable rests on the belief that proxies can capture meaningful variance. Proxies are recipes: administer a test, score the responses, and interpret the resulting numbers. But whether these numbers correspond to real, stable phenomena is an open question. Psychologists lean heavily on statistical validation—correlations, factor analysis, test-retest reliability—but these methods operate within the closed system of the proxy measures themselves. They do not guarantee that what is being measured corresponds to an external, verifiable reality.

The origins of this trend are not entirely clear, but the early efforts to measure intelligence certainly exemplify it. Intelligence tests, especially those developed for military use, claimed predictive validity for specific tasks, such as classifying recruits. While limited success was claimed, these tests were crude and their scope narrow. Nevertheless, they laid the foundation for an enduring belief in the possibility of quantifying human intellect and behavior.

Psychological Measures and Scales of Measurement

I doubt very much that any subjective psychological measure comes close to either equal interval or ratio scales. A framework for measurement, typically known as Stevens' scales of measurement, was introduced by S.S. Stevens in 1946. The main types are nominal, ordinal, interval (sometimes called equal interval), and ratio. Occasionally, some discussions include variations or additional scales (like "absolute" or "logarithmic"), but these four are the primary classifications used in measurement theory.

A brief, precise outline for clarity:

Nominal Scale Pure categorization.

No order, no distance, no meaningful zero.

Example: blood type, gender, nationality.

Not numerical in any meaningful sense beyond labeling. Ordinal Scale Ranking order, but no assumption of equal distances.

Example: Likert scales ("strongly agree" to "strongly disagree"), military ranks.

You know that A > B > C, but you do not know if the gap between A and B is the same as between B and C.

Permissible statistics: median, mode, rank order correlations. Interval Scale (Equal Interval) Ordered categories with equal distances between points.

No true zero point; zero is arbitrary.

Example: temperature in Celsius or Fahrenheit. Zero does not mean the absence of temperature.

Permissible statistics: mean, standard deviation, correlation, linear transformations. Ratio Scale Contains all properties of interval scales, plus a true, non-arbitrary zero point.

Example: weight, length, Kelvin temperature. Zero means none of the quantity exists.

Permissible statistics: geometric mean, coefficient of variation, meaningful ratios ("twice as much").

As applied to psychological measurement:

Most subjective psychological measures do not meet the criteria for interval or ratio scaling.

Likert scales, while often treated mathematically as interval for convenience, are technically ordinal. There is no empirical demonstration that the perceived "distance" between scale points is equal, let alone consistent across individuals.

Constructs like intelligence, anxiety, or personality traits are typically operationalized in ordinal or pseudo-interval forms but do not possess a verifiable natural zero point. For example, an IQ of 0 does not signify the absence of cognitive ability in any literal sense.

Claims that psychological measures operate at interval or ratio levels rely heavily on convenience assumptions, not on demonstrable scaling properties. Statisticians and psychologists often proceed "as if" these measures qualify for advanced operations, but foundational justification is weak.

In short, the ordinal scale is the most one can rigorously claim for most psychological metrics. Anything beyond that ventures into the realm of methodological convenience rather than empirical demonstration.

Validity and Operational Definitions

When dealing with measurement, especially of abstract constructs, the issue of validity is central. Operational definitions—those working definitions that specify the procedures by which a concept is measured—require some demonstration of validity before they can be meaningfully employed. Whether the act is done explicitly, or more often implicitly, any use of an operational definition assumes that it is at least valid in some pragmatic or conventional sense.

Validity, in the broadest terms, concerns the extent to which a measurement tool measures what it purports to measure. It is distinct from reliability, which concerns consistency. A measurement can be reliable without being valid, but it cannot be valid without at least some degree of reliability.

There are several recognized types of validity, which can be grouped under two overarching categories: measurement validity and inferential validity. The focus here is on measurement validity, as it applies directly to the construction and use of operational definitions.

1. Face Validity

Face validity refers to the superficial appearance of validity. Does the measure look like it measures what it claims to measure?

Assessment: Subjective judgment, usually by non-experts or by general observation. It is often a first step and carries no technical rigor.

Role: While it offers no guarantee of actual validity, it can influence acceptance. A test that lacks face validity may be rejected.

Example: A questionnaire about anxiety that asks about heart palpitations, nervous thoughts, and physical tension would likely have high face validity.

Caveat: Face validity is considered the weakest form. It is perceptual rather than demonstrative.

2. Content Validity

Content validity examines whether the measurement captures the full scope of the construct in question. Does it cover all relevant dimensions?

Assessment: Expert judgment is typically used. Panels of experts review the items to ensure they represent the breadth of the concept.

Role: Particularly important in educational testing and psychological inventories, where constructs are multifaceted.

Example: An intelligence test that only measures verbal reasoning lacks content validity because it omits non-verbal and quantitative reasoning domains.

Caveat: Content validity remains partly subjective, relying on expert consensus, which may vary.

3. Criterion-Related Validity

This category involves correlating the measurement with an external criterion that is considered a standard or benchmark. It has two subtypes:

a. Concurrent Validity

Assessment: The measure is compared with an established measure taken at the same time.

Example: A new depression inventory is compared with an existing, validated inventory to see if scores correspond.

b. Predictive Validity

Assessment: The measure is used to predict future performance or outcomes.

Example: IQ tests predicting academic achievement or job performance.

Caveat: Criterion validity is only as strong as the criterion itself. If the criterion is flawed, so is the validity assessment.

4. Construct Validity

Construct validity is the most comprehensive and philosophically demanding form of validity. It concerns whether the measure truly reflects the theoretical construct it is intended to measure.

Assessment: Involves multiple methods: Convergent validity: Does the measure correlate with other measures of the same construct? Discriminant (divergent) validity: Does the measure fail to correlate with measures of different constructs? Factor analysis: Statistical techniques to examine the underlying structure of the measure.

Role: Central in psychology, where constructs like intelligence, anxiety, or personality traits are inherently abstract.

Example: A well-constructed anxiety scale should correlate with physiological measures of arousal and other anxiety scales (convergent), but not with unrelated constructs like extraversion (discriminant).

Caveat: Construct validity is an ongoing process, never fully settled, and depends heavily on theoretical models that are themselves provisional.

5. Ecological Validity (sometimes discussed as a subtype)

Ecological validity concerns whether the measurement reflects real-world conditions and behaviors.

Assessment: Observation of how well laboratory or survey measures correspond to behaviors and outcomes in natural environments.

Example: Measuring memory performance in a lab may have low ecological validity if the tasks bear little resemblance to everyday memory demands.

Caveat: Laboratory measures often trade ecological validity for control over variables.

Interrelationship of Validity Types

These types of validity are not entirely independent. A measure with strong construct validity often achieves good criterion validity as well. Content validity contributes to construct validity. However, it is possible for a measure to perform well on one type of validity and poorly on another.

Operational definitions depend fundamentally on assumptions of validity. Without at least content and construct validity, an operational definition is no more than a procedural recipe with no grounding in reality. Even practitioners who adopt operational definitions without questioning implicitly rely on prior validity assessments—whether formally documented or culturally embedded within the discipline.

Challenges in Assessing Validity

Circularity: Measures are often validated against other operational measures, leading to circular reasoning.

Hidden Assumptions: Measures may rely on unstated assumptions about human behavior or cognition that remain untested.

Cultural and Contextual Dependency: What appears valid in one context may not transfer across populations or settings.

Evolving Constructs: Theoretical constructs themselves may change over time, undermining earlier validations.

More Thoughts

Validity is the central pillar of meaningful measurement, especially when dealing with abstract, unmeasurable constructs operationalized through procedures. While multiple forms of validity provide a structured way to assess the soundness of measurements, none is infallible. Each rests on judgments—sometimes expert, sometimes statistical, sometimes pragmatic. Operational definitions, as the working tools of psychological measurement, depend on these layers of validity to claim relevance to the real world. But this foundation is always, to some extent, provisional. Validity is not a destination but an ongoing, contested process in the attempt to make the unmeasurable seem measurable.

Validity and the Question of the Unmeasurable

If a measure is valid, then it's not unmeasurable. However, epistemologically, the notion of validity has some serious holes in it. It's not at all clear that the picture of validity is as sound as psychologists tend to believe.

Methodological Convenience as Hand-Waving

Sometimes scholars contrast theoretical purity with methodological convenience. But doesn't methodological convenience amount to hand-waving? "Hand-waving" is a derogatory term for making claims without a basis in fact. The problem with hand-waving is that it leads to another colloquial term: bullshit.

The Problem of Clean Measurement

Can we produce clean measurement? The whole notion that you can assess depression with a number based on subjective reports is somewhat suspect. Personality traits are posited and measured. Such assessment is a money maker for some. This is equally suspect. I could go on multiplying instances.

I would argue that, in psychology often enough, psychometrics attempts to measure that which cannot be measured, and then tries to compute that which, by implication, cannot be computed. And they’re kidding themselves.

Numbers, Categories, and Psychological Measurement

We've used numbers to order things — that's the notion of ordinal measurement. We can sort, we can order. We use these numbers to describe and predict, and we assert that we can have cut-offs, as in psychopathy indices. We assert there are differences between a high score and a low score, perhaps in predicting bad behavior. But that's really quite iffy.

We put people into boxes; we categorize them based on these scores. And we always select our evidence when we’re categorizing. It's a mystery. We generalize, we select, we categorize. That's just the way the mind works.

The Habit of Categorization

So we try to do this systematically with numbers. Even without being psychometricians, we routinely put people into boxes, categorize them, assert things about their behavior, their underlying character, their underlying nature — and we just assume we're right. It's just the way we work. We assume that the characteristics we use to describe people carry the same objective weight as measurements with a ruler. And yet they can't.

Predictive Value and Its Limits

Despite that, gut feel and observation does have some predictive value. If we label someone as aggressive based on observing them over years, we can predict that they're likely to explode into aggression at some point, possibly triggered by a very trivial stimulus. It's not irrational — it's something we can observe.

Do the numbers improve our everyday observations, and can they be used in computation? When we try to assign numbers to things, it becomes a somewhat odd thing to do. Taking these numbers, which may not have much validity, and using them in computations is where we may be going astray. Although, there is some evidence that computations — correlations of IQ and later performance — seem to work, this is a mystery in itself, somewhat like Wigner's mystery about the unreasonable effectiveness of mathematics in physics.

So maybe the real question is: are these constructs sound enough to be used in computation to produce meaningful results? And maybe that's still an open question.

Problems of Comparison

To order things means we have to compare things. That should be obvious, but what exactly are we comparing? Impressions? Remembered behaviors? Scores on a test? Likert scores? It's problematic.

Yet, if we live with an abusive person, we come to observe and describe their behaviors, and predict their future behaviors. But it's probabilistic. Sometimes we might say, "If this person gets drunk, they’re going to abuse us." So it's not deterministic, but rather a matter of likelihoods — subjective likelihoods.

Assigning Numbers to Likelihoods

The problem comes when we assume we can apply numbers to these subjective likelihoods. Maybe they could be made objective. We could probably count behaviors if we were accurate observers. But I still think we face a conceptual issue in assigning numbers to such things, just as we have a conceptual issue when trying to assign probabilistic numbers to gut feelings.

Yet, sometimes the computations seem to work — for example, in the correlation between IQ and performance in the military. Or is that just an illusion?

The Replication Crisis

No wonder we have a replication crisis: as if corruption at the level of institutions, businesses, and individual scientists isn't bad enough; as if problems in peer review and publishing isn't bad enough. The very idea of measuring the unmeasurable may invalidate much of what is asserted.

Cosmetic Reforms

The replication crisis should have been a wake-up call, but even Ioannidis makes only cosmetic recommendations. He said, "if only we did it better," rather than going back to the foundations of the somewhat iffy science of psychometrics and the potential inapplicability of statistics.

Once you step into the complex and confounded realms endemic to psychology, Ioannidis tended to think Bayesian statistics was the solution to it all — and Bayesian statistics has its own problems.

Tools Not Fit for Purpose

We now have tools that are possibly not fit for purpose: operational definitions, where we attempt to quantify the unquantifiable, and statistics, where we try to manipulate things using tools that are arguably not fit for purpose in complex, complicated, non-linear, confounded domains. I've addressed that elsewhere at great length. I've argued that inferential statistics is not fit for purpose in two senses: you're not using the right tool, and you're not using the tool right. The replication crisis itself is evidence in support of that position.

Bayesian Statistics and Its Limitations

Even if the Bayesian model were totally sound and applicable to the complexity of human measurement, it's still unclear whether Bayesian statistics can work when you're dealing with measurements based on the unmeasurable and the uncomputable.

The issue with Bayesian statistics is not confined to the occasional subjectivity of priors. All statistics involves judgment at the input end and interpretation, which involves judgment at the output end. There's nothing unique about Bayesian priors in this regard, despite ill-thought-out claims to the contrary. Such claims are based on very superficial thinking IMHO.

The other superficial mistake is the conflation of Bayesian numbers with subjective belief. No such thing. That's a category mistake, again IMHO.

The Puzzle of Test Items

You might have five test items that, on the surface, seem quite similar. One might assume they tap into the same cognitive skills — and perhaps they do, perhaps they don't. But different individuals will succeed on different items from these five. One person might get questions 1, 3, and 5 correct; another might get questions 2 and 4; yet another might succeed on questions 1, 4, and 5. And yet, it's maintained that they're all capturing the same feature of intelligence.

Testing in Academic Settings

Of course, it's no different with any testing regime in academic settings. We face the same issue: presenting a set of questions intended to test knowledge of certain topics and the ability to think using that knowledge. It's all confounded as to what exactly we’re trying to measure, and the patterns of test success are very different — except, of course, for the person who scores 100%.

Early Intelligence Testing

Much of this business of assigning numbers to psychological traits seems to have gained momentum with attempts to measure intelligence for the purpose of predicting fitness for military service. I don’t assert that this was the origin; I simply do not know.

The Quandary of Intelligence

What is intelligence? It seems that intelligence is whatever intelligence tests measure. A teaching assistant told me that once, and I never forgot it. This, of course, is circular.

I have long maintained that intelligence is like a mountain range, with peaks and valleys, and innumerable different cognitive abilities. But even that is an oversimplification — a metaphor, of course.

In the end, I’m left with the same quandary: what does this whole idea of intelligence really amount to? And yet, when I look at people who are developmentally challenged, I must say, well, clearly there is an obvious dimension to it. Then we have the phenomenon of savants. So, it’s very complex, isn’t it?

I've often wondered about this, but I have no real understanding of what’s truly going on, other than the commonly made claim that success in testing — IQ or otherwise — depends, at least in part, on learned experience.

Innate ability? I think there is such a thing. But how to assess it? That is another matter.

We see some people who are clearly developmentally challenged, and others who seem to excel effortlessly. Most people fall somewhere in the middle.

And although we pride ourselves on being able to single out the bright ones from the dull ones, I myself find that I can no longer reliably make that judgment in many cases, because everybody displays both high-level and low-level thinking at different times, IMHO. Some of the people who show the highest ability to solve complex problems also believe things that I consider absurd. So, I would say that's very low-level thinking. But then again, that is all interpretation.

Scholarly Criticisms of Psychological Measurement

Despite being underrepresented in mainstream psychological discourse, criticisms of psychological measurement have been voiced—sometimes clearly, sometimes obliquely—by serious methodologists, philosophers of science, and dissenting psychologists over the decades. Several independent lines of critique have converged toward similar conclusions:

Paul Meehl , a clinical psychologist and philosopher of science, famously criticized the soft sciences for what he called "nomological nets of gossamer," warning that psychologists build elaborate theoretical structures resting on weak empirical foundations and unreliable measures. He argued that psychological theories often fail to achieve cumulative progress, owing in part to measurement problems and in part to the misuse of statistical tools.

Jacob Cohen , well-known for his work on statistical power, openly criticized the field’s reliance on null hypothesis significance testing (NHST). He pointed out that statistical tools were being misapplied to data of poor quality and low reliability. Cohen argued that the emphasis on p-values distracted from the much larger issue: whether the constructs were meaningfully measured in the first place.

Joel Michell , a psychologist and philosopher of measurement, directly challenged the foundations of psychometrics. He argued that psychological attributes generally fail to meet the requirements for quantitative measurement, describing psychometrics as "pathological science"—a field where assumptions of quantity are treated as facts despite a lack of empirical support. Michell contended that the entire enterprise rests on unjustified assumptions of measurability, rendering the framework methodologically unsound.

The Replication Crisis , while often treated as an unexpected development within the field, stands as empirical confirmation of this underlying fragility. The crisis did not arise from statistical power alone but from systemic problems of construct validity, measurement error, and over-reliance on inferential statistics applied to weak and noisy data.

Philosophers of science, such as Nancy Cartwright, have also critiqued the application of measurement and statistical tools to complex, context-sensitive domains. Her work emphasizes that models and measurements borrowed from the physical sciences do not necessarily transfer to the study of human behavior and social systems.

In short, there exists an established—though marginalized—intellectual tradition that recognizes these fundamental flaws. The mainstream field has largely resisted these critiques, driven by institutional momentum, entrenched publication practices, and professional incentives deeply tied to the continuation of current methodologies. Nonetheless, the arguments made here are grounded, coherent, and supported by respected voices across psychology, statistics, and philosophy of science.

Summary

The ambition to measure the unmeasurable reflects both the audacity and the vulnerability of human reasoning. From the reliable, physical tools of measurement to abstract proxies for subjective experience, the expansion of measurement has ventured far beyond its tangible origins. Psychologists, armed with operational definitions and scaling methods, converted beliefs, attitudes, and personality traits into numerical data suitable for statistical manipulation. Yet beneath this appearance of rigor lies an enduring uncertainty: do these numbers correspond to real, stable features of human nature, or are they merely artifacts generated by the act of measurement itself?

When measurement crosses into the subjective, it departs from observation and enters assumption. What began as a grounded practice of observing the world with clarity and precision has, in some quarters of psychology, become an exercise in methodological faith—faith that proxies capture reality, faith that statistical tools yield truth, and faith that the unmeasurable can be made measurable by procedural fiat.

The replication crisis, alongside critiques from methodologists, philosophers of science, and dissenting psychologists, has exposed the fragility of this faith. The pursuit of precision, when applied to constructs built on questionable foundations, risks becoming an illusion of accuracy rather than a demonstration of understanding. In the end, the promise of measurement remains tethered to its origins: meaningful measurement requires phenomena that can be meaningfully measured. Without this, we are left with elegant numbers that signify less than they seem.

