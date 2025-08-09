Standing on the peaks of Mount Epistemology.

Author’s Preface

The Reason series is an ongoing examination of epistemological and related questions in the hope of better understanding the world. This includes the limits of human reasoning, the ways language both enables and distorts thought, and the performance of tools such as large language model AI that extend our capacities but also inherit our weaknesses. It is an exercise in following arguments to their limits, knowing full well that the most fitting response to such an enterprise might be, “Good luck with that.”

This particular essay examines claims that new generations of large language models can better separate reliable from unreliable information. The question is whether algorithmic and procedural improvements can genuinely overcome the limits imposed by the nature of the training data, the properties of language, and the absence of any universal method for establishing truth. This is not a software engineering review but an epistemic audit.

Introduction

Large language models (LLMs) are marketed as improving steadily through advances in architecture, training procedures, and curated datasets. Benchmarks are presented to show newer models outperforming earlier ones in distinguishing fact from error. Such claims may be accurate in certain narrow, factual domains, but they rarely acknowledge the deeper epistemic ceiling created by the content and structure of the data, by the way language works, and by the nature of the domains into which truth claims fall.

This ceiling is not set by computation. It is set by the reliability, diversity, and independence of the information available to the model — conditions that also bind human reasoning. In many domains, truth is sparse, contested, or unreachable, and no algorithm can conjure it from absence.

Discussion

1. Domains and the Conditions for Truth

For purposes of analysis, it is useful to speak of factual domains, interpretive domains, and value-based domains — with the caveat that these are not hard categories but points along a continuum.

Factual domains include areas where repeated empirical demonstration is possible: basic measurements in physics, arithmetic, and other well-defined, stable systems.

Interpretive domains are those where evidence is incomplete, ambiguous, or theory-laden, such that multiple incompatible explanations can persist without resolution.

Value-based domains are shaped by evaluative frames — moral, aesthetic, cultural — that are shared within communities and rarely reducible to objective criteria.

Movement through this space is done without a map. Even fact-based claims require chains of inference, often mediated entirely through language, before they connect to observable reality. The longer and more complex the chain, the more interpretive elements intrude. In value-based domains, the link to the objective world is often indirect at best; reasoning here is grounded in individual and communal frames rather than empirical confirmation.

2. Language as Medium and Limitation

Language is our principal tool for representing and transmitting knowledge, but it is also the means by which we report error, nonsense, fantasy, and deliberate falsehood. It can track reality closely enough to support survival and engineering, but it can also fail without our noticing.

An LLM is entirely dependent on language. It has no independent contact with the world. It processes tokenised representations of text and learns patterns of association and frequency. This includes every mode in which language is used — truth-telling, error, propaganda, and play. A model trained on a truth-scarce or biased corpus will faithfully reproduce those deficits.

3. Benchmarks and the Illusion of General Competence

Developers use benchmarks to measure improvements in model performance. These benchmarks are typically drawn from curated datasets considered factual. In domains where truth is stable and verifiable, benchmarks can measure genuine competence. In interpretive or contested domains, they measure only agreement with a selected frame. Gains on such tests cannot be generalised to truth-finding in the world at large.

4. The Role of Curation and Anchors

Curated or synthetic “anchor” data — vetted examples with known correct answers — can help a model learn consistent patterns in fact-rich domains. The historian analogy is often invoked: a few verified records can help interpret a mass of biased accounts.

However, in interpretive or methodologically unstable domains, anchors embed the same biases and errors as the surrounding corpus. Instead of providing calibration to reality, they reinforce an already skewed frame. The reliability of anchors is limited by the human judgment that selects and labels them.

5. Anti-Platonism and the Objective World

This analysis rests on an explicitly anti-Platonic stance. Platonic thinking — the belief in timeless, independent abstractions or perfect forms — still pervades mathematics, philosophy, and other disciplines, often unnoticed. It encourages reification: treating our conceptual structures as if they existed independently of human cognition.

In contrast, the position here accepts the existence of an objective world but sees all models, categories, and formal systems as human constructs imposed on a fluid reality. They can be extraordinarily effective and may track reality closely, but they are not reality itself. Treating them as such distorts reasoning and encourages overconfidence in formalism as a path to truth.

6. Data Sparsity and the ChordPro Case

One concrete example of the epistemic ceiling is drawn from a highly specific, technical domain: the ChordPro file format. ChordPro is a widely used syntax for representing guitar chords and lyrics in plain text. It allows a music software program to generate chord diagrams — visual grids showing frets, finger positions, and chord names.

When first-generation public ChatGPT was asked to produce ChordPro syntax for specific chords, it consistently produced incorrect output. No amount of prompt rephrasing could fix the problem. The likely causes were:

Sparse representation in the training data — very little accurate ChordPro syntax was available to learn from.

No verification loop — the model could not test its output against actual ChordPro rendering software.

Overgeneralisation from similar formats — syntax elements from other notation systems were substituted, producing plausible but non-functional text.

If a newer model now performs better, it is because accurate ChordPro examples have been added to the dataset or generated synthetically. The improvement would not come from abstract reasoning or “better algorithms” alone — only from better material to learn from. The ChordPro case scales up: in any domain with sparse, noisy, or biased representation, performance is capped by what is in the corpus.

7. The Broader Epistemic Conundrum

Humans and LLMs share the same limitation: neither has a universally reliable method for establishing truth outside of simple or controlled cases. Humans use perception, inference, and consensus — all fallible. LLMs are parasitic on human language, which is itself parasitic on the objective world.

Denying the world is self-defeating; equating language with the world is untenable. Language can reflect reality well enough to permit survival, yet it can also sustain systems of error indefinitely. The mystery is that the same medium serves equally for truth, error, and invention.

Summary

Algorithmic improvements in large language models can yield real gains in domains where truth is abundant, assumptions are met, and accurate anchors are available. Benchmarks in such areas can measure competence. But in the far larger set of interpretive, value-laden, or methodologically flawed domains, there is no general method for determining truth, and benchmark gains do not translate into genuine truth-finding. The epistemic ceiling is set by the corpus, the prompts, and the embedded biases of the system. Beyond that ceiling, the separation of wheat from chaff is a matter of presentation, not discovery.

