Understanding the World: AI Benchmarking, Human Understanding, and Truth
I reflect on creating benchmarks to compare AI performance, and limitations on just what these can accomplish.
Note: This essay was prepared with the research assistance and ghostwriting of ChatGPT 4.0.
Author's Preface
Human beings live in a cosmos far greater than our limited capabilities can comprehend. We create models of reality, scratch at the surface of understanding, and call it science. We build tools, create languages, and form civilizations, all in an attempt to make sense of a vast and unknowable universe. But in every step of this process, we reduce the complexity of the cosmos to fit within the boundaries of our intellect, technology, and language.
In this essay, I explore the intersection of our limited human understanding and the advent of artificial intelligence, particularly through the lens of benchmarking, a process used to assess AI capabilities. What are we really measuring when we benchmark AI, and what are the epistemological implications? This essay seeks to understand not only how we benchmark AI, but how this process reflects the limits of human knowledge itself.
The Immensity of the Cosmos and the Limits of Human Understanding
We begin with the cosmos—vast, unknown, and unknowable. The immensity of space, time, and events is far beyond what human intelligence and technology can comprehend. Even with our most advanced tools, scientific theories, and observational capabilities, we barely scratch the surface of understanding. Our sensorium—what we can perceive through our senses—and the intellectual frameworks we've built around these perceptions are limited by their very nature. Language, one of our greatest achievements, is itself insufficient to fully describe the universe.
Scientific progress, though remarkable in many ways, is hampered by these limitations. We formulate theories about the universe, whether in physics, cosmology, or even metaphysics, but each of these theories is tentative and destined to be overturned. Our current understanding of the cosmos is a work in progress, one that will likely continue to evolve as we encounter new evidence that challenges our existing paradigms.
Even the notion of omniscience—an all-knowing being that understands every detail of the cosmos—seems absurd. While humans use language to describe the universe, it is clear that language is a human invention, a tool that inherently limits the scope of what can be understood. If omniscience existed, it would not depend on language as we do, and even humans do not depend entirely on language for thought. Therefore, we must recognize the limitations of our human-centric perspective when describing the universe.
Stages of Information and Its Reduction
The process of training a large language model AI like ChatGPT is marked by a continuous reduction of information at every stage. What begins as the full scope of human assertions about the cosmos and the world becomes a filtered and narrowed subset of knowledge that the AI can access. This section explores the stages of information reduction in detail, from unrecorded assertions to the final AI output.
Unrecorded Assertions
The vast number of human assertions about the cosmos, events, and everything in between.
The overwhelming majority of these assertions were never recorded and are lost to history.
Recorded Assertions
Only a small subset of human assertions are recorded, representing the first step in preserving human knowledge.
Recorded Assertions That Have Been Lost
Natural disasters, war, vandalism, and degradation have caused a significant portion of recorded assertions to be lost over time.
Assertions That Are Available but Difficult to Access
Some recorded assertions are accessible, but only with difficulty, due to physical location or the specialized knowledge required to interpret them.
Digitized Assertions
A smaller subset of recorded assertions is digitized, further reducing the scope of accessible information.
Assertions Available to the Large Language Model Community
Many valuable sources of knowledge are restricted due to constraints like copyright, proprietary interests, and political concerns, reducing the amount of knowledge AI can access.
The Curation Process
Curators select portions of the digitized data for AI training, a process influenced by biases, corporate pressures, and human judgment.
The Training Process
AI trainers shape the AI model by adjusting weights and patterns in its architecture, further narrowing the scope of its knowledge.
The Prompting Process
Users interact with AI through prompts, generating unpredictable responses shaped by the AI’s training and the prompter’s input.
Human Assertions Over Time and the Loss of Knowledge
Throughout history, human beings have made countless assertions about the world, the universe, and their place within it. These assertions are expressed through language, passed down through generations, and sometimes recorded. But the process of recording and preserving human knowledge is fraught with challenges. Most human assertions, especially those made long ago, were never recorded and have disappeared into the ether of time.
Of the assertions that were recorded, many have been lost due to natural forces, catastrophic events, or human acts of vandalism and destruction. What remains is but a fraction of what has been produced over the millennia, and even then, much of it remains inaccessible or overlooked. Some knowledge has been digitized, and a smaller subset of that digitized knowledge is made available to AI models.
What this means is that by the time we reach the point where we are training artificial intelligence, we are working with a minuscule subset of human knowledge—already reduced and filtered through layers of loss, destruction, and curation.
Curation, AI Training, and the Shaping of Large Language Models
The process of training large language model AI involves several levels of curation. First, curators and AI staff sift through the available digitized information to decide what will be used to train the AI. Their decisions are shaped by their own limited understanding, corporate pressures, and biases. The trainers then apply algorithms to shape the AI model itself, further narrowing the scope of what the AI can know or do.
This process involves adjusting weights and patterns within the AI's architecture, but ultimately, the training is based on human judgment and limited datasets. The result is an AI that is sophisticated in many ways, yet constrained by the limitations of its data and the choices made by its human curators.
Thus, large language model AI is a reflection of the human reductions and selections that precede it. It is not vast—it is a small, small subset of the cosmos, filtered and processed through many layers of reduction. The idea that these AIs are trained on a "vast" corpus of knowledge is a misconception; in truth, what they know is minuscule in comparison to the totality of human knowledge, let alone the cosmos itself.
The Prompting Process and the Emergence of AI Responses
Once trained, the large language model AI interacts with users through prompts. This process, though fascinating, is unpredictable and subject to the prompter's understanding of the AI's behavior. A user can craft a prompt, but there is no guarantee of receiving a predictable or desirable result. The responses generated by the AI are influenced by the data it was trained on, the way it interprets the prompt, and the internal workings of the AI's model.
This interaction, while often coherent, remains mysterious. The AI produces emergent behavior—behaviors that were not explicitly programmed but arise from the complexity of the system itself. Just as human consciousness is emergent and unpredictable, so too are the capabilities of AI. Despite our efforts to understand how these systems work, we remain largely in the dark. No one in the AI or scientific community can honestly claim to fully understand how these systems generate their responses or how many things they can do. The capabilities of AI are continually expanding, and just as we cannot predict the full range of human potential, the emergent behaviors of AI remain elusive.
Benchmarking AI: A Limited View of Capabilities
Benchmarking, the process of measuring AI capabilities, is a crucial step in the development and evaluation of AI systems. It allows developers, researchers, and the public to assess how well an AI system performs specific tasks. These tasks may include everything from natural language processing to image recognition or complex problem-solving. However, when viewed through a high-level, philosophical lens, benchmarking AI reveals inherent limitations.
While benchmarking can provide objective measures like speed and resource efficiency, the question of correctness becomes more complicated, especially given our imperfect understanding of the world. How can we benchmark AI for "truth" when our own grasp of truth is so tenuous? This issue is particularly problematic when dealing with subjective knowledge areas, where what is true for one person may be considered false by another.
The benchmarks themselves, which are often sets of carefully crafted prompts or tasks, are built on human assumptions of correctness. In objective domains such as mathematics and logic, benchmarking AI may be more straightforward—there are clear, unambiguous answers. However, in areas like language, culture, history, or ethics, truth is far more subjective and contingent on the context. Benchmarking AI in these domains involves not only evaluating its performance but also acknowledging the biases in the benchmarks themselves.
Moreover, the data that AI models are trained on includes a mixture of information, misinformation, and disinformation. The complexity of truth is further compounded by this mixed data set, raising the question of how well any AI model can truly perform when it is trained on imperfect information. Despite the best efforts of AI developers and trainers to address these issues, the limitations of both the data and the benchmarks mean that we are only capturing a narrow slice of what AI systems can do.
The Epistemological Challenge of Truth
At the heart of AI benchmarking is the epistemological challenge of truth. If our understanding of the cosmos is limited by human language, perception, and cognition, how can we assert that AI—built upon the same foundations—can access or deliver absolute truth? The training of AI models is predicated on human-constructed datasets, which are themselves filled with human interpretations, errors, and biases.
The issue of truth extends beyond mere technical performance. AI, like humans, is subject to the contradictions and imperfections of the information it processes. Information that may seem correct or true in one context can be false in another, and even in fields like science, knowledge is constantly evolving. What was considered a scientific fact decades ago may be debunked in the future, and AI models, trained on historical data, may propagate outdated information. Thus, the process of benchmarking AI for truth becomes not only a technical problem but a profound philosophical one.
Benchmarking, while useful for certain technical applications, cannot fully resolve the deeper issues related to human understanding and truth. AI, like humanity, grapples with incomplete and sometimes contradictory information. And while benchmarking may improve AI’s ability to perform specific tasks, it will never offer certainty in matters of truth.
Conclusion: The Intersection of Human Understanding, AI, and Benchmarking
As we continue to develop and refine AI systems, we must confront the limitations of both human understanding and artificial intelligence. The vastness of the cosmos, the complexity of human assertions, and the layers of reduction through which information passes all culminate in AI systems that are impressive but fundamentally constrained.
Benchmarking AI can offer valuable insights into its performance and capabilities, but it cannot address the larger epistemological and ontological questions that arise when we consider the nature of truth and understanding. AI models are shaped by the same forces that shape human knowledge: reduction, bias, and selection. And just as humans cannot fully comprehend the cosmos, AI models, trained on filtered subsets of knowledge, cannot access the full spectrum of truth.
The ongoing challenge, then, is not only to improve AI performance but also to deepen our understanding of the limitations inherent in both human and machine knowledge. As AI continues to evolve, we must remain aware of its blind spots and the ways in which it mirrors our own. Benchmarking, while important, is only a tool—a tool that offers a glimpse into the capabilities of AI, but one that ultimately leaves the larger questions of truth and understanding unanswered.
Appendix A: Benchmarking
Benchmarking is a familiar topic to me, having worked in the computer field for decades. Recently, I encountered the notion of benchmarking AI systems—large language models like ChatGPT and others. The idea is to use benchmarking to assess whether new models or new training efforts improve accuracy and performance.
What immediately came to mind is that this process is fraught with epistemological challenges as well as practical challenges. The first question I asked was, "What exactly are they benchmarking against?" Certain aspects of AI systems—such as speed, resource consumption, and performance—can be benchmarked fairly objectively. These are concrete metrics, and improvements in these areas are easily measured.
However, when it comes to more subjective or interpretative tasks, such as language understanding, the challenge becomes far more complicated. Benchmarking correctness in fields like mathematics or logic is more straightforward because there are established formal systems with clear, unambiguous answers. For example, we can assert with confidence that objects fall under the influence of gravity, or that acceleration due to gravity on Earth can be calculated reliably. These are factual and do not involve interpretation.
But in many other domains, facts are not as clear-cut. Nietzsche famously said, “There are no facts, only interpretations,” which resonates when we deal with human knowledge and culture. While I disagree with the totality of Nietzsche’s statement—there are certainly facts—his assertion highlights that much of what we deal with involves interpretation.
The 5WH of Benchmarking AI Systems
In this section, I will explore the key questions around the who, what, why, where, when, and how of benchmarking AI systems:
Who is Benchmarking AI?
AI researchers, developers, and companies like OpenAI, Google DeepMind, and academic institutions are leading the efforts to benchmark AI systems. These stakeholders are responsible for creating and applying benchmarking frameworks to test the models they create.
What is Being Benchmarked?
Benchmarking focuses on several dimensions: performance (speed and efficiency), accuracy (how well the AI completes specific tasks), and correctness in fields like mathematics, logic, and fact-based assertions. Additionally, some benchmarks assess ethical considerations such as bias, fairness, and robustness in the face of adversarial input.
Why is Benchmarking Done?
The purpose of benchmarking is to measure the improvement of AI systems over time. By comparing different models or versions of the same model, developers can gauge whether changes lead to better outcomes. Benchmarking also helps ensure that AI meets certain standards of performance, especially in industries like healthcare, where mistakes can have serious consequences.
Where is Benchmarking Done?
Benchmarking takes place in research laboratories, academic institutions, and technology companies. These settings provide controlled environments where AI models can be tested against specific criteria. Open source communities also participate by sharing benchmarking tools and datasets.
When is Benchmarking Performed?
Benchmarking occurs throughout the development cycle of an AI system, from early training stages to after the model has been deployed. Continuous benchmarking is essential as new versions of AI are released, ensuring that improvements in speed, accuracy, and correctness are tracked.
How is Benchmarking Conducted?
Benchmarking relies on carefully crafted datasets and tasks designed to test the AI’s capabilities. For example, GLUE is a well-known benchmark for natural language processing tasks, while MLPerf benchmarks focus on measuring the performance of machine learning models in various domains. Each benchmark provides a set of standardized tasks that allow developers to compare AI systems on a level playing field.
In summary, benchmarking is a vital process in the development of AI systems, but it carries inherent limitations. While it is valuable for measuring certain technical aspects, it is less suited to resolving deeper questions about truth, bias, and the subjective nature of knowledge. Understanding the boundaries of what can be benchmarked is essential, and recognizing the epistemological challenges involved is equally important for any intelligent application of AI technology.