Understanding the World: Drilling Down to Find Out More About the Workings of LLM AI
Sort of computer geeky stuff, but not that highly technical I think. Surely some will find it so. Some will find it simplistic.
Note: There was extensive prompting on my part to elicit information from ChatGPT about its own implementation I am making the assumption that it is basically correct; I was a computer geek in a past life. ChatGPT may be wrong but it seems consistent with other information I have obtained. The organization is mine, the details courtesy of ChatGPT. This information is starting to make sense to me, at a 10,000 foot level.
Part I - Essay on the Architecture, Tokens, Data Store, and Resource Requirements of Large Language Models (LLM)
Large language models (LLMs) are built on intricate architectures that handle massive amounts of text, manage data efficiently, and make predictions based on learned patterns. To better understand these systems, we will explore several key components: architecture, tokens, data store, data sizes, algorithms, codebase, and resource requirements. We will also touch on the differences in handling data during various phases like curation, training, and inference, with a focus on how data is represented and retrieved.
1. Architecture
The architecture of LLMs separates the code for execution from the trained data. The code is responsible for processes like inference and interaction, while the trained data comprises the model’s learned parameters (weights). During the curation phase, the original characters that make up a word or syllable are stored for later use. These characters are stored in a form that allows efficient retrieval during text generation.
After tokenization, pointers to these original characters are stored in various data structures:
One-dimensional structures (vectors): These are arrays of numbers representing individual tokens and their properties.
Two-dimensional structures (matrices): These store relationships between multiple tokens, such as a set of token vectors for a sentence.
Multi-dimensional structures (tensors): Tensors are used to represent more complex relationships, such as multiple sentences or sequences of data processed simultaneously.
In the architecture, these pointers are referenced during training and inference, ensuring that the original characters can be reconstituted in the correct order and context when generating new text.
2. Tokens
Tokens represent fragments of text, which could be whole words, subwords, or characters, depending on the language. Each token is associated with probabilistic weights that the model learns during training, indicating how often certain tokens appear together in various contexts.
At the lowest level, text is represented by Unicode characters (e.g., UTF-8 encoding). Once the model tokenizes the text, it assigns each token a numerical ID, which is stored in various data structures (like vectors, matrices, or tensors). These pointers to the original characters allow the model to retrieve them during the generation phase.
During training, the model adjusts the weights associated with tokens to reflect learned patterns, which are stored in the model’s parameters. This enables the system to predict or generate text based on token relationships.
3. Data Store
In LLMs, data storage is managed using solid-state drives (SSD) for efficient access, but during inference, the trained data and token embeddings are loaded into RAM for faster processing. The model uses specialized data structures to manage tokens and their associations.
While the original characters (in their Unicode form) are stored and can be retrieved during text generation, the model also maintains efficient pointers to these characters within its internal data structures, like vectors, matrices, and tensors. These structures allow the model to handle large amounts of data quickly and ensure that tokens are efficiently processed and stored.
4. Data Sizes
For models like GPT-3 or GPT-4, storage needs can be significant, but they still remain within consumer-level storage capacities. GPT-3, for example, requires around 350 GB of storage for its parameters. The original text (characters or syllables) is stored in a manageable form, while the pointers to these texts, combined with learned weights, take up the bulk of the model’s storage needs.
In smaller models (such as those running on a Raspberry Pi or older laptops), storage is optimized further by compressing data or using more efficient storage formats, reducing the need for vast amounts of memory.
5. Algorithms
LLMs rely on numerous algorithms to manage their various processes:
Insertion, deletion, and updates to tokens and data during training.
Tokenization algorithms to convert raw input into tokens.
Inference algorithms to combine user input with the model’s pre-trained tokens and weights for generating responses.
These algorithms are optimized for performance, often written in compiled languages like C++ or CUDA. This ensures that large amounts of data can be processed efficiently, particularly during training and inference. Python, while slower, is sometimes used for runtime flexibility, as seen in the Whisper model.
Each phase of the AI lifecycle (curation, training, and use) requires specific algorithms, with some being tailored for AutoML or meta-learning, where AI helps train other AI systems.
6. Codebase
The codebase for an LLM is smaller compared to traditional systems like Microsoft Windows (50 million lines of code), likely amounting to a few million lines for most LLMs. The complexity is not in the volume of code but in optimizing algorithms for performance and scalability.
The runtime code of smaller models (like Whisper) can be compact enough to run on consumer hardware, such as a Raspberry Pi or an older laptop, but still manages to handle the complex tasks required for processing language input.
7. Resources
Running LLMs demands considerable computational resources. Training requires massive computational cycles due to the large-scale matrix operations involved in processing tokens, vectors, and tensors. Efficient I/O operations and parallelism are essential to managing the model’s needs.
8. LLM AI Cycle Resources
During inference and training, LLMs require extensive resources, particularly RAM, GPU, TPU, and CPU. Models like GPT-3 may need hundreds of gigabytes of RAM and specialized hardware like GPUs or TPUs for efficient processing.
However, smaller models, such as those running on Raspberry Pi or older laptops, can operate with limited resources but may perform slower. The Whisper model, for instance, is written in Python, making it less efficient but still operable on hardware without GPU support. Although Whisper runs slowly on such machines, performance is vastly improved when deployed on more suitable hardware.
Conclusion
LLMs rely on a sophisticated architecture where original characters are stored for later use in text generation, and pointers to these characters are organized in data structures such as vectors, matrices, and tensors. These data structures, combined with optimized algorithms, ensure efficient processing, even when working with smaller hardware systems. While the storage needs for the largest models can be significant, they are still manageable within consumer hardware limits, with computational efficiency being achieved through optimized code and hardware acceleration.
Part II - Workflow of Curation, Training, Prompting, and Generation in Large Language Models
Large language models (LLMs) like GPT-3 or Whisper are developed through a multi-phase workflow that involves curation, training, prompting, and generation. Each phase has distinct roles for humans and software, and the entire process is influenced by both human and machine limitations, including issues like data accuracy, propaganda, and subjective interpretation. This section will first provide a general overview of the workflow and then delve into each phase in more detail, exploring the roles humans and software play, as well as the epistemological and interpretive challenges inherent in the process.
General Workflow Overview
Curation: The first step involves gathering, cleaning, and organizing large datasets that the model will use during training. Humans play a key role in selecting the sources, but software automates much of the data collection and processing.
Training: In this phase, the model is trained to learn patterns and relationships in the data. Humans set the objectives and fine-tune the training process, while software performs the heavy lifting of learning from the data and adjusting the model’s parameters.
Prompting: This is where users interact with the trained model, inputting queries (prompts) to get responses. The quality of the prompts and the model’s understanding of them play a crucial role in generating useful outputs.
Generation: Based on the prompt, the model generates text by predicting the next token or sequence. Humans evaluate the generated output, revising prompts if necessary, to better fit their expectations or needs.
1. Curation: Gathering and Organizing Data
Description:
Curation is the process of collecting data that the model will learn from. The quality and diversity of this data directly impact the model’s capabilities. Ideally, the curated data represents a wide range of factual, neutral, and accurate information from across the web, books, research papers, and other sources. However, in practice, this data may also include incorrect information, propaganda, and biased viewpoints.
Role of Humans:
Selecting Sources: Humans decide where to gather data from, which means that their understanding and worldview affect the dataset. Choices about which sources to include or exclude introduce a level of subjectivity.
Cleaning Data: Human curators also help filter and clean data. Here, they may inadvertently allow incorrect information or propaganda to remain in the dataset, either because they didn’t detect it or because they misunderstood it.
Corporate Influence: Curators often work in a corporate environment, meaning that their decisions are influenced by corporate goals, which may prioritize certain types of content over others, either for profit or to align with the company’s public image or business objectives.
Role of Software:
Automated Collection: Software tools automate the collection of large datasets by scraping the web or processing text from various sources. This allows curators to work with a far larger dataset than would be possible manually.
Data Processing: Software is responsible for structuring and tokenizing the data for use during training. It also helps remove obvious errors, duplicates, and irrelevant content, though it cannot always filter out subjective biases or propaganda.
Epistemological Considerations:
Data curation is inherently interpretative—human curators make decisions based on their worldview, and software follows predefined rules, which cannot account for every nuance. Correct or incorrect information is not always clear-cut, as it depends on subjective interpretations of what constitutes a fact or reliable information. In this phase, human understanding and bias interact, but they are not the same: understanding is the capacity to process and interpret data, while bias is a preconceived inclination or prejudice.
2. Training: Learning from Data
Description:
In the training phase, the model processes the curated data and begins learning patterns. This phase involves training the model to predict text sequences based on the input data, adjusting its parameters (weights) through algorithms like gradient descent.
Role of Humans:
Setting Training Objectives: Humans define what the model should focus on during training, such as improving language understanding, text generation quality, or factual accuracy. This is a subjective process that reflects the goals of the model developers and their understanding of language and information.
Monitoring and Fine-Tuning: Human engineers monitor the training process and adjust the model based on its performance, deciding which areas need more refinement. These decisions are influenced by their own biases and the corporate environment in which they work.
Role of Software:
Executing Training: Software, using powerful hardware like GPUs or TPUs, performs the computationally intense process of adjusting the model's parameters based on the patterns it learns from the data. The model updates its weights to better predict future outputs based on the training objectives set by humans.
Handling Data: The model processes the tokenized data by converting it into vectors, matrices, or tensors, where each token has a corresponding learned weight. The original characters are stored, but their pointers are what the model uses during the learning process.
Epistemological Considerations:
Training objectives reflect subjective human goals, and what is deemed "correct" or "relevant" in training is often subject to interpretation. Models are trained to approximate human language, but understanding and bias influence what patterns they learn. In this phase, a key challenge is that training models on large datasets can sometimes lead to garbage in, garbage out (GIGO) if the data itself is flawed or biased.
3. Prompting: Inputting Queries
Description:
Once the model is trained, users (prompters) interact with it by entering prompts—queries or statements that the model processes and responds to. The quality and clarity of the prompt greatly affect the output.
Role of Humans:
Devising Prompts: The prompter creates inputs for the model, but their understanding of the world, the model’s behavior, and the prompting process itself is limited. This can result in suboptimal prompts, which may lead to suboptimal responses.
Evaluating Outputs: Prompters evaluate the generated outputs and adjust their prompts in response. They interpret the model's responses based on their own worldview and biases, which can skew their judgment of the output’s accuracy or relevance.
Role of Software:
Tokenization of Prompts: When a user inputs a query, software tokenizes the prompt and processes it by looking up the token vectors from the model’s training data.
Generating Output: The software then generates a response based on the learned patterns, often in a somewhat randomized or probabilistic manner. This can sometimes produce unintended results depending on the model’s interpretation of the prompt.
Epistemological Considerations:
Prompters bring their human biases and subjective understanding into the interaction, which affects both the quality of the prompts and how they assess the model’s output. Prompting is not an exact science, and there is often a trial-and-error process where the user must refine their prompts based on the generated responses.
4. Generation: Producing Outputs
Description:
Generation is the final phase, where the model uses its learned patterns to generate text or other outputs in response to the prompt.
Role of Humans:
Interpreting Results: Humans review the generated output, judging its quality and accuracy based on their own worldview, subjective interpretations, and epistemological limitations. If the output is unsatisfactory, they may revise their prompt and try again.
Role of Software:
Prediction and Response: The software generates the next token or sequence by predicting the most likely outcome based on the input prompt. This process relies heavily on the model’s trained parameters and the algorithms governing its output.
Retrieving Original Characters: During generation, the model uses the pointers stored during training to retrieve the original characters and produce readable text. These pointers are represented as vectors, matrices, or tensors, depending on the complexity of the task.
Epistemological Considerations:
At this stage, interpretation is key. Deciding whether the output is "correct" or "useful" depends on subjective factors, such as the human’s expectations, knowledge, and biases. There is no foolproof way to ensure that the generated text is free from error, as the process relies on the imperfect nature of both the input data and the human’s evaluation.
Conclusion: Training Objectives and Epistemological Limits
The entire workflow of large language models is shaped by the subjective decisions made at each stage, from data curation to output generation. Training objectives reflect human priorities and biases, which are different from the model's understanding of the world. Every human involved, whether curating data, training models, or prompting for responses, operates under epistemological limitations, meaning that their interpretations of the data and the model’s outputs are subjective and worldview-dependent.
By acknowledging these limitations, we can better understand the inherent challenges in building and using LLMs, recognizing that while these models are powerful, they are also deeply influenced by the humans who create, train, and interact with them.