Crosspost: Understanding Large Language Models
A Cross-Section of the Most Relevant Literature To Get Up to Speed
“Large language models have taken the public attention by storm – no pun intended. In just half a decade large language models – transformers – have almost completely changed the field of natural language processing. Moreover, they have also begun to revolutionize fields such as computer vision and computational biology.
Since transformers have such a big impact on everyone’s research agenda, I wanted to flesh out a short reading list (an extended version of my comment yesterday) for machine learning researchers and practitioners getting started.” — Sebastian Raschka, PhD, Apr 16, 2023
Preface:
This article by Sebastian Raschka provides a reading list of academic research papers for machine learning researchers and practitioners who want to understand large language models (LLMs) and transformers. The list includes the following papers:
"Neural Machine Translation by Jointly Learning to Align and Translate" (2014) by Bahdanau, Cho, and Bengio, which introduces an attention mechanism for recurrent neural networks (RNNs).
"Attention Is All You Need" (2017) by Vaswani et al., which introduces the original transformer architecture and concepts such as scaled dot product attention and positional input encoding.
"On Layer Normalization in the Transformer Architecture" (2020) by Xiong et al., which discusses the placement of layer normalization in the transformer architecture.
"Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Neural Networks" (1991) by Schmidhuber, which proposes an alternative to recurrent neural networks called Fast Weight Programmers (FWP).
"Universal Language Model Fine-tuning for Text Classification" (2018) by Howard and Ruder, which proposes pretraining language models and transfer learning for downstream tasks.
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (2018) by Devlin et al., which introduces the concept of masked-language modeling and next-sentence prediction.
"Improving Language Understanding by Generative Pre-Training" (2018) by Radford and Narasimhan, which introduces the GPT architecture and pretraining via next-word prediction.
"BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension" (2019) by Lewis et al., which combines encoder and decoder parts to create a versatile LLM.
"Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond" (2023) by Yang et al., which provides an overview of different LLM architectures and their evolution.
The article also suggests reading papers on improving efficiency, such as "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022) by Dao et al. and "Cramming: Training a Language Model on a Single GPU in One Day" (2022) by Geiping and Goldstein. Additionally, it mentions "LoRA: Low-Rank Adaptation of Large Language Models" (2021) by Hu et al. as an influential approach for parameter-efficient fine-tuning.
Lastly, it recommends "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning" (2022) by Lialin et al. as a survey paper on parameter-efficient fine-tuning of LLMs.
Article:
Sebastian Raschka, PhD, Apr 16, 2023

