For the last couple of years or so, developments regarding AI and Natural Language Processing seem like they have significantly accelerated. There are more companies, more research papers, more products and more capital invested targeting AI every day. Both expectations and promises have inflated, making categorization and analysis more difficult. To help make more sense of things, in this article I want to talk through the trends and constants that brought us to where we stand today.
I also want to clarify both “language models” and what happens when they are “large”, the main architectures and model families, and all the other factors that influence the building of an LLMs not related to the architecture and choice of pre-trained base. Finally, I want to touch upon the future expectations relating to LLMs. I will make frequent use of the excellent recent survey by MInaee et al. (
Language modeling is a decades-old research domain, with the earliest notable attempt being Claude Shannon’s investigation into it, where he measured how well simple n-gram language models predict natural language text. Since then, language modeling has become fundamental to many natural language understanding and generation tasks, ranging from speech recognition, machine translation, to information retrieval.
Hence, language models are probabilistic models of language that enable us to predict, classify and generate natural language based on specific inputs. This task is highly non-trivial due to, among many other reasons, data-sparsity, ambiguity of natural language, and long-range dependencies within the datasets.
Early models that showed a degree of practical utility were usually highly task specific. They were usually one among many models in highly engineered and improvised Natural Language Processing pipelines. Modifications and variations of neural networks such as Recurrent Neural Networks, LSTM (Long Short-term Memory) networks and GRU (Gated Recurrent Unit) Networks achieved notable results in this era, becoming viable in applications. Yet, operating these in production environments cost-effectively was very challenging and required very specific experiential knowledge available to only a small number of specialists.
While word2vec did indeed establish foundational principles for representing semantics and context within vector representations, thee “big bang” moment, with respect to language modeling, was arguably the 2017 publication of the first Transformer paper. This paper introduced the “Attention” concept, which is essentially a series of matrix operations computed by a Transformer network that helps approximate statistical dependencies among language “tokens” over a wide range of distances.
By this time, the “pre-training” - “fine-tuning” approach had already gained some popularity.
But the more powerful parallelization capacity of the Transformer architecture enabled a significant improvement in training efficiency, essentially making these models more scalable with respect to training data and model size compared to previous state-of-the-art models. Accordingly, larger and larger models getting (pre)trained with bigger and bigger portions of the web as training data led to more task-agnostic Pretrained Language Models being used as bases upon which different finetuning methods and datasets were applied. Some of the first Pretrained Language Models were ChatGPT’s predecessors like GPT-1 and GPT-2, along with BERT (which achieved SOTA results in translation benchmarks).
LLMs are large-scale, pre-trained, statistical language models based on neural networks. “Large” here refers to not just the model size and dataset size but also certain “emergent” properties that are not found in smaller models. These emergent abilities include in-context learning, instruction following and multi-step reasoning. With in-context learning LLMs learn a new task from a small set of examples presented in the prompt at inference time, with instruction following LLMs can follow the for new types of tasks without using explicit examples, and with multi-step reasoning LLMs can solve a complex task by breaking down that task into intermediate reasoning steps.
GPT-3 is widely considered as the first “Large” Pretrained Language Model. It was not only much larger than previous Pretrained Models, but also demonstrated for the first time emergent abilities that are not observed in previous smaller models. GPT3 showed the emergent ability of in-context learning, which means GPT-3 can be applied to any downstream tasks without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieved strong performance on many NLP tasks like translation and question-answering as well as several ones that require on-the-fly reasoning or domain adaptation.
Thus, the “Large” Language Model era was born. OpenAI’s success was followed by rival model families released by Meta (Llama) and Google (PaLm) along with a collection of other experimental open-source pretrained LLMs with different modifications and datasets.
LLaMA, a series of foundation language models by Meta, distinguishes itself from GPT models by being open-source, providing model weights to the research community under a noncommercial license. This has fostered rapid growth in the LLaMA family as it has been widely adopted for developing better open-source LLMs or task-specific ones. Initial releases in February 2023 ranged from 7B to 65B parameters, surpassing GPT-3 in performance on various benchmarks. LLaMA employs the transformer architecture with minor modifications such as using SwiGLU activation, rotary positional embeddings, and root-mean-squared layer-normalization.
The PaLM (Pathways Language Model) family, developed by Google, introduced its first model in April 2022, remaining private until March 2023. This initial model, boasting 540 billion parameters, is transformer-based and trained on a large text corpus across various language tasks. Utilizing the Pathways system for efficient training on multiple TPU (Tensor Processing Unit) Pods, PaLM achieved state-of-the-art few-shot learning results across numerous benchmarks. Subsequent U-PaLM models, ranging from 8B to 540B scales, undergo continual training with UL2R, resulting in computational savings. Flan-PaLM models, instruction-finetuned on a vast number of tasks and data, substantially outperformed previous models, with Flan-PaLM-540B achieving notable performance gains compared to PaLM-540B.
Finally, there have been models that have been developed by researchers sing different pretraining and finetuning approaches. The most significant ones being,
These models represent advancements in various aspects of language modeling, from performance enhancement to specialized applications like dialogue and zero-shot learning. Each model addresses specific challenges within the field and contributes to the ongoing development of large language models.
Although a lot of these models and model families bring unique advantages and drawbacks, they are instances of three essential architectural classes with variations on training processes, dataset collection and preparation, and fine-tuning methods.
The Transformer approach introduced a paradigm shift in natural language processing, primarily due to its novel self-attention mechanism optimized for parallel computation on GPUs. At its core, the Transformer architecture consists of an encoder and a decoder.
The encoder comprises a stack of identical Transformer layers, each housing two sub-layers: a multi-head self-attention layer and a position-wise fully connected feed-forward network.
On the other hand, the decoder, also composed of a stack of layers, incorporates an additional sub-layer for multi-head attention over the encoder's output stack.
In the Transformer model, attention functions play a pivotal role, essentially mapping queries and a set of key-value pairs to an output. These functions, operating on vectors representing queries, keys, values, and outputs, compute weighted sums of values based on the compatibility of queries with corresponding keys. Notably, linear projections are employed to project queries, keys, and values into different dimensions for improved performance. Additionally, positional encoding is introduced to imbue the model with information about token positions within sequences, whether absolute or relative.
Beyond the foundational Transformer architecture, subsequent developments have led to various model families, each catering to specific NLP tasks.
Encoder-only models, exemplified by BERT, are adept at comprehending entire sequences and excel in tasks like sentence classification and named entity recognition. These models are typically pre-trained by masking words and training the model to reconstruct the original sentences.
In contrast, decoder-only models, such as those embodied by GPT, Llama and PaLM are tailored for text generation tasks, with attention layers accessing only preceding words. Pre-training for decoder-only models typically involves predicting the next word in a sequence.
Encoder-decoder models, which integrate both encoder and decoder components, are commonly referred to as sequence-to-sequence models. In this architecture, encoder attention layers can access the entire input sequence, while decoder attention layers focus solely on preceding words. Pre-training for encoder-decoder models involves complex objectives, such as predicting masked text spans. These models are particularly suitable for tasks like summarization, translation, and generative question answering, where generating new sentences conditioned on input is essential.
Although model architecture and the fine-tuning paradigm get most of the attention regarding model building and performance optimization, there are many more aspects of the building process that influence performance outcomes.
Overall, these techniques play crucial roles in improving the quality, efficiency, and effectiveness of large language models.
Going forward, the focus of the research and business communities will be directed more towards deployability, robustness and efficiency. Beyond the flashy demos, benchmarks relating to predictability and robustness will attract more attention. Efficiency, both in terms of model latency but also in terms of training and inference costs, will attract greater scrutiny, pushing research efforts towards those metrics.
As can be seen, despite the enormous progress made in various benchmarks since 2017, the main architectural direction of models has not changed considerably since then. The diversity of models and their various strengths and weaknesses should not blind us to the fact that success in language modeling applications hinges on keeping our eye on metrics that go beyond mere “next token accuracy”, and incorporate business and application level success metrics.
Crucially, we have also seen that pre-training and model selection is but one part of a large chain of processes that make up model building. Aspects like alignment strategy, dataset curation and pre-processing, and deployment optimization are important in differentiating a production-level model serving customers from a toy model running on our laptop for our own amusement.
As model building is democratized with easy access to hardware and data lies in abundance with every business, what is truly differentiating in the AI stack?