Takeaways From “LLM: a Survey” - Where are You Differentiating?by@surbhirathore
170 reads

Takeaways From “LLM: a Survey” - Where are You Differentiating?

by surbhiMarch 18th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The LLM Survey paper serves as a really good foundation for startups to think about where can they differentiate building the AI stack.
featured image - Takeaways From “LLM: a Survey” - Where are You Differentiating?
surbhi HackerNoon profile picture

For the last couple of years or so, developments regarding AI and Natural Language Processing seem like they have significantly accelerated. There are more companies, more research papers, more products and more capital invested targeting AI every day. Both expectations and promises have inflated, making categorization and analysis more difficult. To help make more sense of things, in this article I want to talk through the trends and constants that brought us to where we stand today.

I also want to clarify both “language models” and what happens when they are “large”, the main architectures and model families, and all the other factors that influence the building of an LLMs not related to the architecture and choice of pre-trained base. Finally, I want to touch upon the future expectations relating to LLMs. I will make frequent use of the excellent recent survey by MInaee et al. ( and summarize relevant takeaways.`

Some background…

Language modeling is a decades-old research domain, with the earliest notable attempt being Claude Shannon’s investigation into it, where he measured how well simple n-gram language models predict natural language text. Since then, language modeling has become fundamental to many natural language understanding and generation tasks, ranging from speech recognition, machine translation, to information retrieval.

Hence, language models are probabilistic models of language that enable us to predict, classify and generate natural language based on specific inputs. This task is highly non-trivial due to, among many other reasons, data-sparsity, ambiguity of natural language, and long-range dependencies within the datasets.

Early models that showed a degree of practical utility were usually highly task specific. They were usually one among many models in highly engineered and improvised Natural Language Processing pipelines. Modifications and variations of neural networks such as Recurrent Neural Networks, LSTM (Long Short-term Memory) networks and GRU (Gated Recurrent Unit) Networks achieved notable results in this era, becoming viable in applications. Yet, operating these in production environments cost-effectively was very challenging and required very specific experiential knowledge available to only a small number of specialists.

While word2vec did indeed establish foundational principles for representing semantics and context within vector representations, thee “big bang” moment, with respect to language modeling, was arguably the 2017 publication of the first Transformer paper. This paper introduced the “Attention” concept, which is essentially a series of matrix operations computed by a Transformer network that helps approximate statistical dependencies among language “tokens” over a wide range of distances.

By this time, the “pre-training” - “fine-tuning” approach had already gained some popularity.

But the more powerful parallelization capacity of the Transformer architecture enabled a significant improvement in training efficiency, essentially making these models more scalable with respect to training data and model size compared to previous state-of-the-art models. Accordingly, larger and larger models getting (pre)trained with bigger and bigger portions of the web as training data led to more task-agnostic Pretrained Language Models being used as bases upon which different finetuning methods and datasets were applied. Some of the first Pretrained Language Models were ChatGPT’s predecessors like GPT-1 and GPT-2, along with BERT (which achieved SOTA results in translation benchmarks).

LLMs are large-scale, pre-trained, statistical language models based on neural networks. “Large” here refers to not just the model size and dataset size but also certain “emergent” properties that are not found in smaller models. These emergent abilities include in-context learning, instruction following and multi-step reasoning. With in-context learning LLMs learn a new task from a small set of examples presented in the prompt at inference time, with instruction following LLMs can follow the for new types of tasks without using explicit examples, and with multi-step reasoning LLMs can solve a complex task by breaking down that task into intermediate reasoning steps.

GPT-3 is widely considered as the first “Large” Pretrained Language Model.  It was not only much larger than previous Pretrained Models, but also demonstrated for the first time emergent abilities that are not observed in previous smaller models. GPT3 showed the emergent ability of in-context learning, which means GPT-3 can be applied to any downstream tasks without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieved strong performance on many NLP tasks like translation and question-answering as well as several ones that require on-the-fly reasoning or domain adaptation.

Thus, the “Large” Language Model era was born. OpenAI’s success was followed by rival model families released by Meta (Llama) and Google (PaLm) along with a collection of other experimental open-source pretrained LLMs with different modifications and datasets.

The Growing Model Universe

LLaMA, a series of foundation language models by Meta, distinguishes itself from GPT models by being open-source, providing model weights to the research community under a noncommercial license. This has fostered rapid growth in the LLaMA family as it has been widely adopted for developing better open-source LLMs or task-specific ones. Initial releases in February 2023 ranged from 7B to 65B parameters, surpassing GPT-3 in performance on various benchmarks. LLaMA employs the transformer architecture with minor modifications such as using SwiGLU activation, rotary positional embeddings, and root-mean-squared layer-normalization.

The PaLM (Pathways Language Model) family, developed by Google, introduced its first model in April 2022, remaining private until March 2023. This initial model, boasting 540 billion parameters, is transformer-based and trained on a large text corpus across various language tasks. Utilizing the Pathways system for efficient training on multiple TPU (Tensor Processing Unit) Pods, PaLM achieved state-of-the-art few-shot learning results across numerous benchmarks. Subsequent U-PaLM models, ranging from 8B to 540B scales, undergo continual training with UL2R, resulting in computational savings. Flan-PaLM models, instruction-finetuned on a vast number of tasks and data, substantially outperformed previous models, with Flan-PaLM-540B achieving notable performance gains compared to PaLM-540B.

Finally, there have been models that have been developed by researchers sing different pretraining and finetuning approaches. The most significant ones being,

  • Flan:
    • Wei et al. introduced Flan, which enhances zero-shot learning abilities by instruction-tuning language models on a collection of datasets described via instructions.
    • The model significantly improves zero-shot performance on unseen tasks and is fine-tuned on over 60 NLP datasets, resulting in the creation of Flan.
  • Gopher:
    • Rae et al. presented Gopher, analyzing transformer-based language model performance across various scales, up to a 280 billion parameter model.
    • Gopher achieves state-of-the-art performance across diverse tasks, with architectural details provided in the study.
  • ERNIE 3.0:
    • Sun et al. proposed ERNIE 3.0, a unified framework for pre-training large-scale knowledge-enhanced models.
    • ERNIE 3.0 fuses auto-regressive and auto-encoding networks, facilitating adaptation for both natural language understanding and generation tasks.
  • RETRO:
    • Borgeaud et al. enhanced auto-regressive language models with RETRO, conditioning on document chunks retrieved from a large corpus based on local similarity with preceding tokens.
    • RETRO achieves comparable performance to existing models on the Pile dataset, despite using fewer parameters.
  • GLaM (Generalist Language Model):
    • Du et al. proposed GLaM, utilizing a sparsely activated mixture-of-experts architecture to scale model capacity with reduced training costs.
    • The largest GLaM variant boasts 1.2 trillion parameters, achieving superior performance across various NLP tasks while consuming less energy and computation flops compared to GPT-3.
  • LaMDA:
    • Thoppilan et al. introduced LaMDA, a family of Transformer-based models specialized for dialogue tasks, trained on public dialogue data and web text.
    • Fine-tuning with annotated data and enabling external knowledge consultation lead to significant improvements in safety and factual grounding.

These models represent advancements in various aspects of language modeling, from performance enhancement to specialized applications like dialogue and zero-shot learning. Each model addresses specific challenges within the field and contributes to the ongoing development of large language models.

Although a lot of these models and model families bring unique advantages and drawbacks, they are instances of three essential architectural classes with variations on training processes, dataset collection and preparation, and fine-tuning methods.

3 main LLM architectures that are here to stay

The Transformer approach introduced a paradigm shift in natural language processing, primarily due to its novel self-attention mechanism optimized for parallel computation on GPUs. At its core, the Transformer architecture consists of an encoder and a decoder.

The encoder comprises a stack of identical Transformer layers, each housing two sub-layers: a multi-head self-attention layer and a position-wise fully connected feed-forward network.

On the other hand, the decoder, also composed of a stack of layers, incorporates an additional sub-layer for multi-head attention over the encoder's output stack.

In the Transformer model, attention functions play a pivotal role, essentially mapping queries and a set of key-value pairs to an output. These functions, operating on vectors representing queries, keys, values, and outputs, compute weighted sums of values based on the compatibility of queries with corresponding keys. Notably, linear projections are employed to project queries, keys, and values into different dimensions for improved performance. Additionally, positional encoding is introduced to imbue the model with information about token positions within sequences, whether absolute or relative.

Beyond the foundational Transformer architecture, subsequent developments have led to various model families, each catering to specific NLP tasks.

Encoder-only models, exemplified by BERT, are adept at comprehending entire sequences and excel in tasks like sentence classification and named entity recognition. These models are typically pre-trained by masking words and training the model to reconstruct the original sentences.

In contrast, decoder-only models, such as those embodied by GPT, Llama and PaLM are tailored for text generation tasks, with attention layers accessing only preceding words. Pre-training for decoder-only models typically involves predicting the next word in a sequence.

Encoder-decoder models, which integrate both encoder and decoder components, are commonly referred to as sequence-to-sequence models. In this architecture, encoder attention layers can access the entire input sequence, while decoder attention layers focus solely on preceding words. Pre-training for encoder-decoder models involves complex objectives, such as predicting masked text spans. These models are particularly suitable for tasks like summarization, translation, and generative question answering, where generating new sentences conditioned on input is essential.

Beyond the architecture, what else is influencing performance of these models?

Although model architecture and the fine-tuning paradigm get most of the attention regarding model building and performance optimization, there are many more aspects of the building process that influence performance outcomes.

  • Data Cleaning: Techniques such as filtering and deduplication are crucial for enhancing data quality and improving model performance. Filtering involves removing noise, handling outliers, addressing imbalances, and preprocessing text data. Deduplication eliminates duplicate instances or repeated occurrences of the same data in a dataset.
  • Tokenization: Tokenization involves converting text into smaller parts called tokens. Popular tokenization methods include BytePairEncoding, WordPieceEncoding, and SentencePieceEncoding, which address challenges such as out-of-vocabulary words and corrupted text.
  • Positional Encoding: Techniques like Absolute Positional Embeddings, Relative Positional Embeddings, Rotary Position Embeddings, and Relative Positional Bias are used to preserve sequence order and capture positional information in LLMs.
  • Model Pre-training: Pre-training involves training LLMs on large amounts of unlabeled text data using approaches like autoregressive language modeling and masked language modeling.
  • Fine-tuning and Instruction Tuning: Fine-tuning LLMs for specific tasks using labeled data improves performance. Instruction tuning aligns model responses with human expectations by providing task-specific prompts.
  • Alignment: Alignment techniques like reinforcement learning from human feedback (RLHF), reinforcement learning from AI feedback (RLAIF), and Direct Preference Optimization (DPO) aim to steer LLMs towards human goals and preferences.
  • Decoding Strategies: Strategies such as greedy search, beam search, top-k sampling, and top-p sampling are used for text generation with LLMs, each with its advantages and trade-offs.
  • Cost-Effective Training/Inference/Adaptation/Compression: Techniques like optimized training, low-rank adaptation (LoRA), knowledge distillation, and quantization aim to reduce the computational and memory requirements of LLMs without compromising performance.

Overall, these techniques play crucial roles in improving the quality, efficiency, and effectiveness of large language models.

Going forward, the focus of the research and business communities will be directed more towards deployability, robustness and efficiency. Beyond the flashy demos, benchmarks relating to predictability and robustness will attract more attention. Efficiency, both in terms of model latency but also in terms of training and inference costs, will attract greater scrutiny, pushing research efforts towards those metrics.

  • Smaller and More Efficient Language Models (SLMs):
    • There is a growing interest in developing SLMs as a cost-effective alternative to LLMs, especially considering the inefficiencies and high costs associated with larger models.
    • Techniques such as parameter-efficient fine-tuning (PEFT), teacher/student learning, and distillation are being used to create SLMs from larger models, enabling better efficiency for specific tasks.
    • Prominent works in this direction include Phi-1, Phi-1.5, and Phi-2 from Microsoft, which exemplify efforts to explore smaller, task-specific models.

  • New Post-Attention Architectural Paradigms:
    • While transformer blocks have been dominant in LLM frameworks, there's a growing exploration of post-attention architectural paradigms.
    • Alternative approaches like State Space Models (SSMs) and mechanisms such as Mixture of Experts (MoEs) are gaining traction.
    • MoEs, for instance, allow for the training of extremely large models that are only partially instantiated during inference, leading to efficiency gains.
    • The Monarch Mixer proposes a novel architecture using sub-quadratic primitives, enhancing hardware efficiency.
  • Improved LLM Usage and Augmentation Techniques:
    • Advanced prompt engineering, tools usage, and augmentation techniques are addressing shortcomings like hallucination in LLMs.
    • LLMs are replacing other machine learning systems in various applications like customer service and content recommendation, emphasizing personalization and context analysis.
  • Security and Ethical/Responsible AI:
    • Research efforts are directed towards ensuring the robustness and security of LLMs against adversarial attacks and vulnerabilities.
    • Addressing ethical concerns and biases in LLMs is a priority to ensure fairness, unbiased behavior, and responsible handling of sensitive information.
    • As LLMs are increasingly deployed in real-world applications, efforts to mitigate potential threats and ensure responsible AI usage are crucial.

If you are building a model, what is your differentiation strategy?

As can be seen, despite the enormous progress made in various benchmarks since 2017, the main architectural direction of models has not changed considerably since then. The diversity of models and their various strengths and weaknesses should not blind us to the fact that success in language modeling applications hinges on keeping our eye on metrics that go beyond mere “next token accuracy”, and incorporate business and application level success metrics.

Crucially, we have also seen that pre-training and model selection is but one part of a large chain of processes that make up model building. Aspects like alignment strategy, dataset curation and pre-processing, and deployment optimization are important in differentiating a production-level model serving customers from a toy model running on our laptop for our own amusement.

As model building is democratized with easy access to hardware and data lies in abundance with every business, what is truly differentiating in the AI stack?