GPT-4 represents a major leap forward in large language model capabilities. Developed by OpenAI, it builds on the architecture and strengths of GPT-3 while achieving new levels of scale and performance.
This article summarizes the key details about GPT-4 based on currently available public information.
Total parameters: ~1.8 trillion parameters (over 10x more than GPT-3)
Architecture: Uses a mixture of experts (MoE) model to improve scalability
Training compute: Trained on ~25,000 Nvidia A100 GPUs over 90-100 days
Training data: Trained on a dataset of ~13 trillion tokens
Context length: Supports up to 32,000 tokens of context
The GPT-4 model card provides transparency into the model's training data, intended uses, capabilities, limitations and more.
GPT-4 utilizes a mixture of experts (MoE) architecture with separate expert neural networks that specialize in certain tasks or data types.
This allows the overall model to scale up while keeping inference costs practical. The specialized experts can also develop unique capabilities.
Training a 1.8 trillion parameter model required extensive computational resources:
Various parallelism techniques enabled this scale:
Deploying GPT-4 also requires specialized infrastructure:
Dense inference clusters keep query costs affordable at scale.
The MoE routing mechanism can lead to token dropping, where some tokens are unprocessed due to expert capacity limits.
While impressive, GPT-4 remains focused on text. Future areas of research include:
GPT-4 demonstrates the rapid pace of progress in language models. While we are still far from general intelligence, OpenAI continues pushing towards this goal with each new iteration. Exciting capabilities likely lie ahead.