A new paper from researchers at Google DeepMind demonstrates that large language models like GPT-3 are not just adept at generating human-like text - they are also excellent general-purpose compressors. This means they can compress many types of data like text, images, and audio down to very small sizes, similar to specialized compression algorithms like gzip and PNG.
Data compression is a fundamental capability in computing and AI. Compressing data means we can store and transmit it using less memory, disk space, and bandwidth. This saves costs and allows systems to scale.
But more importantly, good compression also indicates a deep understanding of the structure and patterns in data. To compress well, an algorithm needs to spot redundancies and exploit statistical regularities. So, compression capability acts as a benchmark for how much knowledge an AI system has learned.
The fact that huge natural language models can compress varied data types so efficiently has major implications:
The DeepMind researchers tested the compression capabilities of different-sized language models on three different 1GB datasets:
Text - The first 1 billion bytes of Wikipedia.
Images - 1 million 32x64px patches extracted from ImageNet.
Audio - Speech samples from the LibriSpeech dataset.
They compared the models against standard compression algorithms like PNG, JPEG, and FLAC, which are specialized for images, audio, etc.
The language models are compressed using arithmetic coding - a technique that turns a predictive model into a compressor. The more accurately a model can predict the next byte in a file, the better it can compress the data.
They tested three main types of language models:
The experiments yielded several insightful results:
These findings have significant implications:
They demonstrate language models have learned very general capabilities beyond just text. Their versatility likely stems from pretraining on vast datasets.
The models' strong compression across modalities reflects an understanding of images, audio, and more at a deep statistical level.
There are inherent tradeoffs between model scale, datasets, and compression performance. Bigger datasets allow bigger models, but the size must match.
The results provide a new perspective on model scaling laws - compression considers model size, unlike log loss. Scaling hits limits.
The equivalence between prediction and compression means these models could have practical applications for compressing images, video, and more. However, model size may be prohibitive compared to current methods.
The compression viewpoint offers new insights into model generalization, failure modes, tokenization, and other aspects of deep learning.
In summary, this research shows large language models have become adept general-purpose learners. Their exceptional compression capabilities demonstrate an expansive understanding of patterns in textual, visual, and audio data. There is still progress to be made, but these models show increasing competence as general systems for automating prediction and compression across modalities.
Also published here.