1,178 reads

Who Could Have Guessed LLMs are Great at Compressing Images and Audio: Reports From New Research

by Mike YoungSeptember 21st, 2023

Too Long; Didn't Read

Google DeepMind research shows that large language models like GPT-3 are excellent general-purpose compressors. This means they can compress many types of data like text, images, and audio down to very small sizes, similar to specialized compression algorithms like gzip and PNG. Compressing data means we can store and transmit it using less memory, disk space and bandwidth.

featured image - Who Could Have Guessed LLMs are Great at Compressing Images and Audio: Reports From New Research

A new paper from researchers at Google DeepMind demonstrates that large language models like GPT-3 are not just adept at generating human-like text - they are also excellent general-purpose compressors. This means they can compress many types of data like text, images, and audio down to very small sizes, similar to specialized compression algorithms like gzip and PNG.

Why Should We Care About Compression?

Data compression is a fundamental capability in computing and AI. Compressing data means we can store and transmit it using less memory, disk space, and bandwidth. This saves costs and allows systems to scale.

But more importantly, good compression also indicates a deep understanding of the structure and patterns in data. To compress well, an algorithm needs to spot redundancies and exploit statistical regularities. So, compression capability acts as a benchmark for how much knowledge an AI system has learned.

The fact that huge natural language models can compress varied data types so efficiently has major implications:

It demonstrates they have learned general abilities beyond just processing language.
Their skill at compression reflects an understanding of images, audio, video, and more.
There is potential to apply them to practical compression tasks.

How Was the Research Conducted?

The DeepMind researchers tested the compression capabilities of different-sized language models on three different 1GB datasets:

Text - The first 1 billion bytes of Wikipedia.
Images - 1 million 32x64px patches extracted from ImageNet.
Audio - Speech samples from the LibriSpeech dataset.

They compared the models against standard compression algorithms like PNG, JPEG, and FLAC, which are specialized for images, audio, etc.

The language models are compressed using arithmetic coding - a technique that turns a predictive model into a compressor. The more accurately a model can predict the next byte in a file, the better it can compress the data.

They tested three main types of language models:

Smaller Transformer models trained from scratch on Wikipedia text.
Larger foundation models like Chinchilla-70B are pre-trained on huge text datasets.
As a baseline, general-purpose compressors like gzip and LZMA.

Key Technical Findings

The experiments yielded several insightful results:

Despite being trained only on text, the foundation models compressed all modalities better than methods specialized for each domain. For example, Chinchilla-70B compressed ImageNet images 43.4% smaller than PNG.
Confirmed scaling laws: Bigger models compressed better, but only up to a point. After a certain size, the model itself took up too much space.
There was a direct link between model size and training data size. More data enables bigger models. However, the model size must be suited to the dataset size.
Tokenization like BPE, while useful for language tasks, generally decreased compression performance slightly. This is because it makes the prediction task harder.
Longer contexts improved compression, as models could exploit more sequential dependencies.

Key Implications

These findings have significant implications:

They demonstrate language models have learned very general capabilities beyond just text. Their versatility likely stems from pretraining on vast datasets.
The models' strong compression across modalities reflects an understanding of images, audio, and more at a deep statistical level.
There are inherent tradeoffs between model scale, datasets, and compression performance. Bigger datasets allow bigger models, but the size must match.
The results provide a new perspective on model scaling laws - compression considers model size, unlike log loss. Scaling hits limits.
The equivalence between prediction and compression means these models could have practical applications for compressing images, video, and more. However, model size may be prohibitive compared to current methods.
The compression viewpoint offers new insights into model generalization, failure modes, tokenization, and other aspects of deep learning.

In summary, this research shows large language models have become adept general-purpose learners. Their exceptional compression capabilities demonstrate an expansive understanding of patterns in textual, visual, and audio data. There is still progress to be made, but these models show increasing competence as general systems for automating prediction and compression across modalities.

Also published here.