Last week you may have heard talk about a preprint that has been shared online about research that utilized a seemingly weird method of running text classification tasks. A team of researchers ran a series of benchmarks aimed at evaluating the language proficiency of language models on a model of their own. And it turns out that their model was on par with many large models, even surpassing them in a few tasks.
The crux is with the model they used: instead of running some large neural network, they ran the benchmarks through
Yes, you heard right: If you run text through ZIP compression, you end up with a pretty good model of language.
How can that be?
That is the spicy part: it is not just possible that
gzip is a good language model, it is expected. For that, one should know how compression actually works. Compression algorithms try to find redundancy in a file and, by removing it, they can reduce the file size without losing any information (that is referred to as “lossless”).[^1]
A very common compression algorithm is Huffman coding. Huffman coding uses insights from information science to encode a file such that the resulting new file size is minimal.
The connection between compression and language models is precisely these theoretical insights from information science; more specifically: the key term here is encoding.
Compression encodes text under two conditions: first, the original text must be entirely restorable from the compressed version (“lossless”), and second the compressed version should be as small as possible.
Language models also encode text, but under different conditions: first, the original text does not need to be restored (you cannot recreate a training corpus by reverse engineering the model’s weights), and second, the model must capture meaningful linguistic dimensions.
So, now you may be wondering: if “meaningful linguistic dimensions” are of no concern for
gzip, how come that it still apparently captures those, passing those language benchmarks with flying colors? That must be a coincidence, right?
Well, not quite. One common ancestor of both language models and compression algorithms is information theory. You may have already heard of Claude Shannon’s seminal 1948 piece “A mathematical theory of communication”. In this paper, Shannon defines some key terms for quantifying information, and he was heavily influenced by the question of how information can be transmitted textually. And it turns out that Hoffman-coding as a compression algorithm is also heavily influenced by the question of how text can be efficiently stored. This is also a reason why text files can normally be compressed much better than, e.g., images.
To put a long story short: the
gzip compression algorithm implements a model of language, just as large language models. And – as we now know from the preprint – this model apparently is very good, because it excels in many difficult language tasks.
Okay, but cui bono? What do we gain from this little experiment?, you may ask now.
gzip text encoding will not replace any of our large neural language models. It does not have trainable parameters, and thus it is difficult to adapt it to different domains or tasks. Basically, this encoding always encodes text in one specific way, which works well for the tasks the researchers put their model against, but may not work well for others. One benefit of neural networks is that they can encode information whatever way makes most sense statistically, so they may encode them differently from Huffman-coding.
But there is a lesson to be learned, one that – in my opinion – has been lost along the way. With the success of transformer models since 2017, it turns out that you can make models better simply by increasing their size, and that is the one thing AI companies have been doing since.
The problem is that transformer models assume a very specific model of language. It’s even in the paper title: “Attention is all you need”. Transformer models work by utilizing a somewhat complicated query-key-value mechanism to encode text which – as we now know – works well for many tasks. But it is just one model of language; and likely not the most efficient one.
In fact, there already exist so-called “distilled” versions of some language models. The idea of distilling a model means to remove as many weights as you can from a model without losing performance. And this goes to show that increasing language size makes models better, but not necessarily smarter.
And this is the main insight from the 2021 Stochastic Parrots paper: transformers work well, yes, and by increasing the model size, you can make them better, yes. But AI companies have become complacent: as long as the simple formula “make it bigger” works, they have no incentives of actually developing a new, better model of language. However, such a better model of language would enable us to develop ChatGPT-like models that do not require entire server farms to operate.
In my opinion, the
gzip experiment’s biggest service to the community is to emphasize once more the importance of our assumptions of how language should be modeled. In my opinion, it boils down to “work smarter, not harder”.
OpenAI, Microsoft, Google, Facebook, and all the others would do good to heed this advice.
[^1]: For example when you have a text, instead of writing “the” all the time, you can also just note down the word once and then only record all the offsets in the file where the word appears. This becomes more impressive if you quickly think of a file that contains the word “test” repeated five billion times. The file will be ca. 20 GB large, but you can compress it to maybe 15 bytes. All you have to replace the text file with is the word “test” followed by the number of repetitions: