Is gzip a language model?

A while back I wrote about language modeling without neural networks, where I generated Shakespeare with an unbounded n-gram model: no weights, no training, just counting. Fortuitously, I came across the paper Language Modeling is Compression, which mentioned the compression–prediction equivalence:

every prediction model is inherently a compressor, and all compression algorithms are prediction models.

This led to the natural question: can gzip do language modeling?1 No neural network, no learned parameters, nothing. Just the compressor that ships with your operating system. You prime it with a corpus, and it continues a prompt by searching for the byte sequences that compress best. Here’s some real, unedited output after priming it on tiny Shakespeare:

gzipt --corpus data/tinyshakespeare.txt --prompt $'MENENIUS:\n' --length 200
MENENIUS:
'Though all at once canq

MARCIUS:
Pray now, nocamest thou to a morsel .

LARTIUS:
Hence, and
I' the end admire, where G
again; and after it ag .

It turns out, kind of? It’s not exactly coherent text, but it clearly knows something about the text. Much more than I expected gzip to know.2 So how can a compressor generate this?

Compression is prediction#

Think about what a compressor does. It spends few bytes on data it “expects” and many bytes on data it doesn’t. If I hand you a file that’s the letter A repeated a million times, you can describe it in one sentence. A million random bytes, on the other hand, have no structure to exploit and barely compress at all.

This is not a coincidence; it’s the core of information theory. The number of bits needed to encode a symbol is $-\log_2 p$, where $p$ is the probability the model assigns to it. High probability means few bits. So any compressor has a probability model hiding inside it, whether or not anyone wrote one down.

gzip uses DEFLATE, which compresses the next bytes by finding matches against the recent text in a 32 KiB sliding window. If a continuation echoes something already in the window, DEFLATE encodes it as a cheap back-reference instead of literal bytes. So:

A continuation that gzip “expected”, because it echoes text already in its window, compresses to almost nothing.

That gives us a score. If I have some context and I want to know how good a candidate continuation is, I just measure:

$$\text{score}(\text{candidate}) = \texttt{len(gzip(context + candidate))}$$

The smaller the compressed length, the more “predicted” the candidate is. To prime the model, I include a corpus in gzip’s window. Any continuation that looks like the corpus compresses small, and any continuation that doesn’t compresses large.

Scoring is one thing; generating is another. The naive approach of picking the single next byte that compresses best fails badly, and for a subtle reason: gzip only gives an integer byte length (no fractions). Adding one byte often doesn’t change the compressed length at all, so many candidates tie and the signal is buried in quantization noise.

The fix is to look ahead a whole span before committing. gzipt runs a beam search over byte sequences:

  1. Context. Show gzip the corpus window plus the recent tail of what’s been generated so far.
  2. Search. Keep the beam_width most-compressible partial continuations. Extend each by every byte that occurs in the corpus, score all of them by compressed length, and prune back down to the best beam_width. Repeat for horizon bytes.
  3. Commit. Take the most-compressible full span (or, if a tie, sample among the finalists), append it, and re-plan from there.

One detail that matters is that only the last tail bytes of generated output stay in the scoring context. DEFLATE codes nearby matches more cheaply than far ones, so if gzip could see its entire history, the cheapest thing to do is often to fall into verbatim loops, repeatedly copying text it just emitted.



You can see the decoding and scoring process in the animation above, which is the same replay shown at the top. The whole thing is one file of pure standard-library Python (just zlib). Code’s on GitHub if you want to play with it.


  1. The paper did try this, but it ended up performing really poorly. Adding beam search significantly improved generation quality, which is discussed below. ↩︎

  2. The code actually uses zlib instead of spawning a gzip process, but the name GziPT was too good. I believe they both use the same DEFLATE algorithm under the hood. ↩︎