Skip to content

Does the perfect compressor exist?

Facing the problem of sending quite large files by email, larger than most mail systems allow us to send, it's common practice to "compress" those files to make them "smaller" and try sending them again. Sometimes however they're too big to send even in compressed form, because they don't get reduced enough. Once someone suggested me, "Ok, why don't you try to compress them again?" Interesting suggestion...

What does compression mean?

After all, what does it mean to compress a digital file? To most people the word compress means something mechanical, and tend to imagine that a file can be "compressed" in same way one may "compress" a coat to get it into the wardrobe for good. Obviously a file is not a "plastic" thing (not even solid) one may squeeze, despite of the fact that one of the most popular archiving utility everyone use for that task has a small clamp in its icon. So... what happens to a file when we compress it? In a few words, we're trying to save the same amount of information using less bits that the original file, by means of a reversible method that let us recover the original file at any later time. The reason that makes such thing possible is entirely another topic; let's deal meantime with the problem at hand.

Anyone may suspect that compression cannot be indefinitely repeated achieving ever decreasing file sizes; if it was the case, any file could be reduced to a few bits. It would be great to have every image, web page, video and music in the internet stored in a pen drive... We know there are several compression methods (algorithms) available; some of them are general ones suitable for every kind of file, such as ZIP, RAR, etc., while there are others intended for specific type of content (MPEG for video, FLAC and MP3 for audio). Almost all of them have a user-selected compression ratio, so we can try them out and choose the one that fits best. Anyway, with our file already compressed, what happens if we compress it again?

Second is never good enough...

Surprisingly, second compression not only is unable to reduce file size any further, but in fact makes it a little larger. The reason: a well designed compression method should be able to find the shortest bit sequence from which the original file may be reconstructed; once that sequence is found, it should be "packed" inside the compressed file, along with other relevant data, e.g. original filename, creation or modification date, in short, its metadata. So we have

Compressed file = Original file's minimum bit sequence + metadata

And a second compression would yield

Second compressed file = (original file's minimum bit sequence + metadata) + metadata

It should be clear that this second compressed file will be larger than the first one. The file size increment is quite small, because metadata are usually a tiny amount of data compared to the original file. In the following test I took a Photoshop .psd file (they can usually be quite well compressed) y compressed it several times. The following table show the results:

FileSizeCompression relative to previous
Original22.404.024 bytes-
First compression14.925.081bytes-33%
Second compression14.928.614 bytes+0,02%
Third compression14.933.345 bytes+0,03%

Although there is a small difference (about 3 kB between first and second compression), re-compressed files indeed becomes larger. Metadata is in part to blame for this increment, but if we keep them from being added to every compressed file we would achieve at most a constant file size, without any effective compression.

Is it posible to build a perfect compressor?

It is remarkable that, no matter what compression method we choose, there will be always a file that method will be unable to compress (I already found one in the previous example: it is enough to take the first compressed file as our "original" file; if we remove its extension, so nobody suspects it is already compressed, and we pass it to another people, asking them to compress it, they'll find the same as I did, compressed file is slightly bigger). This fact suggests that it is not possible to build a compression utility capable of compress any file. This strong statement can in fact be proved: please follow me.

As stated, to compress a file means to find another representation of it using less bits. Consider a file with a filesize of n bits; to compress it means to create another file which may be reverted to the original file, having strictly less than n bits. This way, any compression method is just a relationship between every n-bit file (let those be the set A of original files) and any less-than-n-bits file (which we may call set B of all possible compressed files obtained from A). Our requirement for our method to be reversible implies every single element from A must have  one and only one related element in B, so we can always find which file from A created a specific file in B.

A compression method seen as a "mapping" between "original" files (A) and "compressed" files (B). We need B to have at least as many elements as in A.

So far, so good, but is it possible to do this? To find out we need set B to have at least as many elements as set A, so let's count them up. To make things simpler, let's consider only fixed-size files. An n-bit file can be created in 2n different ways. In order to see why, consider a 1-bit file; there are only two possible "files", containing either "0" or "1"; with two bits, we have two possible second bits for each possible first bit, so we have 2*2 posibilities ("00", "01", "10", "11") and so on; an n-bit file has then 2*2*2...*2 (n times) posibilities, and this is the number of possible files with that size.

On the other hand, how many files are there having less than n bits? Easy task for a mathematician, it's just a sum over a geometric series. I will try not to scare non-math readers, so let's make a simple calculation taken into account our former analysis. Let's compute how many files there are having exactly 1, 2, 3... bits and sum them up. Next table shows the general procedure:

Original file size n in bits Number of files of length n bits
(Set A)
Number of files of length strictly less than n bits
(Set B)
"Shortage" of B compared to A
121 = 2None2 - 0 = 2
222 = 424 - 2 = 2
323 = 82 + 4 = 68 - 6 = 2
424 = 162 + 4 + 8 = 1416 - 14 = 2
525 = 322 + 4 + 8 + 16 = 3032 - 30 = 2
626 = 642 + 4 + 8 + 16 + 32 = 6264 - 62 = 2
727 = 1282 + 4 + 8 + 16 + 32 + 64 = 126128 - 126 = 2

What does all of this mean? Since set B is 2 elements "shorter" than A, given all possible files of fixed size in bits, at least two of them cannot be compressed; if that was the case, there will be at least two distinct original files ending up in the same compressed file and there'll be no way of reverting that process, because we`ll be unable to tell original files apart.

This analysis shows that, in this scenario, a "perfect" file compressor capable of compress any sort of file (in a strict sense) cannot exist, though if we would just have two "problematic" files they're probably not worth all this fuzz. Anyway, anyone may address this issue as non realistic, because files used in ordinary life are not fixed in size. A deeper analysis taking this fact into account shows that no more than a third of all possible files can be compressed. In other words: two out of three random generated files cannot be compressed.

Let A(n) be the number of files of size n, B(n) the number of files of size strictly less than n (and so potential candidates to compressed versions of the former files) and C(n) the available number of files for resulting compression of a file of length n not yet used by the method to compress any shorter file, i.e. the effective available number of files after removing those "occupied" in the compression of any shorter file.

Now we can establish that the number of available compressed files C(n) for files of length n equals to all files of shorter length B(n) minus those files already used in compressing any shorter file C(n - 1), giving the recurrence expression

C(n) = B(n) - C(n - 1)

In order to solve it, we take into account that B(n) = 2n - 2 (even by solving the sum of the resulting geometric series or by inspection of previous table) and that C(1) = 0 (it is not possible to compress 1 bit "files"). Solving successively for C(n) yields

C(2) = B(2) - C(1)

C(3) = B(3) - C(2) = B(3) - [B(2) - C(1)] = B(3) - B(2) + C(1)

C(4) = B(4) - C(3) = B(4) - [B(3) - B(2) + C(1)] = B(4) - B(3) + B(2) - C(1)

C(5) = B(5) - C(4) = B(5) - [B(4) - B(3) + B(2) - C(1)] = B(5) - B(4) + B(3) - B(2) + C(1)

and we now can synthesize this into the following expression, noting that C(1) = 0 and completing B series adding B(1), which is zero:

C(n) = (-1)n ∑ (-1)k B(k)

where summation index k extends from 1 to n.

Replacing B(k), according to its definition, gives

C(n) = (-1)n ∑ (-1)k (2k - 2) = (-1)n [ ∑ (-2)k  - 2∑(-1)k ]

Recalling that geometric sum ∑ qk from 1 to n equals to q.(1 - qn)/(1 - q),

C(n) =
= (-1)n .{ (-2).[1 - (-2)n]/[1 - (-2)] + 2.[1 - (-1)n]/[1 - (-1)] } =
= (-1)n. { (-2/3).[1 - (-2)n] + [1 - (-1)n]/[1 - (-1)] }

To solve this expression we must split it into even n and odd n. For the former case, making n = 2k, (-1)n = 1

C(2k) = (-2/3).(1 - 22k) = (2/3).(22k - 1)

For the latter (odd) case, making n = 2k - 1, (-1)n = -1

C(2k - 1) = (-1).[ (-2/3).(1 + 22k - 1) + 2 ] = (2/3).(22k - 1 + 1) - 2

We can clearly realize that using moderate to large values of n, both cases tend to

C(n) ≈ (2/3).2n

On the other hand, the number of files A'(n) no larger than n is given by

A'(n) = A(n) + B(n) = 2n + 2n - 2 = 2.2n - 2 = 2n + 1 - 2

and again, using moderate to large values of n, it tends to

A'(n) ≈ 2n + 1

Ratio C(n)/A'(n) of compressible files is

C(n)/A'(n) ≈ [ (2/3).2]/[ 2n + 1 ] ≈ 1/3

Another way of expressing this result is: probability of a random created file to be able to be compressed doesn't exceed one third.

This seems to deny what usually happens: our files can more or less be compressed but we never faced a case where compression cannot be achieved, even in a small amount. This can be explained from the fact that common, real files (text, images, video, audio, programs) are not "randomly" built; they contain some form of structured (and in many cases redundant) information, for sake of making their process easier, which makes compression possible.

Published inGeneralMath craziness

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *