Foundations of Coding: Compression, Encryption, Error Correction (2015)

Chapter 2. Information Theory and Compression

2.4 Common Compression Codes

As for the compression tool bzip2, current implementations combine several compression algorithms. This enables one to take advantage of all these algorithms and minimize the number of cases of bad compression. Among the most widely used techniques, the dictionary-based techniques of Lempel and Ziv enable one to enumerate full blocks of characters.

2.4.1 Lempel–Ziv's Algorithm and gzip Variants

The algorithm invented by Lempel and Ziv is a dictionary-based compression algorithm (also called by factor substitution); it consists of replacing a sequence of characters (called a factor) by a shorter code that is the index of this factor in some dictionary. The idea is to perform entropy reduction, not on a single character, but rather on full words. For example, in the sentence “ a good guy is a good guy,” one can dynamically associate numbers to the words already met; therefore, the code becomes “a,good,guy,is,1,2,3.” One must notice that, as for dynamic algorithms, Dictionary-based methods only require a single reading of the file. There exist several variants of this algorithm: among the latter, Lempel-Ziv 1977 (LZ77) and Lempel-Ziv 1978 (LZ78) are freely available.

2.4.1.1 Lempel-Ziv 1977(LZ77)

This (published by Lempel and Ziv in 1977) is the first dictionary-based algorithm. Being an efficient alternative to the Huffman algorithms, it relaunched research in compression. LZ77 is based on a window sliding along the text from left to right. This window is divided into two parts: the first part stands for the dictionary and the second part (the reading buffer) meets the text first. At the beginning, the window is placed so that the reading buffer is located on the text, and the dictionary is not. At each step, the algorithm looks for the longest factor that repeats itself in the beginning of the reading buffer in the dictionary; this factor is encoded with the triplet where

· i is the distance between the beginning of the buffer and the position of the repetition in the dictionary;

· j is the length of the repetition;

· c is the first character of the buffer different from the corresponding character in the dictionary.

After the encoding of the repetition, the window slides by characters to the right. In the case when no repetition is found in the dictionary, the character c that caused the difference is then encoded with .

Exercise 2.13 (Encoding a Repetition Using LZ77)

1. Encode the sequence “abcdefabcdefabcdefabcdef” using LZ77.

2. Encode this sequence using LZ77 and a research window whose length is 3 bits.

Solution on page 300.

One of the most well-known implementations of LZ77 is the Lempel–Ziv–Markov- chain Algorithm (LZMA) library that combines—as expected—an LZ77 compression using a variable size dictionary (adaptive) followed by an integer arithmetic encoding.

2.4.1.2 LZ78

This is an improvement of the latter that consists of replacing the sliding window with a pointer following the text and an independent dictionary in which one looks for the factors read by the pointer. Now let us suppose that during the encoding, one reads the stringssc where s is a string at index n in the dictionary and c is a character such that the string sc is not in the dictionary. Therefore, one writes the couple on the output. Then character c appended with factor number n gives us a new factor that is added to the dictionary to be used as a reference in the sequel of the text. Only two pieces of information are encoded instead of three with LZ77.

2.4.1.3 Lempel-Ziv-Welch (LZW)

This is another variant of Lempel–Ziv proposed by Terry Welch. LZW was patented by Unisys. It consists of encoding only the index n in the dictionary; moreover, the file is read bit by bit. Possibly, it is necessary to have an initial dictionary (e.g., an ASCII table). Compression, described by Algorithm 2.7, is then immediate: one replaces each group of characters already known with its code and one adds a new group composed of this group and the next character in the dictionary. ngr010

Decompression is a little bit more tricky, because one has to treat a special case when the same string is reproduced two times consecutively. Algorithm 2.8 describes this mechanism: as for compression, one starts by using the same initial dictionary that is filled in progressively, but with a delay because one has to consider the next character. This delay becomes a problem if the same string is reproduced consecutively. Indeed, in this case, the new code is not known during decompression. Nevertheless, because it is the only case where one meets this kind of problem, one notices that it has the same string at the beginning and one is able to guess the value of the group. ngr011

The following exercise illustrates this principle.

Exercise 2.14 Using the given hexadecimal ASCII table (7bits) as the initial dictionary, compress and uncompress the string “BLEBLBLBA” with LZW.

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

2 SP ! " # \$ \% & ' ( ) * + , - . /

3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z { | } ~ DEL

Uncompress the result without considering the particular case described above. What do you notice?

Solution on page 300.

Several possible variants have been proposed: for example, Lempel-Ziv-Miller- Wegman (LZMW) in which “string + next word” is added to the dictionary instead of only “string + next character” or even Lempel-Ziv All Prefixes (LZAP) in which all prefixes of “string + next word” are added to the dictionary.

The Unix command compress implements LZ77. As for as the gzip tool, it is based on a variant of the latter; it uses two dynamic Huffman's trees: one for the strings and the other for the distances between the occurrences. Distances between two occurrences of some string are bounded: when the distance becomes too high, one restarts the compression from the current character, independently of the compression that has already been performed before this character.

This singularity implies that there is no structural difference between a single file compressed with gzip and a sequence of compressed files. Hence, let f1.gz and f2.gz be the results associated with the compression of two source files f1 and f2: one appends these two files to a third one called f3.gz. Then, when uncompressing the file with gunzip f3.gz, one gets a result file whose content is identical to f1 appended with f2.

Exercise 2.15 (Gzip)

1. 1. Encode the string “abbarbarbbabbarbaa” with LZ77.

2. Cut the previous encoding into three strings (distances, Lengths, and characters), and apply gzip, the dynamic Huffman encoding of being forced onto one single bit, 1. Besides, in gzip, the first is not forgotten.

3. Let us suppose that the previous string is divided into two pieces, “abbarbarb” and “babbarbaa,” in two separated files “f1” and “f2.” Give the content of “f1.gz” and “f2.gz,” the compressions of “f1” and “f2” with gzip.

4. One appends the two files “f1.gz” and “f2.gz” in a file “f3.gz” (e.g., using the Unix command cat f1.gz f2.gz > f3.gz), then one uncompresses the file “f3.gz” with gunzip. What is the result? Deduce the interest of not forgetting the first .

5. How does one add compression levels in gzip (ex. gzip -1 xxx.txt' or gzip -9 xxx.txt')?

Solution on page 301.

2.4.2 Comparisons of Compression Algorithms

In Table 2.5, we compare the compression rates and the encoding/decoding speeds of the common tools on unix/linux using the example of an email file (i.e., containing texts, images, and binary files).

Table 2.5 Comparison of the Compression of an Email File (15.92 Mo: Text, Images, Programs, etc.) with Several Algorithms, on a PIV 1.5 GHz

Algorithm	Compressed	Rate, %	Encoding	Decoding
	file, Mo		speed, s	speed, s
7-Zip-4.42 (LZMA+?)	5.96	62.57	23.93	6.27
RAR-3.51 (?)	6.20	61.07	14.51	0.46
rzip-2.1 -9 (LZ77+Go)	6.20	61.09	9.26	2.94
ppmd-9.1 (Predictive)	6.26	60.71	11.26	12.57
bzip2-1.0.3 (BWT)	6.69	57.96	7.41	3.16
gzip-1.3.5 -9 (LZ77)	7.68	51.77	2.00	0.34
gzip-1.3.5 -2 (LZ77)	8.28	47.99	1.14	0.34
WinZip-9.0 (LZW+?)	7.72	51.55	5	5
compress (LZW)	9.34	41.31	1.11	0.32
lzop-1.01 -9 (LZW+?)	9.35	41.32	6.73	0.09
lzop-1.01 -2 (LZW+?)	10.74	32.54	0.45	0.09
pack (Huffman)	11.11	30.21	0.33	0.27

The gzip and compress programs—which use variants of Lempel–Ziv—are described in the next section. bzip2—which uses an entropy reduction like BWT before performing an entropic encoding—is described in Section 2.3.3. 7-Zip uses the variant LZMA of LZ77, associated with an arithmetic encoding and several additional heuristics depending on the kind of input file. WinZip also uses several variants depending on the type of the file and finishes with a LZW encoding. rzip is used for large files and, therefore, uses LZ77 with a very large window. ppmd uses a Markov probabilistic model in order to predict the next character with respect to the knowledge of the characters that are immediately before it. Then lzop uses a collection of algorithms called Lempel-Ziv-Oberhumer (LZO), optimized for decompression speed: it takes care of long matches and long literal runs so that it produces good results on highly redundant data and deals acceptably with low compressible data. Finally, pack implements the simple Huffman encoding.

As a conclusion, practical behavior reflects theory: LZ77 provides a better compression than LZW, but slower. Dictionary-based algorithms are more efficient than naive entropic encodings but are less efficient than an entropic encoding with a good heuristic for reducing the entropy. Common software usually tries several methods in order to find out the best compression. Although the compression rate is improved, the compression time usually suffers from this implementation choice.

2.4.3 GIF and PNG Formats for Image Compression

Common compact data formats take the nature of the data to encode into account. They are different depending on whether it is a question of a sound, an image, or a text. Among the image formats, the Graphics Interchange Format (GIF) is a graphic bitmap file format proposed by the Compuserve company. It is a compression format for a pixel image, that is, for an image described as a sequence of points (pixels) stored in an array; each pixel has a value for its color.

The compression principle is divided into two steps. First pixel colors (at the beginning, there are 16.8 million colors encoded on 24 bits Red/Green/Blue (RGB) are restricted to a pallet of 2–256 colors (2, 4, 8, 16, 32, 64, 128, or the default value 256 colors). Hence, the color of each pixel is rounded up to the closest color appearing in the pallet. While keeping a large number of different colors with a 256-color pallet, this enables one to have a compression factor 3. Then, the sequence of pixel colors is compressed with the dynamic compression algorithm Lempel-Ziv-Welch (LZW). There exist two versions of this file format, which were, respectively, developed in 1987 and 1989:

· GIF 87a enables one to have a progressive display (using interleaving) and animated images (animated GIFs), storing several images within the same file.

· GIF 89a—in addition to these functionalities—enables one to define a transparent color in the pallet and specify the delay for the animations.

As the LZW compression algorithm was patented by Unisys, all software editors manipulating GIF images would have had to pay royalties to Unisys. This is one of the reasons why the Portable Network Graphics PNG format is increasingly used, to the disadvantage of GIF format. The PNG format is similar to the GIF format, but it uses the LZ77 compression algorithm.