Entropy and prefix coding

The Mastery of Computer Programming: Primary Algorithms - Sykalo Eugene 2023

Entropy and prefix coding
Algorithmic Complexity and Information Theory

Entropy and its relevance

Entropy is a measure of randomness or disorder in a given set of data. In information theory, it is used to estimate the amount of information contained in a message. The entropy of a message is the minimum number of bits required to encode it. For example, consider a message that consists of the letters A, B, and C. If each letter is equally likely to occur, then the entropy of the message is 1.58 bits per symbol, which means that it takes 4.74 bits to encode a three-symbol message. On the other hand, if the letter A is twice as likely to occur as the letters B and C, then the entropy of the message is 1.46 bits per symbol, which means that it takes 4.37 bits to encode a three-symbol message. Therefore, by knowing the entropy of a message, we can estimate the minimum number of bits required to encode it, which is useful in data compression and transmission.

In computer programming, entropy is used in various applications, such as cryptography, random number generation, and compression. For example, in cryptography, entropy is used to generate secure keys and passwords. In random number generation, entropy is used to ensure that the generated numbers are truly random. In compression, entropy is used to estimate the amount of redundancy in a message, which can then be eliminated to reduce the size of the message. Therefore, understanding entropy is essential for any programmer who wishes to develop efficient and effective programs.

Prefix Coding

Prefix coding is a method of assigning codes to a set of symbols in such a way that the code assigned to one symbol is not a prefix of the code assigned to any other symbol. In other words, the code for each symbol is unique and unambiguous. Prefix coding is also known as prefix-free coding or prefix-free binary codes.

Prefix coding is commonly used in data compression, where it is applied to the symbols of a message to create a compressed representation of the message. The compressed representation of the message is typically smaller in size than the original message and can be transmitted or stored more efficiently.

One of the most well-known applications of prefix coding is Huffman coding, which is a specific type of prefix coding used for lossless data compression. In Huffman coding, the frequency of occurrence of each symbol in the message is used to assign a unique code to each symbol. The more frequently a symbol occurs, the shorter its code. This results in a compressed representation of the message that is smaller in size than the original message.

Prefix coding has several advantages over other types of coding, such as variable-length coding. One advantage is that it guarantees that the code for each symbol is unique and unambiguous, which makes decoding more efficient. Another advantage is that it can be applied to any set of symbols, not just binary symbols. This makes it more versatile and applicable to a wider range of applications.

Prefix coding has several applications in computer programming, including data compression, error correction, and cryptography. It is often used in image and video compression, as well as in data transmission over networks. Therefore, understanding prefix coding is essential for any programmer who wishes to develop efficient and effective programs.

Huffman Coding

Huffman coding is a specific type of prefix coding used for lossless data compression. It was developed by David A. Huffman in 1951 while he was a graduate student at MIT. The basic idea behind Huffman coding is to assign variable-length codes to the symbols in a message in such a way that the more frequently a symbol occurs, the shorter its code. This results in a compressed representation of the message that is smaller in size than the original message.

The Huffman coding algorithm works by first calculating the frequency of occurrence of each symbol in the message. This frequency is then used to construct a binary tree called the Huffman tree. The Huffman tree is constructed by repeatedly merging the two symbols with the lowest frequencies into a single node until all symbols have been merged into a single root node. The codes for the symbols are then assigned by traversing the Huffman tree from the root node to each leaf node, with a 0 assigned to each left branch and a 1 assigned to each right branch. The code for each symbol is then the sequence of 0's and 1's obtained by traversing the Huffman tree from the root node to the leaf node corresponding to that symbol.

The main advantage of Huffman coding over other types of coding is that it produces an optimal code, meaning that it produces the shortest possible code for each symbol. This is because the Huffman tree is constructed in such a way that the more frequently a symbol occurs, the closer it is to the root of the tree, and therefore the shorter its code.

Huffman coding is widely used in data compression, especially in image and video compression, as well as in data transmission over networks. It is also used in various applications such as file compression, encryption, and error correction. One of the most well-known applications of Huffman coding is in the compression of text files, where it can achieve compression ratios of up to 50%.

However, Huffman coding also has some limitations. One limitation is that it requires knowledge of the frequency of occurrence of each symbol in the message, which may not always be available. Another limitation is that it may not be efficient for messages with a small number of symbols or for messages with symbols that have similar frequencies of occurrence. In these cases, other types of coding such as arithmetic coding may be more efficient.

Arithmetic Coding

Arithmetic coding is another method of lossless data compression that is more efficient than Huffman coding. It was first introduced by Jorma Rissanen in 1976. The basic idea behind arithmetic coding is to represent a message as a single fractional number in the range [0,1), and then encode that number using a fixed number of bits. The number of bits required to encode the number depends on the precision required and can be adjusted to achieve a desired compression ratio.

The arithmetic coding algorithm works by dividing the range [0,1) into subintervals corresponding to the symbols in the message. The subinterval for each symbol is proportional to its probability of occurrence in the message. The message is then represented as a single fractional number in the range [0,1), and the subinterval corresponding to each symbol is used to encode that symbol. This is done by dividing the subinterval into sub-subintervals corresponding to the symbols that have not yet been encoded, and then recursively encoding each symbol until the entire message has been encoded.

The main advantage of arithmetic coding over Huffman coding is that it produces a more accurate estimate of the entropy of the message, which means that it can achieve higher compression ratios for messages with a large number of symbols or for messages with symbols that have similar frequencies of occurrence. This is because arithmetic coding uses a continuous range of values to represent the message, whereas Huffman coding uses a fixed number of bits per symbol.

Arithmetic coding is often used in text compression and image compression, as well as in speech and audio compression. It is also used in various applications such as file compression, encryption, and error correction. However, arithmetic coding is more complex than Huffman coding and requires more computational resources, which makes it less suitable for real-time applications such as video streaming.

Adaptive Huffman Coding

Adaptive Huffman coding is a variant of Huffman coding that adapts to the input data as it is being encoded. This means that the frequency of occurrence of each symbol is updated dynamically as each symbol is encoded, which allows the Huffman tree to be modified on-the-fly. This results in a compressed representation of the message that is more efficient than static Huffman coding, especially for messages with a large number of symbols or for messages with symbols that have similar frequencies of occurrence.

The Adaptive Huffman coding algorithm works by first initializing the Huffman tree with a single symbol called the escape symbol. The escape symbol is used to represent all symbols that have not yet been encountered in the message. As each symbol is encountered in the message, its frequency of occurrence is updated in the Huffman tree, and the Huffman tree is modified accordingly. The code for each symbol is then assigned based on the updated Huffman tree.

The main advantage of Adaptive Huffman coding over static Huffman coding is that it does not require knowledge of the frequency of occurrence of each symbol in the message, which makes it more versatile and applicable to a wider range of applications. It also produces a compressed representation of the message that is more efficient than static Huffman coding, especially for messages with a large number of symbols or for messages with symbols that have similar frequencies of occurrence.

Adaptive Huffman coding is often used in real-time applications such as video streaming and voice over IP (VoIP), where the input data is constantly changing and the Huffman tree needs to be updated dynamically. It is also used in various applications such as file compression, encryption, and error correction. However, Adaptive Huffman coding is more complex than static Huffman coding and requires more computational resources, which makes it less suitable for low-power devices or applications with limited resources.