# Entropy and certainty in lossless data compression

TABLE OF CONTENTS ABSTRACT ii ACKNOWLEDGEMENTS iii LIST OF ILLUSTRATIONS vi LIST OF TABLES viii LIST OF ABBREVIATIONS ix NOTATION AND GLOSSARY x 1 INTRODUCTION 1 1.1 Introduction to Data Compression 1 1.2 Properties to Classify Compression Algorithms 2 1.3 Objective 5 2 BACKGROUND INFORMATION AND CURRENT SOLUTIONS 8 2.1 Introduction to Data Compression Algorithms and Techniques 8 2.2 Claude Shannon and Information Theory 8 2.3 Entropy Theory 10 2.4 Entropy in Information 13 2.5 Shannon-Fano Encoding 16 2.6 Huffman Encoding 20 2.7 Shannon-Fano-Elias Encoding 23 2.8 Summary 27 3 CURRENT ANALYSIS OF CORE ALGORITHMS 29 3.1 View of the Data Compression Environment and Resulting Algorithms 29 3.2 Certainty and Two Types of Entropy 29 3.3 Shannon-Fano's Traditional View and Entropy Division by Certainty 31 3.4 Huffman and Entropy Addition 33 3.5 Shannon-Fano-Elias' Entropy View and Entropy Division by Entropy 35 3.6 Summary 36 4 MAPPING OF CERTAINTY AND ENTROPY 39 4.1 Illustrating Certainty and Entropy 39 4.2 Analysis of the Previous Models 40 4.3 Map of Certainty and Entropy (MaCE) of the Symbol and Bit Space 42 4.4 Entropy in Two-Dimensional Space 47 IV

4.5 Mapping Certainty and Entropy 49 4.6 Example of MaCE 50 4.7 MaCE and the Symbol 52 4.8 Summary 54 5 USE OF MaCE FOR DATA COMPRESSION 55 5.1 Using MaCE 55 5.2 Select Level Method 56 5.3 SLMN: Decrease Entropy to H(x) + 1 70 5.4 Beyond H(x) + l 79 5.5 JAKE 80 5.6 Symmetry of Encoding and Decoding 86 5.7 Summary 87 6 EXPERIMENTAL RESULTS OF SPATIAL METHODS 89 6.1 Results of SLM and SLMN 89 6.2 Test Environment 89 6.3 CPU Time Comparison between SFE and SLM 90 6.4 Entropy Comparison between SLMN, SFE, and the Ideal 95 6.5 Summary of Results for SLM and SLMN 99 7 CONCLUSION 105 7.1 Conclusion 105 APPENDIX A ARITHMETIC ENCODING EXAMPLE HO B COMPUTER RESULTS SELECT LEVEL METHOD 113 C COMPUTER RESULTS SORT LINEAR METHOD NIVELLATE 115 BIBLIOGRAPHY 118 v

LIST OF ILLUSTRATIONS Figure 2.1 Shannon's decomposition of a choice from three possibilities 10 2.2 Entropy map of a tree described in base 2 12 2.3 Colors expansion example 15 2.4 Shannon-Fano example 18 2.5 Comparison of Shannon-Fano to Huffman encoding example 20 2.6 Illustration of the Huffman algorithm 22 2.7 Shannon-Fano-Elias'view of the data compression environment 25 2.8 Tree created by Shannon-Fano-Elias encoding 26 3.1 Shannon-Fano's view of the entropy space 32 3.2 Graph of two probabilities using the entropy equation 34 4.1 Traditional depiction of the binary tree 41 4.2 Shannon-Fano-Elias' model of the data compression environment 43 4.3 MaCE 44 4.4 MaCE example 46 4.5 Entropy map 48 4.6 Entropy map complete 49 4.7 Certainty map 50 4.8 MaCE example with the ideal H to represent 5 symbols 51 4.9 MaCE example illustrating the compression obtained by Huffman encoding . . 52 4.10 MaCE example including a symbol 53 5.1 Shannon-Fano-Elias: The intersect point 59 5.2 SLM: The intersect point 60 5.3 Equilibrium illustration 64 5.4 Equilibrium: Individual entropy values are equal and at the maximum 65 5.5 Split tree displacement/expansion 66 5.6 Illustration of SLM algorithm 69 5.7 SLMN: Midpoint mapping 72 5.8 Illustration of SLMN algorithm with average 77 5.9 Illustration of SLMN algorithm without average 78 5.10 JAKE in base 3 84 6.1 Total number of CPU cycles for SLM and SFE 90 6.2 Minimum number of CPU cycles for SLM and SFE 92 6.3 Maximum number of CPU cycles for SLM and SFE 94 6.4 Entropy comparison between SLMN, SFE and the theoretical ideal 96 6.5 Entropy comparison for SLMN modified 98 6.6 Entropy comparison between SLMN and SFE 101 VI

6.7 File size comparison between SLMN and SFE 101 6.8 Total number of CPU cycles for SLM 102 6.9 Total CPU cycles for SFE 102 6.10 Minimum number of CPU cycles for SLM 103 6.11 Minimum number of CPU cycles for SFE 103 6.12 Maximum number of CPU cycles for SLM 104 6.13 Maximum number of CPU cycles for Shannon-Fano-Elias 104 vn

LIST OF TABLES Table 2.1 Shannon-Fano-Elias tabulated results 26 5.1 JAKE tabulated results in base 3 83 5.2 Huffman tabulated results in base 2 83 A.l Arithmetic encoding tabulated results 112 B.l SLM tabulated results in CPU cycles 113 B.2 Shannon-Fano-Elias tabulated results in CPU cycles 114 C.l SLMN tabulated entropy results 115 C.2 SFE tabulated entropy results 116 C.3 Ideal tabulated entropy results 117 vin

LIST OF ABBREVIATIONS SFE SF PMF MaCE SLM SLMN MSPM JAKE - Shannon-Fano-Elias • Shannon-Fano • Probability Mass Function • Map of Certainty and Entropy - Select Level Method • Sort Linear Method Nivellate - Midpoint Spatial Mapping - Jacobs, Ali, Kolibal Encoding IX

NOTATION AND GLOSSARY General Usage and Terminology The notation used in this text represents fairly standard mathematical and computational us age. In many cases these fields tend to use different preferred notation to indicate the same concept, and these have been reconciled to the extent possible, given the interdisciplinary nature of the material. General Definitions A code definition refers to the requirements defining the code. A code word refers to the unique string of bits within a given range defined by the code definition used to represent a symbol. A symbol refers to the entities being assigned the code word which is also referred to as encoding. Encoding refers to the process of assigning a symbol to a code word. Decoding refers to the process of retrieving a symbol from the encoding table by utilizing the code word. General Notations Entropy is denoted in multiple fields by S and a specialization of S denoted by H is used for the purpose of information entropy, also known as Shannon entropy. Entropy for an individual symbol is denoted by Ht for datum / and the probability of occurrence for an individual symbol / is denoted by P;. Information or surprisal is denoted by I. The length of a code or the level within the space is denoted by L and the width of the space is denoted by W. The level L representing equilibrium is denoted by Le. Certainty of an event is rep resented by C. The probability of certainty Pc denotes the termination location of complete code word representing the symbol. Acronyms MaCE (Map of Certainty and Entropy), developed in Chapter 4, is a depiction used to il lustrate the spatial relationships between entropy H, certainty C and the symbol. SLM (Select Level Method), developed in Sec. 5.2, is the first compression method proposed which uses the concepts from MaCE to select the level required by the symbol to recreate Shannon-Fano-Elias encoding. MSPM (Midpoint Spatial Mapping) uses the spatial con cepts illustrated in MaCE to define the midpoints between the levels in the entropy and certainty space. SLMN (Sort Linear Method Nivellate), developed in Sec. 5.3, is the sec ond proposed method which uses the concepts from SLM and MSPM to decrease the final entropy L(x) to L(x) < H(x) + 1. JAKE (Jacobs, Ali, Kolibal Encoding) extends SLM and SLMN into other numerical bases to decrease the entropy below H(x) + 1. x

1 Chapter 1 INTRODUCTION 1.1 Introduction to Data Compression Data compression is the art of using encoding techniques to represent data symbols us ing less storage space compared to the space required for the original data representation. Various techniques have been developed to address this problem over the years and many have survived the test of time. The longevity of these techniques has led to many enhance ments and specializations to accomplish the task. However, the ever growing need to store, transmit and retrieve data in a timely manner while preserving communication and storage resources has rekindled the demand for even more efficient data compression methods. Data compression is an art form with the purpose of representing a symbol or series of symbols with fewer symbols as compared to the original data representation. An early example of the art form is in the development of Morse code [39]. Morse code was invented in 1838 and is designed to represent letters efficiently by assigning shorter code words to the most frequent utilized letters. In the Morse code a letter consists of one to five dots or dashes and the commonly occurring letter 'e' is assigned the dot code word. Although the concept was not applied to computing until later, the general idea remains. A code word is the fundamental building block of data compression, building the rela tionship between the encoded message and the original data. This relationship is utilized to produce the encoded message by replacing the original data with the code words. The decoding process reverses the encoding by replacing the code words with the symbols. The code word is defined by the code definition used to produce the relationship and the vari ous properties relating the code word, such as length and assignment order to the symbols. For data compression the encoding technique tries to find a code word that is shorter or smaller then the original representation. There is generally no requirement to mask the relationship between the code words and the original data, so the term encoding should not be misinterpreted as pertaining to cryptology. For computing purposes, the symbols can be anything from a simple character represented in ASCII to a complex series of bits in an image. The main requirement is that the symbol be uniquely representable for the encoding and decoding process in order to be able to reassemble exactly the original data being compressed. The space utilized by data compression can be any machine resource and is usually engineered in binary and the space available is represented in bits.

2 As alluded to in the previous paragraph, the two main applications for data compression are the storage and transmission of data. Both of these applications have finite resource lim itations and data compression is needed to efficiently utilize the available resource. From this point of view data compression can be interpreted as an optimization or fitting process to the resources available. The only variables are the data or how the data is represented, since the resource storing or transmitting the data is static. This attribute requires a data- centric mechanism to preform the manipulation and achieve the desired results in terms of transmission throughput or data retention. Data compression has a history almost as old as computing itself. It is typically traced back to the development of information theory in the late 1940's. One of the most notable figures in the field, due to his original writings on the subject, is Claude Shannon. His approach of using the entropy equation to model communication is the foundation for much of the work in this area. This paper discusses the pertinent parts of his work in detail in the following chapters. Data compression is a required part of computing and an endeavor of great worth to the computing community. Data compression is a large field with varying requirements from the retention of data attributes to the length of the code words. This paper examines the base algorithms supporting the current approaches used to achieve data compression and proposes a new model of the problem, as well as a few methods that utilize this model. We will start with a brief review of the properties used to classify various data compression techniques to introduce these concepts. 1.2 Properties to Classify Compression Algorithms We define some of the properties used to classify data compression algorithms in order to categorize the algorithms and define their relative strengths and weaknesses. The combina tion of these properties defines the requirements of the algorithm. Each property must be selected appropriately to achieve the desired compression. More importantly, the quality of the reproduced data source must be achievable based on the properties selected. Compres sion methods can be broadly grouped in several sub-categories: lossless or lossy; prefixed or prefix-free; variable or fixed length code; and dynamic or static. The first category of data compression algorithms is defined by whether or not all the original data is completely represented in the compressed form in a manner that allows for reconstruction of the data after decompression. The terms associated with this category are lossless and lossy compression. More specifically, the terms refer to how the symbols are represented in their compressed form as the loss is usually implemented at the time the

3 encoding table is created. The encoding tables are used by the compression algorithms to store the code word to symbol relationship for use in the decoding and encoding of the data. Lossy compression uses a method that removes part of the information when creating the encoded symbol. This method typically reduces substantially the size of the final en coding in contrast to lossless compression. Lossy compression is typically used on data sources that do not require all of the information to be utilized in their reproduced form. For instance, an audio signal is originally an analog signal representing the continuous fluctuation of air density. High fidelity conversion of this signal to digital itself already removes some of the information from the original analog signal without a noticeable dif ference. In some cases this loss is even welcomed as it makes the signal cleaner and more distinguishable. Converting the audio files to a lossy compression form, e.g., MP3 (MPEG- 1 Audio Layer 3), removes even more of the information from the data stream while still retaining a respectable quality audio reproduction. In addition to MP3, most multimedia data sources like pictures, audio and video use some sort of lossy compression for the same reason. Some of the formats include JPEG (Joint Photographic Experts Group) for static images and MPEG (Motion Pictures Expert Group) for video images. Lossless compression is the opposite of lossy in that it retains all of the information from the original data source, i.e., there is a direct correlation between representing the symbol in the original data and the reproduced representation of the symbol in the com pressed data. The lossless requirement typically requires a larger compressed represen tation than the lossy method. Lossless compression is used in applications where data sources cannot tolerate loss in the reproduction of the data. For instance, a text message that losses some of the character data would be of little use to the end reader of the doc ument. Of course, the use of error-recovery codes can eliminate the loss, but only at the expense of greater redundancy, yielding no net gain in compression! So, even though the size of the end result may be large, it is the dictates of the final requirements that determine whether or not lossy compression can be utilized. In data compression a combination of lossy then lossless compression is typically utilized to achieve maximum compression for data sources that allow lossy compression. The second category of data compression algorithms is defined by the code words being generated by the encoding process. Two properties associated with defining the code word are prefixed or prefix-free. A prefix is utilized to define a difference between one root word and another with the addition of a leading appendage. A prefix, in the case of binary code words, is an appendage of zeros and ones tagged onto the front of a root code word in the form of zeros and ones. For example, if the root of the code word is the last digit in a string of binary numbers and we have the code of 110 0^ and 110JTJ. The root of the code word is

4 represented by the values in the box and the prefix is represented by 110. A prefixed code allows for a prefix to exist within the code words. The main disadvantage to prefixed codes is that they require some type of termination knowledge in order to differentiate one code word from the next. Two of the common ways to acquire this knowledge is through a fixed length standard or termination codes in the form of additional characters. For data compression this may be a drawback, depending on the method chosen to compress the data as both fixed length and the additional termination codes lead to extra utilization of the resource. In contrast, prefix-free is a property that maintains code words that do not contain a prefix to another code word. The property maintains the ability for the code word to be unambiguous by maintaining a complete string or complete strings of zeros and ones that terminate at only one symbol. The uniqueness of the code word based on the termination of a recognizable string allows a prefix-free code word length to vary. The varying length is beneficial when it is possible to avoid the extra utilization of the resource required by prefixed code words. The main disadvantage to prefix-free code is that the prefix-free properties usage of the system may be disproportional in terms of code word lengths. In data compression a combination of prefixed and prefix-free code is typically utilized to achieve maximum compression. The third category for data compression algorithms is defined by the data source being encoded. The terms associated with this category are static and dynamic and refer to the data source stability while the compression algorithm is encoding the data. For static com pression algorithms the data distribution does not change from the time of sampling to the time of compression. The source being static allows the algorithm to make decisions on how to properly encode in advance based on the distributions. This may require the build ing of an encoding table which contains the symbol counts or relative percentages. The main disadvantage to the static compression model is that the algorithm usually requires two passes through the data set. The first pass builds the encoding table and the second pass encodes the data source into the encoding stream. Dynamic compression allows for changing distributions within the source data stream. The encoding changes in response to the distribution modifications as new symbols are encountered or in response changes in relative distribution of neighboring symbols. The ability to adapt to changes in distribution allows for the algorithm to process the data in one pass by encoding the symbols as they are read. The key advantage to this approach is that one pass through the data allows for encoding in real time. A major disadvantage is its susceptibility to error which can ruin the code. Since the data is sampled in a one pass or streaming manner the error may be encoded and transmitted before the error is recognized.

5 The error may even reach the decoding phase before the error is recognized inhibiting the decoding process. Retry and sampling logic is implemented to address these concerns, increasing the overhead of the compression model. Static compression is also susceptible to error, but since the data is static the retry logic can be imposed at the time of error with less cost as compared to dynamic compression. Another disadvantage is the added overhead involved in the update to the encoding table which is required when a distribution changes. This requirement adds additional overhead as updates must be periodically sent to the receiver. Without the added ability to address the errors and decrease the overhead adaptive encoding is limited. 1.3 Objective The list of properties in Sec. 1.2 is far from exhaustive as it only explains native differ ences in a variety of compression schemes, however, the list is complete enough for the objective of this study. The research preceding this dissertation evolved through a thor ough analysis of the current methods used in the art of data compression in both the static and dynamic environments. The analysis revealed that the current solutions were based on three fundamental algorithms, Shannon-Fano, Huffman and Shannon-Fano-Elias. This in cluded Arithmetic encoding which is the current standard in data compression and is based on the concepts of Shannon-Fano-Elias and the use of data modeling. The main complaint about the current solutions in both the static and the dynamic environments is the overall complexity of the methods. These complaints are the main reason Huffman is still used for compression even though Arithmetic encoding has been shown to produce better results. We tried to address the problem of complexity in the previous analysis and the process revealed that only marginal gains could be accomplished using the previous algorithms containing current enhancements. The research into these algorithms yielded insight on new ways to analyze the data compression environment and to use these concepts to pro vide new methods to perform data compression. To accomplish this objective, Chapter 2 covers some of the basic terms related to data compression beginning with the theory of entropy and the theoretical ideal. The theory of entropy is geared around the concept of space, items represented by their probability within the space and the distance within the space separating the items, or known reference points. The entropy concept is fundamental to all the work in the field of data compression and is used throughout this dissertation to analyze the various methods. Also covered are the three major compression algorithms fitting the above scheme and an analysis of the strengths and weakness of these algorithms. This review of the base algorithms in data

6 compression points out the data centric view and how this view influences the approach to compression that the algorithms follow. In Chapter 3 we re-examine in detail the base algorithms with regard to the unique approach used to create the algorithms and how this approach applies to the theory of entropy. We breakdown the previous algorithms operation in terms of entropy theory to examine their subcomponents to gain new insight into the methods ability to model the entropy ideals. The new analysis of the base algorithms includes the formulation of the equations of entropy that represent these algorithms and provides a quantification of the limits of these algorithms to compress data in terms of Shannon's theory. Chapter 3 also discusses the view taken in this work on the data compression environment. During this analysis the concept of certainty and the duality of the theoretical ideal are added to entropy theory in order to explain the relationships between the algorithms and the system. The concepts are further developed in Chapter 4 as the model of compression envi ronment of the base algorithms, Shannon-Fano, Huffman and Shannon-Fano-Elias, is ex amined to reveal the benefits each depiction has at describing entropy and the certainty. Chapter 4 uses the concepts exposed to introduce a new more holistic model of the com pression environment based on the space available in the system and the space occupied by the data using the analysis in Chapters 2-3. The Map of Certainty and Entropy, MaCE, de fines entropy and certainty as a two-dimensional space based on the entropy and certainty equations. MaCE shows the individual components that comprise entropy and certainty of the environment. The environment consists of both entropy inherent to the data and certainty inherent to the system. The development of this model reveals that it is possible to compare certainty to entropy and entropy to certainty to simplify the problem of data compression. MaCE is also used to illustrate the use of base manipulation to increase com pression. This model opens multiple doors into the subject of data compression, and only a small portion of these are examined in this dissertation. Chapter 5 uses MaCE to address some of the complexity of the problem of data com pression and produces some new methods to accomplish this task. These new methods show the versatility of the proposed approach and its affect on both speed and compression related to existing compression algorithms. The methods, SLM (Select Level Method), SLMN (Sort Linear Method Nivellate) and JAKE (Jacobs, Ali, Kolibal Encoding), use these concepts to accomplish the task. SLM is developed utilizing MaCE model and is designed for speed to achieve a code length L{x) with L{x) < H(x) + 2 with solely an arith metic comparison model. The concept of equilibrium and the split-tree are defined in order to further enhance the computational efficiency of SLM.

7 The second method, SLMN, is developed utilizing MaCE and the concept of midpoints between levels in the binary space. The midpoints develop an integrated sorting algorithm of 0(n) to pre-process the symbols for the data compression method as well as a fitting function built on the midpoint concept to adjust the code of length L(x) to L(x) < H(x) + 1. Chapter 5 also analyses the meaning of the +1 in H{x) + 1 and this knowledge pro duces JAKE. JAKE is an extension of SLM and SLMN into other bases more suitable to represent the data within the constraints of the linear system. The concept of choosing a base adjusts the width of the compression environment to a dimension more suitable for the data distribution. The adjustment of the width provided by JAKE augments the com pression optimization as SLMN and SLM both concentrate on minimizing the length of the code word. The width adjustment by JAKE allows the use of SLM and SLMN to obtain data compression closer to the theoretical ideal. We look at Arithmetic encoding with the understanding of the +1 with the purpose illustrating the power of changing the numeri cal base of the system to other than base 2. A comparison between Arithmetic encoding and Huffman encoding also shows how the granularity of the system affect on the overall compression. In Chapter 6 we also analyze the results in terms of CPU cycles for SLM in comparison to Shannon-Fano-Elias and we analyze SLMN in comparison to Shannon-Fano-Elias and the theoretical ideal in terms of entropy, file size and compression ratio. The results are broken down into their respective parts in order to clearly see the relationship between the algorithms and the data. The test sets included best/worst case analysis as well as random and sequential data distributions. The analysis shows the advantages of the methods over the current algorithms and reveals prospects for improving upon the results. Chapter 7 contains conclusions generated from this dissertation and some of the future work resulting from this research.