University of Ulster, Magee College. BSc. Applied Computing (1220), Year 4. Course: Image Processing (AC460). Autumn Term 1995. Lecturer: J.G. Campbell. Date: / /95 6. Aspects of Image Data Compression. ------------------------------------- 6.1 Introduction and Summary. ----------------------------- The following are the important concepts introduced in this chapter: - data compression should be considered whenever large amounts of data are stored on a limited storage resource (e.g. a magnetic disk), or transmitted over a communications channel that has limited capacity (e.g. a communications line with restricted data rate), - data compression exploits the fact that you need only store or transmit 'real' 'information', - '(real) information' is strongly related to uncertainty; if no uncertainty exists, there cannot be any flow of 'real' information; you give no 'information' if you tell someone something that he/she knows already, - entropy is a suitable mathemetical measure of uncertainty, - formally, the 'size' of a piece of 'information' (its entropy) can be measured in terms of the number of binary-digits (BITS) required to represent it unambiguously, - the number of bits is exactly the number of 'twenty-questions' type questions that need to be asked to unambiguously determine the information, - the measure is zero if there is no uncertainty, and there is no point is sending or storing anything, - the measure is small if there is small uncertainty (small variation), e.g. a hexadecimal digit can be expressed in 4 bits, - the measure is large if there is big uncertainty, a big number requires a large number of bits, e.g. a list of the populations of Irish towns and cities would range (say) 100 to 1,000,000 (a large variation) and would require about 20 bits for each number, - in sequences of pieces of information, sometimes one piece of information 'gives the game away' as to another piece of information; e.g. given the occurrence of the letter 'q', surely 'u' follows; this is variously called 'redundancy' (the 'u' is redundant), 'correlation', and 'dependence', - simple entropy based compression schemes look at the probability (& hence the uncertainty / entropy) of single symbols (a symbol can be e.g. a pixel value, or a text character), - however, the true entropy of the source must be measured using all possible strings of symbols; hence, single symbol entropy based methods (e.g. Huffman coding) do not achieve optimum efficiency, - in signals and images, there usually is dependence (correlation) between neighbouring points, - the bigger the dimensionality (signals: dimensionality = 1, images = 2, moving sequences of images = 3) the more neighbours, so the more correlation, and so the greater the scope for correlation, - various compression methods exploit correlation in markedly different ways: - run-length encoding, - differential (predictive) encoding, - transform coding. 6.2 Motivation. --------------- We mentioned in Chapter 1 that technological advances (sensors, display devices, speed of processors, size of memories) are allowing image processing into the arena of data processing (i.e. processing of text, numbers). Nevertheless, the vast amount of data required to represent a digital image is still a major obstacle: - storage: disk drives are getting large and fast, but not infinitely so. - communications channel capacity: the capacity of a communication channel is finite; fibre optics - with their vastly increased capacity - are on the way, but copper lines will prevail for most of the next 10 years. So, storage and communications of images can still be prohibitively costly, even though the remaining components of the system are readily and cheaply available; this is especially true of image sequences (movies). For example, a single digitised TV image requires (approx.) 600 x 600 x 3 colours x 256 levels = 1.08 Megabyte. A 35mm image (e.g. as used in conventional photography, and in the cinema) requires more than 10 times that. A good 35mm film - like those used in most movie cameras - will give a resolution of around 50 lines per mm (here, lines = resolvable points); thus, if we allow 2 pixels per resolved point, we have 35 x 50 x 2 => 3500 x 3500 x 3 pixels = 36.75 Megabytes. Hence, given our preference for powers of two, most digital animation studios that work for the cinema use 4096 x 4096 pixel images. Image data compression (often called just image coding) is about the reduction in the number of bytes (say) required to represent an image. Ex. 6.1 A Data Mountain! Consider a satellite which images the earth's surface at 10 metres resolution, and in 10 (colour) bands, 1 byte per band. Each point on the earth is covered every 20 days. How many bytes per year?. How many magnetic tapes, assume 100 Mb per tape. [Average radius of the earth Re = 6378 km; area of a sphere = 4PI.R^2]. Area = 511,185,932 kmsq = 0.5 x 10^9 kmsq = 0.5 x 10^15 metres sq (1 kmsq = 10^3x10^3 = 10^6 msq) Convert to pixels: = 0.5 x 10^13 pixels (1 pixel every 100 msq) = 0.5 x 10^14 bytes (1 pixel = 10 colours = 10 bytes) Coverages per year = 365/20 = (approx.) 18. Bytes per year = 0.5 x 18 x 10^14 bytes = 9 x 10^14 bytes One tape = 100 x 10^6 bytes, Therefore, bytes per year = 9 x 10^14 / 10^8 = 9 x 10^6 i.e. 9 million tapes. [I figure that on office like mine [about 6 m.x 3 m x 4 m(height) ] could hold about 20,000 tapes, i.e. we would require 500 such rooms!] 6.3 Context. ------------ Figure 6.3-1 gives the context of the problem. An image exists at the SOURCE in direct storage / raw / original form. SOURCE----->ENCODER---->CHANNEL---->DECODER---RECEIVER ^ | | NOISE Figure 6.3-1 Information Transmission Model --------------------------------------------- First, the data are ENCODED, presumably to occupy less storage than the original. Conceptually, encoding can be split into three steps, as shown in Figure 6.3-2; note, however, that any of the three steps may be missing. If there is no compression / encoding, all three are missing! TRANSFORMATION----->QUANTISATION---->CODING Figure 6.3-2 Encoder Model -------------------------- The first step is to transform the image into some form where parts of the transform space are more important (carry more information) than others - however, the transformation is not essential to the concept, so for the purposes of this introduction, assume that the transformation passes the raw data straight on to the quantiser. The next step is to quantise the data into a finite number of bits. Finally, the data can be coded (e.g. using a 'codebook' ); sometimes called 'source coding'. The coded data are then sent via a CHANNEL. Channel is a general term - it could be a transmission line, or it could be a storage device. While in the channel, the data may be subject to noise corruption, e.g. additive random noise like we have seen. Reduction of the effects of noise is the objective of ERROR CORRECTION CODES; we shall not be concerned with noise - leaving that problem to another part of the system. The channel has limited CAPACITY: for a transmission line - bits per second, for a data file - perhaps some agreed limitation in size. Channel Bandwidth: ------------------ The bandwidth (BW) of a channel is measured in Hertz (Hz). The capacity of the channel - its ability to pass information - is proportional to bandwidth. The capacity of a channel depends on (a) its bandwidth, and (b) its signal to noise ratio. Noise reduces capacity. On leaving the channel, the data is not in a directly usable form: it must be DECODED. Obviously the decoding process is strongly dependent on the encoding process - clearly, there must be cooperation between sender and receiver. Finally, the image arrives at the RECEIVER. Ideally, the receiver should receive the data as they left the source, or as close as possible. Source Coding versus Channel Coding. ----------------------------------- Most books introduce two extra stages: CHANNEL ENCODING and CHANNEL DECODING - just before, and just after, the channel. These are the equivalent of modulation and demodulation; they are associated with getting the signal into a form in which it can be carried by the so-called 'physical-layer'. For the sort of approach used here, it is much better to include these in the channel - and to associate any noise due to them with the channel. If neccessary we will make the distinction by explicitly naming the encoder and decoder as SOURCE encoder, and SOURCE decoder. Next, we introduce the theoretical notion of information. 6.4 Information Theory. ----------------------- 6.4.1 Introduction. ------------------- We need a precise definition of INFORMATION, particularly a quantitative unit and a method for measuring it. Information theory defines information as the reduction of uncertainty. Thus, the sentence "IBM make computers" conveys little or no information - because there is no uncertainty in your minds about the matter. Informal discussion of Information: ----------------------------------- Roughly speaking, the more variance, the more information. More precisely, the information (in bits) in a message is the number of 'twenty questions' type questions you would have to use to get the information. Thus, if I tell you I have written down a number between 0 and 15 (say 5), you can ask me: is it less-than-or- equal-to 7? (yes), is it less-than-or-equal-to 3? (no), less-than-or-equal-to 5? (yes), therefore it is 5 or 4, and the next question clinches it; thus, 4 questions or 4 bits; i.e. a hexadecimal digit contains 4 bits of information. In this last example, each value was equally likely, and the information can be calculated as: I = - Sum { pi. log2(1/pi) } where pi = probability of ith value = 1/16. log2(1/16) = - 4. I = - (16/16) (-4) bits = 4 bits. This is Shannon's 'entropy' or uncertainty. Information can flow only where there is uncertainty. If someone tells you that the number of days in the week is 7, no information flowed, because you knew the value with total certainty. Mathematically, you know that the probability of 7 being the number of days in the week is 1, and of all other numbers is 0. Thus, I = - Sum {pi. log2(pi) } = 0 because log2(1) = 0. What is the difference between information and data? This is a bit subtle, and can lead to problems and apparent paradoxes. For example, if you wanted to transmit Hamlet's 'To be or not to be' speech to a friend and you knew they had a collected works on their bookshelf, all you need to transmit is 'Hamlet Act 3 Scene 2, speech 4' i.e. about 30 alphabetic characters, or 240 bits - instead of the 2000 or so characters in the speech. Magic? paradox? No. The friend had received the major part of the information previously - the book. Most of data compression depends on this principle: don't waste resources on sending 'information' that is already known. 6.4.2 Information Measure - Entropy. ----------------------------------- Information theory uses a measure of information called ENTROPY; entropy is associated with 'mixed-up-ness' - uncertainty. Entropy (H) is measured in units of BITS - BInary digiTs. A bit is the fundamental quantum of information required to distinguish between two equally likely alternatives, e.g. the result of the toss of a fair coin. Mathematically, for N equally probable outcomes, (6.4-1) H = -log2 (1/N) (=log2 N) H is information in bits. Ex. Fair coin: H=-log2 (0.5)=1 bit. Ex. Dice: H=log2 (6), approx = 2.5 bits, NB log2(1/6)=-log2(6) Ex. Hexadecimal digit, see section 6.4.1 (sidebar), H = -log2(1/16) = -log2(1/2^4) = -log2(2^-4) = 4. Ex. One letter from an equally probable 26-letter alphabet: H=log2 (1/26)= 4.7 bits i.e. 5 bits would do. Ex. The winner of yesterdays 2.30 at Kempton Park: 0 Tomorrows: !! More correctly: (6.4-2) H = -log2 (p) where p=probability i.e. for a set of N independent symbols (6.4-3) H = -log2(p1) -log2(p2) -... -log2(pN) bits. Ex. Check the dice problem. p1=p2=...=p6=1/6 H = -log2(1/6) = -log(1/6) = log(6) [log(1/x)= -log(x)] Ex. ASCII code (128 allowable symbols): H= - log2(1/128)= 7 bits, which is precisely the number of bits used. 6.4.3 Average Entropy per Symbol. --------------------------------- Consider, now, n symbols (say 26) each with a different probability, pi = probability of ith symbol. The average information conveyed by a symbol is: (6.4-4) Hav = - p1.log2(p1) - p2.log2(p2)- .. -pn.log2(pn) n = - #pi*log2(pi) i=1 Note: the average of any function of i, f(i), can be determined from: n fa = - # pi.f(i) i=1 If the probabilities were equal for each letter, then the average information per letter for English would be 4.7 bits. However, the relative frequencies of occurrence - the frequencies of occurrence (approx. = probability) - of English letters are unequal ( e: 0.131, t: 0.105, ...down to z: 0.00077 i.e. 770 times every million), which leads to a reduction in average information to about 4.15 bits per letter. Clearly, we would like to use 4.15 bits instead of 4.7. In fact, there are coding methods (so-called entropy encoding) that can exploit unequal entropies in symbols; we will see an example of this later. In images, the pixel values correspond to symbols. And we have already seen, from histograms, that there are unequal probabilities, and, therefore, unequal entropies that can be exploited, usually by appropriate coding - see Figure 6.3-2. Some probabilities of letters are given in Table 6.4-1 (from Edwards(1969) ). ------------ Letter pi ------------ a .082 b .014 c .028 d .038 e .131 ... h .053 i .063 ... q .001 (rounded - off the scale) ... u .025 ... z .001 (rounded - off the scale) -------------- Table 6.4-1 Probabilities of Letters in English Text. ---------------------------------------------------- 6.4.4 Redundancy ---------------- We mentioned "independent" symbols above. In a message composed of English text, the symbols are not independent: there is varying degrees of dependence between successive symbols (and indeed between groups of them). If the letter "q" has just been transmitted, then the probability of "u" surely increases and the information contained in the next letter decreases accordingly; therefore, very little information is contained in the "u" because the receiver had a fair idea what it was before it arrived. Redundancy is the information theory term for 'dependence' (or 'lack of independence') between symbols; literally, redundancy means 'not needed'. Of course, redundancy is not always as clear- cut as in the case of the 'q' and 'u' mentioned above - given 'q', the probability of 'u' might jump from 0.025 to 0.95 (95%) (not 100%), and given 't' the probability of 'h' jumps from 0.053 to maybe 0.2. A measure of redundancy can be defined, in terms of the ratio of 'lost' entropy to entropy with no redundancy, as follows: (6.4-5) redundancy = (Hind - Hred)/Hind = (1- Hred/Hind) where Hred= average entropy for dependent symbols, taking into account redundancy, Hind= average entropy for independent symbols. Overall, English is more than 50 percent redundant. Here, we have just been discussing dependency/correlation between pairs of symbols (so called 'digrams'); there is additional redundancy if we look at triplets (so called 'trigrams'); e.g. given 'in', 'g' becomes much more likely. In images redundancy crops up as correlation between neighbouring pixels. Deep down, most data compression algorithms are based on removal of redundancy - don't send parts of the message that are already known - or on non-uniform entropy (and these are related). Transformation coding and predictive coding are examples of redundancy removal - but the redundancy removal is implicit and not too obvious. 6.4.5 Redundancy is Sometimes Useful! ------------------------------------- In most forms of communication redundancy can sometimes be useful; humans use it all the time; e.g. you can often reply before the previous speaker has finished, misspellings in text are an irritation, but the message still gets across. Of course, parity and checksums are classical forms of redundancy - used for error checking and correction. Note: You will find that some authors, e.g. Edwards (1969), include, in the definition of redundancy, the loss of entropy due to unequal probabilities; we will reserve redundancy for loss of entropy due to dependence - as above in eqn. 6.4-5. Ex. 6.4-1. Derive an estimate of the redundancy contained in the original of the following sentence (from Cover and Thomas (1991), p. 136). TH_R_ _S _NLY _N_ W_Y T_ F_LL _N TH_ V_W_LS _N TH_S S_NT_NC_. 6.5 Introduction to Image Data Compression. ------------------------------------------- Image data compression techniques are very much problem oriented. Nevertheless, there are general methods, and general categorisations that can be made. In some cases there will be a difference (error) between what leaves the source, and what arrives at the receiver. If perfect reconstruction (after the decoder) is demanded, i.e. zero error, the encoder-decoder (compression) system is called INFORMATION PRESERVING, or LOSSLESS; typically, for text communication, we require lossless compression; often, for image communication, we may tolerate some loss, after all, we are used to fuzzy pictures in newspapers, badly set-up video recorders, etc. Therefore, if the nature of the application allows some error, we can use a compression technique that merely maximises some FIDELITY CRITERION; these techniques are called LOSSY. Note: Chapter 8 (Pattern Recognition) mentions feature extraction. Feature extraction is strongly related to compression, because: - we want to minimise the number of features, - we want them to 'classify' as well as possible (the fidelity criterion). It is important to realise that, for error free compression, there is 'no free dinner'. As mentioned in the previous section, there are two major sources of savings: non-uniform entropy, and redundancy. If each pixel truly contains 8-bits of information and is truly independent, no compression is possible. But, if the grey levels are not equally likely (say, the average entropy is 6 bits) then compression is possible; however, it will not be possible to reduce below the theoretical limit of 6-bits. In this case the savings can be achieved through proper coding. If there is correlation, it will be possible to TRANSFORM the data such that there is less correlation (or none) - see Figure 6.3-2. E.g. transformation to the frequency domain, via the Fourier transform. Summary: ------- There are two major categories of data compression: - lossless; the data are reconstructed as if no compression had taken place. - lossy; here distortion / errors are tolerated to a limited extent; minimise the errors by maximising some fidelity criterion. Major compression principles: - transform compression, - linear transforms, e.g. DFT, Discrete Cosine Transform. - general transforms - usually decorrelating. - predictive compression (sometimes contained in the transform category), - source coding, e.g. entropy encoding, - quantisation coding, - ad hoc structural methods, e.g. run-length encoding. - image model coding, again, may be rather ad hoc. Many compression schemes combine more than one of these principles within an encoding scheme - e.g. JPEG, MPEG. 6.6 Run-Length Encoding. ------------------------ If we stretched the definition, run-length encoding could be considered a transformation; however, it is best to consider it as 'ad-hoc'. Consider the image in Figure 6.6-1 (a). Run length encoding exploits the highly repetitive nature of the image. It detects 'runs' of the same value, and codes the image as a sequence of: length-of-run, value;.... See Figure 6.6-1(b). 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16,0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16,0; 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 3,0;10,1;3,0; 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 3,0;10,1;3,0; 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 3,0;10,1;3,0; 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 6,0;4,1;6,0; 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 6,0;4,1;6,0; 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 6,0;4,1;6,0; 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 6,0;4,1;6,0; 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 6,0;4,1;6,0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16,0; Figure 6.6-1 (a) Direct storage (b) Run Length ------------------------------- -------------- Let us assume that the pixels can be any of 255 levels. The image is 11 x 16, therefore 176 bytes. Assuming that value and length-of-run can be quantised into 8 bits, run-length encoding reduces 176 bytes to 54 bytes. A form of run-length encoding is used in fax machines. First, the image is quantised to 1 bit (black-white), and then runs of black (or white) are determined. If we recognise the two-dimensionality of the image, we can run- length encode both horigontally and vertically. In its plain form, run-length encoding is error free - lossless. Ex. 6.6-1. Obviously, run length encoding will not provide any compresssion for 'busy' images (i.e. where the pixel values are rapidly varying, and there is very little correlation between adjacent pixels). Considering the model of a checker-board (alternate black (0), white (1), squares, and a 16 x 16 image, and starting with (1) a pure white image, next, (2) an image with 4 squares (white,black black, white) and so on, At what stage does run-length encoding start to INCREASE the data (i.e. more than 256 bytes)? Ex. 6.6-2. Assume images are first quantised to 1 bit, and that maximum run length is 128; this allows length-of-run, value to be code in ONE byte. Calculate the number of bytes for Figure 6.6-1. Ex. 6.6-3. Using the scheme of Ex. 6.6-2, recalculate the cut-off point for Ex. 6.6-1 (checker-board). 6.7 Quantisation Coding. ----------------------- For example, in the case of faxed documents, we agree that there are only two grey levels in the image (black=0, white=1). This we can quantise the source image to 1 bit. Thus, a saving of 8 times, (over 256 levels). If we have a pictorial image with 255 levels, but know that it will only ever be used to print on a newspaper with 16 grey levels (say), then we can quantise it to 4 bits. Choice of quantisation should depend on knowledge of the SOURCE, and of the RECEIVER - the expected use of the image. E.g. if the image is known to be from (source) a black/white page, then one bit is enough; likewise, if (any) image is only ever going to be printed on such a page, again one bit will suffice. From our knowledge of vision, we know that humans can, simultaneously, only cope with 160 levels at once; therefore we can restrict most images to 256 levels, or 8 bits. Nevertheless, in an X-ray image, a clinician may view parts of the image NON- simultaneously; X-ray images tend to be stored using 12 bits. In general, quantisation coding is LOSSY. 6.8 Source Coding. ----------------- (We mention only variable length codes, particularly the Huffman code.) 6.8.1 Variable Length Coding. ---------------------------- The basic principle of variable length coding is: use short codewords for frequent symbols, use longer for less frequent symbols; in the average your message will be shorter than for equal length codes. Note: this principle is not new - consider the Morse code, which used '.' for the most frequent symbol, the letter 'E'. However, ASCII uses an equal length code. Variable length codes can be usefully characterised by the average length of their codewords: (6.8-1) Lav = # pi. Li where Li is the length, in bits, of symbol i and, pi = probability of symbol i. The theoretically optimum variable length code would give Lav = Hav, the average entropy per symbol. However, in general, quantisation effects may reduce the efficiency of the variable length code (for the simple reason that we can use only integer numbers of bits; the optimum code may call for (say) 3.2 bits, we have to use 4, see Example 6.8-1). It is easy to show that an optimum code should use, for symbol z - with a probability p(z), a codeword of length as close as possible to log2(1/p(z)). Clearly, we will only Lav = Hav when all the 1/p(zi) are integer powers of 2 (see Example 6.8-1). In general, we can set bounds on the code length, L(z), for a symbol z: (6.8-2) H(z) <= L(z) < H(z) + 1 where H(z) is the entropy of the symbol. Example 6.8-1. Let the distribution of values in an image be: z 0 1 2 3 p(z) 0.5 0.25 0.125 0.125 The average entropy per symbol is given by: [Note: log2(1/2) = -1, log2(1/4) = -2, log2(1/8) = -3 ] Hav = -[ - 0.5log2(0.5) - 0.25log2(0.25) + 2x0.125log2(.125) = -[ - 0.5 - 0.5 - 0.75 ] = 1.75 ------- This compares with the 2 bits per pixel that would be required without source coding. If we allocate codewords as follows: z 0 1 2 3 p(z) 0.5 0.25 0.125 0.125 code 0 10 110 111 Length 1 2 3 3 we get an average codeword length of: Lav = 0.5x1 + 0.25x2 +0.125x3 +0.125x3 = 1.75 ------- i.e. the code is optimal. Note: it is just by chance that 0 -> 0 6.8.2 Unique Decoding. ---------------------- Obviously, we need codes to be uniquely decodable; and, we don't want the enormous overhead of 'start' and 'stop' codes ! At a first glance, the codes in Example 6.8-1 might look as if a sequence of them would be difficult to disentangle. Consider 1101110101110 well, in fact, this uniquely separates into 110 111 0 10 111 0 and, hence, can be uniquely decoded as: 2 3 0 1 3 0 Ex. 6.8-2. Using the results of Example 6.8-1, derive a variable length code for the following symbols - given their probabilities; hence compute the compression achieved. z 1 2 3 4 p(z) 0.5 0.25 0.125 0.125 Ex. 6.8-3. Using the results of Example 6.8-1, derive a variable length code for the following symbols - given their probabilities; hence compute the compression achieved. z 1 120 122 250 p(z) 0.5 0.25 0.125 0.125 Ex. 6.8-4. Using the results of Example 6.8-1, derive a variable length code for the following symbols - given their probabilities; hence compute the compression achieved. z 122 120 250 1 p(z) 0.5 0.25 0.125 0.125 6.8.3 Huffman Coding. -------------------- The Huffman coding algorithm is an efficient algorithm for optimal source encoding. The Huffman procedure is given as follows: 1. List the symbols, along with their probability. Elements of the list are nodes - we are building a BINARY tree. 2. Pick the two least probable nodes. 3. Make a new node out of these two - adding the probabilities. 4. Repeat steps 2, 3, until only one node remains - the root. (effectively, we are building a binary tree) 5. Bit allocation. Starting at the root, allocate 1 to one branch (left, say), and 0 to the other; until all branches have been allocated. 6. Read out the codes. Starting at the root, travel to the leaf that represents each level, reading off the bits that make up the code for that level. Example. 6.8-4. In an image, the 8 symbols z0, z1, ... z7 occur with respective probabilities: 0.4, 0.08, 0.08, 0.2, 0.12, 0.08, 0.03, 0.01; (a) derive the Huffman code, (b) derive the average entropy per symbol, (c) using the results of (a) derive the average length of the Huffman generated code symbols, (d) hence, compare the efficiency of the Huffman code with the optimum. ANS: Figure 6.8-1 (a) shows steps (1) to (4), and Figure 6.8-1 (b) step (5). Level z0 z1 z2 z3 z4 z5 z6 z7 Starting Probs. 0.4 0.08 0.08 0.2 0.12 0.08 0.03 0.01 ------ | 1st Iter 0.4 0.08 0.08 0.2 0.12 0.08 0.04 ---------- | 2nd Iter 0.4 0.08 0.08 0.2 0.12 0.12 --------- | 3rd Iter 0.4 0.16 0.2 0.12 0.12 ----------- | 4th Iter 0.4 0.16 0.2 0.24 --------- | 5th Iter 0.4 0.36 0.24 ---------------- | 6th Iter 0.4 0.6 -------------------- | 7th Iter 1.0 Figure 6.8-1 (a) Steps (1) to (4) of Huffman Procedure. ------------------------------------------------------- Level z0 z1 z2 z3 z4 z5 z6 z7 Code 1 0111 0110 010 001 0001 00001 00000 Probs. 0.4 0.08 0.08 0.2 0.12 0.08 0.03 0.01 | | | | | | | | | | | | | | +-+-+ | | | | | | 1 | 0 | | | | | +---+---+ | | | | | 1 | 0 | +--+----+ | | | | 1 | 0 | | | | | | +---+----+ | | | 1 | 0 | +---+---+ | | 1 | 0 | | +------+----+ | 1 | 0 +--------+---------+ 1 | 0 Root Figure 6.8-1 (b) Step (5) of Huffman Procedure. ------------------------------------------------------- Note: (1) the work shown above (Figures 6.8-1 (a) and 6.8-1 (b) would have been a lot easier, and clearer, if we had ordered the values / symbols according to probability, i.e. 0.4, 0.2, 0.12, 0.08, 0.08, 0.03, 0.01; the result would have been the same. (2) I hope it is clear that the abstract symbols z0..z7 could represent ANY actual values, e.g. 0, 1, 2, ..., 6, 7; or, equally validly: 24, 61, 93, 119, 121, 150, 200, 250. Rosenfeld and Kak, p. 187, gives an efficient computational version of this procedure; or Sedgwick p. 328. Despite first appearances, the Huffman code is UNIQUELY DECODABLE. Suppose a decoder receives, from the channel, the bit stream: 0110100001 Examine this from the left. Neither 0, 01, nor 011 correspond to any code in Figure 6.8-1(b). But, 0110 corresponds to z2. The next bit is 1, which corresponds, unambiguously to z0, etc. 0110 1 00001 z2 z0 z6 Notice that the length of the code corresponds, as closely as discrete (integer) lengths can do, to the entropy of the symbol / level. The average entropy of a message composed of the symbols above is (see eqn. 6.4-4): (6.8-3) Hav = - #pi*log2(pi) [Note: to compute log-to-the-base-2 (log2(), use the following: logB(x) = log10(B).log10(x); log10(2) = 3.322 Thus, log2(x) = 3.322 x log10(x) ] = -[ 0.4*log2(0.4) + 3*{ 0.08*log2(0.08) } + 0.2*log2(0.2) + 0.12*log2(0.12) + 0.03*log2(0.03) + 0.1*log2(0.01) ] = -[ 0.529 + 0.874 + 0.464 + 0.37 + 0.152 + 0.066 ] = 2.45 bits. ---------- This is the theoretical limit; NO source code can achieve better. An uncompressed code would use 3 bits (8 levels). For so few levels the savings ar not great, but as the number of levels gets greater, so do potential savings. But anyway, which would you prefer, to pay 300 pounds for something, or 245? The average length of the Huffman code words is: 1*0.4 + 3*4*0.08 + 3*0.2 + 3*0.12 + 5*0.03 + 5*0.01 = 2.52 bits ---------- which is not far off the optimum. Source encoding is error free. ------------------------------ Ex. 6.8-5. (a) Compute the histogram of the 10 x 10 image given below; (b) hence, compute the probability density; (c) hence, derive the average entropy per symbol / pixel; (d) derive a Huffman code for this image; (e) compute the average length of the Huffman code (averaged over all pixels in the image); (f) compare (e) with the theoretical optimum; (g) assuming the image is originally coded with 3 bit pixels, how many bits will the full image occupy? (h) neglecting the space occupied by the code table, how many bits will the Huffman code image occupy? (i) what saving will the Huffman code give? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 1 2 2 1 1 2 1 2 2 2 1 3 3 1 1 3 3 3 0 0 4 0 0 4 4 0 4 4 4 4 5 4 5 4 4 5 5 5 5 4 4 6 6 7 6 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [Hint: look carefully at Example 6.8-4] Ex. 6.8-6. (a) Compute the histogram of the 10 x 10 image given below; (b) hence, compute the probability density; (c) hence, derive the average entropy per symbol / pixel; (d) derive a Huffman code for this image; (e) compute the average length of the Huffman code (averaged over all pixels in the image); (f) compare (e) with the theoretical optimum; (g) assuming the image is originally coded with 8 bit pixels, how many bits will the full image occupy? (h) neglecting the space occupied by the code table, how many bits will the Huffman code image occupy? (i) what saving will the Huffman code give? 119 119 119 119 119 119 119 119 119 119 119 119 119 119 119 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 24 250 250 24 250 250 24 24 250 24 250 250 250 24 121 121 24 24 121 121 121 119 119 93 119 119 93 93 119 93 93 93 93 150 93 150 93 93 150 150 150 150 93 93 61 61 200 61 150 150 119 119 119 119 119 119 119 119 119 119 119 119 119 119 119 119 119 119 119 119 [Hint: look carefully at Ex. 6.8-5 and Example 6.8-4] 6.8.4 Some Problems with Single Symbol Source Coding. ---------------------------------------------------- 1. Symbols are NOT independent, redundancy exists in strings of symbols. Single symbol coding cannot exploit this potentially rich source of compression. Ideally, we would like to compute the probabilities for all pairs of characters {x1, x2}, hence requiring 65536 (why this number?) probabilities to be estimated; in this case, we would expect {'t','h'} to have a high probability value, and, as mentioned earlier, {'q','u'} to have nearly as high a probability as a single 'q'. But, we wouldn't stop at pairs, we would go on to triples {x, x2, x3} -- in which case {'i','n','g'}, {'t','h','e'} would be high in the league table; of course, the table of probabilities has now 256 x 256 x 256 entries = 16.9 x 10^6 which is getting a bit large! Ex. If, say, 'th' has a probability of 0.03125 (== 1/32 == 2^-5), we can code it with a 5-bit code. On the other hand, if we take single letters, 't' has probability 0.104, and 'h' 0.053; if we round these up to powers of two, we get p('t') = 0.125 == 1/8 == 2^-3, and p('h') = 0.0625 == 2^-4, i.e. for an optimum code, we get a total of 7 bits for the two of them. On the other hand, using p('th') == 0.03125, we arrive at an optimum code of 5 bits for the pair. 2. If the probabilities used to make the code are wrong, then compression performance can drop badly. And, since it is difficult to make such coding schemes adaptive, i.e. make them adapt to changing probabilities, then the coding can stray from optimum for large sections of the massage. For example, in this part of the notes, there is a lot of English, so the probabilities for normal English text may work O.K. However, if we switch to a C program, the probabilities will be wrong. 3. The average symbol length (Lav) of single symbol coding will achieve average entropy only if the probabilities are integer powers of 2; i.e. if we have a symbol i whose probability, pi, is 1/4, we can code it with -log2(1/4) = 2 bits; however, if we have probability, pi = 1/5, -log2(1/5) = 2.3, so we need to round up and use 3 bits. We waste 0.7 bits every time that symbol occurs. 6.8.5 Alternatives / Solutions. ------------------------------ (a) Dictionary Codes. -------------------- Here we have a dictionary agreed between encoder and decoder. Commonly used strings are held in the dictionary. When a dictionary string appears at the encoder, the encoder emits (1) an 'escape' code that says: dictionary code follows, (2) the dictionary code. Example. Assume a dictionary optimised for text about data compression, with entries as follows: Code String ---- ------ 0 the 1 encoder 2 decoder 3 code 4 dictionary 5 compression etc. Every time we encounter 'decoder' we send just two data - the 'escape' code, and the code '1'; hence, assuming we are limited to bytes, we can send 'decoder' for the cost of two bytes - as saving of (7 - 2) = 5 bytes. (b) Sliding Window Coding ------------------------- The dictionary method has the problem of lack of adaptivity; again, a dictionary set up for a textbook on data compression may not work too well for a novel, or for a C program. In sliding window compression, both encoder AND decoder keep a buffer containing the last N characters sent / received (typically N = 2048). The scheme works very similarly to the dictionary method, except now the decoder searches the buffer for the longest match between the incoming string and a string already in the buffer; if no match of length greater than (say) 2 is found, the characters are sent as single characters; however, if a match is found, an 'escape' code is sent, together with the position and length of the matched string. Hence, the buffer works as a sort-of adaptive dictionary. One of the Lempel-Ziv (LZ) compression schemes (LZ77) uses a sliding window method. Lempel and Ziv are responsible for another adaptive method - LZ78 - in which both the encoder and decoder builds a dictionary in the form of a tree. This is used in the CCITT standard V42 bis for data compression in modems. Most archiving / compression software uses some form of Lempel-Ziv code. E.g. DoubleSpace for PCs uses a sliding window compression. Originally, Huffman coding was the default choice -- for English text it gave a compaction of about 43%. However, Lempel-Ziv give 55% compaction (hence the name 'DoubleSpace'). However, dictionary and sliding window methods normally work much better for text, than for numerical data (e.g. image data) - where single symbol methods like Huffman, or arithmetic, see below, perform relatively well. The principle of Lempel-Ziv coding scheme can be seen from the following example. Essentially, in LZ, encoding is performed by parsing the source data stream into the shortest substrings that have not already encountered. The encoder and decoder keep in step and maintain codebooks that are identical. Example. (S. Haykin, Communication Systems, Wiley, 1994, pp. 629- 631). Input data stream: 000101110010100101... Assume that the symbols 0 and 1 are already in the codebook. Hence, we can proceed as follows: Sequences stored: 0, 1 Stream to be parsed: 000101110010100101... Shortest substring not yet encountered: -- Sequences stored: 0, 1, 00 Stream to be parsed: 0101110010100101... Shortest substring not yet encountered: -- Sequences stored: 0, 1, 00, 01 Stream to be parsed: 01110010100101... Shortest substring not yet encountered: --- Sequences stored: 0, 1, 00, 01, 011 Stream to be parsed: 10010100101... Shortest substring not yet encountered: -- etc. At the end of the data stream shown, we will end up with a situation as follows, where 'data sent' is simply the shortest- substring-not-yet-encountered coded as: previous-substring, followed by the 'new' symbol Numerical position / codebook index: 1 2 3 4 5 6 7 8 9 Substring/codebook entry: 0 1 00 01 011 10 010 100 101 Data Sent: 11 12 42 21 41 61 62 Binary data sent: 0010 0011 1001 0100 1000 1100 1101 ^ | ------------------------------------------------+ Here, we see that 1100 has been sent; the 110 represents '6' which tells the decoder where to look in its (identical) codebook (where it will find '10'. It should be easy for you to persuade yourself that frequently occurring very long strings can be 'learned' by the system, with consequent large savings. You should also convince yourself that Lempel-Ziv can be done 'on- the-fly', and, hence, is suitable for use in modems; again, the encoder and decoder must keep in step. (c) Arithmetic Coding. --------------------- Arithmetic coding is an alternative to Huffman coding that CAN cater for probabilities that are not integer powers of two (problem 3. of previous section). 6.9 Transform Coding -------------------- 6.9.1 General. ------------- In some of the literature, transform covers a multitude of processes; in what follows we will restrict to linear transformations - see Chapter 3. Generally speaking, linear transforms exploit the correlations between pixels; the remove the redundancy, by transforming into a domain where the values are not correlated. There is a statistical procedure called Karhunen-Loeve Transform (also called Hotelling Transform, Principal Components Analysis, Eigenvector Analysis, or Factor Analysis) which is optimal in producing decorrelated values. However, we have not the prerequisite statistics and mathematical background required to cover it. Also, no fast algorithms (O(nlogn)) exist for K-L. Currently, see section 6.14, the Discrete Cosine transform (DCT) is the most commonly used for compression; the DCT is a very close relative of the DFT (see Chapter 3). It can be shown that the DCT most closely approaches the K-L. In the past the Hadamard transform has been used. It is possible to justify transform compression mathematically, but we will avoid that. There are three intuitive ways of explaining transform compression. (1) Frequency selection. If the adjacent pixels of an image are correlated, this means that the pixel values change by only a small amount between neighbouring pixels; which, in turn, means that the lower frequencies in the image frequency spectrum are dominant. Therefore, we code the data in terms of frequencies, and we can throw away some of the high frequency elements; or, you can code the higher frequencies using less bits. Thus, the encoder takes the DFT of the input image, picks the dominant (say) 10% of frequencies and sends these. The decoder reconstructs the DFT 'image' (setting the missing 90% to zero), and takes the Inverse DFT. (There will be some overhead, since the encoder must tell the decoder which frequencies were left out). (2) Mathematical transformation. Simply by saying that we are transforming to a domain in which the values are less correlated. Since the information content must remain the same, we can represent the image, in the decorrelated domain, using fewer values; for more details, see any description of the Karhunen-Loeve transform. (3) Description in terms of basis images. The image is expressed in terms of its correlation with (similarity to) basis images; e.g. the image can be described as 0.5 parts of basis-image 1, 0.2 of basis-iamge 2, etc. You send the 0.5, 0.2, ... ; the receiver knows the basis-images and can reconstruct. If the transform used is Fourier or Cosine, these basis-images correspond to sine and cosine functions at different frequencies - see Chapter 3.8 (DFT). Generally, transform image compression is LOSSY. Although we have no time to mention them here, WAVELET transforms are now attracting great interest for image compression. 6.9.2 Subimage Coding. ---------------------- Usually transform coding is applied, not to the full image, but to subimages, e.g. 8 x 8, 16 x 16. See section 6.14. 6.9.3 Colour Image Coding. -------------------------- For naturally occurring scenes, there is usually very strong correlation between colours; do a scatter plot (function d2s) of the green and red bands (bands 1 and 2 - band 0 is infra-red) of the SPOT image in DataLab - you will see that the data lie very close to a straight line along the diagonal, thereby indicating very high correlation. Decorrelating transforms can help here, too. But, note that we are decorrelating the vector formed by the bands, NOT the vector/image formed by the spatial array of pixels. 6.10 Image Model Coding. ------------------------ Again, image model coding covers a multitude of techniques. We shall only give a flavour. A trivial example. Consider the 11 x 16 'T' image in Figure 6.6-1. This started off as 176 bytes. (a) Run-length encoding reduces to 54 bytes. (b) Simply quantising to 1 bit reduces to 176/8 = 22 bytes. (c) My guess is that the first 4 Fourier frequency components (out of 256) would produce an intelligible version of the 'T'. But, if we know we have a page of text, it can be reduced to 1 byte - the code for 'T' ! Or, if we know it is a graphics image with only rectangles, we can get away with 4 bytes per rectangle. 6.11 Differential and Predictive Coding. ---------------------------------------- Consider the following sequence of data representing (say) 12-bit speech data {si}: s1 s2 s3 s4 s11 300, 305, 312, 320, 324, 326, 327, 327, 323, 319, 310, ... This sequence is smoothly varying; compare with the rapidly varying: 300, 1092, 2935, 123, 4000, .... If we take differences di= si+1 - si, we have d1 d2 d3 d4 5 7 8 4 Assume that the maximum difference is 15, and that we can arrange to we can transmit the starting value (300). We can then transmit the signal as the {di} quantised to 4 bits, thus saving 8 bits per sample. (In practice differential coding works slightly differently - it needs to guard against one error destroying all the signal, as the above simple method is prone to). 6.12 Dimensionality and Compression. ------------------------------------ Because of the importance of correlation in image and signal data compression (roughly, correlation means that a pixel is strongly related to - correlated with - some of its neighbours, there is redundancy) multiple dimensions in data offer greater opportunities for data compression. The greater the dimensionality the greater the number of neighbours and therefore potential for correlation, and therby redundancy, along each dimension. Thus, familiar examples are: - audio signal - one dimension, f0, f1, ... fi-1, fi, fi+1, ... fi has 2 neighbours. - monochrome image - two dimensions, f(0,0) f(0,1) ..... .... f(r-1,c-1) f(r-1,c) f(r-1,c+1) .... f(r,c-1) f(r,c) f(r,c+1) .... f(r+1,c-1) f(r+1,c) f(r+1,c+1) f(r,c) has 8 neighbours. - sequence of mono. images (e.g. a movie) - three dimensions. each point has 26 neighbours. The sequence is: 2, 8, 26, i.e. (3^n - 1), where n is the dimensionality. It is well known that the compression factor increases quite non-linearly as you go down the above list; the more neighbours, the more correlation that can be exploited by the compression algorithm. Colour is a fourth dimension present in most image sequences - and this dimension offers its own correlation, and additional scope for compression. Incidentally, I would consider a single datum to be zero-dimensional, and, consequently, a collection of data whose only relationship is that they are members of a set would be zero-dimensional. 6.13 Vector Quantisation. ------------------------ Chapter 6.7 discussed quantisation for single grey levels, i.e. assuming a monochrome image. What if we have a colour image? In general a multispectral image where each pixel is represented by a vector of n bands. We could quantise each band separately. However, it would make sense to take account of the bands together, i.e. divide the 'space' formed by the n bands into k quantisation 'classes'. So, we need perform unsupervised classification (clustering) in k classes. k-means clustering (see chapter 7.5) is one method of doing this. Ex. 6.13-1. (a) Use the k-means Clustering algorithm to segment the image in Figure 6.13-1 into 2 regions (see Chapter 7.5), and hence explain how to compress it into 1 bit per pixel. Assuming 4 bit input data, what is the compression efficiency. Compute the root- mean-square error (see section 6.15). 1 2 3 1 3 2 3 1 2 3 2 3 1 3 2 3 1 2 3 1 3 1 3 2 3 8 2 3 1 2 1 2 3 7 8 9 9 8 7 1 2 3 1 8 9 9 8 7 7 2 3 1 2 9 9 8 7 7 8 3 3 1 2 9 9 8 7 7 8 3 1 2 3 1 3 2 3 1 2 3 2 3 1 3 2 3 1 2 3 1 3 1 3 2 3 1 2 3 1 2 Figure 6.13-1 ------------- (b) In addition to the one bit pixels, what data must you transmit? Answer: lookup table giving (code : mean value). (c) What changes if you have colour (multispectral) data.? (d) Given the result of (a) what other compression mechanism could you bring into play? [Hint: won't there be fairly long sequences of 0s and 1s?]. (e) What is the compression efficiency / rate for the combination of (a) and (d). Verify that the root-mean-square-error (see section 6.15) remains the same as for (a). (f) Would it make any sense to use Huffman coding on the result of (a) ? 6.14 The JPEG Still Picture Compression Standard. ------------------------------------------------- See Wallace (1991), reference given below. JPEG stands for Joint Picture Experts Group - a joint group formed by ISO and CCITT. ISO and CCITT realised that efficient compression is not enough - we need agreed standard methods. I.e. the decoder shouldn't have to write a new program for every image. Actually, there are three compression standards in the JPEG standard. One, the baseline (described below), is based on the Discrete Cosine Transform (DCT); this one is lossy. Another is also lossy but allows the user to specify the compression ratio required. The third is lossless. We will just give a flavour of the principles behind the JPEG baseline standard; it uses a concatenation of some of the methods mentioned in the previous subsections: (1) the image is split into 8 x 8 subimages and these 8 x 8 subimages are transformed using the DCT, (2) The DCT coefficients are quantised: e.g. starting with 8 bits for the DC coefficient, down to 7 for the verly low frequency components, down to 0 (i.e. the data are ignored) for the very high frequency terms, (3) The DCT components obtained from (2) are 'differenced' (see chapter 6.11) with respect to the corresponding component of the previous subimage, (4) The difference values obtained from (3) are Huffman encoded ( using a fixed preset Huffman code). Hard on the heels of JPEG, comes MPEG standard - Moving Picture Experts Group; MPEG standard would be used for such applications as our video-conferencing link. If High Definition TV (HDTV) is ever to become a reality, some form of moving picture compression will be essential. 6.15 Error Criteria for Lossy Compression. ----------------------------------------- For lossy compression schemes it is neccessary to specify how the loss / error is to be computed. Root-mean-square-error (RMSE) is a natural choice: (6.15-1) RMSE = sqrt [ # ( z - zc )^2 ] where zc is the value after compression, z is the (true) value before, and, the summation is over all pixels. 6.16 References. ---------------- G.V. Wallace "The JPEG Still Picture Compression Standard" Comm. ACM Vol. 34, No. 4 April 1991. Gives a good introduction to the current state of play in image compression. JPEG is Joint Picture Experts Group - a joint group formed by ISO and CCITT. E. Edwards Information Transmission Chapman and Hall 1969 (very easy, readable introduction to Information Theory) Gonzalez and Woods Chap. 6 M. Purser Data Communications for Programmers Addison-Wesley 1986 6.17 Additional Exercises. -------------------------- Ex. 6.17-1. Explain the use of the following methods by which the 16 x 16 image in Figure 6.17-1 may be compressed; give the compression factor (compressed bits / direct bits). Assume the source is 16 x 16 x 8 levels (3 bits) = 768 bits. (a) Run-length encoding (b) Image model coding (c) Quantisation. (d) Hadamard transform [Hint: see Gonzalez and Woods, Figure 3.26, p. 143), and shift and scale the pixel values so that the pixels are -1,+1 instead of 0,1]. 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 Figure 6.17-1. -------------- Ex. 6.17-2. (a) Explain how the image in Figure 6.17-2 may be compressed using a Huffman code. 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6 Figure 6.17-2 ------------- (b) Assuming that the symbol probabilities are the same for all such images, show that the average entropy per symbol is 2.73. (c) Calculate the average entropy per symbol for your Huffman code. (d) Compare the compression achieved with the results obtained in Ex. 6.17-1. Ex. 6.17-3. Identify, with explanations, applications that (a) can tolerate lossy compression, (b) cannot tolerate lossy compression. end of chapter 6.