Compression of embedded system programs

June 4, 2017 | Autor: Andrew Wolfe | Categoria: Processor Architecture, First-Order Logic, Embedded System

Descrição do Produto

Compression of Embedded System Programs Michael Kozuch and Andrew Wolfe Department of Electrical Engineering Princeton University Abstract

instruction memory. The reduced code size provides a cost savings for each production unit. At run time, the stored instructions are decompressed by the instruction cache refill engine. Code in the instruction cache appears to the processor as standard CPU instructions. No changes in operation or performance occur when instructions are found in the instruction cache. On a cache miss, the cache refill engine locates the compressed cache line and expands it. The cache refill time varies depending on the characteristics of the memory system, the coding method, the decompression hardware, and the compressed code.

Embedded systems are often sensitive to space, weight, and cost considerations. Reducing the size of stored programs can significantly improve these factors. This paper discusses a program compression methodology based on existing processor architectures. The authors examine practical and theoretical measures for the maximum compression rate of a suite of programs across six modern architectures. The theoretical compression rate is reported in terms of the zeroth and first-order entropies, while the practical compression rate is reported in terms of the Huffman-encoded format of the proposed compression methodology and the GNU file compression utility, gzip. These experiments indicate that a practical increase of 15%30% and a theoretical increase of over 100% in code density can be expected using the techniques examined. In addition, a novel, greedy, variable-length-to-variable-lengthencoding algorithm is presented with preliminary results.

The reduced program size means that system cost, size, weight, and power consumption can be reduced. Furthermore, fetching compressed instructions requires fewer bus cycles, further reducing power consumption. Even a slight reduction in program size will allow designers to incorporate additional features without increasing system memory. This can greatly increase product competitiveness.

1. Introduction

A system for encoding and decoding RISC programs has already been proposed [Wolfe92]. A preliminary study of embedded program compression in that paper showed that such a method is practical and can provide significant compression at a reasonable cost and with minimal impact on performance. This earlier work measures the effectiveness of a single class of compression algorithms on the MIPS architecture. That work is extended here in a more comprehensive exploration of the potential of program compression. Practical and theoretical compression rates are measured for 6 common 32-bit architectures. A new specialized compression method is then presented that may be more effective than existing algorithms for embedded program compression.

An embedded system may be loosely defined as a system which incorporates computer technology but is not, itself, a computer. Examples range from simple devices such as television remote controls to compute intensive systems such as automotive engine controllers and missile guidance systems. While these systems vary greatly in their cost, complexity, and function, they all share some common characteristics. The system’s primary functions are implemented by a microprocessor or microcontroller executing an essentially permanent stored program composed by the system designer rather than the user. These systems are also sensitive to many design constraints including limits on size, weight, power consumption, and cost.

2. Embedded System Program Compression

An ideal solution for the high performance embedded systems market would be a processor that provides all of the recognized performance benefits of RISC architectures while also providing denser instruction storage than existing RISC. implementations. In pursuit of this goal. we have been exploring methods for compression of programs in embedded systems. We have developed a method whereby embedded programs are stored in compressed form but executed from the instruction cache in the standard format. This provides increased storage density with only a minimal impact on processor performance.

2.1. Prior Research Lossless data compression as a field of study is relatively mature. However, the best known lossless data compression methods such as Huffman coding [Huffman521 or Lempel-Ziv coding [Lempe176, Ziv77, Ziv781 are primarily designed for text compression. These algorithms are commonly used in off-line, file-based compression programs such as the UNIX” utilities compress and gzip. While the theory behind these algorithms is valuable to embedded systems program compression, the existing file-based implementations are not suitable for run-time operation. In addition to these wellknown methods, there has been considerable research in the area of lossy compression methods such as DCT-based [PetajangZ] or fractal-based compression [Kocsis89]. These produce a much greater rate of compression than lossless methods; however, they are unsuitable for program compression.

The system consists of a standard processor core augmented with a special code-expanding instruction cache. A traditional compiler and linker are used to generate standard embedded object code. This object code is then compressed on the host development system using a code compression tool similar in principle to the UNIX” compress utility. This compressed code is stored in the embedded system

Embedded system program compression has cost and speed constraints that differentiate it from typical file-based

This research has been partially supported by Motorola under Research Agreement JD979- 1.

270

1063-640-4

$4.00 0 1994 IEEE

since the vast majority of instruction fetches result in cache hits, the performance of the processor is unchanged for these instructions.

compression applications. Since embedded systems are highly cost sensitive and typically only execute a single program, it is not possible to include temporary storage for an uncompressed version of the program. Instead, the program must be decompressed on demand at run time, so that an uncompressed copy of the next instruction is always available. The system proposed in [Wolfe92] uses the existing instruction cache in high-performance processors as a decompression buffer, storing uncompressed copies of recently used fixed-sized blocks of instructions. These fixedsized blocks are decompressed by the cache refill hardware whenever there is an instruction cache miss. Since the program must be decompressed in small fixed-sized blocks rather than the more common approach of decompressing the entire program from. beginning to end, the most obvious compression methods require that each block has been separately compressed. Furthermore, to retain high performance it must be possible to decompress a block with low latency, preferably no longer than a normal cache line refill.

In an embedded system, it is not possible to decompress the entire program at once; therefore, a block oriented compression scheme is required. The experiments we have performed are based on compressing 32-byte cache lines into smaller byte aligned blocks as shown in Figure 1. A number of compression techniques are possible, but they all must allow for effective run-time decompression. Compression takes place at program development time therefore compression time is immaterial. Decompression time,, however, directly impacts cache refill time and thus performance. 8-word fully-aligned Mock I I I I I 1 ...00 ...04 ...08 ...OC ... 10 ...14 ...18 ...1C

1

In addition to the existing work on file-based compression, automated compression and decompression schemes have been implemented in general-purpose systems at slower levels in the memory hierarchy [Tautongl]. Automated file compression systems such as the Doublespace utility in MSDOS 6.2 use file and block based compression to reduce the disk space requirement of files. A similar method is discussed in [Categl] using compression within memory and disk to compress pages in a demand-paged virtual memory system. These disk-based systems use large data blocks, on the order of 4K-16K bytes, rather than the 14-64 byte blocks common in instruction caches. Furthermore, disk-based systems can tolerate decompression latencies on the order of 10-2OOms rather than the 50-500ns latency that is tolerable in embedded system program compression. These differences in scale allow the effective use of Lempel-Ziv type algorithms implemented in either hardware or software for disk or virtualmemory based compression. Unfortunately, this class of algorithms does not appear to be practical or effective for short program blocks at cache speeds.

I

I

n-byte unaliined block I I I I ...00 ...0 4 . . . 0 8 ... OC ...10 ...1 4 ...1 8 ...1C

I

Figure 1

1

-

Block Bounded Compression.

Maintaining full compatibility with existing code presents a problem when executing control transfer instructions such as jumps or procedure calls. The address of the jump target in the compressed code is different than it is in the uncompressed code. This problem is one reason why continuous file-based compression is impractical for direct execution. If a program branches to an instruction at a given address, how can that instruction be found in the compressed program? A specific jump target address in the original code may not even correspond to an addressable byte boundary in the compressed code. While it might be possible to place all jump targets on addressable boundaries and replace uncompressed code target addresses in the original code with the new compressed code target addresses, this introduces new problems. Jump targets that happen to be in cache would have different addresses than the same targets in main memory. Furthermore, programs often contain indirect or computed jump targets. To convert these addresses would require modifications to the address computation algorithms in the compiled code.

A program compression scheme based on dictionary compression has been proposed in [Devedasgrl]. This work presents some interesting ideas in software-only compression; however, the experimental results do not yet validate the methods. The experimental results are based on unoptimized assembly code without the inclusion of libraries. These examples contain far more redundancy than fully optimized applications. In fact, simply enabling the optimizer produces smaller code than the dictionary-based compression.

In-cache expansion solves most gddressing problems. The address of a jump target in cache is the same as in the original uncompressed program. If a program jumps to a target that is not in cache, that target is brought into the cache before execution. This only requires that the processor locate the address of the beginning of each compressed cache line. This restricts each compressed cache line such that it must start on an addressable boundary. Some record of the new location of each cache line is required to map the program address of each block to its actual physical storage location.

2.2. Mechanisms for Program Compression The key challenge in the development of a code compression scheme for existing microprocessor architectures is that the system must run all existing programs correctly. Furthermore, the performance of a compressed code processor should be comparable to that of a traditional processor. The use of instruction cache based decompression assures that these requirements can be met. All instructions are fetched through the instruction cache. Since they are stored uncompressed in cache, they can always be fetched from the original program address. Furthermore,

A new structure is incorporated into the cache refill hardware. The Line Address Table or LAT maps program instruction block addresses into compressed code instruction block addresses. The data in the LAT is generated by the compression tool and stored along with the program. Figure 2 diagrams the LAT for a 32-byte cache line.

271

&qypsq&

This code, called a Preselected Huj” Code is then built into the decompression logic rather than stored in memory. The effectiveness of this code is generally independent of block size; however, the embedded system compression mechanisms add additional overhead that limits the effectiveness of coding.

Cache Une Address

Figure 2

-

Huffman codes suffer from inherent inefficiency whenever the frequency of occurrence of symbols is not exactly a negative power of two. This is a quantization effect caused by the requirement that an integral number of bits is used to code each symbol. This is further compounded in many cases by the fact that 8-bit symbols have been used rather than 32-bit symbols corresponding to the size of RISC instructions. This is necessary to reduce the complexity of the decoder. Despite these inefficiencies, our experiments show that the effect of these factors is small. The effectiveness of compression is also reduced by the fact that each compressed block must be stored on a byte-addressable boundary. This adds an average of 3.5 bits to every coded block.

Line Address Table.

Using a Line Address Table, all compressed code can be accessed normally by the processor without modifying the processor operation or the program. Line Address Table access increases cache line refill time by a marginal amount, at least one memory access time. This is not a major effect since it only occurs during a cache miss, however this effect can be further reduced by using another small cache to hold the most recently used entries from the LAT. This cache is essentially identical to a TLB and in fact is called the Cache Line Address Lookaside Buffer or CLB. In practice, the LAT is simply stored in the instruction memory. A base register value within the cache refill engine is added to the line address during CLB refill in order to index into this table. Figure 3 shows how the overall instruction memory hierarchy might be implemented in a typical system.

Another factor contributing to coding overhead is the storage of the Line Address Table. Storing a full pointer to each compressed line would be prohibitively expensive, however, we have used an ad-hoc compression technique to pack multiple pointers into each LAT entry based on storing a base address plus the length of each compressed line. According to this design, the compressed cache lines are aligned on byte boundaries in the compressed program storage area, and the LAT provides a compact index into these compressed cache lines. Specifically, if the cache line size is I bytes, the address space is b bits, and each LAT entry provides a pointer to each of c cache lines, then each LAT entry occupies: b +C . [i0g2(1)] bits Because each LAT entry locates cl bytes, the overhead associated with this design is:

!AT

Figure 3

-

6 +c-[10g2(l)l bits 8 bits I byte

Overall Memory System Organization.

c.lbYte-

2.3. Codes for Program Compression

1

For the example of a 32-byte cache line, a 24-bit address space and 8 cache lines per entry, the overhead is 8 byted256 bytes = 3.12595. Since the number of bits available to represent the length of each line is limited, the compressed length of each line must be limited as well. One can limit the encoded length to no more than the original length by simply not encoding any line that is not reduced in size. This slightly improves the rate of compression.

The specific requirements of embedded system program compression restrict the class of coding techniques that can be effectively used for compression. The critical requirement that small blocks of data must be separately compressed is the most difficult constraint. Many of the adaptive Lempel-Ziv algorithms that are effective for file compression simply do not provide good compression on small blocks. In fact, it is very difficult to get good compression on small blocks of programs.

This design has been shown to provide very good performance and moderate compression for the MIPS R3000 architecture in earlier work. However, it was still not known if this method is effective on a variety of architectures. It is also interesting to explore the possibility that there may be better coding methods for short blocks of program code. The next two sections explore these issues.

Most of the experiments we have done in embedded program compression rely on Huffman coding. Variable length code words are selected to represent each byte in the original program. The length of each code word depends on the frequency of Occurrence in the sample program. A Huffman code can be generated for each embedded system program, however this requires that the code be stored along with the program. For embedded systems programs, we have found that it is preferable to build a fixed Huffman code for each architecture based on a large sample of typical programs.

3. Investigation Given the advantages of compressing embedded system code and the feasibility of the proposed architecture, a further

272

investigation into the compressibility of embedded system code is warranted. The purpose of these experiments is to explore the compressibility of code on several modern architectures using traditional coding methods modified for embedded systems program compression.

On each architecture, an instruction extraction tool was created. These tools extract the actual instructions (text segment) from executable programs compiled on that architecture. After the program set was compiled on each machine, the instruction extraction tool was applied to each program to isolate the actual program. (This step eliminates the data segments, relocation information, symbol tables, etc.)

Embedded system code is rarely portable among architectures. Therefore, the analysis of this paper is based on a set of computer benchmarks which we believe may be a suitable approximation to embedded system code. These benchmarks are taken from the SPEC benchmark suite, the UNIXm operating system, and some speech encoding/decoding programs. The fifteen example programs are presented in Table I. This set was chosen for its mix of floating point and integer code, various size programs, and portability.

Program awk dnasa7 dodw eqntott espresso fPPPP gsmtx matrix300 neqn Sed

tomcatv uvselp xlisp yacc

Normalized Program Set Size

Function pattern scanning/processing floating point kernels thermohydraulical modelization Boolean equation translation Boolean function minimization quantum chemistry regular expression matching GSM 06.10 speech CODEC matrix multiplication typeset mathematics stream editor mesh generation NADC speech coding lisp interpreter yet another compiler compiler Table I.

. Vax

MIPS

68020 . SPARC

RS6OOO . MPC603

Figure 4. Sum of Program Sizes for Each Machine (Normalized to the VAX 11/750) For comparison, an architectural comparison of the uncompressed text size is presented in Figure 4 where the sum of the sizes in the test set is reported for each machine normalized to the total size of the VAX 111750. The native programs differ significantly in size based only on the architecture and compiler. This results both from differences in the instruction set encoding and in the speeasize tradeoffs made by the compiler as well as differences in library code. This raises the interesting question of whether the less dense instruction sets contain more redundancy and thus are more compressible. We conducted a number of experiments to investigate this issue.

Benchmark Set

Each of the programs in Table I was compiled on six different architectures. This was intended to identify architectural differences (e.g. RISC vs. CISC) which might affect compressibility. The architectures used are shown in Table 11. Five of these architectures are typical of current and future 32-bit high-performance embedded processor cores. The VAX is used as a reference point as a high-density instruction set. VAX 111750 (SGI) MIPS R4ooo (Sun) 68020 (Sun)SPARC (IBM) RS6000

~

The first experiment measures the entropy of each program using a byte-symbol alphabet. These entropy measures determine the maximum possible compression under a given set of assumptions. The zeroth-order entropy assumes that the probability of an occurrence of a given symbol ai of alphabet S is given by p ( q ) and is independent of where the byte occurs in the byte stream. That is, bytes occur in random order, but with a certain distribution. The zeroth-order entropy over a source alphabet, S, is then given by

BSD UNIX 4.3 IRIX 4.0.5F System V SunOS 4.1.1 SunOS 4.1.3 AIX 3.2

Table 11. Tested Architectures and OS. Each of the fifteen programs was compiled on each of the six architectures except sed which would not compile on the RS6000 machine. The programs were compiled with the standard compiler on each machine (e.g. cc orj77) and with the highest optimization level allowed. Further, to ensure that the programs contained all the code to be run. each was compiled statically (i.e. no dynamically linked libraries). The same source code was used on each machine, so at an abstract level, the amount of computational work performed was identical on all machines.

(The subscript 2 in the entropy symbol, H 2 ( S ) , denotes that the encoded symbol stream is of radix 2, that is, the encoded stream is in bits.) The entropy of a source may be interpreted as the information content of that source, and hence may be viewed as a measure of the theoretical maximum compressibility of that source [Ha"ing80]. The entropy is dependent upon the model of the source assumed. Here, we have assumed a zeroth-order model.

273

program set. the aggregate entropy is greater (meaning less compression) than the average entropy, but this is not always true. In the context of a compressed program embedded system, the average entropy expresses the typical compression ratio (in the theoretical limit) for a program if the decompression engine is custom designed for that one program. The aggregate entropy represents the limit of average compression if a single code is used for all programs.

For our purposes, the alphabet, S, is the set of all possible bytes (8 bits). The probability of a byte occurring was determined by counting the number of Occurrences of that byte and dividing by the number of bytes. The encoding tool first builds a histogram from the extracted program. The histogram leads directly to the probability distribution, and the entropy may then be calculated according to the above equation. If we change the model of our source, we must also determine a new form for the entropy. A more general source model is given if we assume that each byte is dependent upon the previous byte (as in a first-order Markov process). In this case, we have a set of conditional probabilities, p ( a j l a i ), which indicate the probability that symbol a. occurs gven that ai has just occurred. This model is the first-order model and the entropy is given by the first-order entropy:

UZero Order OFmt Order

1 0.9

0.8 0.7 0.6 0.5 0.4

0.3 0.2 0.1

Here, p ( a i , aj,, indicates the probability that the pattern ai,aj occurs.

0

The first-order entropy of each program set was determined similarly to the zeroth-order entropy. The significant difference between the two measurements is that calculating the first-order entropy involves generating the n conditional probabilities for each of the n symbols in the alphabet, S. During processing, a nxn matrix, h. is generated where h ( i j ) is the number of aj symbols which follow ai symbols. The probability of a symbol occurring may then be gwen by:

Figure 5.

Average Entropy for 6 architectures.

I

I

0.9 0.8 0.7 0.6

0.5 0.4

i

j

0.3

and the conditional probability of a symbol occurring is given by:

0.2 0.1

0 Van

1 Further, the pattern probability may be found from:

Figure 6.

MIPS

68020

SPARC

R S W

MPc603

Aggregate Entropy for 6 architectures.

It is clear from the zeroth-order entropy numbers, that simple compression methods like Huffman coding are not going to achieve very high rates of compression on this type of data. In fact, it appears that the MIPS instruction set originally studied is the most compressible using zerothorder methods. This indicates that simple coding methods are likely to be inadequate for other architectures.

The average and aggregate calculated values of the entropy for the program set are shown in Figure 5 and Figure 6, respectively. The entropy may be interpreted as the maximum compression ratio where the compression ratio is given by:

In order to measure the efficiency of Huffman coding on these architectures, we used the compression method described in Section 2 to actually compress programs from each architecture in cache-line sized blocks. Figure 7 describes the compression ratio obtained and the sources of compression overhead from 32-byte blocks using a 64-bit LAT entry for each 8 blocks. The coding overhead represents the inefficiency in the Huffman code caused by integral length symbols. This difference between the observed compresssion

Compressed Size Compression Ratio = Uncompressed Size The aggregate entropy is measured by generating occurrence statistics on the program set as a whole, and the average entropy is calculated by averaging (arithmetically) the separately measured entropy for each program. For our

274

Figure 7.

Compression Efficiency

rate and the zeroth-order entropy limit is reduced by the adhoc optimization of not encoding blocks that would increase in length. Although, this can result in better compression than predicted (negative overhead), it requires that the special case be flagged in the LAT. The blocking overhead is the result of byte alignment. This and the LAT overhead are constant across architectures, although sample variations occur in the data. The data shows that for all 6 architectures, the block compression method used is about 94%-95% as effective as the best zeroth-order code. This appears to be essentially constant, indicating that entropy is a good estimator for practical compression ratio for this type of code.

3.2. Variable Symbol Length Encoding These experiments have measured three indicators of compressibility, zeroth-order entropy, first-order entropy, and gzip compression. Compression methods that approach zeroth-order entropy provide fast, simple decompression and hence, are more suitable for embedded system applications. However, the more aggressive compression methods are much more effective at reducing program size. Ideally, we would like to develop new coding schemes that are more effective than Huffman coding, work well on short blocks, and are easy to decode at high speed. In order to exceed the entropy limits established in section 3. we must modify our model of the source. One key observation is that the traditional model of treating each 8-bit byte as a source symbol, while obvious for ASCII text, is rather arbitrary for programs. One possible modification is to use zeroth-order entropy coding with source symbols of other sizes. An investigation of various other symbol lengths is summarized in Figure 9. It might be argued that the natural symbol size for RISC instructions is 32 bits, however constructing a hardware decoder for a Huffman code with 232 distinct symbols is impractical. Sixteen-bit symbols are also more effective than 8-bit symbols, however they are also probably too expensive for embedded systems.

3.1. More Aggressive Coding. The data from our experiments indicates that zeroth-order coding is only moderately effective at compression for most of the architectures we have evaluated. However, the firstorder entropy data shows that much better compression is possible with more aggressive coding methods. To confirm this hypothesis, we also compressed each set of programs via the Gnu file compression utility, gzip. Note that these values, shown in Figure 8, are better than those reported for the above zeroth and first-order entropies because the gzip algorithm operates on a higher-order source model than we have considered. We have assumed that the decompression engine required by the higher order compression techniques (Lempel-Ziv etc.) would be too expensive for implementation in embedded systems, but the potential for greater compression indicates that it is valuable to search for more aggressive compression methods that are feasible for embedded system compression.

0.9 0.8 0.7 x

0.6

a

0.5 0.4

0.3 0.2 0.1 0

Gzip Compression Ratio 1

~~

0.9.

2

0.8. 0.7

.

4

8

12

16

Symbol Size (bits)

0.6 9

mMPC603 mRS6000 nMps

068020

oSPARC mVax

Figure 9. Entropy vs. Source Symbol Size

Vax

MIPS

68020

SPARC

Rs6Mo

A detailed examination of instruction encoding for these architectures shows that instructions are comprised of a sequence of fields of varying lengths, ranging from 4 to 26 bits. This presents the possibility that the best method for encoding of instructions is to create a set of symbols of

MKXO3

Figure 8. Gzip Compression Ratio

275

Greedy Symbol Selection Algorithm

varying lengths such that this set of symbols can be combined into strings to create any possible cache line. The encoding problem thus consists of two parts, selection of a set of variable-length symbols for encoding and optimal partitioning of program blocks into those symbols. One possible set of symbols is the set of all 8-bit symbols; hence, the byte-symbol Huffman codes are a subset of variable-length codes.

One alternative to exhaustive search is to employ a greedy algorithm as in Figure 10. This algorithm successively searches for the next symbol to be included in the set until either the maximum set size is reached or the selection criteria are no longer met. A valid source symbol set must cover the source program. This requires that some sequence of the symbols in the set is equal to the original source program. Preferably, the source symbol set covers all legal programs.

Although variable-length codes may provide improved compression for embedded programs, the mechanics of generating these codes is computationally complex. Ideally one would like to be able to select the optimal set of symbols such that a program encoded using those symbols is of minimum length. Unfortunately, this problem cannot be solved in reasonable time. Assuming that the maximum number of symbols to be included in the source model is n , and the maximum source symbol length is b bits, the exhaustive search yields

xr=,( 2bti-2)

The Priority() function used in FindBestSymbol() uses the number of occurrences of a symbol, count, and the symbol size in bits, size, to determine a priority for inclusion of the symbol in the symbol set (the symbol with the highest priority is to be included). This priority is an estimation of the savings that inclusion of the symbol represents. Two versions of the Priority() function are under evaluation. The first version estimates the total number of bits saved from the sample program by including a given symbol.

possible symbol

sets. This is simply the enumeration of all distinct sets of the Zb+l-2 strings of length b or less, such that the set contains at least 1 and no more than n members. For the relatively small example of b=16, n=32, this results in approximately possibilities while the more reasonable b=32 and n=256 is not directly computable in 64-bit double precision arithmetic. The difficulty of this problem is compounded by the fact that the evaluation of the quality of a symbol set involves actually encoding the program sample and measuring the encoded size. The optimal encoding given a set of variable-length symbols also requires exponential time to compute. Consequently, with present computer technology, exhaustive symbol set evaluation is infeasible. The obvious alternative is to use heuristics to search and evaluate a subset of this design space in the search for a very good, if not optimal, set of coding symbols. We are experimenting with greedy algorithms to construct source symbol sets.

The second version estimates the compression rate of the subset of the source data that will be encoded by a given source symbol. If this priority function is adopted, an inclusion threshold is required. Since only a limited number of symbols can practically be included in the source symbol set, only those symbols that provide a good compression rate and cover a significant potion of the source program should be included.

[size-[log2

[

# of I symb. of length size

COWtt

Il

size Pri0rityrme = After a symbol set is selected, the resulting symbols are Huffman encoded to produce a uniquely decipherable coded symbol set. Figure 1 1 reports results for several programs compressed under both versions of the priority function. Since the source program is encoded on a cache line basis, each cache line may be encoded using an exhaustive search algorithm.

symbol PindBestSymbolO {

Symbol tOBW, 6; priority temp-w, w-0; for(al1 6-01

# of symb. of length size

prioritymvhgs= count

lengths)

1

{

^ -

GanerateHistogramO; t e m p = ~ t Y a x 8 y m b o l; ~ tm-w=Priority(ttmp); if ( t m - w > w )

T

I .Bit

nComDression Rate Prioritv I

Savings Priority

I s-t-; w-temp-w; 1 1 return

8;

1 SymbolSetSelectO {

while(bitstream not exhausted) 1 symbol s=FindBeotSymbolO; for(al1 occur. of symbol s in bitstream) Mark8ymbolUsed ( ) ; SetInclude(s); 1

I awk eqntott grep

gsmtx

naqn

sed

uvselp xlisp

yacc

Figure 11. Variable Length Symbol Compression.

1

Figure 10. Greedy Symbol Selection Algorithm.

276

4. Conclusions and Future Work

Massachusetts Institute of Technology, 1994.

This paper presents several experiments conceming the effectiveness of program compression for embedded systems. Simple comparisons of six 32-bit architectures show that significant variations in program size exist between architectures using native program coding. Despite the large variations in program density, compressibility varies only moderately among architectures and appears to be uncorrelated to uncompressed program size. Simple compression methods such as Huffman coding and its variants provide only moderate compression on all of these architectures; however, first-order entropy analysis and Lempel-Ziv based coding demonstrate that better compression is possible.

[Hamming801 R.W. Hamming, Coding and Information Theory, Prentice-Hall, Englewood Cliffs, NJ, 1980.

In order to discover an improved coding method for embedded systems programs, a class of variable source symbol length codes is considered. The complexity of determining optimal variable source symbol length codes is shown to be intractable; however, two greedy heuristics are evaluated. These heuristics provide compression that is no better than common coding methods. The most obvious area for continued research is the development of more effective heuristics for variable source symbol length codes. In addition, compiler-based methods are under investigation to improve compressibility. Instruction selection, register allocation, and instruction scheduling may all be optimized for low entropy. This may improve practical compression rates. This may be extended into earlier optimization stages of the compiler in order to increase the similarity among program structures and thus increase opportunities for entropy reduction. Concurrently we are investigating opportunities for using the compiler to reduce program size prior to compression. This primarily involves detection of opportunities to combine similar code sequences. Finally, some investigations could be made into designing instruction sets which possess good compressed program characteristics.

[Huffman521

D. A. Huffman, "A Method for the Construction of Minimum-Redundancy Codes," Proceedings of the IRE, Volume 40, pp. 1098-1101, (1952).

[Intrate1-921

G. Intrader and I. Spillinger, "Performance Evaluation of a Decoded Instruction Cache for Variable Instruction-Length Computers," Proc. of the 19th Symp. Comp. Arch., IEEE Computer Society, May 1992.

[Kocsis891

A.M. Kocsis, "Fractal Based Image Compression," 1989 Twenty-Third Asilomar Conference on Signals, Systems, and Computers, Volume 1, pp. 177-181.

[Lempe176]

A. Lempel, and J. Ziv, "On the Complexity of Finite Sequences." IEEE Transactions on Information Theory, Volume 20, pp. 75-81, 1976.

[Petajan92]

E. Petajan, "Digital Video Coding Techniques for US High-DefinitionTV,"IEEE Micro, pp. 13-21, October 1992.

[StorerSS]

J. A. Storer, Data Compression: Methods and Theory, Computer Science Press, Rockville, MD, 1988.

[Tauton91]

M. Taunton, "Compressed Executables: an Exercise in Thinking Small." Proceedings of the Summer 1991 Usenix Conference, pp. 385-403.

[Wolfe92]

A. Wolfe and A. Chanin,"Executing Compressed Programs on An Embedded RISC Architecture," in proc. Micro-25: The 25th Annual International Symposium on Microarchitecture , 1992.

[Ziv77]

J. Ziv and A. Lempel. "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, Volume 23, pp. 337-343, 1977.

[Ziv78]

J. Ziv and A. Lempel. "Compression of Individual Sequences Via Variable-Rate Coding," IEEE Transactions on Information Theory, Volume 24, pp. 530-536. 1978.

5. References [Cate91]

[Devedas94]

V. Cate and T. Gross, "Combining the Concepts of Compression and Caching for a Two-Level Filesystem", Proc. Fourth International Con5 on Architectural Support for Programming Languages and Operating Systems, ACM, April 1991. S. Devedas, S. Laio, and K. Keutzer, "On Code

Size Minimization Using Data Compression Techniques", Research Laboratory of Electronics Technical Memorandum 94/18,

277

Lihat lebih banyak...

Compression of embedded system programs

Descrição do Produto

Comentários