Improving bitstream compression by modifying FPGA architecture

July 5, 2017 | Autor: Morteza Zamani | Categoria: Entropy, Compression

Descrição do Produto

Improving bitstream compression by modifying FPGA architecture

ABSTRACT The size of configuration bitstreams of field-programmable gate arrays (FPGA) is increasing rapidly. Compression techniques are used to decrease the size of bitstreams. In this paper, an appropriate bitstream format and variable symbol lengths are proposed to utilize the routing patterns for enhancing the compression efficiency. An order of inputs of multiplexers in switch modules is also proposed to improve the symbol statistics and hence, the compression efficiency. A framework to generate the bitstream and hardware description of FPGAs are developed as well. Experimental results over 20 MCNC benchmarks show that by applying the proposed approaches, the compression rate is improved by 46% on average compared to the methods with fixed symbol lengths.

Categories and Subject Descriptors B.7.1 [Integrated Circuits]: Types and Design Styles—Gate Arrays

General Terms Performance, Design.

Keywords Bitstream, compression, entropy.

1. INTRODUCTION Field-programmable gate arrays (FPGA) are widely used due to their reconfigurability, short time to market and ease of use. FPGAs are continuously growing to support various modern applications. As a result, the size of configuration bitstreams is increasing rapidly. Nowadays, there are some FPGA families with bitstreams longer than 400 Mega bits. A bitstream is usually saved in an expensive nonvolatile memory chip on an FPGA board. Longer bitstream means larger and more expensive memory chips, and also long time to load the bitstream from the on-board memory which is undesirable in some applications. To reduce the size of bitstream, they are compressed. Most compression techniques firstly convert the segments of bitstream into symbols, and then compress the produced stream of symbols. Several bitstream compression techniques have been presented. However the effects of routing resource structure on the bitstream compressibility have not been studied. Authors of [1] modified the placement and routing tool to shrink the area that the design bitstream spreads in. Hence, the size of the bitstream decreases because of the reduction in the number of frames used by the design. A method was proposed in [2] to reduce the configuration bits of the lookup tables by removing redundancies. In [3], various compression methods such as Huffman, arithmetic coding and LZW are applied to the Xilinx Virtex II bitstream, with the assumption of fixed symbol length of either 6 or 9 bits. In [4], the authors introduced a decoding-aware compression method that is a combination of bitmask-based compression techniques

and run-length encoding (RLE). They used a fixed symbol length of 8 bits. Authors of [5] divide the bitstream of an FPGA into some regions, i.e. logic, routing resources, and IO blocks, with different statistics and compress those regions separately. Different region in bitstream are identified by region filter. In this paper, we present two modifications in structure of the bitstream and routing resources to enhance the efficiency of the subsequent compression methods. These modifications does not necessitate any changes in either the place and route tool or the interconnect architecture. Hence, our method does not affect the performance of the design and the area of FPGA. Our contributions are as follows: 1- A proper order for the configuration bits and an appropriate bit length for the symbols are found in a way that the routing patterns can be utilized appropriately for a compression method. 2- The inputs of multiplexers in switch modules are reordered to increase the number of identical symbols in the symbol stream. It is noteworthy that more compression can be obtained for streams with a large number of identical symbols. Our input reordering method makes it possible to represent the frequently-used similar routing patterns by similar symbols. Such an approach can produce identical symbols for a large part of the stream. To verify the efficiency of our approach, we developed a framework based on VPR 5 CAD tool [6]. This framework is used to (a) generate the gate level netlist of the basic tiles of the target FPGA that is described by VPR 5 and (b) to produce bitstreams for the circuits that are placed and routed by VPR. To make the evaluation of out method independent of the compression techniques, we used the concept of Shannon entropy [7]. Shannon entropy presents the lower bound on the effectiveness of entropy-based compression methods such as Huffman and arithmetic coding. The rest of the paper is organized as follows: Section 2 provides a background for FPGA architectures and their programmer circuit, and explains the concept of entropy. The proposed approach is described in Section 3. The framework and experimental results are presented in Section 4. Finally, we conclude the paper in Section 5.

2. BACKGROUND 2.1 FPGA architecture Island-style FPGAs have a regular structure containing a twodimensional array of CLBs within a large set of routing resources. Therefore, the chip can be viewed as a set of basic tiles. Figure 1.a illustrates one of the basic tiles of the FPGA that consists of a CLB and its routing resources. The CLB contains a number of basic logic elements (BLE) which are programmed to implement logic functions. The routing resources, which include a connection block and a switch module, are unidirectional and consist of wire segments, multiplexers and configuration SRAM cells. The wire segments lie in horizontal (CHANX) or vertical (CHANY) channels. Half of the wire segments in each channel are increasing (decreasing) that is their directions are

from low (high) to high (low) coordinates in the FPGA. There are a large number of multiplexers in the FPGA that are mainly used for routing, and a dominant part of the configuration SRAM cells are dedicated to them. Multiplexers can have tree, flat or hybrid structures. In this paper, we focus on the tree-shaped multiplexer structures. Nevertheless, our approach can be extended to FPGAs with flat or hybrid multiplexer structures.

the transferred data, that is the multiplication of entropy and symbol count, is 6.48. However, if each pair of consequent bits is considered as a symbol (symbol length=2), the symbols will be '00','01' and '01' and the entropy, the number of symbols and the total size will be 1, 4 and 4, respectively.

a)

As described in the previous section, the bitstream format, the symbol length, and the frequency of symbols have a significant effect on the entropy. Therefore, in this paper, we attempt to modify the bitstream format, set an appropriate symbol length and increase the occurrence of frequent symbols to reduce the entropy. To do this, we modified the bitstream and the routing resource structure in two ways:

b)

Enable

0 1 2 3 Write_enable Core AND gate

SRAM

Figure 1. a) One of the basic tiles of the FPGA b) The FPGA programmer SRAM cells are used to configure the FPGA programmable modules. During the configuration, the SRAM cells are loaded with the bits of the bitstream using data and write lines. An on-chip programmer circuit is used to feed the FPGA with the bitstream. The structure of a typical programmer is shown in Figure 1.b. As illustrated in this figure, the programmer consists of two shift registers (i.e. Data_in and Write_enable) and a core. Data_in and the Write_enable registers are connected to the data and write lines of the SRAM cells. During the configuration, the core pushes the bitstream into the Data_in shift register, and then loads them into the SRAM cells by activating the appropriate write_enable and Enable line. After loading the data of one column into SRAM cells, the Enable line is deactivated to prepare the circuit for loading another column of the configuration bits. This procedure continues until all of the configuration SRAM cells are loaded with the bitstream.

2.2 Shannon entropy Shannon entropy is a criterion for measuring the amount of information. If the entropy of data is low, the amount of information stored in the data is small and it can be compressed better. Entropy presents a lower bound for the efficiency of entropy-based compression methods as well. Huffman and arithmetic coding are two popular entropy-based compression methods. Entropy is defined by Equation 1.

E = −∑ [ p(i ) × log 2 ( p (i ))]

(1)

i

where

p (i ) is the probability that symbol i occurs in the data. The of E shows the average number of bits needed to represent a

value symbol. Entropy-based compressors reduce the size of data because they encode symbols with higher frequencies by shorter codes. If the data has some more frequent symbols, the entropy would be lower, hence, more compression rate. For example, if each letter represents a symbol, the entropy of the string “aaaab” will be lower than the entropy of “aaabb” because of more occurrence of symbol “a” in the former string. The method of representing the data (i.e., bitstream format and symbol length) has a significant effect on the symbol statistics and entropy. For example, let the bitstream be “00010001”. If each bit is used as a symbol (symbol length=1), then the entropy and the number of symbols will be is 0.81 and 8, respectively. Therefore, the total size of

3. THE PROPOSED APPROACH

1- The configuration of each multiplexer is represented by individual symbols. To do this, the bitstream format (i.e. the order of configuration bits in the bitstream) are changed in a way that the configuration bits of each multiplexer can be placed together. Furthermore, we did not use a fixed symbol length. The symbol lengths are set to the number of configuration bits of each multiplexer. To modify the bitstream format, the Write-enable and Data_in lines of SRAM cells are rearranged. 2- The structure of switch module multiplexers are modified to have better symbol statistics in the stream, which leads to an increase in the occurrences of more frequent symbols and a decrease in the occurrences of others. Our approach is based on the fact that some routing patterns occur more frequent than others. To have better symbol statistics, the frequent routing patterns are represented with identical symbols by arranging the input of the switch module multiplexers. As a result, the occurrences of those identical symbols will increase and the other ones will decrease. Therefore, the generated bitstream will be more compressible.

3.1 The bitstream format and symbol lengths A simple method to represent the routing patterns that could be used for compression is to have them in individual symbols. To this end, the configuration bits of each multiplexer should be kept together. Therefore, the interconnections of SRAM cells are modified in a way that the configuration SRAM cells of each module have a common Write-enable line and chronological Data_in lines. If the SRAM cells of a multiplexer have different Write_enable lines, then according to the structure of the programmer in Figure 1.b, their configuration bits will place in different columns. Moreover, if they have a common Write_enable line but they are not arranged contiguously, their configuration bits are placed within a column but they may be separated. In both of these cases, the configuration bits of a multiplexer could not be represented by one symbol because they do not take place sequentially in the bitstream. The only way that an individual symbol can be used to configure a multiplexer is to rearrange its configuration SRAM cells to have a common Write_anable line and contiguous sequential Data_in lines. As described before, the symbol length has a significant effect on the compression rate. By choosing an appropriate symbol length, the configuration of each multiplexer could be represented by a symbol so the symbol statistics and the compression rate can improve. In our approach, the configuration bits of each multiplexer constitute a single symbol. Due to the different sizes of multiplexers in the FPGA, a unique symbol length cannot be used for all multiplexers. An example is shown in Table 1. Assume that a segment of the bitstream is “x0x1x2y0y1y2y3z0z1z2z3z4” where mi (m ∈ {x, y, z}) is the ith configuration bit of a multiplexer M. If the symbol length is set to a fixed value of 3 or 4 bits, the symbols cannot represent the configuration patterns of the multiplexers. On the contrary, if the symbol length is variable, and equal to the number of configuration bits, the symbols will represent the patterns of the related multiplexers. However, different symbol lengths imply that the length of all symbols

Figure 2. The block diagram of the proposed bitstream compression and decompression method must be saved in masks to make the bitstream decompression possible. Nonetheless, due to the tileability of FPGAs, the overhead of saving the symbol lengths is fairly low because a single mask could be used for multiple tiles. Our FPGA consists of 25 basic tiles for which the symbol lengths must be defined. All tiles are generated by replicating the basic tiles. Table 1. The number of switch module routing patterns of seq.net bitstream 3-bit symbol 4-bit symbol Variable length symbol

x0x1x2y0y1y2y3z0z1z2z3z4 x0x1x2, y0y1y2, y3z0z1, z2z3z4 x0x1x2y0, y1y2y3z0, z1z2z3z4 x0x1x2, y0y1y2y3, z0z1z2z3z4 Related mask: 3, 4, 5

To prepare the variable length symbols for compression, they are represented by their maximum length. For example, as illustrated in Figure 2, let the length of symbols for the configuration of switch module multiplexers be 3, 4 and 5 bits. First, the symbols are represented with 5 bits by appending extra zeros as the most significant bits. Afterward, the compressor encodes these fixed length symbols. Inside the FPGA, after decoding the compressed bitstream using the decompressor, the real sizes of symbols are restored using the appropriate symbol length masks and the extra zeros are eliminated. In the case that the compression method needs to transfer the symbol probabilities, as in Huffman and arithmetic coding, if the symbol length is long, then the size of symbol probabilities will be large (e.g. 216 for a symbol length of 16). Therefore, it is better to use multiple shorter symbols instead of a long symbol. To this end, we divided the large symbols into smaller symbols. For example if a symbol length is 16 bits, it is divided into four 4-bit symbols. Although appending the extra zeros increases the number of bits before compression, it reduces the size of the compressed bitstream due to the improved symbols statistics.

3.2 The order of inputs for switch modules’ multiplexers In FPGAs with bidirectional routing resources, some routing patterns in the design are more frequent than the others [8]. We obtained the same results for the FPGAs with unidirectional routing resources in our experiments. Assigning identical symbols to frequent patterns increases the number of identical symbols and makes the final bitstream more compressible. To choose the best order for the multiplexer inputs, we extracted frequently-used routing patterns using statistical information obtained from VPR5 place and route tool. The target FPGA in our experiments has ten 4-input LUTs in each CLB, and each wire segment spans one CLB. The switch module architecture is Wilton and the flexibility of switch modules (Fs) is 3. All wire segments are unidirectional. The routing patterns are categorized into straight (CHANX_to_CHANX and CHANY_to_CHANY segments), orthogonal (CHANX_to_ CHANY and vice versa) and starting (output port of CLB into the switch module multiplexers). Each routing pattern has a wire segment or the output port of a CLB as its source, and a wire segment as its destination. Each wire segments is either increasing or decreasing. As can be seen in Table 2, straight patterns occur much more frequent than the other patterns for the attempted benchmark (seq.net). Each multiplexer can produce only one straight pattern, so the inputs of all

the multiplexers are ordered in a way that a unique symbol (e.g. “0000”) represents all straight patterns. Next to the straight patterns, the orthogonal patterns are more frequently used than the rest of patterns. Therefore, identical symbols are used for representing such patterns to improve the compression rate further. In FPGAs with segment lengths of 1, two orthogonal patterns can be produced for each multiplexer. Thus, two symbols (e.g. “0001” and “0010”) are sufficient to represent all orthogonal patterns. The 10 different outputs of each CLB lead to different starting pattern for the multiplexer that connected to that CLB. Different symbols must be assigned to each of these starting patterns. As a result, each distinctive starting pattern occurs about 10 times less than 1436 which implies a less occurrence frequency than that of straight and orthogonal patterns. Some of switch module multiplexers do not have required connections for particular routing pattern, for example straight pattern in fringe tiles. For these multiplexers, symbols with smaller values are assigned to patterns with the higher probabilities of occurrence, for example, "0000" and "0001" are assigned to the orthogonal patterns in multiplexers that do not have straight pattern. Table 2. The number of switch module routing patterns of seq.net Destination direction Source direction Straight Orthogonal Starting

Mux in X direction Mux in Y direction increasing decreasing increasing decreasing Sum Inc. Dec. Inc. Dec. Inc. Dec. Inc. Dec. 2066 0 0 1327 1652 0 0 1951 6996 504 281 543 466 609 534 308 424 3669 352 400 340 344 1436

This approach is explained by a simple example. Consider a switch module architecture as illustrated in Fig. 3.a. The corresponding symbols are “0101”, “1010”, “0010”, and “0000” to form horizontal and vertical straight patterns, respectively. Assume that each of these patterns appears in the design by an average of α times. Therefore, each symbol occurs in the bitstream for at least α times. If the order of multiplexer inputs is modified into what is shown in Fig. 3.b, then the horizontal and vertical straight patterns can be represented by “0000”. As a result, there will be only one symbol for straight patterns which occurs in the symbol stream 4*α times and will lead to a better symbol statistics and an improved compression rate. It is clear that the symbols associated with patterns in multiplexers cannot be changed after fabrication since they depend on the orders of the multiplexer inputs that are hard wired.

Figure 3. a) A routing multiplexer (a) without and (b) with ordered inputs and the corresponding symbols for straight patterns

4. EMPERIMENTAL RESULTS We developed a framework which generates the gate level netlist of basic tiles, the whole FPGA and circuit bitstreams based on architecture file used by VPR 5. The framework supports various unidirectional architectures (i.e., different FPGA sizes, CLB sizes, channel width sizes, switch module topologies, segment lengths, Fc and Fs). As the default architecture described by VPR is not tileable, in this paper the tool was modified slightly to generate tileable architectures. The modified tool is used to place and route designs with the minimum channel width. Subsequently, the HDL code of the basic tiles, the hardware description of the FPGA in Verilog, and the bitstream is generated using the developed framework. After generating the bitstream, the entropy of the bitstream is calculated using Equation 1. We exclude the don’t care cases (unused resources) for a better illustration of our idea. The architecture attempted in the experiments was the same as the one used in Section 3.2. Table 3 shows the entropy of bitstreams for 20 MCNC benchmarks. The rate of compression is obtained by dividing the entropy of the bitstream by the length of symbol. In our experiments (the proposed approaches), the maximum length of symbols is chosen to be 4 and the shorter symbols are converted to 4-bit symbols, as described in Section 3.1. Table 3. The size of compressed bitstream in various approaches benchmark

3-bit

4-bit

5-bit

6-bit App.1

tseng.net

2.65

3.55

4.45

5.13

2.66

App.1+A pp.2 2.08

diffeq.net

2.65

3.52

4.46

5.1

2.46

1.95

s298.net

2.66

3.22

4.4

4.84

2.22

1.73

frisc.net

2.65

3.46

4.41

4.98

2.28

1.74

s38417.net

2.61

3.15

4.28

4.52

2.27

1.83

clma.net

2.61

2.97

4.3

4.84

2.05

1.6

ex5p.net

2.7

3.37

4.53

5.29

2.55

1.9

apex4.net

2.67

2.92

4.47

4.96

2.52

1.85

misex3.net

2.67

3.18

4.47

5.19

2.55

1.86

alu4.net

2.66

3.28

4.43

4.73

2.37

1.8

seq.net

2.65

3.25

4.41

5.12

2.44

1.83

apex2.net

2.67

3.36

4.48

5.01

2.49

1.88

spla.net

2.65

3.52

4.39

5.04

2.22

1.72

pdc.net

2.66

3.43

4.39

5.01

2.2

1.66

ex1010.net

2.64

3.3

4.37

5.02

2.22

1.76

dsip.net

2.55

3.14

4.11

4.57

2.03

1.48

bigkey.net

2.58

2.95

4.16

4.3

2.05

1.51

elliptic.net

2.63

3.32

4.36

4.97

2.19

1.65

s38584.1.net

2.58

3.34

4.22

4.86

2.22

1.75

des.net

2.58

3.1

4.09

4.6

2.01

1.5

average entropy

2.64

3.27

4.36

4.90

2.30

1.75

Compression rate

88%

82%

87%

82%

58%

44%

35%

30%

34%

30%

---

---

50%

46%

50%

46%

24%

---

Improvement ( App.1) Improvement (App.1+App.2)

The first proposed approach (App.1 column), that is the use of variable symbol length improves the average bitstream compression rate by 35%, 30%, 34%, and 30% compared to the fixed symbol length of 3, 4, 5, and 6 bits, respectively. By applying the second proposed approach, which is the reordering of multiplexer inputs, on top of the first approach (App.1+App.2), the compression rate is improved by 24% on average, compared to the first approach. For all the benchmarks, the proposed approaches led to better results than the the approaches which use fixed symbol length. It should be pointed out that the base architecture which the ordered inputs architecture was compared, was produced by VPR 5. This architecture is not the worst case and the order of its multiplexer inputs can improve the compression rate compared to the worst case.

5. CONCLUSION AND FUTURE WORKS In this paper, we focused on the multiplexers of switch modules. Research on other multiplexers in other parts of the FPGA, such as connection blocks and CLBs is underway. In this paper, we modified the bitstream format and chose the appropriate symbol lengths for the proper configuration of multiplexers. Then, we arranged the inputs of multiplexers in the switch modules to improve the symbol statistics. By applying both approaches, the average compression rate is improved by 46% over 20 MCNC benchmarks compared to the approached which use fixed symbol length.

6. REFERENCES [1] P. Stepien and M. Vasilko, "On Feasibility of FPGA Bitstream Compression During Placement and Routing," International Conference on Field Programmable Logic and Applications, 2006. [2] A. Abdelhadi and G. Lemieux, "Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations," International Conference on Reconfigurable Computing and FPGAs, 2011. [3] Z. Li and S. Hauck, "Configuration Compression for Virtex FPGAs," The 9th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, 2001. [4] X. Qin, C. Muthry, and P. Mishra, "Decoding-Aware Compression of FPGA Bitstreams," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.19, no.3, pp.411419, March 2011. [5] M. Martina, G. Masera, A. Molino, F. Vacca, L. Sterpone, and M. Violante, "A New Approach to Compress the Configuration Information of Programmable Devices," Design, Automation and Test in Europe, DATE '06. Proceedings, vol.2, March 2006. [6] J. Luu, I. Kuon, P. Jamieson, T. Campbell, A. Ye, W. M. Fang, and J. Rose, "VPR 5.0: FPGA CAD and Architecture exploration Tools with single-Driver Routing, Heterogeneity and Process Scaling," ACM/SIGDA International Symposium on FPGAs, Feb. 2009. [7] C. E. Shannon, “Prediction and entropy of printed English,” Bell Syst. Tech. J. 30, 50–64, 1951. [8] G. Wang, S. Sivaswamy, C. Ababei, K. Bazargan, R. Kastner, and E. Bozorgzadeh, "Statistical Analysis and Design of HARP FPGAs," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.25, no.10, pp.2088-2102, Oct. 2006.

Lihat lebih banyak...

Improving bitstream compression by modifying FPGA architecture

Descrição do Produto

Comentários