A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems

Share Embed


Descrição do Produto

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems F. Campi, A. Cappelli, R. Guerrieri, A. Lodi, M. Toma ARCES Universit`a di Bologna

A. La Rosa, L. Lavagno, C. Passerone Dipartimento di Elettronica Politecnico di Torino

Abstract Flexibility, high computing power and low energy consumption are strong guidelines when designing new generation embedded processors. Traditional architectures are no longer suitable to provide a good compromise among these contradictory implementation requirements. In this paper we present a new reconfigurable processor that tightly couples a VLIW architecture with a configurable unit implementing an additional configurable pipeline. A software development environment is also introduced providing a userfriendly tool for application development and performance simulation. Finally, we show that the HW/SW reconfigurable platform proposed achieves dramatic improvement in both speed and energy consumption on signal processing computation kernels.

1 Introduction Today’s embedded systems, especially those aimed at the wireless consumer market, must execute a variety of high performance real-time tasks, such as audio, image and video compression and decompression and so on. Flexibility required to reduce masks and design costs, computing power involved by more and more complex applications and low power consumption due to the almost negligible growth of battery capacity, are problems that traditional processors won’t be able to satisfy in the next few years. Two main approaches have been explored in order provide a solution to face these challenging issues. The first one is represented by mask configurable processors, such as Xtensa [1], where new application specific instructions can be easily added at design time integrating dedicated hardware within the processor pipeline. Selection of the new possible instructions is performed manually by using a simulator and a profiler. When the Xtensa processor is synthesized including new application specific hard-

R. Canegallo NVM-DP CR&D STMicroelectronics

ware, a dedicated development tool-set is also generated that supports the newly added instruction as function intrinsics. This approach provides a user-friendly environment for application development. However, since the hardware for the new instructions is synthesized with an ASIC-like flow, the processor cannot be reconfigured after fabrication, resulting in very high non-recurrent engineering costs when specifications of the application change. Following a different approach, several new configurable architectures [2, 3, 4, 5, 6, 7] have been proposed usually trying to couple an FPGA with a microprocessor. Computation kernels, where most of execution time of applications is spent, are identified and implemented in the gate array, thus achieving a boost in speedup and energy performance. At the same time a high degree of flexibility is retained, still being able to reprogram the FPGA after fabrication. The introduction of Run-Time Reconfiguration [8, 9] further improved FPGAs flexibility and increased efficiency, allowing the use of different FPGA instructions through run-time modification of the instruction set, based on the currently executed algorithm (e.g. audio vs. video decoding). In this paper we present a new configurable VLIW architecture that tightly couples a processor with a configurable unit. The integration of a configurable datapath in the processor core reduces any communication overhead towards other functional units, thus increasing its use in more computation kernels. At the same time an integrated software environment providing user-friendly tools has been developed. Our approach to software support of the extended instruction set is similar to that of Tensilica, in that we rely on manual identification of the extracted computational kernels. However, we do not require regeneration of the complete tool chain whenever a new instruction is added.

2 System Architecture The approach adopted in the architecture proposed is to provide a VLIW microprocessor with an additional

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

Pipelined Configurable Gate Array (PiCoGA, pGA), capable of introducing a large number of virtual application specific units. The reconfigurable unit is tightly integrated in the processor core, just like any other functional unit, thus receiving inputs from the register file and writing results back to the register file. Each configuration implements a peculiar data-path with a number of stages suitable for the function to be executed which could not even have been known at compile time. In fact, the number of cycles needed to complete the execution may depend on the FPGA pipeline status and on the values of the inputs when for example f or or while loops are entirely implemented inside the array. The computational model proposed takes advantage of the synergy between different application specific functional units tightly integrated in the same core. An FPGA behaving like a coprocessor [6] needs to implement an entire computational kernel in order to achieve high throughput because the communication overhead to the processor core is considerably high. As a consequence, when a kernel is composed of both functions suitable to be mapped in an FPGA and operators which could not be efficiently implemented in the configurable unit, it is often completely executed in the processor core, thus leaving the array unused. In our model communication overhead between the array and the other functional units is as small as possible, thus allowing the distribution of different operations included in a single kernel to the functional unit that best fit them. Wide multipliers, variable shifters, MACs, which are so difficult to implement efficiently in FPGAs, could be executed in dedicated hardwired function units, while the configurable unit exploits parallelism of even a small portion of kernels. In this way utilization of the gate array increases considerably, justifying its cost in terms of area for a wide range of applications.

2.1 XiRisc: a VLIW Processor The XiRisc architecture [12] is strictly divided between a system control logic and a data elaboration region. The system design approach was to provide a simple and straightforward control architecture, serving as a structural backbone and providing the programmer with a familiar execution model to have full control of the elaboration. All data processing resources are added to this structure as independent, concurrent functional units, each of which is controlled through a subset of assembly instructions (i.s.a. extensions). The control architecture is based on the classic RISC five stages pipeline described in [13]: XiRisc is a strictly load/store architecture (Fig. 1), where all data loaded from memory is stored in the register file before it is actually computed. This very straightforward computational model

might result in a severe bottleneck for memory-intensive applications. In order to maintain high data throughput to and from the described functional units, the processor is structured as a Very Long Instruction Word machine, fetching and decoding two 32-bit instructions each clock cycle. The instruction pairs are then executed concurrently on the set of available functional units, determining two symmetrical, separate execution flows that are called data-channels. Simple, commonly used functional units such as Alu and Shifter are duplicated over the data channels. All others functional units are more efficiently shared between the two channels. Some functional units are essential to the processor’s functionality, that is the program flow control unit and the memory access unit. All other functional units may be inserted or excluded from the design at HDL-compilation time, thus achieving a first level of design time configurability. To simplify hazard handling, the software compilation tool-chain [11] will schedule instruction pairs in order to avoid simultaneous access to the same functional units, so that pairs are never split during the execution flow. All other pipeline hazard configurations are resolved run-time by a fully bypassed architecture and a hardware stall mechanism. A special purpose assembly-level scheduler tool was added to the compiler to minimize stall configurations in order to enhance overall processor performance. The pGA is handled by the control logic and the compilation toolchain as a shared functional unit. Operands are read and results written on the register file, and a special instruction set extension is used to control both the array execution and reconfiguration. From the architectural point of view the main differences between the pGA and the other functional units are: 1. The PiCoGA supports up to 4 source and 2 destination registers for each issued assembly instruction. In order to avoid bottlenecks on the writeback channels a special purpose register file was designed, featuring four source and four destination registers, two of which reserved for the PiCoGA. 2. PiCoGA instructions feature unpredictable latency, so that a special register locking mechanism had to be designed to maintain the program flow consistency in case of data dependency between PiCoGA instructions and other assembly instructions.

2.2 Instruction Set Extension In order to support the integration of the pGA in a processor core, the instruction set has to be extended just as when new function units are added. In the case of a configurable unit usually both configuration and execution instructions are provided, to support previous load from the second-level cache of the functions to be executed. Figure

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

MUX ALU

MUX

SHIFTER DATA CHANNEL 1 INSTR DECODE

LOGIC 2

(Multiply/MAC)

MUX

INSTR DECODE

MUX

MEMORY

REGISTER FILE

INSTRUCTION

LOGIC 1

(Data Memory Handle)

F.U. #1 F.U. #2

SHARED FUNCTIONAL UNITS

MUX

F.U. #3 ( ... )

DATA CHANNEL 2

ALU

MUX

SHIFTER

MUX

FPGA Control Unit

GATE−ARRAY CONTROL

FPGA Gate − Array

GATE−ARRAY WRITEBACK CHANNEL

Figure 1. System architecture 2 shows the adopted execution (pGA-op) and configuration (pGA-load) instruction formats for the implemented prototype.

63

58 57

64−bit pGA−op

53 52

5

5 26 25

32−bit pGA−op

21 20

5

16 15

5

26 25

pGA−load

33 32

Dest 1

5

5

Source 1 Source 2

6 31

• 64-bit pGA-op allow the architecture to take advantage of the full bandwidth between register file and pGA, but no other instruction can be fetched concurrently;

38 37

Source 1 Source 2 Source 3 Source 4

6 31

Regarding pGA-op, two different instruction formats are provided:

43 42

48 47

11 10

Dest 1 5

Dest 2

6 5

Dest 2

6

6

28 0

operation specification

5

6 0

20 19 region specification

operation specification

5

5

0

28 27

configuration specification 20

Figure 2. XiRisc extended instruction formats

• 32-bit pGA-op can be fetched concurrently with another traditional RISC instruction. This format is useful when memory access is the bottleneck, since one datapath is free to perform one read or write each cycle while the other feeds the pGA. The main drawback is that only two register file source operands are available for pGA, since another two are needed for memory addressing. In both cases the least significant bits are used to identify which of the possible configurable functions is to be executed. PiCoGA-load instruction specifies the configuration to be loaded (each code corresponding to a PiCoGA function) and the array region and layer where to load it, given as an immediate (statically determined) or as a register file address (dynamically determined).

3 PiCoGA: Pipelined Configurable Gate Array In the past a few attempts have been carried out in order to design a configurable unit tightly integrated in a processor core and their study led to some guidelines that have to be followed to achieve a significant gain in overall system performance. First of all the configurable unit, as opposed to the one in [3], should be able to perform complex functions that require multi-cycle latency. The PiCoGA is designed to implement a peculiar pipeline where each stage corresponds to a piece of computation, so that high throughput circuits can

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

.

4x32−bit input data bus from Reg File 2x32−bit output data bus to Reg File 192−bit configuration bus from Configuration cache

2

loop−back

VERTICAL CONNECTION BLOCK

12 global lines to/from RF

2

RLC

INIT LOGIC

2 2

LUT 16x2 2

12

LUT 16x2 2

OUTPUT

configuration bus

pGA CONTROL UNIT

HORIZONTAL CONNECTION BLOCK

SWITCH BLOCK

EN

REGISTERS 2 2

CARRY CHAIN

2

pGA control unit signals

Figure 3. PiCoGA structure be mapped. The array is also provided with a control unit which controls pipeline activity, just as if it was a complete additional datapath. In this way a sequence of pGA instructions can be processed filling the pipeline in order to exploit parallelism. Moreover the configurable unit should preserve its state across instruction executions, so that a new pGA instruction may directly use the results of previous ones. Therefore the pressure on the register file is reduced and functions having more operands than supported by instruction formats can be split in a sequence of related pGA instructions. As most of the control logic would be executed in the standard processor pipeline, the configurable unit should have a granularity suitable for multi-bit data-path implementation, but at the same time it should be flexible enough to compensate the other functional units for the kind of computations that are not efficient. Finally a tight integration in the processor core gives the opportunity to use the pGA in many different computational cores. Run-time reconfiguration is therefore necessary to support new sets of dynamically defined instructions, but it is effective only if there is no reconfiguration penalty.

3.1 PiCoGA Structure The pGA is an array of rows, each representing a possible stage of a customized pipeline. The width of the datapath obtained should fit the processor one, so each row is able to process 32-bit operands. As shown in Figure 3, each row is connected to the other ones with configurable interconnect channels and to the processor register file with six 32-bit busses. In a single cycle four words can be received from the register file and up to two can be produced for

writeback operation. The busses span the whole array, so that any row can access them, improving routability. Pipeline activity is controlled by a dedicated configurable control unit, which generates two signals for each array row. The first one enables the execution of the pipeline stage, allowing the registers in the row to sample new data. In every cycle only rows involved in the computation piece to be executed in that cycle are activated, in a dataflow fashion. In this way a state stored in flip-flops inside the array can be correctly held and at the same time unnecessary power dissipation is avoided. The second signal controls initialization steps of the array state.

3.2 Configuration Caching One of the reasons for tight integration of an FPGA in a processor core is the opportunity to use it frequently, for many different computational kernels. However reconfiguration of commonly used gate arrays can take hundreds but most frequently thousands of cycles, depending on the reprogrammed region size. Although execution can still continue on other processor resources, scheduling will hardly find enough instructions to avoid stalls that could overcome any benefit from the use of dynamically configurable arrays. Furthermore in some algorithms the function to be implemented is only known at the time it has to be executed, so that no preventive reconfiguration can be done. In such cases many computational kernels cannot take advantage of the presence of a configurable unit, leaving it unused during their execution in the processor standard datapath. Three different approaches have been adopted to overcome these limitations. First of all the pGA is provided with a first level cache, storing 4 configurations for each logic

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

cell [8, 9]. Context switch takes place in one clock cycle, providing 4 immediately available pGA instructions. Furthermore Partial Run-Time Reconfiguration (PRTR) [10] is supported, allowing reconfiguration of just a portion of the array, while the rest remains untouched. While the pGA is executing one computation, reconfiguration of the next instruction can be performed, highly reducing cache misses, even when the number of configurations used is large. Finally reconfiguration time can be shortened exploiting a wide configuration bus to the pGA. The Reconfigurable Logic Cells (RLC) in a row are written in parallel with 192 dedicated wires, taking up to 16 cycles to have a complete reconfiguration. A dedicated second level cache on-chip is needed to provide such a wide bus, while the whole set of available functions can be stored in an off-chip memory.

4 The Software Development Environment The global optimization and configuration flow for the processor is shown in Figure 4. In this section we focus on the effort made to build a suitable environment for XiRisc reconfigurable processor [11].

Initial

Profiling

C code

Optimized C code

fpga−op

FPGA mapping

Executable code 011100011001001 10010010101001 0100010101001101 0101001001010100 1010010010010 1010010101011010 101010101011011

Configuration bits

Figure 4. The complete customization flow A key design goal of the software development chain, including compiler, assembler, performance simulator and debugger, was to support compilation and simulation of software including user-definable instructions, without the need to recompile the tool chain every time a new instruction is added. We also chose to use gcc as the starting point for our development, because it is freely available for a large number of embedded processor architectures, including DLX and MIPS, and features optimization capabilities. Ideally, one would like to be able to define new instructions so that the compiler can use them directly to optimize C source code. Although some work has been done in this direction (see [14] for a summary), this still remains a very difficult problem, and gcc in particular offers limited support for it, since the md format used to describe machine instructions was designed with compilation efficiency more

than user friendliness in mind. Our approach is that of manually identifying and tagging computation kernels to be extracted as single instructions, and then providing automated support for compilation, assembly, simulation and profiling of the resulting reconfigurable processor code. This allows the designer to quickly search the design space for the best performance for the target application code, given the architectural constraints.

4.1 Compiler and Assembler We re-targeted the compiler by changing the machine description files found in the gcc distribution, to describe the extensions to the DLX architecture and ISA. The presence of the PiCoGA was modeled as a new pipelined function unit with a set of possible latencies. The approach that we used is based on using a single assembler mnemonic for all reconfigurable instructions with the same latency, so that gcc can bind them to the appropriate PiCoGA function to represent its latency for the scheduler. All such mnemonics are then translated to the opcode shown in Figure 2. On the assembler side, we modified the file that contains both the assembler instruction mnemonics and their binary encodings to add the PiCoGA instructions that are used when the code is loaded on the target processor. Furthermore two instructions, called tofpga and fmfpga were added, which must be used only with the modified ISA simulator to emulate the functionality and performance of the PiCoGA instructions with a software model.

4.2 Simulator and Debugger Simulation is a fundamental step in the embedded software development cycle to both check the correctness of the algorithm and debug the object code (including the ability to set breakpoints and access variables and symbols). In order to have fast performance analysis and design space exploration at the algorithmic and hw/sw partitioning level, we use a C model of the behavior of the PiCoGA instruction, which is then called at simulation time when the instruction is encountered. A drawback of this approach is that the designer of the PiCoGA configuration data must manually ensure the correspondence of behavior between the simulation model and the PiCoGA implementation model. A PiCoGA instruction is described in the source code using a pragma directive with the following format: #pragma fpga instr name opcode delay nout nin outs ins where instr name is the mnemonic name of the PiCoGA instruction to be used in the asm statements, opcode is the immediate field identifying the specific PiCoGA instruction, delay is the latency in clock cycles, nouts and nins are the

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

int bar (int a, int b) { int c; ... #pragma fpga shift add 0x12 5 1 2 c a b c = (a
Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.