VLSI architecture for a low-power video codec system

Share Embed


Descrição do Produto

Microelectronics Journal Microelectronics Journal 33 (2002) 417±427

www.elsevier.com/locate/mejo

VLSI architecture for a low-power video codec system A. Chimienti a, L. Fanucci b,*, R. Locatelli c, S. Saponara c a

b

IRITI, National Research Council, Strada delle Cacce 91, I-10125 Torino, Italy Centro Studi Metodi e Dispositivi per Radiotransmissioni (CSMDR), National Research Council, Via Diotisalvi 2, I-56122 Pisa, Italy c Department of Information Engineering, University of Pisa, Via Diotisalvi 2, I-56122 Pisa, Italy Received 6 November 2001; revised 18 January 2002; accepted 23 January 2002

Abstract In this paper, the design of a very large scale integration (VLSI) architecture for low-power H.263/MPEG-4 video codec is addressed. Starting from a high-level system modelling, a pro®ling analysis indicates a hardware±software (HW±SW) partitioning assuming power consumption, ¯exibility and circuit complexity as main cost functions. The architecture is based on a reduced instruction set computer engine, enhanced by dedicated hardware processing, with a memory hierarchy organisation and direct memory access-based data transfers. To reduce the system power consumption two main strategies have been adopted. The ®rst consists in the design of a low-power highef®ciency motion estimator speci®cally targeted to low bit-rate applications. Exploiting the correlation of video motion ®eld it attains the same high coding ef®ciency of the full-search approach for a computational burden lower than about two orders of magnitude. Combining the decreased algorithm complexity with low-power VLSI design techniques the motion estimator power consumption is scaled down to few mW. The second consists in the implementation of a proper buffer hierarchy to reduce memory and bus power consumption in the HW±SW communication. The effectiveness of the proposed architecture has been validated through performance measurements on a prototyping platform. q 2002 Elsevier Science Ltd. All rights reserved. Keywords: Hardware±software co-design; Very large scale integration architectures; Low-power circuits; Rapid prototyping; Video coding

1. Introduction New trends in personal computing are mainly focused on multimedia access and communication. Moreover there is a great interest for portability in applications such as CMOS camera, 3 G mobile phones (UMTS), personal digital assistants where the low-cost and low-power constraints are mandatory. Igura et al. [1] estimated that the power consumption of a digital signal processing circuit in portable multimedia terminals must be at most 500 mW. A key issue for multimedia communication is the use of video coding techniques to reduce the enormous bit-rate for transmission and storage (tens of Mbits/s depending on video frame rate and image size). To this aim, several compression standards were developed by ISO and ITU-T targeting different requirements in terms of image quality and bandwidth [2]. Communication applications like video telephony, wireless multimedia, remote surveillance and emergency systems are ef®ciently covered by the H.263 and MPEG-4 simple pro®le recommendations [2±5]. Real time video coding requires both high computational burden and high ¯exibility. It * Corresponding author. Tel.: 139-050-568-668; fax: 139-050-568-522. E-mail address: [email protected] (L. Fanucci).

combines processing intensive low-level tasks featuring a regular computation on simple data structures, like motion estimation (ME) and discrete cosine transform (DCT), with data dependent medium tasks characterised by an irregular data ¯ow with a lower computational demand. All the outlined requirements can be met using the mixed hardware±software (HW±SW) architecture presented in the paper. 1.1. Previous works The best implementation for getting the highest ¯exibility is a complete software (SW) solution, but usually a microprocessor platform does not provide suf®cient performance and/or low energy consumption under real time constraints [2,6,7]. On the other hand, fully dedicated hardware (HW) solutions are limited by a poor ¯exibility and reusability. They are targeted for a well-de®ned application with high production volumes and are typically not very suited for the wide and continuously changing multimedia application ®eld. Dedicated and programmable approaches can be ef®ciently combined in the design of hybrid architectures re¯ecting the inner structure of multimedia processing that consists of multiple tasks with different computational

0026-2692/02/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved. PII: S 0026-269 2(02)00009-5

418

A. Chimienti et al. / Microelectronics Journal 33 (2002) 417±427

Table 1 Pro®ling of the video codec Task

CISC CPU (%)

RISC CPU (%)

ME (full search) DCT/IDCT Q/IQ Others Coder IDCT IQ MC Others Decoder Codec

58.36 15.09 2.33 10.43

64.35 11.56 2.84 10.87

7.57 0.15 2.09 3.98

86.21

13.79 100

5.78 0.13 1.57 2.90

power but high-ef®ciency ME technique. In Section 4, we detail the proposed memory hierarchy organisation and summarise the system requirements in terms of bus bandwidth, memory size and power consumption. Section 5 describes the architecture prototyping and relevant performance measurements. Some conclusions are drawn in Section 6.

89.62

2. Codec pro®ling and hardware±software partitioning 10.38 100

burdens. Many hybrid architectures have been proposed for multimedia processing and particularly for video applications [1,2,6±11]. These architectures are based on various approaches with a different degree of programmability. A comprehensive overview is given in Ref. [2]: typically they enhance the processing capabilities by exploiting parallelisation (SIMD, MIMD, VLIW or split-ALU approach) and/or adaptation strategies (specialised instruction set or coprocessor approach) at different levels of computation. At present, the best-suited solution for the design of ¯exible, low-complexity and low-power video processing scheme is embedding a reduced instruction set computer (RISC) engine with HW dedicated units for the intensive computation tasks. According to this approach several architectures have been recently proposed in literature to integrate in a single-chip a complete H.263/MPEG-4 codec for a power consumption lower than 300 mW [2,6±9]. Unfortunately, they present two main drawbacks still to overcome: the implementation of a low-power but high-ef®ciency ME technique and the optimisation of the HW±SW communication memory hierarchy to reduce system power consumption by exploiting data reuse. To this objective, in this paper we propose a hybrid architecture for H.263/MPEG-4 video coding which addresses both the above issues. First the system HW±SW partitioning has been derived as a result of codec high-level modelling and relevant SW pro®ling. Special emphasis has been devoted to the design of a low-complexity and lowpower very large scale integration (VLSI) macrocell for ME. Then the exploration analysis of a direct memory access (DMA)-based communication between HW and SW tasks and the relevant memory hierarchy organisation have been performed to reduce the impact on system performance of the large data transfers which characterise video processing. Finally, to assess the architecture ef®ciency the whole system has been prototyped on an FPGA-based emulation platform. Measured performances have been compared with the ones of the well-known TMN codec [12]. After this introduction, in Section 2 we brie¯y describe the co-design methodology, the relevant system pro®ling and partitioning. Section 3 addresses the issue of a low-

The target of our project is the design of a low-power and low-complexity architecture for single-chip H.263/MPEG-4 video codec suitable for the real time processing of QCIF (176 £ 144 pixels) video up to 30 Hz. The maximum considered bit-rate amounts to 500 KBits/s. In such a system, the ME task aims to reduce the temporal data correlation between video frames (inter-coding) while the DCT allows for the reduction of the spatial correlation (intracoding) of each frame. The whole scheme also includes data dependent tasks for direct/inverse quantisation (Q/ IQ), motion compensation (MC), variable length coding/ decoding, bit-stream generation, system control and I/O. The context-based functionalities, featured by some MPEG-4 pro®les, are not taken into account and so we refer to ®xed size frames which can be coded in intra, inter or bi-directional modes. Starting from a high-level C description of the whole codec, a pro®ling analysis indicates a possible system HW±SW partitioning assuming power consumption, ¯exibility and circuit complexity as main cost functions. Pro®ling data have been collected running the C code on general purpose RISC (UltraSparc II) and CISC (PENTIUM III) micro-architectures according to a computational cost approach. The computational burden represents a value proportional to energy consumption and can be used for performance and low-power considerations. Table 1 indicates the percentage breakdown of the different functions. As expected from Refs. [5,13], the worst case is when the system is in the coding mode. The above results indicate that the most demanding task in terms of computational power (and so in terms of energy consumption) is the ME one and hence it represents the best candidate for a dedicated VLSI implementation. The ME is also a good choice for the HW solution because it is a function of the coder which requires low ¯exibility. SW implementation on a programmable engine is the best-suited solution for the remaining video codec tasks. After proper algorithm optimisations (such as fast implementation of the DCT and IDCT based on the known Chen [14] algorithm with ®xed-point arithmetic) and handcrafted code re®nement, the computational power requested by the SW tasks can be supported by a low-power ARM9 microprocessor. Such a micro provides up to 220 MIPS at 200 MHz for a peak power consumption of 0.8 mW/MHz in a 0.18 mm CMOS technology [15]. The above power ®gure takes

A. Chimienti et al. / Microelectronics Journal 33 (2002) 417±427

419

hierarchy design in Sections 3 and 4, respectively) can be applied to any system design for low bit-rate and low-power video coding. 3. Motion estimator 3.1. Algorithm analysis

Fig. 1. HW±SW system architecture.

into account the contribution of the ARM9TDMI core plus the relevant instruction and data caches and the memory management unit (MMU). To be noted that some of the hybrid architectures proposed in literature [6±10] feature a dedicated HW implementation for the DCT and IDCT to overcome the poor computational capabilities of the adopted microprocessor core and/or to further reduce the energy consumption. For instance, a DCT/IDCT VLSI coprocessor with a power consumption of few mW for a QCIF video has been proposed by the authors in Ref. [16]. Despite the above issues in the proposed system architecture we address a SW implementation for the DCT and IDCT tasks because of the following main reasons: (i) the selected processor engine is able to support all the SW tasks under real time constraint; (ii) the estimated power consumption of the overall system (see Section 4.1) is well below the 300 mW target; (iii) the proposed partitioning further increases system ¯exibility and reusability. It is also suited for the implementation of new coding algorithms [17±20] where the DCT/IDCT tasks are missing or replaced by other techniques such as Walsh±Hadamard transform, Karhunen±Loeve transform, Wavelet. Summarising, the exploration and pro®ling analysis has suggested the functional organisation of the codec indicated in Fig. 1. A RISC microprocessor and a dedicated accelerator implement the SW and HW tasks of the algorithm. The communication between the two agents, i.e. the HW±SW interfacing, is solved by proper buffer levels. This way, a power saving memory hierarchy exploiting data reuse has been implemented (see Section 4). A DMA engine ensures best performance to the processor data management by handling all the data transfers between frame memories, buffer hierarchy and I/O interface. It is worth noting that the resulting system partitioning is aligned with the ones presented in literature [2,6±8] based on a RISC-like engine enhanced by dedicated HW processing. Therefore, the power saving techniques proposed in the next sections (fast predictive ME and memory

A straightforward technique for performing ME is the full-search block-matching (FS) [2,5,21±23]: the current frame of a video sequence is divided into not-overlapped N £ N reference blocks and, for each of them, a block in the previous frame (candidate block), addressed by a motion vector (MV), is exhaustively searched for the best matching within a proper search area according to a sum of absolute differences (SAD) cost function. If a…i; j† and b…i; j† are the pixels of the reference and candidate blocks and m and n are the coordinates of the MV, the SAD is de®ned as SAD…m; n† ˆ

NX 2 1 NX 21 iˆ0

ua…i; j† 2 b…i 1 m; j 1 n†u;

…1†

jˆ0

with 2pv # m # pv 2 1 and 2ph # n # ph 2 1: Usually ph ˆ pv ˆ p ˆ 16 and N ˆ 16: This distortion is computed for all the 4p 2 possible positions of the candidate blocks within the search window. The block corresponding to the minimum distortion (SADmin) is used for prediction and its MV is given by MV ˆ …m; n†uSADmin where SADmin ˆ min ‰SAD…m; n†Š: …m;n†

…2† This exhaustive approach achieves optimal performance in terms of peak signal-to-noise ratio (PSNR) for a given compression factor but at the expenses of high computational burden and data bandwidth. To reduce the FS complexity, several different blockmatching algorithms have been proposed and implemented in state-of-art single-chip video coding schemes. They are all based on the reduction of the number of candidate blocks and/or the number of pixels investigated for each candidate block, such as the three-step-search (TSS), four-step-search, hierarchical-search, cross-search, 2D log-search, pixel subsampling, reduced pixel resolution, densely centred uniform P-search (DCUPS) and fast search (FAST) [5,21±26]. To further reduce the power consumption these algorithms can be joined with a proper circuit clock gating strategy. Processing is stopped when a partial SAD exceeds the current SADmin because it will never be selected as minimum distortion value [8]. Unfortunately, adopting these ME algorithms the lower computational complexity achieved is paid with an increase of the bit-rate or, in case of constant bit-rate transmission, with a lower quality of the coded image. To overcome the complexity of the FS while maintaining the same coding ef®ciency for the considered low bit-rate applications, we propose a fast predictive spatio-temporal

420

A. Chimienti et al. / Microelectronics Journal 33 (2002) 417±427

Fig. 2. Fast predictive algorithm.

algorithm. It exploits the past story of the motion ®eld to predict the current one. Indeed in a typical video sequence, particularly in low bit-rate coding, the motion ®eld is usually slowly varying from frame to frame (temporal correlation) and the blocks belonging to the same physical objects in a scene show nearly the same motion (spatial correlation). By exploiting this correlation, the MV of a given block can be predicted from a set of four initial MV candidates, two selected from its spatial neighbours and two from its temporal ones, according to the minimisation of the SAD cost function (predictive phase). To further reduce the residual estimation error, the initial predictive phase is followed by a re®nement phase on a grid centred around the position pointed by the predictive phase winner, hereafter called V0, and made up of four points on crossdirections and four points on diagonal ones. To reach halfpixel resolution the points on cross-directions have 1/2 pixel distance from the centre while the points on the diagonal ones have 1 or 3 pixel distance. The amplitude of the grid corner points is selected according to this rule: if SAD(V0) is greater than a proper threshold it means that V0 is likely to be a poor predictor and so the search area must be enlarged. Since this happens especially in case of scene change or sudden motion change, grid ampli®cation allows a quicker recovery of the true motion ®eld. Fig. 2 shows an example case for the proposed algorithm. We also consider the calculation of the cost function for the null vector for a total complexity of just 13 SAD evaluation for each 16 £ 16 macroblock (MB) processing instead of the 1032 SAD evaluation for the FS approach with the typical value p ˆ 16 and half-pixel resolution. It is worth mentioning that the SAD for the null vector may be reduced by a proper threshold to improve the coding ef®ciency (Static Priority Option, see Ref. [22]).

The speci®c set of the initial predictors, the thresholds and the shape of the re®nement grid have been selected as the best trade-off between computational saving and performance loss. The driving idea was to reduce the number of MV to be tested as much as possible while maintaining a reasonable video quality. As it will be detailed in Section 5, an exhaustive test campaign on a FPGA-based codec emulator has been carried out to validate the strength of the proposed ME algorithm. It attains the same high video compression quality of the FS technique outperforming other fast algorithms such as the TSS adopted in Refs. [7,8,10], the FAST implemented in Ref. [26] and the DCUPS proposed in Ref. [23]. With respect to other predictive algorithms proposed in literature [27,28,36], which features a closer similarity to our approach, the fundamental difference lies in the motion ®eld re®nement. These algorithms exhibit a performance comparable to our algorithm in terms of PSNR [36] or computational complexity [27], but their re®nement phase is based on the iterative application of a ®xed grid depending on the ful®lment of some stop rules. By doing so the relevant computational load is neither constant nor known a priori and hence the worst case must be considered during VLSI circuit design thus reducing the ef®ciency of the HW implementation [37]. On the contrary, our algorithm features a re®nement phase based on the single-step application of a ¯exible grid. This way, it reaches a ®xed computational workload which eases the HW implementation of the spatio-temporal predictive approach. 3.2. VLSI architecture The proposed technique has been implemented by an

A. Chimienti et al. / Microelectronics Journal 33 (2002) 417±427

421

Fig. 3. Block diagram of the ME coprocessor.

intellectual-property VLSI macrocell (patent ®led [29]) whose block diagram is sketched in Fig. 3. The main unit of the macrocell is the ME_Engine (see Fig. 4) which, loaded with proper reference and candidate block pixels, is able to process the SAD cost function and the detection of the MV that minimises it. The backbone of this engine is a parallel array of eight processing elements (PE) each of them implementing, at pixel level, an absolute difference operation. The engine also incorporates a parallel array of eight interpolation modules (IM) needed for the implementation of the half-pixel resolution search. Finally, the MDD unit detects the minimum value of the cost function (SAD_min) and its relevant MV (MV_min). The TH_ctrl unit compares this minimum with programmable thresholds. To be noted that the chosen size of the processing array is the result of a trade-off between circuit

complexity and degree of parallelism to attain clock rate reduction and architecture-driven voltage scaling. The other units of the macrocell shown in Fig. 3 are in charge of the management of both data ¯ow and memory resources to permit the implementation of the algorithm functionality according to a pipeline processing. Particularly the MV memory stores the set of MV predictors for all the blocks of the reference frame. Since the considered predictive algorithm features a ®xed computational workload and regular data ¯ow, the control unit is realised by a simple ®nite state machine. The block diagram of Fig. 3 also includes a reference buffer hierarchy and a candidate one to exploit data reuse reducing system bandwidth and power consumption (see Section 4). The VHDL (very high-speed integrated circuits HW description language) description of the ME cell has been

Fig. 4. ME_Engine block diagram.

422

A. Chimienti et al. / Microelectronics Journal 33 (2002) 417±427

Fig. 5. Power consumption for a ®xed throughput access of different memory sizes.

implemented by means of logic synthesis on a 0.18 mm, six metal levels, standard-cells CMOS technology. The core complexity amounts to 27 Kgates for the logic plus 150 bytes of single port MV RAM. Combining the decreased algorithm complexity with the proper use of parallelism and pipelining at architectural level, a large reduction of the required clock rate is obtained. The proposed ME coprocessor requires 687 cycles for each 16 £ 16 reference MB supporting real time processing of 30 Hz QCIF video with a clock frequency of nearly 2 MHz. The clock rate reduction, and the consequent lowered circuit speed requirement, can be exploited for power saving. This can be achieved by scaling down the supply voltage and/or by using low-speed, low-leakage version of the considered standard-cells library when available. For instance, the adopted 0.18 mm CMOS technology provides two different versions: a device high speed (DHS) optimised for low circuit propagation delay and a device low leakage (DLL) optimised for low-leakage power consumption. The optimisation is mainly obtained by using two different threshold voltages which are scaled down to about 20% going from DLL library to DHS one. Gate level simulations demonstrate an average dynamic power consumption of 2 mW at 1.6 V for typical H.263/ MPEG-4 video sequences. By using the DLL library the leakage power contribution is negligible since it amounts to 2.5 mW instead of the 750 mW for the DHS one. The results in terms of circuit complexity, power consumption and coding ef®ciency are very interesting when compared with state-of-art ME techniques. For instance, the low-complexity FS systolic array presented in Ref. [22] required 29 Kgates plus 9 Kbits of dual port RAM and about 42 mW for a 30 Hz QCIF. As explained in Ref. [25], by adopting an adaptive pixel truncation scheme, the power consumption of the FS algorithm can be reduced to roughly 30 mW at the expenses of a few % PSNR reduction. Anyway this power ®gure is an order of magnitude greater than the 2 mW of our macrocell. On the contrary, some solutions adopted in recent single-chip architectures [7,8] achieve the same low-power consumption of our predictive approach but at the expenses of a reduced coding quality since they are based on the TSS algorithm (see Section 5). Moreover, their low-power performance is also due to full-custom circuit optimisations such as clustered voltage scaling and variable threshold voltage scheme which reduce the portability of these solutions to different silicon technologies.

4. Memory hierarchy design A large percentage of the system power consumption in data dominated applications, such as real time video coding, is due to data communication and storage in large background memories [30,31]. Such applications typically feature a high reuse factor since most data are read multiple times (e.g. reference and candidate pixels during blockmatching). In order to reduce this power component, the design of a proper memory hierarchy between the processor and the ME coprocessor is outlined in this section. The basic idea is that, exploiting temporal locality and data reuse by introducing an optimised multi-level buffer hierarchy, a great reduction on the overall power consumption can be achieved. Power savings are expected when high read/write access rates are related to smaller buffers instead of large system memories. To better understand this idea, let us consider the following expression, which can be used for power contribution of an embedded memory [30±33]: 2 2 Pmemory ˆ …1=2†Vdd Cread Fread 1 …1=2†Vdd Cwrite Fwrite ;

…3†

where Fread and Fwrite represent the throughput of read and write accesses, Vdd is the power supply voltage and Cread and Cwrite are technology parameters which express the read and write equivalent capacitances. Model (3) is derived under the reasonable hypothesis that a power down mode allows to avoid power consumption when the memory is not accessed. Cread and Cwrite values depend on the size of the considered memory: they grow with the buffer dimension and this behaviour justi®es the idea of introducing levels of buffers to reduce the throughput access from large system memories. Fig. 5 shows that power P1 is greater than P2 because the same Fread throughput (5/T in this example) is related to memories of different sizes …S1 . S2 ) Cread1 . Cread2 †: Fig. 6 shows an example of buffer hierarchy: let start from a memory of size S1 that is accessed with the throughput of 5/T as indicated on the left side of Fig. 5; the relative power consumption is P1. In order to reduce the frequency access from the memory of size S1, and the consequent power dissipation, we introduce a smaller buffer (size S2): the throughput access of 5/T is now linked to a small size memory but data has to be copied from the ®rst large memory to this smaller buffer. Thus the ®nal power contribution (P2 1 P 01 1 P 02 instead of the initial value P1) takes into account the relative read and write extra cycles for

A. Chimienti et al. / Microelectronics Journal 33 (2002) 417±427

423

Fig. 6. Example of buffer hierarchy introduction for reducing memory power consumption.

updating the buffer of size S2. Thanks to data reuse, it is possible to reduce considerably the read access throughput from the original large memory: in Fig. 6 this read access frequency becomes 1/T and obviously it is equal to the write access one relative to the small buffer of size S2. It should be clear now that the effective power improvement is the result of a trade-off between the read power saving related to the high throughput (P2 2 P1) and the extra read and write consumption for transfer between the two buffer levels …P 01 1 P 02 †: According to the system architecture sketched in Fig. 1, we focus our data ¯ow analysis on the communication between the frame memories and the data path registers (level 0) of the ME coprocessor. This communication can be split into three separate data ¯ows independently optimised: reference and candidate inputs to the ME engine and output ME results (MV and SAD represented on 4 bytes). These ME results are represented on few bytes with respect to the other input data, so the relevant memory organisation has a limited impact on the whole analysis. Then, we concentrate our exploration only on the reference and candidate memory hierarchies. Starting from the trivial solution, which does not include any buffer, we have selected all possible con®gurations taking into account two main rules [30]: (i) when more levels are introduced, the buffer size has to increase from

Fig. 7. Reference memory hierarchy con®gurations.

lower (data path registers side) to higher levels (system bus side); (ii) introducing levels, power saving is expected only if access throughput to/from a memory is reduced (as explained earlier in Fig. 6 the small buffer reduces the read access frequency for the large memory from 5/T to 1/T ). For the exploration and evaluation of the selected con®gurations, we refer to the 0.18 mm CMOS technology already mentioned in Section 3.2. We only consider single port memory con®gurations because dual port ones nearly double the power consumption [33]. Our analysis also takes into consideration the system bus dissipation, which has mainly a dynamic component. In that respect, a simpli®ed power model of the bus can be expressed as [32] Pbus ˆ g Cbus Bbus ;

…4†

where Bbus and Cbus represent the data bandwidth and the equivalent capacitance of the main bus, respectively. The technology dependent parameter g takes into account the data switching activity and is linked to the data correlation. We apply that model only to the system bus, considering the relevant bandwidth. Assuming that the equivalent capacitance of the main bus (Cbus) is much higher than the one of the other intra-buffer connections, the power consumption due to the communication between buffer levels is negligible. Figs. 7 and 8 show a selection of the most interesting con®gurations for the reference and candidate ¯ows (Row is the number of MB into a picture line and it amounts to 11 for QCIF case). Old and New are the two frame memories which store the current frame to be processed (New) and the reconstructed previous one (Old). For both reference and candidate cases the trivial con®guration without any buffer (denoted as A in Figs. 7 and 8) is considered to evaluate the ®nal improvement of this approach. Memory power contribution and area results, indicated with P and A, respectively, are reported in both Figs. 7 and 8 next to each considered case; values are normalised with respect to the relevant reference and candidate trivial cases. Normalised system bus power contribution is depicted in Fig. 9 for the reference hierarchy and in Fig. 10 for the candidate one. For the reference case we choose B as the best con®guration. Indeed the 1 £ Row MB buffer foreseen in the C and D cases does not provide any improvement with respect to the B one because the latter already minimises the read/write access throughput and the system bus bandwidth, with negligible area penalty.

424

A. Chimienti et al. / Microelectronics Journal 33 (2002) 417±427

Fig. 10. Candidate hierarchyÐbus power results.

Fig. 8. Candidate memory hierarchy con®gurations.

C is the best con®guration for the candidate hierarchy. The introduction of the 3 £ Row MB buffer determines a 5% memory power saving which seems to be small with respect to the relative area penalty. However, looking at the bus power results (Fig. 10) the C case allows a reduction of more than 90% with respect to the trivial A. This explains our choice and underlines the importance of including the bus power consumption into the exploration cost function. It must be noted that the chosen candidate hierarchy differs from the ones presented in literature [30,31,34]. Even when considering the same hierarchy, such as for the B reference case [30,31], memory power savings reported in our analysis are reduced with respect to Refs. [30,31,34]. The above differences are explained by new aspects considered in our study: ®rst our analysis includes system bus power evaluation as an important driving factor; secondly it directly refers to the predictive ME algorithm described in Section 3 instead of the FS one, adopted in Refs. [30,31,34], with a consequent different data ¯ow of the ME engine and a limited reuse factor (13 vs. 1032 SAD, see Section 3). Moreover, in Refs. [30,31] the frame memories New and Old are considered off-chip, while we can refer to an on-chip implementation of those memories

Fig. 9. Reference hierarchyÐbus power results.

because our analysis is based on a more recent technology, which allows ef®ciently embedding large RAM blocks (up to 1 Mbits). Obviously the power consumption related to accesses from an off-chip memory is greater than the one from an equivalent on-chip memory. To be noted that the analysis in Ref. [34] refers to video object planes (VOPs) instead of ®xed size frames as in our target application (see Section 2). VOPs contain both video sequences and shape information. The latter is coded by the so-called alpha plane which is a bitmap indicating which pixels are inside the shape. This way, in Ref. [34] the analysis of the alpha data ¯ow is added to the reference and candidate ones. To summarise, relating to the trivial solution without any hierarchy scheme, the overall memory power reduction for both reference and candidate data ¯ow amounts to about 6%; considering the bus power the reduction increases up to 90%. 4.1. System bus bandwidth, power consumption and memory size The whole system bus bandwidth for the coding of a 30 Hz 4:2:0 QCIF sequence, considering also video input and the communication between processor and frame memories, is 5.8 MBytes/s. In the trivial case without the proposed buffer hierarchy this value becomes 23.2 MBytes/ s. With reference to Fig. 1, Table 2 details all the considered contributions. For the coded data stream-out we refer to the maximum considered bit-rate of 500 Kbits/s. The total RAM size for the frame memories and the proposed ME buffer hierarchy amounts to roughly 120 KBytes. To be noted that in Table 2 we refer to three frame memories (FM) called New, Old and Prefetch. As explained earlier New and Old memories store the current kth frame to be processed and the reconstructed (k 2 1)th one. Concurrently the next frame (k 1 1)th is written into the Prefetch FM. At the end of the current elaboration the kth reconstructed frame is stored into the New FM. Therefore, during the following (k 1 1)th period, the Prefetch, the New and the Old FM become the New, the Old and the Prefetch FM, respectively. The above system ®gures have been used to estimate the peak power consumption of the proposed system architecture. For the single-chip implementation on the mentioned

A. Chimienti et al. / Microelectronics Journal 33 (2002) 417±427

425

Table 2 Bus bandwidth contributions Task

Video in

Frame reconstruction

From To Bandwidth (MBytes/s)

I/O interface FM Prefetch 1.088

FM New CPU 1.088

FM Old CPU 1.088

0.18 mm CMOS technology it amounts to roughly 180 mW considering the contribution of the ARM9 macrocell, ME and DMA coprocessors, I/O interface, frame memories, buffer hierarchy and system bus. 5. System prototyping The algorithms, the proposed HW±SW partitioning and interfacing and the system memory hierarchy de®nition have to be exhaustively veri®ed and optimised before silicon implementation. Even if a SW simulation environment can be used for the co-veri®cation of mixed HDL-C/C11 system descriptions, a rapid prototyping approach has been preferred in our design ¯ow. Due to the huge amount of data and operations, typical of image processing, a SW simulation takes hours or days for few seconds of video sequence. Simulating all the possible con®gurations of complex HW± SW architectures needs very long validation patterns, which leads to prohibitive time with respect to typical consumer time-to-market constraints. For instance a simulation of the ME HDL model for a 30 Hz QCIF, within a Synopsys framework on a SUN Ultra 10, required about 33 h while its real time emulation on a prototyping breadboard lasts 1 s. Furthermore the subjective analysis of the human visual perception is a necessary step during the design and testing of algorithms for video applications. In that respect HW emulation allows for testing the quality behaviour of the video codec by direct checking of images on an output screen. The prototype consists of a general purpose PC platform and a PCI board with a recon®gurable HW based on FPGA technology (Fig. 11). A PENTIUM III was chosen as host processor to implement the SW part of the video coding scheme. The Celoxica RC1000-PP [35], the selected FPGA-based PCI development board, is equipped with a Xilinx Virtex FPGA XCV1000, which provides 1 million

Motion estimation CPU FM New 1.088

FM New ME 0.725

FM Old ME 0.725

ME CPU 0.011

Stream out

Total

CPU Out 0.061

5.874

of system gates. The ME macrocell, including the buffer levels, was implemented in FPGA technology by means of logic synthesis occupying less than 30% of the XCV1000 resources. A DMA engine completes the description of resources used for mapping the HW±SW architecture into the breadboard prototype. The system prototype also includes video I/O interfaces. Real time video data is grabbed by a video camera and converted to the suitable format by proper ®lters running on the host PC. An output interface sends the coded bit stream into Internet protocol network to a connected PC station, which displays the decoded video data. A rapid design methodology, based on a library of predesigned HDL-C/C11 modules, completes the framework. The HW emulator performs real time processing and, thanks to the proper acquisition/ visualisation scheme and user interface, it allows the use of many real test patterns, as long as desired, taking also into account the effects of noise due to acquisition systems. A quality analysis, based on several test conditions (standard sequences and real ones grabbed by the acquisition system), highlights the ef®ciency of the coding scheme implemented by the prototype with respect to the SW implementation of the well-known TMN coder [12] considering Table 3 Prototyped system comparison vs. TMN-FS and TMN-TSS (variable bitrate) Quantisation factor; bit-rate (Kbits/s)

Qˆ5

Q ˆ 10

Q ˆ 15

Q ˆ 20

Akiyo Prototype TMN-FS TMN-TSS

61.69 59.32 75.09

23.07 22.56 24.91

14 14.6 15.35

11.31 11.39 11.46

Coastguard Prototype TMN-FS TMN-TSS

419.85 401.9 440.86

149.12 144.36 160.62

81.1 79.9 87.48

49.6 49.86 55.12

Table 4 Prototyped system comparison vs. TMN-DCUPS and TMN-FAST (variable bit-rate) Quantisation factor; bit-rate (Kbits/s)

Fig. 11. Rapid prototyping environment.

Foreman Prototype TMN-DCUPS TMN-FAST

Q ˆ 10

Q ˆ 15

Q ˆ 20

Q ˆ 25

126.43 137.45 134.85

75.32 82.07 80.08

52.61 57.64 55.88

41.84 46.01 44.43

426

A. Chimienti et al. / Microelectronics Journal 33 (2002) 417±427

Fig. 12. Prototyped system comparison vs. TMN (constant bit-rate).

for the ME the following algorithms: FS, TSS, DCUPS and FAST. Both the cases of variable bit-rate coding and constant bit-rate coding have been considered. For example Table 3 compares the obtained bit-rate vs. the quantisation parameter Q for the 30 Hz QCIF test sequences Akiyo and Coastguard considering FS and TSS (the quantisation steps were ®xed to have near constant quality, producing therefore variable bit-rate). Table 4 shows the same comparison for the 30 Hz QCIF test sequence Foreman considering DCUPS and FAST. The results demonstrate that the proposed coding scheme with the fast predictive ME attains the same high video compression performance of the TMN with the FS and outperforms the other fast algorithms (TSS, DCUPS, FAST). It is worth mentioning that for very low bit-rate applications (e.g. Q ˆ 20 in Table 3) the prototype performs even better than the TMN-FS. The reason is that the predictive ME algorithm tends to ®nd more regular motion ®elds which require fewer bits to be coded (H.263/MPEG standards adopt a differential coding scheme for the MV ®eld). As far as constant bit-rate coding is concerned, Fig. 12 summarises PSNR results obtained with a 256 KBit/s channel for a mixed test pattern composed of ®ve pieces of different QCIF sequences (Akiyo, News, Container, Coastguard, Foreman). Both the performances of the prototyped codec and of the TMN-FS one are presented: the two PSNR curves substantially coincide, also in the ®rst frames after a scene change. The above results validate the effectiveness of the proposed coding scheme even in the case of sudden motion change. Similar ®gures have been obtained comparing the prototyped system and the TMN for the other test cases. 6. Conclusions The design of a VLSI architecture for H.263/MPEG-4

low-power video codec has been proposed in this paper. A co-design approach has been followed to de®ne a HW±SW partitioning based on a RISC-like programmable engine. The ME function has been selected to be implemented as a dedicated coprocessor, based on a low-power and highef®ciency predictive technique which exploits the spatiotemporal correlation of the video motion ®eld. The proposed algorithm achieves the same high coding ef®ciency of the FS while outperforms other low-complexity ME techniques usually adopted in state-of-art single-chip video codecs. An exploration analysis of the communication between HW and SW tasks and the relevant memory hierarchy organisation complete the architecture de®nition as a key issue for the power optimisation of the whole system. The above power saving techniques can be applied to any system design for low bit-rate and low-power video coding. Performance measurements on a prototyping platform provide a functional validation of our analysis and con®rm the effectiveness of algorithmic and architectural choices. Acknowledgements This work was supported by the Italian National Research Council in the framework of the ª5% Microelectronicsº research program. References [1] H. Igura, Y. Naito, K. Kazama, I. Kuroda, M. Motomura, M. Yamashina, An 800-MOPS, 110-mW, 1.5-V parallel DSP for mobile multimedia processing, IEEE J. Solid State Circuits 33 (11) (1998) 1820±1828. [2] P. Pirsch, H.-J. Stolberg, VLSI implementations of image and video multimedia processing systems, IEEE Trans. Circuits Syst. Video Technol. 8 (7) (1998) 878±891. [3] ISO/IEC 14496-2, Generic Coding of Audio-Visual Objects, 1998.

A. Chimienti et al. / Microelectronics Journal 33 (2002) 417±427 [4] ITU-T Recommendation H.263: Video Coding for Low Bit-rate Communication, 1998. [5] P. Kuhn, Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation, Kluwer Academic Publishers, Dordrecht, 1999. [6] H. Ohira, T. Kamemaru, H. Suzuki, K. Asano, M. Yoshimoto, A low power media processor core performable CIF 30fr/s MPEG4/H26X video codec, IEICE Trans. Electronics E84-C (2) (2001) 157±165. [7] M. Takahashi, M. Hamada, T. Nishikawa, H. Arakida, T. Fujita, F. Hatori, S. Mita, K. Suzuki, A. Chiba, T. Terazawa, F. Sano, Y. Watanabe, K. Usami, M. Igarashi, T. Ishikawa, M. Kanazawa, T. Kuroda, T. Furuyama, A 60-mW MPEG4 video codec using clustered voltage scaling with variable supply-voltage scheme, IEEE J. Solid State Circuits 33 (11) (1998) 1772±1780. [8] M. Takahashi, T. Nishikawa, M. Hamada, T. Takayanagi, H. Arakida, N. Machida, H. Yamamoto, T. Fujiyoshi, Y. Ohashi, O. Yamagishi, T. Samata, A. Asano, T. Terazawa, K. Ohmori, Y. Watanabe, H. Nakamura, S. Minami, T. Kuroda, T. Furuyama, A 60-MHz 240-mW MPEG-4 videophone LSI with 16-Mb embedded DRAM, IEEE J. Solid State Circuits 35 (11) (2000) 1713±1721. [9] T. Hashimoto, S. Kuromaru, M. Matsuo, Y. Kohashi, T. Mori-iwa, K. Ishida, S. Kajita, M. Ohashi, M. Toujima, T. Nakamura, M. Hamada, T. Yonezawa, T. Kondo, K. Hashimoto, Y. Sugisawa, H. Otsuki, M. Arita, H. Nakajima, H. Fanujimoto, J. Michiyama, Y. Iizuka, H. Komori, S. Nakatani, H. Toida, T. Takahashi, H. Ito, T. Yukitake, A 90 mW MPEG4 video codec LSI with the capability for core pro®le, Proceedings of IEEE Solid State Circuits Conference, 2001, pp. 140±141. [10] M. Harrand, J. Sanches, A. Bellon, J. Bulone, A. Tournier, O. Deygas, J.-C. Herluison, D. Doise, E. Berrebi, A single-chip CIF 30 Hz H.261, H.263, H.263 1 video encoder/decoder with embedded display controller, IEEE J. Solid State Circuits 34 (11) (1999) 1627±1633. [11] J.-M. Kim, Y.-S. Shin, I.-G. Hwang, K.-S. Lee, S.-I. Han, S.-G. Park, S.-I. Chae, A high performance videophone chip with dual multimedia VLIW processor cores, IEICE Trans. Electron. E84-C (2) (2000) 183±192. [12] ITU, Video codec test model, near term, version 10, TMN10 ITU-T, 1998. [13] V. Bhaskaran, K. Kostantinides, Image and Video Compression Standards, Kluwer Academic Publishers, Dordrecht, 1997. [14] W.H. Chen, C.H. Smith, S.C. Fralick, A fast computational algorithm for the discrete cosine transform, IEEE Trans. Commun. 1 (1977) 1004±1009. [15] ARM9 Thumb family, http://www.ARM.com. [16] L. Fanucci, R. Saletti, S. Saponara, Parametrized and reusable VLSI macro cells for the low-power realization of 2-D discrete-cosinetransform, Microelectronics J. 32 (12) (2001) 1035±1045. [17] Z. Xiong, K. Ramchandran, M.T. Orchard, Y.-Q. Zhang, A comparative study of DCT- and wavelet-based image coding, IEEE Trans. Circuits Syst. Video Technol. 9 (5) (1999) 692±695. [18] T. Hamada, S. Matsumoto, WHT-based composite motion compensated NTSC interframe direct coding, IEEE Trans. Commun. 44 (12) (1996) 1711±1719. [19] C.E. Lee, J. Vaisey, Comparison of image transforms in the coding of the displaced frame difference for block-based motion compensation, Proceedings of the Canadian Conference on Electrical and Computer Engineering, 1993, pp. 147±150.

427

[20] K. Kobayashi, M. Eguchi, T. Iwahashi, T. Shibayama, X. Li, K. Takai, H. Onodera, A low-power high-performance vector-pipeline DSP for low-rate videophones, IEICE Trans. Electron. E84-C (2) (2001) 193±201. [21] M. Ghanbari, The cross-search algorithm for motion estimation, IEEE Trans. Commun. 38 (7) (1990) 950±953. [22] L. Fanucci, S. Saponara, L Bertini, Programmable and low power VLSI architectures for full search motion estimation in multimedia communications, Proceedings of IEEE International Conference on Multimedia and Expo, 2000, pp. 1395±1398. [23] B. Furth, J. Greenberg, R. Westwater, Motion Estimations Algorithms for Video Compression, Kluwer Academic Publishers, Dordrecht, 1997. [24] R. Li, B. Zeng, M.L. Liou, A new three-step search algorithm for block motion estimation, IEEE Trans. Circuits Syst. Video Technol. 4 (4) (1994) 438±441. [25] Z.-L. He, C.-Y. Tsui, K.-K. Chan, M.L. Liou, Low-power VLSI design for motion estimation using adaptive pixel truncation, IEEE Trans. Circuits Syst. Video Technol. 10 (5) (2000) 669±677. [26] Image Processing Lab, University of British Columbia, TMN (H.263 1 ) encoder/decoder, version 3.0, TMN (H.2631) codec, September 1997. [27] J. Chalidabhongse, C.-C. Kuo, Fast motion vector estimation using multiresolution spatio-temporal correlations, IEEE Trans. Circuits Syst. Video Technol. 7 (3) (1997) 477±488. [28] K. Lengwehasatit, A. Ortega, A. Basso, A.R. Reibman, A novel computationally scalable algorithm for motion estimation, Proceedings of Visual Communications and Image Processing (VCIP98), 1998. [29] L. Fanucci, S. Saponara, A. Cenciotti, D. Pau, F. Rovati, D. Alfonso, A VLSI architecture, particularly for motion estimation applications, European patent 00830604.5-2218, ®led on 7 September 2000. [30] S. Wuytack, J.-P. Diguet, F. Catthoor, H. De Man, Formalized methodology for data reuse exploration for low-power hierarchical memory mappings, IEEE Trans. VLSI Syst. 6 (4) (1998) 529±537. [31] S. Wuytack, F. Catthoor, L. Nachtergaele, H. De Man, Power exploration for data dominated video application, Proceedings of IEEE Symposium on Low Power Design, 1996, pp. 359±364. [32] T. Givargis, F. Vahid, Interface exploration for reduced power in core-based systems, Proceedings of IEEE Symposium on System Synthesis, 1998, pp. 117±122. [33] W.-T. Shiue, Optimizing memory bandwidth with ILP based memory exploration and assignment for low power embedded systems, Proceedings of IEEE Workshop on Memory Technology, Design and Testing, 2000, pp. 95±100. [34] E. Brockmeyer, L. Nachtergaele, F. Catthoor, J. Bormans, H. De Man, Low power memory storage and transfer organization for the MPEG4 full pel motion estimation, on a multimedia processor, IEEE Trans. Multimedia 1 (2) (1999) 202±216. [35] Celoxica, RC1000-PP User Reference Manuals, http://www.celoxica.com. [36] F. Kossentini, Y. Lee, M. Smith, R. Ward, Predictive RD optimized motion estimation for very low bit-rate video coding, IEEE J. Select. Areas Commun. 15 (9) (1997) 1752±1763. [37] L. Fanucci, S. Saponara, A. Cenciotti, IP reuse VLSI architecture for low complexity fast motion estimation in multimedia applications, Proceedings of IEEE Euromicro Conference, 2000, pp. 417±424.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.