Compressed domain content based retrieval using H.264 DC-pictures

Share Embed


Descrição do Produto

Multimed Tools Appl (2012) 60:443–453 DOI 10.1007/s11042-010-0597-9

Compressed domain content based retrieval using H.264 DC-pictures Mahdi Mehrabi & Farzad Zargari & Mohammad Ghanbari

Published online: 14 September 2010 # Springer Science+Business Media, LLC 2010

Abstract A fast and simple method for content based retrieval using the DC-pictures of H.264 coded video without full decompression is presented. Compressed domain retrieval is very desirable for content analysis and retrieval of compressed image and video. Even though, DC-pictures are among the most widely used compressed domain indexing and retrieval methods in pre H.264 coded videos, they are not generally used in the H.264 coded video. This is due to two main facts, first, the I-frame in the H.264 standard are spatially predicatively coded and second, the H.264 standard employs Integer Discrete Cosine Transform. In this paper we have applied color histogram indexing method on the DC-pictures derived from H.264 coded I-frames. Since the method is based on independent I-frame coded pictures, it can be used either for video analysis of H.264 coded videos, or image retrieval of the I-frame based coded images such as advanced image coding. The retrieval performance of the proposed algorithm is compared with that the fully decoded images. Simulation results indicate that the performance of the proposed method is very close to the fully decompressed image systems. Moreover the proposed method has much lower computational load. Keywords Compressed domain image indexing and retrieval . DC-picture . H.264 video coding standard . Color histogram M. Mehrabi Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran e-mail: [email protected] M. Mehrabi e-mail: [email protected] F. Zargari (*) Information Technology Research Institute of Iran Telecom Research Center (ITRC), Tehran, Iran e-mail: [email protected] M. Ghanbari School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK e-mail: [email protected]

444

Multimed Tools Appl (2012) 60:443–453

1 Introduction Visual information is expanding rapidly in the recent years and effective retrieval of visual data according to the visual contents is a challenging research issue. Since manipulating the visual information requires large amount of storage capacity and processing power, there is a need to efficiently index and retrieve the visual information in multimedia applications. Content based retrieval (CBR) was introduced for managing image and video libraries. In content based image retrieval (CBIR) various image features such as color, texture and shape are used for retrieving images from image libraries. Video retrieval uses the same features in image retrieval along with temporally related features in video sequences. Digital image and video libraries usually store visual information in the compressed form. For retrieval, which normally need uncompressed data, in these libraries; an unwanted decompression stage will increase the search time and complexity. On the other hand, the retrieval techniques that apply directly to compressed data are faster and are more preferred in terms of computational cost and retrieval time, particularly for real-time applications. A survey of features used in compressed domain image and video retrieval is presented in [17]. The most common video coding standards are among those of MPEG and H.26X families. These video coding standards employ hybrid coding of DCT block based transform and motion compensation. These standards also employ intra and inter frame coded pictures for refreshment and easy access. The intra coded pictures (I-frames) are coded independently and without reference to other frames. The inter frame coded pictures (P and B frames) are coded with reference to other frames using motion compensated prediction. Even though in the previous versions of these video coding standards I-frames are coded similar to the JPEG image coding standard, in the recently introduced H.264/ AVC standard they are coded with spatial prediction. Since all these video coding standards and also the JPEG image coding standard are DCT block based, the DCT coefficients of the coded pictures are widely used to generate the compressed domain feature vectors from the compressed visual data. An important technique for fast and easy access and manipulation of compressed visual data is to construct a lower quality picture i.e. DC-picture instead of performing inverse DCT transform and full decompression. The DC coefficients of the DCT transform in the blocks are used to produce the DC-pictures. Even though the DC-pictures can be constructed by averaging over the pixels, it requires full decoding of pictures. On the other hand constructing DC-pictures directly from the compressed video is more desirable, because the inverse DCT transform constitutes a large portion of the decoding process of H.26x coded video. The average of a block is an approximation of its pixels; hence, by replacing the blocks with their average values, an approximated picture can be constructed for fast access to the content of the original picture. This picture is a lower quality resemblance of the original picture. DC-pictures are used in various image and video analysis and retrieval applications [1–3, 5, 6, 9, 12, 13, 16]. DC-picture is used in [3] to extract color histogram, which is one of the most important feature vectors in image retrieval, and it is indicated that the DC value of 2×2 blocks has the best performance in making color histograms for retrieving color images [3]. The authors in [3] and [4] have proposed a method of extracting DC coefficients of small sub-blocks from the DCT coefficients of larger blocks to access the DC-pictures directly from the DCT based coded videos or images. The performance of this method for extracting the DC values of 2×2 sub-blocks from the DCT coefficients of 8×8 block is given in [3]. Since the method in [3] and [4] is designed for the non-integer DCT codecs

Multimed Tools Appl (2012) 60:443–453

445

prior to the H.264 coding standard, it is inappropriate for extracting DC coefficient of a small sub-block in the H.264 standard. This is because in the H.264 standard, the DCT transform and also it’s inverse are carried out with integer transform. Moreover, H.264 standard employs spatial prediction for coding of I-frames, a technique which is one of the innovations introduced in H.264 standard. This paper introduces compressed domain image retrieval and indexing method based on color histogram of the DC-pictures derived from the I-frames of H.264 coded video. The proposed method can be used for I-frames in the H.264 video coding standard. In addition to the H.264 coded video, the proposed methods can also be used in the I-frame based image coding, such as advanced image coding (AIC) or modified advanced image coding (M-AIC) [18]. The proposed DC-Picture based method for indexing and retrieval of I-frame in H.264 coded video has two important features. First, processing I-frame is independent of other frames in a group of pictures (GOP); hence the proposed descriptor can be extracted easily and rapidly from the coded video. Second, in the video coding methods which use I-frames to code the key frames [14], or compressed domain video indexing and retrieval methods, which use I-frames as the best candidate for key frames [7], the proposed method can be used for analysis of key frames in video analysis applications. The rest of paper is organized as follows. In section 2 the proposed method is introduced. The performance and computation time of the proposed method are evaluated in section 3, and the paper ends with concluding remarks in section 4.

2 The proposed method This section provides a short overview of the inverse integer DCT transforms and dequantizer in the H.264 standard to the extent that is required to follow the discussions in the rest of the paper. More detailed explanations about inverse integer transform and dequantizer can be found in [10]. In the Baseline, Main and Extended profiles of H.264 video coding standard the integer DCT transform is performed on 4×4 blocks. The inverse 4×4 DCT transform in H.264 is defined as: Z ¼ Cf T ðY  PÞCf

ð1Þ

where Y is the dequantized coefficients matrix, ⊗ indicates element by element matrix multiplication, Cf and P are integer DCT inverse transform matrices, CfT represents the transpose of Cf and Z is the decoded block. The matrix Cf in the H.264 standard is defined as: 2

1 1 6 1 1=2 6 C f ¼4 1 1=2 1 1

1 1 1 1

3 1=2 1 7 7 1 5 1=2

ð2Þ

Although, the way the DCT transform is implemented in the H.264 standard reduces its computational cost over the non-integer DCT transform of pre H.264 standards, the DCT transform in the H.264 standard still is amongst the highest computational stages in coding or decoding process. It is due to the fact that DCT transform should be applied to a large number of blocks. In this paper we aim to extract average of 2×2 sub-blocks from the DCT

446

Multimed Tools Appl (2012) 60:443–453

coefficients of H.264 coded 4×4 blocks and then apply the retrieval methods to the resulted approximated images. Consider Z in (3) that represents a 4×4 block decoded from an H.264 coded block given by (1). Matrix M in (4) can extract average of 2×2 sub-blocks of Z through operation given in (5). 2 3 x00 x01 x02 x03 6 x10 x11 x12 x13 7 7 ð3Þ Z ¼6 4 x20 x21 x22 x23 5 x30 x31 x32 x33 2

1 60 M¼6 40 0

1 0 0 0

0 0 1 0

2

ðx00 þx01 þx10 þx11 Þ=4 6 0 T DC 2 ¼ 1=4  ðMZM Þ ¼6 4 ðx20 þx21 þx30 þx31 Þ=4 0

3 0 07 7 15 0

0 0 0 0

ð4Þ

ðx02 þx03 þx12 þx13 Þ=4 0 ðx22 þx23 þx32 þx33 Þ=4 0

3 0 07 7 ð5Þ 05 0

Since Z is also equal to the left side of equality in (1), we can also use the right hand side of equality in (1) instead of Z in (5): DC2 ¼ 1=4  ðMCf T ðY  PÞCf MT Þ ¼ 1=4  N0 ðY  PÞN0 where N′=MCfT and the matrix N′ is calculated as: 2 2 3=2 6 0 0 N0 ¼ MCTf ¼6 4 2 3=2 0 0

T

3 0 1=2 0 0 7 7 0 1=2 5 0 0

ð6Þ

ð7Þ

Since in (6), the elements of DC2 are averages of 2×2 sub-blocks, given the coefficients of a 4×4 H.264 coded block say Y, the average of its 2×2 decoded sub-blocks can be calculated using Eq. 6. Operation in (6) can be further simplified through matrix factorization. The matrix N′ can be decomposed to N and K as: 2 3 2 32 3 2 3=2 0 1=2 4 1 0 1 1=2 0 0 0 60 6 76 0 0 0 7 0 7 7 6 0 0 0 0 76 0 3=2 0 7¼ NK ð8Þ N0 ¼6 4 2 3=2 0 1=2 5¼4 4 1 0 1 54 0 0 1=2 0 5 0 0 0 0 0 0 0 0 0 0 0 1=2 Now replacing N′ in (6) with its equivalent product of NK, and using associative and commutative properties of matrix multiplication, operations in (6) can be rewritten as: DC2 ¼ 1=4  ðNKðY  PÞKT NT Þ ¼ Nð1=4  ðKðY  PÞKT ÞÞNT   ¼ Nð1=4  KYKT  PÞNT

ð9Þ

Multimed Tools Appl (2012) 60:443–453

447

Since K is a diagonal matrix, multiplication of Y by K from left and right can be written as element by element multiplication of Y by K′: DC2 ¼ Nð1=4  ðY  K0 Þ  PÞNT ¼ NðY  ð1=4  K0  PÞNT where K′ is :

2

1 63 0 K ¼ 1=4  6 41 1

3 9 3 3

1 3 1 1

3 1 37 7 15 1

ð10Þ

ð11Þ

If we select P′ as: P0 ¼ 1=4  K0  P

ð12Þ

DC2 ¼ NðY  P0 ÞNT

ð13Þ

Then:

The right hand side of equation (13) represents the final method of achieving average of 2×2 blocks from the coefficients of 4×4 integer DCT transformed block. Since (13) represents the inverse DCT transform in the same way as (1), matrix P′ can be combined into the dequantizer, resulting a new dequantization table similar to dequantization table for P in (1). This means that the proposed method is compatible with H.264 quantizer. The difference between (13) and (1) lays in matrices N and Cf. Since most of the elements in N are zero, the proposed method has much lower computational load than the inverse transform in (1). In the proposed compressed domain retrieval method for H.264 coded I-frames the DCT coefficients are extracted from the compressed video file. Then using (13) the DC value for each 2×2 sub-block of a 4×4 block is calculated. These DC values are up sampled (Fig. 1) to produce an approximation for the 4×4 blocks, which are the residues of the intra predicted blocks in the original image. Hence, we add the resulted approximation of up sampled 4×4 residue blocks to the spatially predicted values to get an approximated resemblance for each coded block in an I-frame. In this way we generate a lower quality DC-Picture without the Inverse DCT and full decompression of the coded video. We use the color components of the resulted DC-picture to extract the color histogram feature vector of the coded I-frame without full decompression.

Fig. 1 Up sampling DC values for 4×4 blocks

448

Multimed Tools Appl (2012) 60:443–453

In this retrieval method we use the color histogram feature vector introduced in [15]. The color histogram feature vector for a color image including three eight bit color components A, B and C for each pixel Pij is calculated as: Hðak ; bm ; cn Þ ¼

h X w X   f Pij i

ð14Þ

j

where Pij is the pixel at i-th row and j-th column of the image and indices h and w are the picture height and width, respectively. The function f(Pij) is defined as follow:         f Pij ¼ 1 if ak  A Pij < akþ1 and bm  B Pij < bmþ1 and cn  C Pij < cnþ1   ð15Þ f Pij ¼ 0 otherwise where ak, bm and cn are the decision boundaries of A, B and C color components. A(Pij), B (Pij) and, C(Pij) are Y, Cb and Cr color components of pixel Pij respectively. The number of decision boundaries should be different for three color components because the uniform quantization for perceptually non-uniform color spaces may be problematic [8]. Since H.264 uses YCbCr color space and the chromatic components, which vary slowly in a picture, are used along with achromatic component for retrieval, we choose five bins for A (Y component) and 16 bins for B (Cb component) and C (Cr component) color components. ak ¼ k  ð255=5Þ k ¼ 0; 1; . . . ; 4 bm ¼ m  ð255=16Þ;

cn ¼ n  ð255=16Þ m; n ¼ 0; 1; . . . ; 14; 15

ð16Þ ð17Þ

The intersection of two histograms H1 and H2 is calculated as: S¼

4 X 15 X 15 X

MinðH1ðak ;bm ;cn Þ; H2ðak ;bm ;cn ÞÞ=ðw  hÞ

ð18Þ

k¼0 m¼0 n¼0

S is a number in the range of [0, 1] and is the measure to represent the similarity between the color histograms of two images. Since there are five bins for Y component and 16 bins for Cb and Cr color components, the proposed color histogram will be a three dimensional histogram including 5  16  16 ¼ 1280 bins. In the next section we present the performance evaluation of the proposed DC-Picture based color histogram retrieval method.

3 Performance evaluation The proposed indexing method is used in a query by example method for retrieving images from the Washington University image database. The image database includes 1330 color images. Each image in the database was coded as I-frame using the joint video team (JVT) H.264 encoder , and the conventional settings of the encoder such as: quantization parameter equal to 26, dispersed macroblock order, and the prediction mode optimization for each block was selected to minimize the residual prediction errors. We extracted the color histograms of the coded images by two different methods. In the first method, which is called method 1 here after, we used the standard H.264 decoder to decode the coded image and found the color histogram for each image according to (14). In

Multimed Tools Appl (2012) 60:443–453

449

Fig. 2 Left column contains query pictures and a indicates ranked retrieved pictures of left query picture using method 1, and b indicates those from the proposed method

Table 1 Percentage of relevant pictures in rank retrieve list of two methods

Method 1

The proposed method

Rank 1

100%

100%

Rank 2

90%

82%

Rank 3

73%

73%

Rank 4

73%

73%

Rank 5

45%

36%

Rank 6

18%

27%

450

Multimed Tools Appl (2012) 60:443–453

Table 2 MAP for retrieval experiments of two methods MAP

Method 1

The proposed method

0.84

0.81

the second method, which is called the proposed method hear after, we extracted DCPictures without full decompression of the coded images and used DC-Pictures to generate the color histogram using (14). In this way, there are two histograms for each coded image in the data set. To evaluate the performance of the proposed retrieval method, a sample query image set was used. This sample query image set consists of 30 selected images from the Washington data set. We used the produced color histograms in the previous stage and the histogram similarity metric (18) to find ten first retrieved images for each query image in the sample query images set. We made two ranked retrieved lists for each query image, one using method 1 (full decompression) and the other for the proposed method (DCPicture). Figure 2 shows the first 4 retrieved images using the two retrieval methods. It is obvious that the first retrieved image at each retrieved list is the query image. In order to evaluate the performance of the proposed method with method 1 we calculated the percentage of the relevant retrieved images to the total number of retrieved images in rank ordered retrieved list of all query images for each rank (Table 1). The results in Table 1 indicate that the proposed method has very close retrieval performance to Method 1 for each rank. Furthermore to evaluate the overall performance, we used the well known metric Mean Average Precision (MAP), which is used by TRECVID. MAP provides a single-figure measure of retrieval quality for query set [11]. MAP is calculated as: MAPðQÞ ¼

mj j Qj 1 X 1 X Pr ecisionðRjk Þ jQj j¼1 mj k¼1

ð19Þ

where Q is the index of query in the query set, mj is the number of relevant images in the database for j-th query, Rjk is the rank of k-th retrieved relevant image in rank retrieved list of j-th query image, and Precision(x) is the Precision for x-th retrieved image as defined in (20). It is worth noting that MAP ranges from 0 to 1 and higher values for MAP indicate better performance of retrieval experiment. Pr ecisionðxÞ ¼

# relevant retrieved images up to rank x # retrieved images

ð20Þ

The MAP values for the tested retrieval methods in the two decoded databases are listed in Table 2. As Table 2 indicates, the overall performance of the proposed method is very close to the results derived from full decompression in Method 1. Moreover, we measured the processing time for extracting feature vectors from DCT coefficients of the coded images in a 2.26 GHz PC with 1 GB of RAM. The average processing time for the tested methods is tabulated in Table 3. Table 3 indicates that the proposed method in average requires less than 39% of the processing time of method 1 to calculate the feature vector. Table 3 Average processing time required for extracting Color histogram from DCT coefficients. Average processing time (millisecond)

Method 1

The proposed method

562

219

Multimed Tools Appl (2012) 60:443–453

451

4 Conclusion In this paper we introduced a novel compressed domain image retrieval method for I-frame coded images of the H.264 standard. This method can be applied either to I-frames of H.264 coded videos or to the images coded by the techniques such as advanced image coding (AIC) and modified advanced image coding (M-AIC) [18], which use intra frame block prediction. The proposed method uses color histogram of DC-Pictures for visual information retrieval in compressed domain image retrieval and video analysis applications. Simulation results indicate that the proposed method reduces the computation time to less than 39% of the full decompression method, while its retrieval performance is very close to the method that uses color histograms resulted from fully decoded images. The low complexity and computation time of the proposed method and little reduction in the performance of retrieval method compared to full decompression indicate that the proposed method can be used as a fast and simple method to extract content-preserved DC-Pictures of H.264 coded videos without full decompression. Hence, the proposed compressed domain indexing method is effective image retrieval and indexing method and can be used for retrieval of AIC coded images and analysis of H.264/AVC coded videos in various applications.

References 1. Divakaran A, Vetro A, Asai K, Nishikawa H (2000) Video browsing system based on compressed domain feature extraction. IEEE Trans Consum Electron 46(3):637–644 2. Feng Y, Fang H, Jiang J (2005) Region growing with automatic seeding for semantic video object segmentation. Lect Notes Comput Sci 3687:542–549 3. Jiang J, Armstrong A, Feng GC (2002) Direct content access and extraction from JPEG compressed images. Pattern Recognit 35:2511–2519 4. Jiang J, Feng G (2002) The spatial relationship of DCT coefficients between a block and its sub-blocks. IEEE Trans Signal Process 50(5):1160–1169 5. Jiang J, Weng Y, Jie P, Li C (2006) Dominant colour extraction in DCT domain. Image Vis Comput 24:1269–1277 6. Joyce RA, Liu B (2000) Temporal segmentation of video using frame and histogram space. Image Processing, Proceedings. 2000 International Conference on, vol. 3, pp. 941–944 7. Kobla V, Doermann D, Lin K-I (1996) Archiving, indexing, and retrieval of video in the compressed domain. In: Proc. SPIE Conf. on Multimedia Storage and Archiving Systems, SPIE 2916:78–89 8. Lee SM, Xin JH, Westland S (2005) Evaluation of image similarity by histogram intersection. Color Res Appl 30(4):265–274 9. Li X, Zhang M, Zhu Y, Xin J (2009) A novel RS-based key frame representation for video mining in compressed-domain, second international workshop on knowledge discovery and data mining. IEEE Computer Society, pp. 199–201 10. Malvar HS, Hallapuro A, Karczewicz M, Kerofsky L (2003) Low-complexity transform and quantization in H.264/AVC. IEEE Trans Circuits Syst Video Technol 13(7):598–603 11. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval” Cambridge University Press 12. Qian X, Liu L, Su R (2006) Effective fades and flashlight detection based on accumulating histogram difference. IEEE Trans Circuits Syst Video Technol 16(10):1245–1258 13. Seo K-D, Park S, Jung S-H (2009) Wipe scene-change detector based on visual rhythm spectrum. IEEE Trans Consum Electron 55(2):831–838 14. Shu-long Z, Zhi-sheng Y, Shi-yong L, Xin Z (2007) “An improved video compression algorithm for lane surveillance”, Fourth International Conference on Image and Graphics(ICIG), pp. 224–229 15. Swain M, Ballard D (1997) Color indexing. Int J Comput Vis 7:11–32 16. Tavanapong W, Zhou J (2004) Shot clustering techniques for story browsing IEEE Trans Multimedia 6 (4):517–527 17. Wang H, Divakaran A, Vetro A, Chang S-F, Sunb H (2003) Survey of compressed-domain features used in audio-visual indexing and analysis. J Vis Commun Image Represent 14:150–183

452

Multimed Tools Appl (2012) 60:443–453

18. Zhang Z, Veerla R, Rao KR (2008) Modified advanced image coding. In: Proc. international conference on complexity and intelligence of the artificial and natural complex systems, medical applications of the complex systems, biomedical computing, pp. 110–116

Mahdi Mehrabi received his B.Sc. degree in Electronics & Telecommunication Engineering from Tehran University, Tehran, Iran and his M.Sc. degree in Telecommunication Engineering (with distinction) from Shiraz University, Shiraz, Iran. He is currently a Ph.D. candidate in the Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran. He is also a lecturer at Azad university of Shiraz, Shiraz, Iran. His research interests include computer vision and image processing, multimedia information retrieval and multimedia over IP.

Farzad Zargari received his B.Sc. degree in Electrical Engineering from Sharif University of Technology and his M.Sc. and Ph.D. degrees in Electrical Engineering from University of Tehran, all in Tehran, Iran. He is currently a research associate in the information technology research institute of Iran Telecom Research Center (ITRC), Ministry of Telecommunications and Information Technology of Iran. His research interests include multimedia systems, image and video signal processing algorithms, and hardware implementation of image and video codecs.

Multimed Tools Appl (2012) 60:443–453

453

Mohammed Ghanbari is the professor of video networking at the university of Essex, united Kingdom and is best known for his pioneering work on two-layer video coding for ATM networks (which earned him an IEEE Fellowship in 2001), now known as SNR scalability in the standard video codecs. He has registered for eleven international patents on various aspects of video networking and was the co-recipient of A.H. Reeves prize for the best paper published in the 1995 Proceedings of IET on the theme of digital coding. He is the author of five books and his book: An Introduction to Standard Codecs, received the year 2000 best book award by IET Prof. Ghanbari has authored or co-authored about 500 journal and conference papers, many of which have had a fundamental influence in the field of video networking. He has served as Associate Editor to IEEE Transactions on Multimedia (IEEE-T-MM from 1998-2004).

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.