Discriminative optical flow tensor for video semantic analysis

Share Embed


Descrição do Produto

Computer Vision and Image Understanding 113 (2009) 372–383

Contents lists available at ScienceDirect

Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Discriminative optical flow tensor for video semantic analysis q Xinbo Gao a, Yimin Yang a, Dacheng Tao b,*, Xuelong Li c a b c

School of Electronic Engineering, Xidian University, Xi’an 710071, China School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Blk N4, Singapore, 639798 School of Computer Science and Information Systems, Birkbeck College, University of London, London WC1E 7HX, UK

a r t i c l e

i n f o

Article history: Received 29 September 2007 Accepted 18 August 2008 Available online 29 August 2008 Keywords: Optical flow tensor Video semantic analysis General tensor discriminant analysis Linear discriminant analysis Hidden Markov models

a b s t r a c t This paper presents a novel framework for effective video semantic analysis. This framework has two major components, namely, optical flow tensor (OFT) and hidden Markov models (HMMs). OFT and HMMs are employed because: (1) motion is one of the fundamental characteristics reflecting the semantic information in video, so an OFT-based feature extraction method is developed to make full use of the motion information. Thereafter, to preserve the structure and discriminative information presented by OFT, general tensor discriminant analysis (GTDA) is used for dimensionality reduction. Finally, linear discriminant analysis (LDA) is utilized to further reduce the feature dimension for discriminative motion information representation; and (2) video is a sort of information intensive sequential media characterized by its context-sensitive nature, so the video sequences can be more effectively analyzed by some temporal modeling tools. In this framework, we use HMMs to well model different levels of semantic units (SU), e.g., shot and event. Experimental results are reported to demonstrate the advantages of the proposed framework upon semantic analysis of basketball video sequences, and the cross validations illustrate its feasibility and effectiveness. Ó 2008 Elsevier Inc. All rights reserved.

1. Introduction Characterized by its extraordinary expressive power, video is increasingly becoming the favorite medium for many applications. With the fast development of internet and high-capacity storage devices, large scale digital video databases are emerging. However, it is time consuming and quite tedious to index and retrieve video clips of interest by browsing a huge database. Therefore, video content analysis and summarization, which engages in providing concise and informative video summaries to help people to browse and retrieve desired video clips efficiently, has attracted more and more attention. Video semantic analysis, as an integrant step of video summarization, retrieval and management, has become a fundamental research topic in video processing. The existing work on video semantic analysis generally falls into three categories, namely, video genre classification, shot classification, and event detection (where shot and event are considered as two kinds of semantic q This work was partially supported by the Program for New Century Excellent Talents in University (NCET-04-0948), the Program for Changjiang Scholars and Innovative Research Team in University of China (IRT0645), the National Natural Science Foundation of China (Nos. 60771068, 60702061), the Key Lab of ATR, Shenzhen University, China, and the Competitive Research Grants at the Hong Kong Polytechnic University (under Project NO. A-PH42 and A-PC0A). * Corresponding author. Fax: +852 2764 2528. E-mail addresses: [email protected], [email protected] (D. Tao).

1077-3142/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2008.08.007

units (SU) in our application). Generally, these algorithms could be classified into three types. The first one is based on the visual features including colors, texture, shapes, motion, etc. Ekin et al. [1] proposed a soccer analysis scheme including some novel lowlevel video processing algorithms, such as dominant color region detection, robust shot boundary detection, and shot classification, as well as some higher-level algorithms for goal detection, referee detection, and penalty-box detection. The second one is the semantic analysis by synthetically using the visual and audio cues. For example, Li et al. [2] proposed a movie parsing and indexing approach by analyzing both audio and visual sources and accounting for their interrelations to extract high-level semantic cues. The third one is based on the existing pattern recognition tools, for example, Gaussian mixture model (GMM) [3], support vector machine (SVM) [4], Bayesian Model (BM) [5–6], and hidden Markov model (HMM) [7–9]. HMM, a statistical model for dealing with hidden states and observations, has been successfully applied for speech recognition [10], because a speech signal can be precisely modeled by a Markov process. Similar to speech signals, a video semantic unit, e.g., shot and event, naturally forms a Markov process, so the HMM is adopted as a powerful tool for video content analysis. Niu et al. [11] proposed an HMM-based approach that uses threshold and voting to automatically segment and recognize complex activities. The layered HMMs were used by Barnard and Odobez to recognize events at different time scales in sports videos [12]. And in [7], Xu,

373

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383

et al. proposed an HMM-based framework as a general solution to video semantic analysis. In this paper, a motion representation scheme was given by using motion vector-based energy distribution function and three motion filters. Then, a multi-HMM framework was constructed to detect the shots and events in videos. However, the motion information fails to reflect real event contents because it is very sensitive to global motions and therefore cannot have a good detection on local motion changes. The typical application of HMM for semantic analysis is to use motion information and other features as observation. Compared with the color, shape, and texture features that are frequently used in video content analysis, motion is an essential characteristic which captures the rich dynamic content of video sequence. Motion analysis is an important topic in computer vision research due to its promising applications in many areas, e.g., visual surveillance, perceptual user interface, and content-based image [36] and video retrieval [13–15]. Among various motion estimation approaches, optical flow provides an estimation of 2D representation of apparent velocities based on the pixel intensity values across a group of adjacent frames [16]. Conventional optical flow methods can be classified into four categories [17]: differential techniques [18], region-based matching [19], energy-based [20] and phasedbased approaches [21]. Among these four different categories, differential techniques (gradient-based) methods, which compute optical flow from the spatial-temporal derivatives of image intensity, are the best known and most used. The original optical flow fields are often with an extraordinary high dimensionality which varies intensively with the size of the video frame. In order to reduce the feature space to an ideal size for training purpose and preserve the original discriminative information at the same time, a suitable feature space presentation scheme is of great importance, especially when confronting with small sample size problem. Tensor representation is helpful to reduce the small sample size problem in discriminative subspace selection. This is mainly because the structure information of objects in computer vision research is a reasonable constraint to reduce the complexity of a learning model. Therefore, our previous work [22] applied this information to the vector-based learning and developed the supervised tensor learning (STL) framework, which accepts tensors as input directly. Moreover, in order to preserve the discriminative information of training tensors from different classes, the general tensor discriminant analysis (GTDA) technique [23] was proposed as a preprocessing step for linear discriminant analysis (LDA), a conventional classification method. Apart from GTDA and STL, discriminative tensor rank one projections [30][31], current subspace analysis [32], and discriminative tensor subspace analysis [33] can also be considered as alternatives. Moreover, correlation tensor analysis [34] and discriminant simplex analysis [35] are both useful subspace learning algorithms. In this paper, we utilize tensor to provide a new representation of motion features based on optical flow fields. To obtain a more concise and statistical form of optical flow fields, the optical flow histogram is formed to be further constructed into tensors. Then, the optical flow tensors (OFTs) are analyzed by GTDA and LDA methods to generate the final feature vectors, which could be the input of HMMs. Based on the OFT and HMMs, we have presented a new video semantic analysis system, called OFT–HMMs framework. It is mainly composed of two major procedures. The one is the OFT-based feature extraction, and the other is the HMMs-based video semantic analysis. In our application, the HMMs are used to model the semantic units (SU), e.g., shot and event, and put into real-world applications such as shot classification and event recognition. The rest of this paper is organized as following. In Section 2, the OFT–HMMs framework is presented, including the construction of

OFT, the discriminative subspace selection procedure and the working principles of HMMs. In Section 3, the video content analysis method is developed based on the proposed OFT–HMMs framework for shot classification and event recognition. The experimental results with application to basketball video analysis are given in Section 4. Section 5 concludes.

2. An OFT–HMMs-based framework for video semantic analysis In this section, the OFT–HMMs framework, as shown in Fig. 1, will be discussed in detail. In this framework, the semantic units are pre-defined according to the domain specific knowledge of practical application, and the training samples are selected in advance for HMMs modeling. The feature extraction procedure, including OFT construction and the following subspace selection, are illustrated in Sections 2.1 and 2.2, respectively. Finally, the HMMs training procedure will be given in Section 2.3. 2.1. Optical flow tensor 2.1.1. Optical flow calculation Motion is a prime source of information for presenting video contents in our environment. Therefore, accurate techniques for estimating the velocity field (optical flow field) are indispensable components of many vision applications. Horn and Schunck [16] were the first to develop a technique based on computing spatio-temporal differences from image sequences, which has spurred the development of a wide range of techniques and approaches for optical flow field estimation. In a review paper, Barron et al. [17] evaluated nine different techniques, representative for various approaches, namely the differential, matching, energy-based, and phase-based ones. They have tested these algorithms on several standard video sequences and one of their conclusions was that the phasebased technique of Fleet and Jepson [21] and the differential technique of Lucas and Kanade [18] produced the more accurate results overall. The well-known Lucas–Kanade method is particularly attractive for its notable attributes such as high accuracy and efficient implementation. However, as a gradient-based approach, the Lucas–Kanade method often fails when facing the large motion problem due to its weak initial value. To solve this problem, a hierarchical Lucas–Kanade (HLK) method is proposed based on a set of images deriving from the original video frames at several resolutions like a pyramid hierarchical structure. The approximate optical flow is first calculated between the images having the lowest resolution, and a more precise one is obtained by calculating between the images having a higher resolution. The processing is successively repeated until optical flow is calculated between the images having the highest resolution. Due to the well performance of the HLK method, it is utilized for the calculation of optical flow in our approach. A concise representation of HLK optical flow estimation method is shown in Fig. 2. As a multi resolution representations of an image that provide a range of coarse to fine views of the image, image pyramids are first produced. Then, each frame is pre-smoothed by a spatio-temporal filter to diminish noise and aliasing effects in the gradient estimates. oI oI and ot are typically computed using a 5The intensity derivatives, oy tap filter. The velocity vector (vx,vy) is then computed for each pixel by solving the 2  2 linear equation:

2 P 2 "P P oI oI 3  w ooxI w ox oy mx w oI 4P 5 ¼  P ox P 2 oI oI oI w oy my w ox w ooyI oy

oI ot oI ot

# ;

ð1Þ

374

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383

FEATURE EXTRACTION

Pre-Classified Training Data

Optical Flow Tensor Construction

Discriminant Subspace Selection

Semantic Units Modeling

HMMs

Video to be Analyzed

Feature Extraction

Semantic Units Detection

SEMANTIC EXTRACTION

Fig. 1. The OFT–HMMs-based framework for video semantic analysis.

Image Pyramid Producing

Pre-smoothing

Derivative Estimation

Velocity Vector Calculation

IMAGE VELOCITY CALCULATION

Fig. 2. The block diagram of HLK optical flow estimation method.

Optical Flow Calculation Based on Lucas-Kanade Algorithm

Video Sequence

Optical Flow Field s Fig. 3. A sample of optical flow fields.

where w(x,y) is a window function that assigns higher weight to the center of neighborhood and the sums are typically over 5  5 pixels. Fig. 3 shows a sample of optical flow field calculated by HLK algorithm. 2.1.2. Tensor representation In computer vision, many objects are naturally represented by multidimensional arrays, i.e., tensors [24], such as the gray face image in face recognition, the color image in scene image classification, and the video shot in motion categorization. However, in current research, the original tensors (images and videos) are always scanned into vectors, thus discarding a great deal of useful structural information, which is helpful to reduce the small sample-size (SSS) problem in subspace selection methods, e.g., linear discriminant analysis (LDA). To utilize this structure information, many dimension reduction algorithms based on the multilinear subspace method (MLSM) have been developed for data representation, pattern classification, and network abnormal detection. This structure information of objects in computer vision is a reasonable constraint to reduce the number of unknown parameters used to represent a learning model.

Tensors are arrays of numbers that transform in certain ways under different coordinate transformations. The order of a tensor X 2 RL1 L2 LM , represented by a multidimensional array of real numbers, is M. An element of X is denoted as X l1 ;l2 ;;lM , where 1 6 li 6 Li and 1 6 i 6 M. The ith dimension (or mode) of X is of size Li. A scalar is a zeroth-order tensor; a vector is a first-order tensor;

Fig. 4. A third-order tensor X 2 RL1 L2 L3 .

375

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383

and a matrix is a second-order tensor. A third-order tensor as an example is shown in Fig. 4. Tensor representation is helpful to reduce the small sample size problem in discriminative subspace selection. This is mainly because the structure information of objects in computer vision research is a reasonable constraint to reduce the number of unknown parameters used to represent a learning model.

ϕ2 ϕ1 ϕD

2.1.3. Optical flow tensor construction Based on the analysis above, the procedure of optical flow tensor construction is presented in Fig. 5. Fig. 6. Direction Bins.

2.2. Discriminative subspace selection 2.2.1. Core tensor generation using GTDA Since the OFTs are often with a high dimensionality, an appropriate feature space analysis technique which can properly preserve discriminative information is in great need for dimension reduction. The conventional classification methods, such as linear discriminant analysis (LDA), often fails when the dimensionality of the feature space is much higher than the number of training samples, this is known as the under sample problem (USP). In order to reduce USP and to preserve discriminative information, we proposed the general tensor discriminant analysis (GTDA), in our previous work [23], as a pre-processing step for conventional classifiers, such as LDA (see Fig. 6). As a tensor extension of the differential scatter discriminant criterion (DSDC), the GTDA is defined as,

 n o U l ¼ arg max trðU Tl ðBl  fW l ÞU l Þ ; Ul

1  l  M;

ð2Þ

8 c P > > ni ðmi  mÞðmi  mÞT > Sb ¼ 1n <

ð6Þ

i¼1 j¼1

ð3Þ

i¼1 ni n c X    o X l U Tl matTl ðXi;j  Mi Þl U Tl ; matl ðXi;j  Mi Þ

ð5Þ

2.2.2. Linear discriminant analysis The aim of LDA is to find a projection of the produced core tensors Yi,j (1 6 j 6 ni, 1 6 i 6 c), which is optimal for separating the different classes in a low dimensional space. For this purpose, we first vectorize the core tensors Yi,j to yi,j, as shown in Fig. 7, and then calculate the between-class scatter matrix Sb and the within-class scatter matrix within the training set by,

i¼1

c n    o X l U Tl and Bl ¼ ni matl ðMi  MÞl U Tl matTl ðMi  MÞ

i¼1

Y ¼ X1 U 1    M U M ¼ XPM l¼1 l U l

ni c P P > > 1 > ni ðyi;j  mi Þðyi;j  mi ÞT : Sw ¼ n

where

Wl ¼

Finally, the core tensor is obtained by,

ð4Þ

j¼1

where Xi,j is the jth training sample (tensor) ith in the individual Pni X i ; j is the class mean tensor of the ith class, class ci, Mi ¼ ðn1i Þ j¼1 Pc 1 M ¼ ðnÞ i¼1 ni Mi is the total mean tensor of all training tensors, nth is the number of sample in the ith class, and Ul denotes the ith projection matrix obtained during the training procedure. Moreover, Xi,j (1 6 j 6 ni,1 6 i 6 c), Mi(1 6 i 6 c) and M are all Mth—order l U Tl ¼ l U Tl    l1 U Tl1 tensors that lie in RN1 N2 NM , and  lþ1 U Tlþ1    M U TM .

Pni where mi ¼ ðn1i Þ j¼1 n x is the mean vector for the individual class ci Pc Pni i i;j 1 and m ¼ ðnÞ i¼1 ji xi;j is the total mean vector. The projection, which is defined by a set of vectors U = [u1,  , uc1], is chosen to maximize the ratio between the trace of sb and the trace of Sw.

( 

U ¼ arg maxU

(

trðU T Sb UÞ

))

trðU T Sw UÞ

ð7Þ

The projection matrix U* can be computed from the leading eigenvectors of S1 w Sb If y is a new feature vector, then it is projected to y* = UT(y  m). The vector y* is used in place of y for representation and classification.

Fig. 5. An example of optical flow tensor construction.

376

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383

Fig. 7. The algorithm and diagram of the subspace selection scheme using GTDA and LDA.

2.3. HMMs and its training 2.3.1. Hidden Markov models Initially introduced and studied in the late 1960s and early 1970s, statistical methods of Markov source or hidden Markov modeling have become increasingly popular in the last decades. There are two strong reasons why this has occurred. First, the models are very rich in mathematical structure and hence can form the theoretical basis for use in a wide range of applications. Secondly, the models, when applied properly, work very well in practice for several important applications, such as speech recognition and face recognition. With the development of video content analysis, HMM has also found wide application in video semantic modeling. Taking for example a left-right model, as in Fig. 8, an HMM is characterized by the following steps:

aij ¼ Pðqiþ1 ¼ Sj jqt ¼ Si Þ;

1  i; j  N:

ð10Þ

In our application, it may reflect the temporal relationship between consecutive motions. (1) The observation symbol probability distribution in state j, B = {bj(k)}, where

Bj ðkÞ ¼ Pðmk at

tjqt ¼ Sj Þ;

1  j  N;

1kM

ð11Þ

the observation symbols represent some specific feature values for every motion. (1) The initial state distribution p = {pi} where

(1) N, the number of states in the model. Although the states are hidden, for many practical applications there is often some physical significance attached to the states or to sets of states of the model. For video semantic analysis, each state may corresponded to a certain motion. The individual states are denoted as S = {S1,S2,   , SN} and the state at time t as qt. (2) M, the number of distinct observation symbols per state. The observation symbols correspond to the physical output of the system being modeled. In many real-world applications, the observation is often with a continuous density. The individual symbols are denoted as V = {m1,m2,  mM}. (3) The state transition probability distribution A = {aij} where

a11

a22

a12

pi ¼ Pðq1 ¼ Si Þ; 1  i  N

It can be seen from the above discussion that a complete specification of an HMM requires specification of two model parameters (N and M), specification of observation symbols, and the specification of the three probability measures A, B and p. For convenience, we use the compact notation

k ¼ ðA; B; pÞ:

ð13Þ

to indicate the complete parameter set of the model.

aii

… …

a23

ð12Þ

aij

… …

M

M

M

M

1

2

… …

i

aNN

… …

aN −1N

M

… …

Fig. 8. An N-state left-right hidden Markov model.

M N

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383

Model Initialization

λ NO State Sequence Segmentation

Training Data

Model Convergence ?

Parameters Estimation via Segmental k-means

YES

Model Set

λ′ Model Reestimation Fig. 9. The block diagram of HMM training.

Given the form of HMM of the previous section, there are three basic problems of interest that must be solved for the model to be useful in real-world applications. Problem 1. Given the observation sequence O = O1 O2  OT, and a model k = (A, B, p), how do we efficiently compute P(O|k), the probability of the observation sequence? The corresponding solution is the Forward–Backward algorithm [25]. Problem 2. Given the observation sequence O = O1 O2  OT, and the model k, how do we choose a corresponding state sequence q = q1q2  qT which is optimal in some meaningful sense? The solution to this problem is the Viterbi algorithm [26]. Problem 3. How do we adjust the model parameters k = (A, B, p) to maximize P(O|k)? The conventional solution is the Baum–Welch algorithm [27]. 2.3.2. HMMs training The HMMs in our approach are with continuous observation densities, and the output probability density function bj(Ot) is

bj ðOt Þ ¼

K X k¼1

cjk N Ot ; ljk ;

! X

;

1jN

ð14Þ

377

basis of any available model which is appropriate to the data (see Table 2). Following model initialization, the set of training observation sequences is segmented into states, based on the current model k. This segmentation is achieved by finding the optimum state sequence, via the Viterbi algorithm, and then backtracking along the optimal path. Based on the state segmentation, updated estimates of the aij coefficients can be obtained and the segmental k-means procedure is used to cluster the observation vectors within each state Sj into a set of k clusters, where each cluster represents one of the k mixtures of the bj(Ot) density. From the clustering, an updated set of P model parameters, cjk, ljk, and jk, are derived. 0 An updated model k is obtained from the new model parameters and the formal re-estimation procedure is used to re-estimate all model parameters. The resulting model is then compared to the previous model (by computing a distance score that reflects the statistical similarity of the HMMs). If the model distance score exceeds a threshold, then the old model k is replaced by the new (reestimation) model and the overall training loop is repeated. If the model distance score falls below the threshold, then the model convergence is assumed and the final model parameters are saved. In our approach, the HMMs are trained by multiple training samples (feature vector sequences), and the training procedure is shown in Table 3.

3. Video content analysis based on the OFT–HMMs framework A video shot is a series of consecutive video frames filmed by camera. As a basic semantic component of video, a shot often carries certain specific meaning expressed by a sequence of motions. Shot classification is always one of the most important tasks for video content analysis. As for event detection, it is widely studied for video semantic extraction, especially in sports videos. Since some of the video shots are usually contain one or more semantic events, just like sentences are composed of words, it will be of great help to recognize the event strings contained by shots. In this section, the shot classification and event recognition procedure based on the proposed OFT–HMMs framework will be presented. 3.1. Shot classification

jk

where N denotes a multidimensional Gaussian density function of P mean vector l and covariance matrix . Here cjk is a weighting coefficient. A procedure for providing good initial estimates of parameters based on segmental k-means is adopted in our approach as shown in Fig. 9 (see Table 1). Assume there is a training set of observations (the same as is required for parameter re-estimation), and an initial estimate of all model parameters. However, unlike the one required for re-estimation, the initial model estimate can be chosen randomly, or on the

We modeled each pre-defined shot by a HMM, denoted as kmS , 1 6 m 6 V, and the shot classification is based on the well-known Viterbi algorithm, as shown in Fig. 10. 3.2. Event recognition 3.2.1. Connected event recognition procedure A block diagram of the overall level building connected event recognition procedure is given in Fig. 11. There are essentially three steps in the recognition process:

Table 1 Optical flow tensor construction procedure Input: A video sequence with T frames. Output: A optical flow tensor sequence. ðu; vÞ,which is the motion vector  Step1. Calculate the corresponding optical flow fields, which are denoted as F = {f1(x, y), f2(x, y),   , f1(x, y),   , fT - 1(x, y), where f1 ðxi ; yi Þ ¼ m for every pixel in a particular optical flow field.  Step2. A sliding window with the size of W is applied to the optical flow fields, at a frequency of K frame, as shown in Fig. 5. For every window, (1) Decompose each optical flow field in the window into M  N blocks with identical size, generating M  N sub-sequences of the original optical flow fields, each of which is considered as an optical flow cube (OFC). ðu; mÞ among every OFC is classified into one of the D directional bins as shown in Fig. 6, constructing an optical flow histogram. (2) Each vector m (3) All of the M  N groups of optical flow histogram are reflected into a third-order optical flow tensor (OFT), denoted as XeRM  N  D.

378

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383

Table 2 Subpace selection for OFT using GTDA and LDA Input: OFTs xi,jeRM  N  D, 1 6 j 6 ni, 1 6 i 6 c. Output: Final feature vectors yi;j . 





 Step 1. Train the GTDA projection matrices U l 2 ðRM M ; RN N ; orRD     Step 2. Produce a new sequence of core tensors Yi;j 2 RM N D by

D

Þ; 1  l  3.

Yi;j ¼ Xi;j P3l¼1  lU l

ð8Þ 





 Step 3. Vectorize the core tensors to long vectors Yi;j 2 RM þN þD .  Step 4. Construct the LDA projection matrix UT and mean vector m This step can be replaced by [37] to further improve the performance.  Step 5. The final vectors yi;j are obtained by

yi;j ¼ U T ðyi;j  mÞ

ð9Þ

Table 3 HMMs training procedure Input: Feature vector sequences {O1,O2,  ,Oi,  ,OQ}, where Q is the number of the samples. Output: Model k.  Step 1. Model parameters are randomly initialized as k(0) and set t = 0.  Step 2. Divide each training sequence Oi, 1 6 i 6 Q, into N predefined states by Viterbi decoding algorithm according to k(t).  Step 3. Estimate the state transition coefficients aij by

aij ¼

Nði; jÞ ; NðiÞ

ð15Þ

where N(i,j) is the number of transitions from state i to j and N(i) is the number of transitions from state i to any state. 

 Step 4. Segment the feature vectors among each state Sj into k clusters, denoted as V j;k 2 RY ; 1  j  N; 1  k  K:  Step 5. Estimate the new set of model parameters of k(t+1) by

C jk ¼

jV j;k j ; jV j j

ð16Þ

where |.| means the cardinality of vectors.

ljk ¼ meanðV j;k Þ; and

ð17Þ

¼ covðV j;k Þ

ð18Þ

X

jk

 Step 6. Form an updated Model k(t+1) and re-estimate using Baum-Welch algorithm. ðtþ1Þ ðtÞ Þj  ?, if YES, then stop, otherwise let t = t + 1, and go to Step 2.  Step 7. Check convergence: jPðOjkPðOjkÞPðOjk ðtþ1ÞÞ

(1) Feature extraction: The video shot is converted to either a set of feature vectors based on OFTs. This defines the observation sequence O of the unknown connected event string. (2) Event segmentation based on level building: The sequence of Feature vectors (the observations) of the unknown connected event string is matched against the single event HMMs using a viterbi scoring algorithm. The output of this process is a set of candidate event strings, generally of different lengths (i.e., different number of event per string), ordered by log probability scores. (3) Post-processing: The candidate event strings are subjected to further validity tests, e.g., duration, to eliminate unreasonable candidates. The postprocessor chooses the most likely event string from the remaining candidate strings.

starting at observation interval 1 on level 1, and retain for each possible frame t the following: (1) P zl ðtÞ, 1 6 t 6 T, the accumulated log probability to frame t, at level l, for reference model kzE , along the best path. (2) F zl ðtÞ, 1 6 t 6 T, a backpointer indicating where the path started at the beginning of the level. To compute P zl ðtÞ, we need a local measure for the probability that observation Ot occurred in state j of model kzE . At the end of each level l (where the level corresponds to event position within the string), a maximization over z is performed to get the best model at each frame t as following,

PBl ðtÞ ¼ max1zZ P zl ðtÞ; 3.2.2. Level building on event HMMs The way in which level building is used on event HMMs is illustrated in Fig. 12. If we denote the set of V event HMMs as kzE , 1 6 z 6 Z, then to find the optimal sequence of HMMs that match O, a sequence of Viterbi matches is performed. For each event HMM kzE , and at each level l, we do a Viterbi match against O,

1tT

W Bl ðtÞ ¼ arg max1zZ P zl ðtÞ; F Bl ðtÞ

¼

W B ðtÞ F l l ðtÞ;

1tT

1tT

ð19Þ ð20Þ ð21Þ

where W Bl ðtÞ records the number of the event model which gave the best score at frame t, level l, and F Bl ðtÞ records the backpointer of the best event model.

379

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383

Fig. 10. The block diagram of shot classification.

Event Models

Video Shot

Feature Extraction

Event Segmentation Based on Level Building

Post-processing

Recognized Event Sting

Fig. 11. The block diagram of connected event recognition.

Fig. 12. The level building on event HMMs.

Each new level begins with the initial best probability at the preceding frame on the preceding level and increments the Viterbi score by matching the event models beginning at the new initial frame. This process is repeated through a number of levels equivalent to the maximum expected number of event in any string. At the end of each level, a best string of size l (1 6 l 6 L) with probability PBl ðTÞ is obtained by backtracking using the backpointer array F Bl ðtÞ to give the event in the string. The overall best string is the maximum of P Bl ðTÞ over all possible levels l. 4. Evaluation of the OFT–HMMs framework with application to basketball video analysis Sports video, due to rich spatial-temporal patterns and having tremendous commercial potentials, has been widely studied. As a result of the existence of rules in most of sports videos, the

semantics are more regular and easier for analyzing. Among the various sports genres, basketball, soccer and volleyball are the most popular. Here, we take basketball video as sample application to evaluate the performance of the proposed OFT–HMMs framework. As shown in Fig. 13, the shot and event HMMs are trained in advance. Given a video sequence, we first decompose it into individual shots based on the visual information alone using the segmentation method proposed in [28]. Then shot classification and event recognition are conducted based on the corresponding model set. To evaluate the effectiveness of the OFT-based feature extraction method and the HMM-based modeling, two sets of K-fold cross validation [29] are made for shot and event, respectively. The total duration of test data is more than 4 h and over 1800 shots. In our experiments, the width of sliding window is 5 frames and the sample interval is one frame. The OFT is with the size of

380

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383

Fig. 13. The diagram of semantic analysis based on the OFT–HMM Framework.

Fig. 14. The semantic structure of basketball video.

8  8  8 (Section 2.1). We provide about 20 standard clips for each shot and event, respectively. 4.1. Basketball content analysis Basketball is known as a hot sport and characterized by fast rhythm and frequently changed motions, and even a camera shot always contains rich semantics [7]. Generally speaking, basketball shots are classified into following types (as shown in Fig. 14): (A) Court shot. It is composed of most of the court events, including (1) Left offence; (2) Right offence; (3) Left break; (4) Right break;

(5) Left lay-up; (6) Right lay-up; (7) Left shot; and (8) Right shot. (B) Close-up shots; (C) Left track; (D) Right track; (E) Left foul shot; (F) Right foul shot; and (G) Wipe. After a statistical analysis of a great amount of basketball shots, we gain some domain-specific knowledge, for example, different shot categories generally have different durations in general. Especially, the court shots usually have the longest duration, while the wipe shots are the shortest. So they can easily recognized by the duration information. In our experiments, each kind of shot and event, namely, B to F and A(1) to A(8), is characterized by a continuous mixture Gauss-

381

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383

tively, an average recognition rate of 99.4% demonstrate the good performance of the Shot HMMs.

Table 4 Shot classification procedure Input: Video sequence with T frames, Shot HMMs kmS , 1 6 m 6 V, .

4.3. Event recognition performance evaluation

Output: Recognized shot.  Step 1. Convert the video sequence to a series of feature vectors based on OFTs, denoted as O.  Step 2. Compute ,PðOjkmS Þ, 1 6 m 6 V , via Viterbi algorithm.  Step 3. Select the maximum PðOjkmS Þ, corresponding to one of the pre-defined shots.

The confusion matrix of 5-fold, 4-fold and 2-fold cross validation for event recognition are given in Tables 8, 9 and 10, respectively. The 2-fold cross validation reaches the highest recognition rate overall, since it has the most samples for each training set. It is noticed that the offence events (A(1) and A(2)) are more likely to be failed in recognition because of unapparent motion, and the A(5) are easily recognized as A(7), because our system can not always distinguish ‘‘layup” from ‘‘shot”.

ian HMM, whose topology is a left-right 4 state structure, 3 mixtures per state(see Table 4). 4.2. Shot classification performance evaluation

5. Conclusion The confusion matrix of 5-fold, 4-fold and 2-fold cross validation for shot classification are given in Tables 5, 6 and 7, respec-

Video semantic analysis is crucial for application as video skimming, indexing and retrieval. In this manuscript, we proposed an optical flow tensor-based video content analysis scheme using HMMs, called OFT–HMMs framework. Since video is a kind of content-sensitive media carrying rich motion information, the optical flow field is used to give a good estimation of the motion. However, it is unreasonable to use the optical flow vectors directly as feature vectors, because every pixel in the video frame is entitled with an optical flow vector, which results in a high dimensionality of the feature space. Hence, the OFT is proposed to turn the optical flow vectorbased feature space to the tensor-based feature space, which carry most of the structure information. In order to preserve the discriminative information of training tensors, the OFT is analyzed by GTDA and LDA method, producing the final feature vectors. As is known that HMM is an effective sequential analysis tool, and video are characterized by its content-sensitive nature, so the HMMs are adopted to model the semantic units (SU), such as shot and event, in our OFT–HMMs framework. Shot classification and event recognition, two of the most interested researches in video semantic analysis, are detailedly discussed based on the proposed OFT–HMMs framework. Finally, a practical application to basketball video is carried out to validate the proposed semantic analysis framework. The promising results demonstrate the effectiveness of the OFT-based feature extraction method and the whole scheme for video semantic analysis, which can be extended to many other applications. In future, there are a lot of research works could be done. First, New ways of video content representation, except for motion analysis, will be explored. Second, we can extend our method to cover much more events like turnover, interception, fumble, etc., as well as to handle meaningful actions and players. Finally, except for HMM, we are considering using other kinds of powerful modeling tools such as dynamic Bayesian networks.

Table 5 The confusion matrix of 4-fold cross validation for shot classification

B C D E F

B

C

D

E

F

Recognition rate (%)

20 — — — —

— 20 — — —

— — 20 1 —

— — — 19 —

— — — — 20

100 100 100 95.0 100

Average

99.4

Table 6 The confusion matrix of 5-fold cross validation for shot classification

B C D E F

B

C

D

E

F

Recognition rate (%)

20 — — — —

— 19 — — —

— 1 20 — —

— — — 20 —

— — — — 20

100 95.0 100 100 100

Average

99.4

Table 7 The confusion matrix of 10-fold cross validation for shot classification

B C D E F

B

C

D

E

F

Recognition rate (%)

20 — — — —

— 20 — — —

— — 20 1 —

— — — 19 —

— — — — 20

100 100 100 95.0 100

Average

99.4

Table 8 The confusion matrix of 4-fold cross validation for event recognition

A(1) A(2) A(3) A(4) A(5) A(6) A(7) A(8) Average

A(1)

A(2)

A(3)

A(4)

A(5)

A(6)

A(7)

A(8)

Recognition rate (%)

20 — — — 1 — — —

— 19 — — — — — —

— — 20 — — — — —

— — — 20 — — — —

— — — — 18 — — —

— — — — — 19 — —

— 1 — — 1 1 20 —

— — — — — — — 20

100 95.0 100 100 90.0 95.0 100 100 97.5

382

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383

Table 9 The confusion matrix of 5-fold cross validation for event recognition

A(1) A(2) A(3) A(4) A(5) A(6) A(7) A(8)

A(1)

A(2)

A(3)

A(4)

A(5)

A(6)

A(7)

A(8)

Recognition rate (%)

20 1 — — 1 — — —

— 19 — — — — — —

— — 20 — — — — —

— — — 20 — — — —

— — — — 19 — — —

— — — — — 19 — —

— — — — — 1 20 —

— — — — — — — 20

100 95.0 100 100 95.0 95.0 100 100

Average

98.1

Table 10 The confusion matrix of 10-fold cross validation for event recognition

A(1) A(2) A(3) A(4) A(5) A(6) A(7) A(8)

A(1)

A(2)

A(3)

A(4)

A(5)

A(6)

A(7)

A(8)

Recognition rate (%)

19 — — — — — — —

— 20 — — — — — —

— — 20 — — — — —

— — — 20 — — — —

1 — — — 20 — — —

— — — — — 19 — —

— — — — — 1 20 —

— —— — — — — — 20

95.0 100 100 100 100 95.0 100 100

Average

References [1] A. Ekin, A.M. Tekalp, R. Mehrota, Automatic soccer video analysis and summarization, IEEE Trans. Image Process. 12 (7) (2003) 796–807. [2] Y. Li, S. Narayanan, C.-C J. Kuo, Content-based movie analysis and indexing based on audio visual cues, IEEE Trans. Circuits Syst. Video Technol. 14 (8) (2004) 1073–1085. [3] L.Q. Xu, Y. Li, Video classification using spatial-temporal features and PCA, in Int. Conf. Multimedia and Expo, Baltimore, 2003, pp. 485– 488. [4] D.A. Sadlier, N.E. O’Connor, Event detection in field sports video using audiovisual features and a support vector Machine, IEEE Trans. Circuits Syst. Video Technol. 15 (10) (2005) 1225–1233. [5] F. Wang, Y.F. Ma, H.J. Zhang, J.T. Li, A generic framework for semantic sports video analysis using dynamic Bayesian networks, in Proc. IEEE Int. Conf. Multimedia Model., 2005, pp. 115–122. [6] C.L. Huang, H.C. Shih, C.Y. Chao, Semantic analysis of soccer video using dynamic Bayesian network, IEEE Trans. Multimedia 8 (4) (2006) 749– 760. [7] G. Xu, Y.F. Ma, H.J. Zhang, S.Q. Yang, An HMM-based framework for video semantic analysis, IEEE Trans. Circuits Syst. Video Technol. 15 (11) (2005) 1422–1433. [8] L. Xie, S.F. Chang, A. Divakaran, H. Sun, Structure analysis of soccer video with domain knowledge and hidden Markov models, PRL 25 (7) (2004) 767–775. [9] J.C. Huang, Z. Liu, Y. Wang, Joint scene classification and segmentation based on hidden Markov model, IEEE Trans. Multimedia 7 (3) (2005) 538–550. [10] L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77 (2) (1989) 256–286. [11] F. Niu, M. A. Mottaled, HMM-based segmentation and recognition of human activities from video sequences, in Proc. IEEE Int. Conf. Multimedia and Expo. Amsterdam, IEEE, 2005, pp. 804–807. [12] M. Barnard, J.-M Odobez, Sports event recognition using layered HMMs, in Proc. of IEEE Int. Conf. Multimedia and Expo. Amsterdam, 2005, pp. 1150– 1153. [13] M.M. Trivedi, K.S. Huang, I. Mikic, Dynamic context capture and distributed video arrays for intelligent spaces, IEEE Trans. Syst. Man Cybernet. 35 (1) (2005) 145–163. [14] O. Alatas, O. Javed, M. Shah, Compressed spatio-temporal descriptors for video matching and retrieval, in Proc. IEEE Int. Conf. Pattern Recognition, 2004, pp. 882–885. [15] J.W. Hsieh, S.L. Yu, Y.S. Chen, Motion-based video retrieval by trajectory matching, IEEE Trans. Circuits Syst. Video Technol. 16 (3) (2006) 396– 409. [16] B.K.P. Horn, B.G. Schunk, Determining optical flow, Artificial Intell. 15 (1981) 185–204. [17] J.L. Barron, D.J. Fleet, S. Beauchemin, Performance of optical flow techniques, Int. J. Comput. Vis. 12 (1) (1994) 43–77.

98.8

[18] B.D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in Proc. 7th Int. Joint Conf. Artificial Intell., 1981, pp. 674–679. [19] P. Anandan, A computational framework and an algorithm for the measurement of visual motion, Int. J. Comput. Vis. 2 (1989) 283–310. [20] D.J. Heeger, Optical flow from spatiotemporal filters, Int. J. Comput. Vis. 1 (1988) 279–302. [21] D.J. Fleet, A.D. Jepson, Computation of component image velocity from local phase information, Int. J. Comput. Vis. 5 (1) (1990) 77–104. [22] D. Tao, X. Li, W. Hu, S. Maybank, X. Wu, Supervised tensor learning, Knowledge Inform. Syst.. 13 (1) (2007) 1–42. [23] D. Tao, X. Li, X. Wu, S. Maybank, General tensor discriminant analysis and gabor features for gait recognition, IEEE Trans. Pattern Anal. Mach. Intell. 29 (10) (2007) 1700–1715. [24] L.D. Lathauwer, Signal processing based on multilinear algebra, Ph.D. Thesis, Katholike Universiteit Leuven, 1997. [25] L.E. Baum, J.A. Egon, An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology, Bull. Am. Meteorol. Soc. 37 (2) (1967) 360–363. [26] G.D. Forney, The Viterbi algorithm, Proc. IEEE (1973) 268–278. [27] L.E. Baum, An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes, Inequalities 3 (1972) 1–8. [28] X. Gao, X. Tang, Unsupervised video-shot segmentation and model-free anchorperson detection for news video story parsing, IEEE Trans. Circuits Syst. Video Technol. 12 (9) (2002) 765–776. [29] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Second ed., WileyInterscience, 2000. [30] D. Xu, S. Lin, S. Yan, X. Tang, Rank-one projections with adaptive margins for face recognition, IEEE Int. Conf. Computer Vis. Pattern Recogn., 37(5) (2006) 1226–1236. [31] D. Tao, X. Li, X. Wu, S.J. Maybank, Tensor rank one discriminant analysis, Neurocomputing, in press. [32] D. Xu, S. Yan, L. Zhang, H. Zhang, Z. Liu, H. Shum, Concurrent subspace analysis, IEEE Int. Conf. Comput. Vis. Pattern Recogn. 2 (2005) 203–208. [33] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, H. Zhang, Discriminant analysis with tensor representation, IEEE Int. Conf. Comput. Vis. Pattern Recogn. 1 (2005) 526–532. [34] Y. Fu, T.S. Huang, Image classification using correlation tensor analysis, IEEE Trans. Image Process. (TIP) 17 (2) (2008) 226–234. [35] Y. Fu, S. Yan, T.S. Huang, Classification and feature extraction by simplexization, IEEE Trans. Inform. Forensics Security 3 (1) (2008) 91– 100. [36] D. Tao, X. Tang, X. Li, X. Wu, Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 28 (7) (2006). 088-1,099. [37] D. Tao, X. Li, X. Wu, S.J. Maybank, Geometric mean for subspace selection in multiclass classification, IEEE Trans. Pattern Anal. Mach. Intell. in press.

X. Gao et al. / Computer Vision and Image Understanding 113 (2009) 372–383 Xinbo Gao received the B.S. degree in Electronic Engineering from Xidian University, Xi’an, China, in 1994, the M.S. degree in Signal and Information Processing from Xidian University, Xi’an, China, in 1997, and the Ph.D. degree in Signal and Information Processing from Xidian University, Xi’an, China, in 1999. From 1997 to 1998, he was a research assistant in the Department of Computer Science at Shizuoka University, Japan. From 2000 to 2001, he was a research associate in the Department of Information Engineering at the Chinese University of Hong Kong. Since 2003, he has been a Professor in the School of Electronic Engineering at Xidian University, Xi’an, China, and the Director of Video/Image Processing System Laboratory (VIPSL). His research interests include signal/image/video processing, biomedical imaging, pattern recognition, and machine learning. Dr. Gao was selected as a member of the program for New Century Excellent Talents in University of China by the Ministry of Education (MOE) in 2004. He was authorized the title Pacemaker of Ten Excellent Young Teacher of Shaanxi Province in 2005. In 2006, he was awarded the Young Teacher Award of High School by the Fok Ying Tung Education Foundation. Dr. Gao is a member of IEEE, senior member of Chinese Institute of Electronics (CIE) and China Computer Federation (CCF).

Yimin Yang received the B.S. degree in Electronic Engineering from Xidian University, Xi’an, China, in 2005. She is currently working toward the M.S. degree in Signal and Information Processing. Her research interests include video processing, pattern recognition, and computer vision.

Dacheng Tao received the B.Eng. degree from the University of Science and Technology of China (USTC), the MPhil degree from the Chinese University of Hong Kong (CUHK), and the PhD degree from the University of London (Lon). Currently, he is a Nanyang assistant professor with the School of Computer Engineering in the Nanyang Technological University, a visiting professor in Xi’Dian University, a guest professor in Wuhan University and a visiting research fellow in the University of London. His research is mainly on applying statistics and mathematics for data analysis problems in data mining, machine learning, multimedia, computer vision, and visual surveillance. He has published more 80 scientific papers in IEEE TPAMI, T-KDE, T-IP, T-MM, T-CSVT, T-SMC-B, T-SMC-C, ICDM, CVPR, ECCV; ACM T-

383

KDD, Multimedia, KDD etc., with best paper award and nominations, e.g., the IEEE International Conference on Data Mining Best theory/algorithm paper runner up award. Previously he gained several Meritorious Awards from the International Interdisciplinary Contest in Modeling, which is the highest level mathematical modeling contest in the world, organized by COMAP. He serves as an associate editor of the Official Journal of the International Association for Statistical Computing—Computational Statistics & Data Analysis (Elsevier) and Neurocomputing (Elsevier). He authored/edited six books and eight special issues in CVIU, PR, PRL, SP, Neurocomputing, and other reputable journals. He (co-)chaired special sessions, invited sessions, workshops and conferences. He served for more than 50 major international conferences including ICDM, KDD, CVPR, ICCV, ECCV, and ACM Multimedia, and more than 15 top international journals including T-PAMI, T-KDE, TOIS, T-IP, T-CSVT, T-MM, T-IFS, T-SMC-B, CVIU, and Information Science.

Xuelong Li holds a permanent academic post at Birkbeck College, University of London and a visiting professorship at Tianjin University. His research interests include digital image/video processing, analysis, retrieval, and indexing, pattern recognition, biometrics, and visual surveillance. His research activities are partly sponsored by EPSRC, the British Council, Royal Society, etc. He has around a hundred scientific publications. Dr. Li is an associate editor of IEEE T-IP, T-CSVT, TSMC-B, and T-SMC-C, an editor of four books, an editorial board member of other ten journals, and a guest coeditor of eight special issues. He is the recipient of several best paper awards and nominations. He has served as a chair/cochair of a dozen conferences, and a program committee member for more than eighty conferences. He has been a reviewer for over a hundred journals and conferences, including eleven IEEE transactions. He is a member of the academic committee of the China Society of Image and Graphics, a senior member of the IEEE, the chair of IEEE-SMC Technical Committee on Cognitive Computing, and a member of several other IEEE technical committees. He is an IEEE-SMC chapter coordinator.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.