Cluster-temporal browsing of large news video databases

Share Embed


Descrição do Produto

Cluster-Temporal Browsing of Large News Video Databases Mika Rautiainen, Timo Ojala, Tapio Seppänen MediaTeam Oulu, University of Oulu, P.O.BOX 4500, FIN-90014 University of Oulu {[email protected]}

Abstract This paper describes cluster-temporal browsing of news video database. Cluster-temporal browsing combines content similarities and temporal adjacency into single representation. Visual, conceptual and lexical features are used to organize and view similar shot content. Interactive experiments with eight test users have been carried out using a database of roughly 60 hours of news video. Results indicate improvements in browsing efficiency when automatic speech recognition transcripts are incorporated into browsing by visual similarity. Cluster-temporal browsing application received positive comments from the test users and performed well in overall comparison of interactive video retrieval systems in TRECVID 2003 evaluation.

1. Introduction Content-based indexing methods for video and image retrieval have been studied for more than a decade already. Prototype systems have been developed in research institutions and universities around the world but rarely commercialized. The best known examples of these are IBM’s QBIC [1], Carnegie Mellon University’s Informedia [2] and Físchlár project from the Centre for Digital Video Processing at Dublin City University [3]. Lexical search from automatic speech recognition transcripts has been the most successful strategy for content-based video retrieval, since the inherent property of speech is purely semantic. The problem with the transcripts is their dependency on the quality of available audio. Also, the spoken words do not wholly cover the abundant semantic content available in a film. The question is how to utilize other cues derived from visual or auditory data to help the higherlevel cognitive process of searching and browsing? In [4], Rautiainen et al. introduced cluster-temporal browsing as an interactive method for navigating in large video databases. The novelty lay in combining the

traditional time-line presentation of videos and unsupervised content-based clustering into a single dynamic representation of video items. In this approach, the selection of the clustering parameters affects to the overall performance of the method. In this paper we have substituted supervised clustering with a nearest neighbour search to eliminate the effect of parameter selection and focus on testing what benefit lexical features can bring to traditional key frame based browsing. We report our interactive experiment on news video searching in TRECVID 2003. The organization is following; Section 2 describes the principal elements of cluster-temporal browsing. Section 3 describes the overall architecture of the test system and the various features employed in content-based analysis. Section 4 presents the experimental setup and results. Section 5 concludes the paper with discussion and directions for future work.

2. Cluster-Temporal Browsing Traditionally, video browsing is understood as traveling along the temporal axis of continuous data. The browsing interface is often based on advanced playback or summarization of video clips, e.g. in form of key frames or mosaics, which reduces the amount of video data the user has to wade through. This approach does not support weighing inter-video relationships and exhausts the user with large amount of data. Contentbased similarity search is another possibility. Unfortunately, the ‘semantic gap’ between the user’s information need and provided presentation are often insurmountable by computational techniques, especially when deriving content features from lowlevel processing of data. Recently, Heesch et al. [14] introduced a promising method for browsing video key frames, which they call ‘lateral browsing’. It constructs navigational views of nearest neighbours in a feature weight space. A recent study [7] by Rodden and Wood suggests that the most desirable features users seem to be fond of when browsing digital photo archives are chronological ordering and seeing large amounts of

images at once. These are both characteristics of cluster-temporal video browsing. In cluster-temporal browsing, the low-level similarities of content features are integrated with less ambiguous information, which is obtained from the temporal video structure. In videos where causal relations are used in storytelling, temporally close video segments are often semantically related. Previously, temporal adjacency has been abundantly used in video segmentation. Yeung et al. [5] proposed the use of Scene Transition Graphs in video browsing. Zhang et al. [6] presented integrated solution for sequential and hierarchical access methods for automatically parsed video. These methods emphasize single video clip representation, which is inadequate of wading through hundreds of clips at once. The arrows in Figure 1 illustrate the way clustertemporal browsing combines inter-video similarities and temporal adjacency to maximize information in a single view. The top horizontal arrow displays the current video clip of interest, so that its key frames are chronologically ordered. The vertical arrows in each column depict shots from the video database. The results are ordered with downward decreasing similarity with the shot in the video of interest atop of the column. The vertical columns form a plane of similarity created from the entire database, in which the user can navigate both in temporal and content similarity axis. User can change the video of interest by choosing any of the shots visible in the representation. When a new video is selected, the representation is instantly updated. User can also navigate laterally in the timeline of the video; similar shots are updated whenever user decides to focus into a specific portion of a clip. Computing the most similar shots requires processing of parallel queries. This calls for powerful indexing and caching to update the view every time there is a navigational change or the user changes the features employed in the similarity measurement.

3. VIRE - Video Browsing and Retrieval Test System Our prototype retrieval system, VIRE, consists of a query server, content-based search tool and clustertemporal browser. The browser utilizes three different levels of semantic feature indexes (visual v, conceptual s and lexical l). Query server delivers ranked results based on these indexes and late feature fusion. Browser application creates the cluster-temporal organization of the results sets.

3.1 Feature Indexes The index of visual features is constructed from the color and structural properties in a video shot. In [8][9] two low-level shot features have been introduced that escalate traditional video features by shifting from the static key frame context to the temporal properties of video color and structure. Temporal Color Correlogram (TCC) and Temporal Gradient Correlogram (TGC) features describe statistical co-occurrences of colors and edges over a video sequence. The dissimilarities of the individual feature descriptors are based on L1 distance and the overall visual similarities are obtained with modified Borda count: the TCC and TGC generate rank-order lists of similarity that are summed to produce overall ranking. [10] The semantic concept feature index is constructed from defined concept terms that are detected from a shot with a certain confidence value, which is a number between 0 and 1. Unsupervised training of semantic concept detectors creates series of confidences for each video shot. Following detectors were implemented into VIRE: outdoors, news subject face, people, building, road, vegetation, animal, female speech, car/truck/bus, aircraft, news subject monologue, non studio setting, sporting event, weather news and physical violence. These concepts were defined as a part of the TRECVID 2003 Semantic Feature Detection task [11]. The conceptual similarity of two shots is computed by first determining the most descriptive concepts with highest and lowest confidences from the example shot. For example, a shot of an airplane flying in the sky has top three concepts: airplane, outdoors and no face. This is the ideal concept representation used in determining the similarity to other shots. See more details in [10]. Lexical index is constructed from the automatic speech recognition (ASR) data. The words of ASR transcripts are first pre-processed with stop word removal and stemming and then indexed into database. Grouping words into speaker segments improves contextual organization for the index and whip up the lexical search. The lexical similarity of two shots was computed using Term-Frequency Inverse Document Frequency (TFIDF) [12]. Before TFIDF, stemming is used to remove any suffixes from the query words.

3.2 Late Feature Fusion To provide the final result list, VIRE system commences sub-queries into each of the feature index that is defined in the query. The rank-ordered subresults can be considered as votes from the feature

‘experts’ v,s,l. To make a fusion of these feature lists, we use a variant of Borda count voting. f t (n) = sum(

[

v t ( n) s t ( n) l t ( n ) , , ) (1) t t V max S max Ltmax

F t = sort{ f t (1),..., f t ( N )}

]

(2) X

f t (n) = overall rank of a result shot n to the example t

v t (n), s t (n), l t (n) = rank of a result n given by

independent ‘expert’, e.g. visual search v = sizes of the independent result lists

t t Vmax , Smax , Ltmax

Ft

= Final ranked set of results for the example t

[ ]X = X top-ranked items in a sorted list 3.3 Cluster-Temporal Browser

Figure 1 shows a close-up segment of the browsing interface. The panel showing the top row of key frame images displays sequential shots from the video of interest and the scrollable time-line. User can sweep through the entire video chronologically to get a fast overview of the video content. The panel below gives the user similarity view that is computed from the database. TEMPORALLY ORDERED SHOTS OF A VIDEO

CONTENT-BASED RESULTS Figure 1. Cluster-temporal browsing interface The columns of the similarity panel show the most similar matches to the shots at the top. Similarity is decreasing downwards. User can select one or more of the ‘visual’, ’conceptual’ and ‘lexical’ search features. User can also organize results by videos and display/hide ASR content together with the key frames.

4. Experimental Results Cluster-temporal browsing was evaluated in TRECVID 2003 workshop, which provides common framework for research groups around the world to test their content-based video retrieval systems [11]. NIST provided 24 search topics that were used in the interactive user experiments. The search results were sent to NIST for evaluation. A search topic contained one or more example clips of video or images and textual topic description to aid the search process. TRECVID 2003 data consisted mainly of ABC/CNN news from 1998 and about 13 hours of CSPAN programming totalling about 133 hours of video data halved for system development and testing. Speech recognition transcripts were provided by LIMSI [13]. NIST created the segmentation for every video, resulting over 32000 video shots in the test collection. IBM organized a collaborative annotation for TRECVID 2003 participants, resulting in annotation of feature search development collection, creating over 60 hours of annotated video. [11] The interactive search experiment was carried out by a group of eight users with no prior experience of using the system. Test users, two of them females, were mainly information engineering undergraduate students, having good skills in using computers, but little experience in searching video databases. They all were used to web searching. Test subjects were organized into following configuration Table 1. Interactive test configuration Run ID Searcher ID [topic IDs] I1V(A) S1[T1-T6] S7[T7-T12] S2[T13-T18] S8[T19-T24] I2VT(B) S2[T1-T6] S8[T7-T12] S1[T13-T18] S7[T19-T24] I3V(A) S3[T1-T6] S5[T7-T12] S6[T13-T18] S4[T19-T24] I4VT(B) S4[T1-T6] S6[T7-T12] S5[T13-T18] S3[T19-T24]

Two different system variants were tested in order to find out whether combination of textual and visual browsing cues improves the search efficiency over key frame based browsing. System variant A had excluded the ‘lexical’ browsing feature and ASR text visualization so that searching was based entirely on visual keyframes. System variant B included lexical searching and speech transcript visualization of shots as shown in Figure 1. The Latin Square configuration (see Table 1) of the experiment minimized the effect of ‘random’ proficiency in certain search topics and system configurations by changing the sets of six topics and two system variants A and B between the eight searchers. During the change of system configurations

in halfway of the experiment, users were given refreshments and a break to reduce the effect of fatigue. The effect of learning within the topic sets was not controlled and most of the users processed the topics in numerical order. All users were given half an hour introduction to the system, with emphasis on the search and browsing interface functions. Users were told to use approximately 12 minutes for each search, during which they browsed the shot database and selected shots that seemed to fit to the topic description. After finishing the search, computer created a final list of 1000 ranked shots with the found results as an example. Total time for the experiment was about three hours. Users were also told to fill a questionnaire about their experiences. Average precisions for the four different search configurations are shown in Table 2. MAP shows the mean value of the average precisions for 24 topics. The average time of searching was 10.68 and 11.63 for system variants A and B respectively. The average of ‘hits at depth 30’ was 8.19 and 10.88 for A and B. The same values for ‘hits at depth 10’ were 4.81 and 6.33 respectively. This means that the people spent about 9% more time browsing with visual and textual cues resulting in about 30% increase in found matches within the first 10-30 results. Table 2. MAP and average time for search runs Search Run ID I1V (A) I2VT (B) I3V (A) I4VT (B) Median of all systems Max of all systems

MAP 0.172 0.241 0.156 0.207 0.184 0.476

Elapsed Time Avg 11.2 min 12.1 min 10.2 min 11.2 min 9.5 min 15 min

According to the questionnaire, VIRE system was easy to learn and somewhat easy to use. This was qualitative improvement over the previous version of the browser [4] and was partially due to use of ASR textual transcripts. People preferred browsing with visual and lexical features enabled and ASR transcripts visible. Similar results were preferably viewed in shot columns, not by video rows.

5. Conclusions Cluster-temporal browser provides many levels of search features for the user. The system offers two parallel search paths in a single interface, which differentiates it from the traditional approaches. Cluster-temporality gives the user better understanding about the various relations between video shots during browsing. This is supportive for the navigational

decision-making. According to users, browsing history was a desired feature so it will be a part of future work. The experiment showed that the spoken content gives more cues for browsing and improves the browsing results noticeably. Overall, cluster temporal browser obtained above median performance in TRECVID 2003 evaluation.

6. References [1] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele & P. Yanker, “Query by Image and Video Content: The QBIC System,” In IEEE Computer Magazine 28(9), 1995, pp. 23-32. [2] Informedia, http://www.informedia.cs.cmu.edu/ [3] A. Smeaton, “Browsing Digital Video in the Físchlár System,” Keynote presentation at Infotech Oulu International Workshop on IR, Oulu, Finland, 2001. [4] M. Rautiainen, T. Ojala & T. Seppänen, “Clustertemporal video browsing with semantic filtering”, Proc. of Advanced Concepts for Intelligent Vision Systems, Ghent, Belgium, 2003, pp.116-123. [5] M. Yeung, B.L. Yeo, W. Wolf, & B. Liu, “Video browsing using clustering and scene transitions on compressed sequences”, Proc. of Multimedia Computing and Networking, February 1995, pp. 399-413. [6] H.J. Zhang, J. Wu, D. Zhong & S.W. Smoliar, “An integrated system for content-based video retrieval and browsing”, Pattern Recognition, Vol. 30(4), Apr 1997, pp. 643-658. [7] K. Rodden & K.R. Wood, “Searching and organizing: How do people manage their digital photographs?”, Proc. of the Conference on Human Factors in Computing Systems, April 2003, pp. 409 – 416. [8] M. Rautiainen & D. Doermann, “Temporal color correlograms for video retrieval”, Proc. of 16th International Conference on Pattern Recognition, Quebec, Canada, 2002, pp.267-270. [9] M. Rautiainen, T. Seppänen, J. Penttilä & J. Peltola, “Detecting semantic concepts from video using temporal gradients and audio classification”, Int. Conference on Image and Video Retrieval, Urbana, IL, 2003, pp. 260-270. [10] M. Rautiainen, J. Penttilä, P. Pietarila, K. Noponen, M. Hosio, T. Koskela, S.M. Mäkelä, J. Peltola, J. Liu, T. Ojala & T. Seppänen, “TRECVID 2003 experiments at MediaTeam Oulu and VTT”, In Proceedings of TRECVID, November 2003. [11] TREC Video Retrieval Evaluation Home Page, http://nlpir.nist.gov/projects/trecvid/. [12] G. Salton & C. Yang, “On the specification of term values in automatic indexing”, Journal of Documentation, Vol. 29, 1973, pp. 351–372. [13] J.L. Gauvain, L. Lamel & G. Adda, “The LIMSI Broadcast News Transcription System”, Speech Communication, 37(1-2), 2002, pp. 89-108. [14] D. Heesch, M.J. Pickering, S. Rüger & A. Yavlinsky, “Video Retrieval within a Browsing Framework Using Key Frames”, In Proceedings of TRECVID, November 2003.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.