Audio Data Analysis using Parametric Representation of Temporal Relations

June 15, 2017 | Autor: Isabelle Ferrané | Categoria: Data Analysis, Information Communication

Descrição do Produto

Audio Data Analysis Using Parametric Representation of Temporal Relations Zein Al Abidin IBRAHIM (1), Isabelle FERRANE (1), Philippe JOLY (1) (1)IRIT Institute of Research in Computer Sciences Paul Sabatier University, Toulouse Cedex 3, France +33(0)5 61 55 60 55 {ibrahim, ferrane, joly}@irit.fr Abstract The aim of our work is the automatic analysis of audiovisual documents to retrieve their structure by studying the temporal relations between the events occurring in each of them. Different elementary segmentations of a same document are necessary. Then, starting from a parametric representation of temporal relations, a Temporal Relation Matrix (TRM) is built. In order to analyze its content, a classification step is carried out to identify relevant relation class or to observe predefined relations as the Allen’s relations. The use of segmentation tools brings in the problem of the segmentation result reliability. Through a first experiment we analyze the effect of segmentation errors on the study of temporal relations. Then, as a second experiment, we apply our parametric method to a TV game document in order to analyze audio events and to see if the observations made can reveal information about the document content or its structure.

1. Introduction The automatic analysis of audiovisual documents to retrieve their structure or to find to which kind of documents they belong to, is the main motivation of our work. Many tools are developed for automatic audiovisual content indexing. They generally produce results in the form of segment sequences, where segments are related to the presence of a specific basic feature like in dominant color segments or in speech segments. To detect relevant events or to generate summaries or indexes for further retrieving or processing steps, low level feature extractions are generally performed in a previous step (gradual transition, applause on the soundtrack, text or character presence on the screen, movement quantity …). Indexes produced by this way, are usually associated to a single modality and suffer from their own lack of semantics. Combining primary indexes that characterize an audiovisual content may be a way to produce more meaningful indexes. Existing tools developed for an indexing task can be classified on the base of the features involved in the analysis process. Colour and shape are used in [1] to classify TV news in semantic classes. Audio features like in [2] aim at classifying events of football game audio in

semantic classes. Multimodal features are also used in [3] to index the news videos by detecting six semantic classes. They all have in common the fact that they are built on a priori knowledge about how events are temporally related to each other according to the type of audio-visual documents considered. This point was the beginning of our reflection about the need to define a more generic approach for analysing document content. The idea was to propose a method for observing temporal relations, whatever they may be, between events that occur in a same document. Events are segments based on audio or video features and belong to different segmentation results. We have presented in [4] a parametric representation of temporal relations from which the document content can be analysed. In that, we have a double objective: to look for the temporal relations that are the most representative of the document content and structure and then, to use these relations to detect more semantic events occurring in this document. This paper will be organized as follows: in the second section, we will describe our parametric representation of temporal relations and will explain how it is used for building a specific matrix called TRM. Before analyzing this matrix, quantization and classification steps are necessary. A short example based on Allen’s relations will be given as an illustration of our representation. We will see how temporal relations can be combined by means of a conjunction operation. In the third section, we will address the problem of the reliability of the segmentation results that may be used as input in our process, and their effect on temporal relation analysis. Then, in the fourth section, we will present the result of the experiment based on audio segmentations of a TV game. The aim of this experiment was to apply our method and see if the observations can reveal information about the document content or about its structure. At last, in section five, we will conclude and present our future works on this topic.

2. Parametric representation of temporal relations To study the temporal analysis of video or audio contents and extend our study to subtle or unpredictable relations between any kind of events, we needed tools to represent and to reason about time. Time can be

considered under different forms: time-line, time intervals, time points, duration, or time positions. Different approaches of temporal representation and reasoning are presented in [5], [6]. Existing models to express temporal relationships can be divided in two classes: the point-based [7] and interval-based models [8]. In point-based models, points are elementary units along the time dimension. Each event is associated with a time point and the basic relations between two points are before () or simultaneous to (=). Intervalbased models consider elementary media entities as time intervals which can be ordered according to different relations. The existing models are mainly based on the relations defined by Allen in [8] to express the knowledge about time. An interval can be seen as a point and a duration along the time dimension. In our case, this last representation is the best adapted to our work because we use as input different elementary segmentations of a same document.

2.1. Elementary segmentation of a document An elementary segmentation is a set of intervals where only one type of events is occurring. Specific segmentation tools, as for example, a tool for person detection in a video or a tool for speech or applause detection on the soundtrack can automatically produce such segmentations. If no specific tool is available, segmentations are manually made. Each segmentation result is based on a specific low or mid-level feature (dominant colour, person appearance, speech or music) and each segment indicates the presence of this very characteristic, which is considered as an event. So, in previous work [4], we have proposed a representation, where a segmentation result S is a set of N temporally disjoint segments si and can be defined by S = { si } where i  [1,N]. A temporal interval si is characterised by its two endpoints, its beginning (sib) and its end (sie), so that si can also be written [sib , sie].

2.2. Temporal relation between two segments Let us now consider two elementary segmentations S1 and S2 as two sets of not connected segments {s1i} and {s2j}, in which each segment is considered as a temporal interval: s1i = [s1ib , s1ie] and s2j = [s2jb , s2je] (where i  [1,N1] and j  [1,N2]). We represent the temporal relation between a couple of segments (s1i , s2j) by means of three variables [9]: DE = s2je – s1ie

DB = s1ib– s2jb

Lap = s2jb – s1ie

This can also be written in the form s1i R (a, b, c) s2j, where R is the temporal relation observed between s1i and s2j, and a, b, c its three parameters (a = DE, b = DB, c = Lap). Then, a graphical representation of temporal relations can ensue from this three-parameter representation where each relation is a point in a three dimensional space.

2.3. TRM definition A three dimensional matrix called Temporal Relation Matrix (TRM) can also be created from this threeparameter representation by associating each parameter to one of the matrix dimension. Each cell of the matrix will be an accumulator to count the number of times a temporal relation is observed. The study of two elementary segmentations S1 and S2 of a same document will produce a TRM that will be filled when each interval s1i of S1 is compared to each interval s2j of S2 so that the three parameters that characterise each couple (s1i , s2j), are computed. Then a weight, set to 1, is added to the TRM cell referred by DE, DB and Lap. The TRM can be used directly to determine the frequencies of potential relations, but it can also be used as a vote matrix to observe remarkable distributions of votes and to identify general rules about the temporal behaviour of the events occurring in the document. Before that, we have to solve some representation problems: the first one concerns time units, which may be different from a segmentation to the other one. Because, for example, visual features are computed for each image while audio features are time-based, it is necessary to come down to the same time unit that is to say in our case, the smaller one. The next problem is related to the type of index value. The three parameters used as index in the TRM have real values, so the TRM might be very large due to the document duration. The index values need to be transformed in integer values to reduce the TRM size.

2.4. TRM classification Once the TRM between two segmentations of the same document has been computed, an analysis step, based on the frequencies of the observed relations, whatever they may be, has to be made. Unlike other vote techniques, a significant relation can not be reduced to a single maximum value in a cell. By considering the subparts of the TRM where votes are distributed, more semantic information can be brought to the fore. So, our purpose is to localise clouds of points in the graphical representation, or votes in the TRM, that correspond to significant information which will help us to characterise the content of the document or its structure. To do so, a first approach is to classify the data contained in the TRM, using for example the k-means or Fuzzy C-means classifier. Another approach for classifying these data can consist in dividing the TRM in subparts according to prior knowledge related to some predefined semantic relations, as for example the Allen’s relations [8]. The constraints of each Allen’s relation are then adapted to the space of our parameters. This is illustrated in Table 1 where constraints on the parameter values are defined according to the meaning of the two Allen’s relations ‘before’ and ‘after’.

characterized by its three parameters as seen in section 2.2 (a = DE, b = DB, c = Lap), and such as:

Table 1: three-parameter representation of ‘before’ and ‘after’ relations Relation Before After

Lap

DB

DE

0 < Lap ≤ α

DB < -Lap

DE > Lap

DE-DB < Lap < 0 & 0 0

DE < 0

If s1 is ‘before’ s2 that means that Lap must be upper to zero. As Lap represents the gap between the first segment and the second one, in some cases it becomes necessary to define a second limit (here called α), so that the relation between these two events remains relevant. The constraints expressed on this way will limit the graphical zone. A description of all the constraints determined on Allen’s relation can be found in [4]. A graphical representation of the ‘before’ relation is given in Fig1 where x = DE, y = DB, and z = LAP. This can be generalised to each Allen’s relations giving by this way a representation in a 3D space.

Fig. 1 The ‘before’ relation representation in a three dimensional space After this classification step, the occurrence of each relation class is computed as the sum of the votes that are in the corresponding subpart. The size of the TRM will be reduced to a cubic 3D matrix with dimensions equal to the number of classes or of predefined relations.

2.5. Conjunction of temporal relations Our point is to find a way to put to the fore events of a higher informative level than the basic events used as input. Temporal relations belonging to the same class can be representative of a type of event. Combining temporal relations belonging to different classes can be a way to obtain more meaningful events. In previous works on Allen’s temporal relations, the conjunction of relations has been already studied from a qualitative point of view. The result was a set of possible relations. In our case, we can consider conjunction from a quantitative point of view because the result of the conjunction of two relations could be expressed by the same three- parameter representation. Let us consider three temporal intervals s1i, s2j, and s3k respectively belonging to three segmentations S1, S2 and S3 and two temporal relations R1 and R2, each one

s1i R1 (a1, b1, c1) s2j

s2j R2 (a2, b2, c2) s3k

The relation called R3 corresponding to the conjunction of R1 and R2 can be defined by three parameters as well. Their values are calculated from those of the two relations R1 and R2 as shown below: s1i R3 (a3, b3, c3) s3k with

a3 = a1 + a2 ;

b3 = b1 + b2 ;

c3 = c1 – b2

If R1 belongs to a relation class C1, and R2 to another class C2, a third class C3 could be put to the fore from the conjunctions of the temporal relations of the first class with those of the second class. An example of relevant information stemmed from conjunction of temporal relations will be given in the section 4.

3. Effect of segmentation errors on data analysis The use of segmentation tools brings in the problem of the reliability of the segmentation results. Compared to a reference segmentation, a segment produced by a segmentation tool can not be exactly identical to its reference. One of its endpoints can be shifted forward, for example, when the beginning or the end of an event has been anticipated. It can also be shifted backwards, for example when there is latency in the event detection, because borders can be difficult to find as in sound detection. Both of the endpoints can also be shifted. To take into account this lack of reliability, a segment produced by a segmentation tool will be represented considering the differences between its own endpoints and those of its reference. If [t1,t2] represents the reference segment then [t1+α, t2+β] will represent the segment produced where the differences α and β are two integers (positive, negative or equal to zero). Our idea is to model the error in order to observe its effects on the temporal relation analysis. If we consider for example, the Allen’s temporal relations ‘starts’, ‘finishes’ ‘equal’ or ‘meets’ that can be observed between two reference segments (if s1 ‘starts’ at the same time than s2 for example) endpoint shift in the automatic segmentation results can be critical for these relations. On the other hand, other temporal relations like ‘before’, ‘overlaps’ and ‘during’ in the Allen’s categorization, can be insensitive to slight shifts. To study error effects on temporal relations, without limit ourselves to Allen’s relations, we have carried out a first experiment. Automatic segmentation results were available, respectively produced by applying a speech segmentation tool [10] and a person detection tool [11] to a same audiovisual program (TV game). These segmentations were used as reference segmentations. Then, to simulate error segmentation, a Weibull distribution has been applied to shift several intervals in each segmentation.

All the kinds of shift that may occur in the feature extraction domain have been studied. A TRM has been built in each case: one with the reference segmentations TRMref, the other one with the shifted segmentations TRMshift. Then each one has been classified by carrying out different classifications: two-class, three-class and four-class. We have noticed that the error effects were negligible because the occurrences of the relations classified on both sides were approximately equal, though the precision decreased when the number of classes increased.

4. Audio data analysis using TRM 4.1. Experiment context As said in the introduction, our main motivation here is to use our parametric representation, which has been thought to be generic, that is to say independent of the audiovisual document type and of the kind of temporal relations to observe. As an applicative framework, we have focused on the study of audio events, by using some different audio segmentation results as input of our process. Our purpose was to see if relevant information about the document content or its structure could be brought to the fore by analyzing audio data using TRMs. In this context, a second experiment has been made on the same TV game program used in section 3, the length of which is about 31 minutes. Eight speakers have been identified. No tool was available to produce automatic speaker segmentation, so, an elementary segmentation, as it is defined in section 2.1, has been manually made for each speaker. These eight segmentation results represent the only knowledge at our disposal.

duration was in general a maximum gap to detect relevant relationship between two segments in a conversational context. Nevertheless, the results obtained were quite poor. The main reason was that, if this value is well adapted when two speakers are talking together at a normal rhythm as we should say, in fact it is not adapted to the audiovisual document we were analyzing. In this TV game, players need time to think about the right answer, so, the value of α must be increased to detect such interactions between players. Then, to obtain more informative TRMs, α has been set to 10. To illustrate the kind of results obtained at the end of this first analysis step, let us consider four of the eight segmentation results: segmentations related to speakers #2, #3, #4 and #5 and associated TRM. The graphical representation of the temporal relations observed between speakers #2 and #3 (TRM 2,3) and speaker #4 and #5 (TRM 4,5) are given respectively in Fig 2.a and Fig 2.b.

Fig 2.a: Graphical representation of the TRM computed between speaker #2 and speaker #3

4.2. TRM analysis Using our parametric representation of temporal relations, a TRM has been computed, as explain in section 2.3, for each couple of different speaker segmentations, which corresponds to 28 TRMs. This was done in order to analyze the distribution and the frequencies of the temporal relations that can be observed. As it is shown in the example of the Allen’s relations given in section 2.4, constraints need sometimes to be specified to limit the scope of the observations. The Lap parameter gives a measure of the gap between the two segments put in relation (cf.§2.2). A high value means that segments are very far from each other (from a temporal point of view) and the temporal relation, which characterizes them, may not be relevant enough. That is why a limit called α has been previously introduced. Direct conversations between speakers can be one of the audio events that characterize the document content or its structure. To allow such events to come automatically to the light, a value must be given to α. A first, it has been set to 1, assuming that one-second

Fig 2.b: Graphical representation of the TRM computed between speaker #4 and speaker #5 The four other TRMs (TRM 2,4 , TRM 2,5, TRM 3,4 and TRM 3,5) are practically empty because, comparatively, the global number of votes for the temporal relations observed in each case is very low while it reaches 247 for TRM 2,3 and 450 for the TRM4,5 (see Table 2 for details). If we look afterward at the structure of the TV game, we can notice that there are two teams involved, each one with two players. Each round of the game consists in exchanges between members of the same team, the first one asking a question to the other one who has to reply. Direct exchanges are more frequent between members of the same team than members of different teams. The direct conversations put to the fore between speaker #2 and #3

as well as between speaker #4 and #5 are reflecting the fact that each of these speaker couples corresponds in the game to one of the two teams. Three other observations can also be made after this step of TRM analysis. The first one is that the speaker #1 is the only one speaker who has exchanges with every other one. By looking afterward at the TV game, this speaker has in fact a specific role which is to present and lead the game. So, he needs to interact with all the others. On the contrary, the second observation we can made, is that the speaker #6 seems to speak only with speaker #1 and also with speaker #4. The first exchanges results from the specific role of speaker #1 as explained just above. The others, with speaker #4, are in fact due to a special round involving a player (in this case speaker #4) and one person from the public. The third observation concerns the number of votes in TRM2,3 and in TRM 4,5. We can notice that it is higher in the second case. In fact, each round of this TV game involves a team at a time, where a player has to make the other one to guess as many words or expressions within the time allowed. The second team, with a higher score, is in fact the best one, finding answer faster than the other team.

4.3. TRM classification As illustrated above, the first step analysis may be used to see if a relevant link can be established between two events (two speaker segments in the case of our example). Then, from this first observation, the idea is to try to go further in our TRM analysis to see if more relevant information may be conveyed by the TRM. In our experimental context, maybe other audio events like conversation can be highlighted. When two people (A and B) are talking together, we can consider two cases: A is speaking to B or B is speaking to A. So, we can consider two classes of relations, the first class C1 (speaker A / speaker B) and the second one C2 (speaker B / speaker A). In this step, the k-means classifier has been used to divide the votes of each TRM into two classes C1 and C2. The number of votes in each class are reported in the Table 2 where A and B represent the numbers of the two speakers involved in the TRMA,B. In case of conversations between two speakers, more or less brief, the two classes can almost counterbalance each other. Table 2: Classification of each TRM A 1 1 1 2 2 3 4 4 5 7

B 2 5 8 5 8 6 5 8 8 8

C1 65 106 3 6 0 0 245 0 0 4

C2 60 97 5 6 0 0 205 0 0 3

AB 1 3 1 6 2 3 2 6 3 4 3 7 4 6 5 6 6 7

C1 49 6 123 0 6 7 4 0 1

C2 49 5 124 0 5 4 8 0 0

AB 1 4 1 7 2 4 2 7 3 5 3 8 4 7 5 7 6 8

C1 84 89 4 6 10 0 15 39 0

C2 71 79 7 7 5 0 19 26 3

The classification results in Table 2 and the number of votes in each class, only gives global idea of the number of exchanges between couple of speakers, all along the TV program. If we consider again the couple #4 and #5, the class C1, A is speaking to B, count 245 votes, while the other class C2, B is speaking to A, wins 205 votes. It could be interesting to know if they belong to the same conversation or if they are separated in time.

4.4. Conjunction of temporal relations Conversation, as consecutive exchanges between two people, can be considered as a sequence of speech turns where A is talking to B and then B is talking to A. Temporal relations belonging to the class C1 could be combined to the temporal relations belonging to the second class in order to highlight a conversational scheme: (speaker A / speaker B / speaker A). This can be done by applying a conjunction operation between temporal relations of each class, as it is defined in section 2.5. The same thing can be done between the temporal relations belonging to the class C2 and to the class C1 to model exchanges as (speaker B / speaker A / speaker B). So, more complex scheme, and in our case, conversational scheme, can be put to the fore by conjunction of temporal relations. Using conjunction operations, we have retrieved the sequences of exchanges between speakers that model a conversation. We have observed that long conversations (high number of consecutive exchanges) appear only between speakers #2 and #3 as well as well as speakers #4 and #5. In the same way, other schemes can also be highlighted. If in addition to the eight speaker segmentations, we have considered a new basic audio segmentation results. Also manually made, it consists in applause segments. To represent temporal relations between each speaker and the applause events, eight new TRMs have been computed and a two-classification step carried out. Each TRM has been divided in two relation classes: the class (speaker #X / applause) and the class (applause / speaker #X). Now if we combine temporal relations from the appropriate classes other schemes will appear (speaker A / speaker B / applause) or (speaker A / speaker B / speaker A / speaker B / applause) , that is to say a sequence of exchanges closed by applause, which can reveal a successful phase of the game. After the detection of the consecutive exchanges, which we call conversation events, we have used this last model to retrieve the game phases. Most of them were correctly detected but others were mixed with the next one. Because, in case of unsuccessful phase of the game, no applause follows and the player starts directly a new game phase.

In our TV game, a round is a sequence game phases. So, if we bring the conjunction of temporal relation to general use, and apply it to temporal relation directly observed between basic events, as well as to temporal relations resulting themselves from a conjunction like game phases, events like rounds will be modeled by: (speaker A / speaker B (/speaker A / speaker B)* / applause /(speaker A / speaker B (/speaker A / speaker B )*/ applause)*) here * means repeated none, one or several times. The duration of applause segments might also be an indication on the degree of importance or granularity of the event detected, but we haven’t taken this information into account.

5. Conclusion and future work In this paper we have presented the parametric representation of temporal relation that we have defined and on which our work is based. Starting from two elementary segmentations bringing low or mid-level information about the same audiovisual document we built a Temporal Relation Matrix (TRM). Classification of this matrix content put temporal relation classes to the fore. Temporal relations belonging to these classes can be combined using a conjunction operation, bringing out information or events of a higher level. Although manual segmentations were mainly used in our experiment, we also address the problem of the effects of segmentation errors on our process, when the segmentation results are not fully reliable. To give an illustration of how our method can be used, we have chosen a TV game focusing on audio event analysis as an applicative framework. Our purpose in this first experiment was to present the kind of results produced by our method and to show some interesting paths to follow and to go deeper into to make an automatic analysis of document content. The automatic detection of events carrying highlevel information is necessary in indexing tasks just as in automatic characterization of audiovisual document content or structure. Bringing out the main events, which are representative of a document structure, can allow to calculate similarity between audiovisual documents belonging to a same collection which could be of interest in data mining purpose. Although we applied our method only on audio events, it has been thought to be independent of the type of basic events taken into account, the type of temporal relations to observe as well as the type of audiovisual document. Concerning the first point, we are making other experiment using audio event (speaker segments) and video ones (presence of one or two people on the screen) to see which kind of events can be brought out from automatic analysis. Although the temporal relations of Allen are important in terms of temporal analysis, we do not limit ourselves to look for predefined relations. Our parametric representation allows to observe relations with less semantic meaning than Allen’s ones. On this third point, however, we have to be moderated. The value we gave to α in our experiment has

been chosen by us to be the more appropriate to the document content. In future works, this choice should be made according to values defined after a training step made on several videos in order to take the decision automatically. In the same kind of ideas, the number of class to find might be determined during the classification step as the optimal number of class. At last, our method must be applied on a set of documents, to validate the automatic extraction of high level events characterizing their semantic content. These are the directions of our future works.

6. References [1] Y. Avrithis, N. Tsapatsoulis and S. Kollias, “Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach,” IEEE International Conference on Multimedia and Expo, New York City, NY, July 2000. [2] S. Lefevre, B. Maillard, and N. Vincent, “3 classes segmentation for analysis of football audio sequences,” in Proc. ICDSP’2002, Santorin, Greece, July 2002. [3] S. Eickeler and S. Muller, “Content-Based Video Indexing of TV Broadcast News Using Hidden Markov Models,” In Proc. IEEE ICASSP, Phoenix, 1999. [4] Zein Al Abidin Ibrahim, Isabelle Ferrané, Philippe Joly, “Temporal Relation Analysis in Audiovisual Documents for Complementary Descriptive Information,” … of AMR 2005, Glasgow, UK, 28 juillet 29 juillet 2005. [5] L. Chittaro and A. Montanari, “Temporal Representation and Reasoning in Artificial Intelligence: Issues and Approaches,” Annals of Mathematics and Artificial Intelligence, Vol.28, pp. 47-106, 2000. [6] A. K. Pani, ”Temporal representation and reasoning in artificial intelligence: A review,” Mathematical and Computer Modelling, 34:55–80, 2001. [7] M. Vilain, and H. A. Kautz, “Constraint propagation algorithms for temporal reasoning,” In AAAI-86, pp. 132-144, 1986. [8] J. F. Allen, “Maintaining Knowledge about Temporal Intervals,” Communication of ACM, 26(11):832 – 843, 1983. [9] B. Moulin, “Conceptual graph approach for the representation of temporal information in discourse,” Knowledge based systems, vol. 5, num. 3, pp 183 –192, 1992.

[10] J. Pinquier, J.L. Rouas and R. André-Obrecht (2002). Robust speech/music classification in audio documents. In ICSLP’2002, Denver, Colorado, September 2002. [11] G. JAFFRÉ et Philippe JOLY (2005). Improvement of a Person Labelling Method Using Extracted Knowledge on Costume. In Proc. of the 11th International Conference on Computer Analysis of Images and Patterns, Rocquencourt, France, September 2005.

Lihat lebih banyak...

Audio Data Analysis using Parametric Representation of Temporal Relations

Descrição do Produto

Comentários