Text Segments as Constrained Formal Concepts

June 20, 2017 | Autor: E. Kapetanios | Categoria: Formal Concept Analysis, Conceptual Clustering, Concept lattice, Text Segmentation

Share Embed

Denunciar este link

Descrição do Produto

Text Segments as Constrained Formal Concepts. Doina Tatar University ”Babes¸-Bolyai” Cluj-Napoca, Romania

Epaminondas Kapetanios University of Westminster London,UK

Abstract In this paper we address the problem of segmenting (and summarizing) a text by Formal Concept Analysis (FCA) using Lexical Chains established for the text. The first proposed method relies on the Conceptual hierarchy (Concept lattice) derived for a formal context expressing the map of Lexical Chains to the text. The second one offers a conceptual view for a segmentation, using a conceptual clustering of sentences. At least at our knowledge there are not previous attempts to solve the segmentation (and the summarization) of a text by FCA.

1. Introduction Formal Concept Analysis has been introduced in 1980 as an attempt to restructurize lattice theory. It offers a variety of methods for discovering relationships between objects and their attributes in a given context. FCA has a big potential to be applied to a variety of linguistic domains as a method of knowledge representation, and FCA applications provide a very suitable alternative to statistical methods. It is somewhat surprising that FCA is not used more frequently in linguistic [17]. The reason could be that the notion of ”concept” in FCA does not corresponds exactly to the notion of ”concept” as developed in Computational Linguistics. However, the notion of ”synset” in WordNet [6] could be assimilated with the former, if the semantical relations in WordNet are considered as subconceptsuperconcept relations in FCA. In [18] is stated that: ”it is entirely safe to use FCA for linguistic applications as long as it is emphasized that the mathematical vocabulary should be separated from the linguistic vocabulary”. In this paper we address the problem of segmenting and summarizing a text by FCA method. At least to our knowledge, there are not previous attempts to solve the segmentation and the summarization of a text by FCA. The needed knowledge in Algorithm A is only the set of Lexical Chains and in Algorithm B is the taxonomy derived from a text [4],[5]. The paper is structured as follows: Section 2 introduces the FCA basics. Section 3 surveys the related work attempting to segment texts in hierarchical and linear segments. Section 4 presents the new FCA method of segmentation, based on Lexical Chains. Section 5 shows how a concept-oriented segmentation could be generated by using the extracted taxonomy ([4]) from a text. We finish the paper with conclusions and future work directions in Section 6.

Christian Sacarea University ”Babes¸-Bolyai” Cluj-Napoca, Romania

Diana Tanase University of Westminster London,UK

2. A short survey of Formal Concept Analysis An attempt has been undertaken in 20th century’s mathematics to weave the threads of hierarchies, vector spaces and Boolean logic as separate mathematical systems together. They are conceived as different variants of one underlying structure called Lattice [2]. The distinctive property of a lattice is that two elements can be disjoined (allowed to spread out and diversify) and conjoined (woven together). FCA has been introduced by B. Ganter and R. Wille in 1982 (for a textbook see [7]) as a descendant of lattice theory for ordering and organising concepts. Since then, it has been applied in many different domains as artificial intelligence, linguistics, software engineering, medicine, etc. R. Wille named in [24] a concept as ”the most simple basic form of human thinking”. Regarding the philosophical point of view, a concept is a unit of thought consisting from two parts: the extension (the set of objects belonging to this concept) and the intension (the set of common attributes valid for all these objects). The frame for defining a set of concepts is the so called Formal Context defined by the set of objects G, the set of attributes M and the incidence relation I. So, a Formal Context is given by a triple K := (G, M, I) the definition of which is as follows: Definition 1: A formal context K := (G, M, I) consists of two sets G and M with I being a binary relation between G and M , I ⊆ G × M . The elements of G are called objects and the elements of M are called attributes of the context.

The pair (g, m) ∈ I is read as ”the object g has the attribute m”. Usually a formal context is given by an incidence matrix, where a mark ”X” on the line of g and the column of m means that the object g has the attribute m. For a set A ⊆ G, we define the ”derivative” of A, denoted by A0 , as: A0 = {m ∈ M | ∀g ∈ A, (g, m) ∈ I} In other words, A0 is the set of all attributes shared by the set of objects from A. Dually, for a set B ⊆ M , ”derivative of B”, denoted by B 0 , is defined as: B 0 = {g ∈ G | ∀m ∈ B, (g, m) ∈ I}. Again, B 0 is the set of all objects which share the set of attributes from B. These derivation operators form a Galois connection between the power sets of G and M .

Now, the notion of formal concept is defined as: Definition 2: A formal concept of the formal context K = (G, M, I) is a pair (A, B), with A ⊆ G and B ⊆ M such that the following relations take place: A0 = B

and

B0 = A

The set A is called the extent of the formal concept (A, B) and the set B is called its intent. Between the formal concepts is defined the subconceptsuperconcept relation ≤ as follows: Definition 3: If (A1 , B1 ) and (A2 , B2 ) are concepts of a context K, (A1 , B1 ) is called subconcept of (A2 , B2 ) provided that A1 ⊆ A2 (or equivalently, B2 ⊆ B1 ). In this case, (A2 , B2 ) is a superconcept of (A1 , B1 ) and we write (A1 , B1 ) ≤ (A2 , B2 ). The relation ≤ is called the hierarchical order, or simply order of the concepts.

For a given object g ∈ G, we define the object concept to be γg := (g 00 , g 0 ) and the attribute concept to be µm := (m0 , m00 ). The set of all formal concepts of a formal context together with the subconcept-superconcept order relation ≤ forms o complete lattice called the concept lattice or conceptual hierarchy, and is denoted by B(K). Theorem 1: Basic Theorem of Formal Concept Analysis The concept lattice of any formal context (G, M, I) is a complete lattice. For an arbitrary set {(Ai , Bi ) | i ∈ I} ⊆ B(K) of formal concepts, the supremum is given by !00 ! _ [ \ (Ai , Bi ) = , Ai Bi i∈I

i∈I

i∈I

and the infimum is given by !00 ! ^ i∈I

(Ai , Bi ) =

\ i∈I

Ai ,

[

Bi

.

i∈I

Remark 1: By the Basic Theorem of FCA, we have that V (A, B) = m∈B µm, i.e., it is the infimum of all attribute concepts of its intent. An usual method for visualizing conceptual hierarchies are Hasse diagrams, where every element is denoted by a circle, the ordering relation being denoted by a path connecting the circles. Since labeling every node by the extent and intent would imply lots of redundant information, a reduced labeling has been introduced. The rule for the reading of the concept lattice is the following: ”An object g has an attribute m iff there exists a path from the top (the last element of the Concept lattice, 1B(K) ) to g which contains the label m. Dually, an attribute m is shared by all the objects situated on the paths which begin in m and end at the bottom (the first element of the Concept lattice, 0B(K) )

3. Related work in linear segmentation 3.1. Earlier methods in linear segmentation As a first method of linear (flat) segmentation we will summarize here the reference chains method. Consider that

all the chains of antecedents-anaphors [13] of a text are CHR1 , · · · , CHRn . A chain CHRi contains the occurrences of entities identified as antecedents for a given anaphor and also the occurrences of this anaphor. The principle detecting a boundary between two segments is the following: the most frequent pair antecedent-anaphor (P) is changed at a boundary, and stays unchanged at the interior of a segment. So, if the most frequent pair antecedent-anaphor for the sentences S1 , · · · Si , denoted by P(S1 , · · · Si ) is different of P(S1 , · · · Si+1 ), then there is a boundary between the sentences Si and Si+1 . Otherwise, Si and Si+1 are in the same segment. Another method for linear segmentation is provided by the Centering Theory [8]. Let us consider that the Forward looking centers (the syntactically ranked entities introduced by a sentence) are calculated, and the most well ranked center (Preferred center) is established for each sentence Si . The principle detecting a boundary between two segments is the following: the Preferred center is changed at a boundary, and stays unchanged at the interior of a segment. So, if the Preferred center of Si , denoted by CP(Si ) is different of CP(Si+1 ), then there is a boundary between the sentences Si and Si+1 . Otherwise, Si and Si+1 are in the same segment. The most well known and largely implemented method of topic segmentation is TextTiling [10], [19]. The article [10] describes a model of discourse structure based on the notion of topic shift, and an algorithm for subdividing expository texts into contiguous, nonoverlapping subtopic segments (from here deriving the name TextTiling of the method). As the author explains, ”instead of undertaking the difficult task of attempting to define ’what a topic is’, we should concentrate on describing what we recognize as topic shift”. TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well ([11]). The central idea of TextTiling is comparing adjacent blocks of text of fixed size. The more words the blocks have in common, the higher the lexical score at the gap (the boundary candidate) between them. If a low lexical score is preceded by and followed by high lexical scores, this is assumed to indicate a shift in vocabulary corresponding to a subtopic change (a boundary)[11]. The lexical score of a gap is calculated as the cosine of vectors associated to the adjacent block texts, where a vector contains the number of times each lexical item occurs in corresponding text. Another method, proposed in [21], is called Logical segmentation because the score of a sentence is the number of sentences of the text which are entailed by it. The scores form a structure which indicates how the most important sentences alternate with ones less important and organizes the text according to its logical content. Due to some similarities with TextTiling algorithm this method is called Logical TextTiling (LTT). Similarly with the topic segmentation ([10]), in logic segmentation the focus is of describing what is a shift in relevance information of the discourse. Simply, a valley (a local minim) in the obtained graph (logical structure) is a boundary between two (logical) segments. This is an

accordance with the definition of a boundary as a perceptible discontinuity in the text structure ([3]), in this case a perceptible discontinuity in the connectedness between sentences. Moreover, this perspective about segments closely corresponds to the intensional structure of a text [9]. This idea will stay also on the foundation of the method proposed in this paper (Section 4).

3.2. Linear segmentation by Lexical Chains LCs are sequences of words which are in lexical cohesion relation with each other and they tend to indicate portions of a text that form semantic units [15]; they could serve further as a basis for a segmentation. The first paper which used LCs to indicate the structure of a text was [14], and it relies on the hierarchical structure of Roget’s thesaurus to find semantic relations between words. Usually LCs are constructed in a bottom-up manner ([1]) by taking each candidate word of a text, and finding an appropriate semantic relation offered by a thesaurus (as Rodget or WordNet). The paper [22] proposes a top-down method of linear text segmentation based on lexical cohesion of a text. Namely, first a single chain of disambiguated words in a text is established, then the rips of this chain are considered. These rips are boundaries of the segments in the cohesion structure of the text. Thus, a segment is considered as a piece of text where the disambiguation (the mining) of contained words is ”chained”. Due to some similarities with TextTiling algorithm, the method is called Cohesion TextTiling (CTT). The chain of disambiguated words of a text is obtained by a Lesk type algorithm based on Word Net. Let us remark further that the segmentation and the (extractive) summarization are two interdependent goals. Moreover, a good segmentation of a text could improve the summarization ([3], [1], [19]). On the other hand, the method of extracting sentences from the segments is decisive for the quality of the summary. Some largely applied strategies (rules) are ([21]): 1. First sentence of a segment is selected. 2. For each segment the sentence with a maximal score is considered the most important for this segment, and hence it is selected (For example, for LTT method, the sentence with a maximal score for a segment is a sentence which entails a maximal number of other sentences, and thus it is the most important). 3. The third way of reasoning is that from each segment the most informative sentence (the least similar) relative to the previously selected sentences is picked up. The paper [22] compares segmentations by LTT and CTT by comparing the summaries obtained applying the above strategies. The conclusion arisen is that the quality of CTT summaries is better than the quality of the LTT summaries from the point of view of informativeness (the similarity, calculated as cosine between the summarized text and the summaries obtained with one of the above strategy). The last method summarized here is the Lexical Chains Distribution (LCD) method ([23]) that relies on the number

(weight) of chains which end in a sentence, which begin in the following sentence and which traverse these two successive sentences. The method improves a scoring method of [15], [20]. Namely, let denote by input(Si) the number of LCs which end in Si , by output(Si ) the number of LCs which begin in Si , and by during(Si , Si+1 ) the number of LCs which traverse Si (and are not beginnings for LC) and Si+1 (and are not ends for LC). The score of Si to be the last i )+output(Si+1 ) sentence of a segment is: score(Si ) = input(S during(Si ,Si+1 ) . The bigger the score(Si ) is, the bigger is the chance to have a boundary between Si and Si+1 . In this way, the points of local maxima in the graph score(Si ) indicate the boundaries of the text S1 , · · · , Sn . The performance of LCD method is experimentally proved as better than CTT in [23]. Let us remark that a recent paper [12] introduces a method very similar with our LCD method. Namely, a score(Si ) = input(Si ) + output(Si ) is calculated for each sentence Si . However, as we proved in [23], the influence of denominator during(Si , Si+1 ) in the process of determination of segments is important: close by 4 %. On the other hand, it is not clear the role of Si to be the first or the last sentence in a segment, which for the summarization with the first sentence rule is crucial.

4. Segmenting texts towards a conceptual representation 4.1. Lexical Chains as formal attributes In this paper we apply the FCA theory to obtain the segments of a text. Let us mention the Intentional structure principle [9] which is based on the idea that every discourse has an overall purpose, and that every discourse segment has a purpose, specifying how it contributes to the overall purpose. It is reasonable to consider that the purpose of a segment is maximally coherent (all the sentences contribute to this purpose) and that the purposes of two different segments are minimal similar. The corresponding FCA idea is to determine a set of concepts (where the objects are the sentences) such that: • the set of concepts extents has to cover the text; • the set of concepts extents has to contain disjoint extents; • the set of concepts has to present a maximal internal cohesion (semantical and lexical); • the set of concepts has to display a minimal ”in between” cohesion (cohesion between a concept and the other concepts). All these four constraints applied to the set of concepts led us to identify the sentences of a text as objects, the Lexical Chains as attributes and the formal context K = (G, M, I) as below: • G = {S1 , S2 , · · · , Sn } • M = {L1 , L2 , · · · , Lm }. For each Lexical Chain Li the first sentence (where the chain begins) and the last sentence (where the chain ends) is given.

(Si , Lj ) ∈ I if the Lexical Chain Lj begins in a sentence Sb , with b ≤ i and ends in a sentence Se , with i ≤ e The formal concepts of the context K are pairs formed by a set of sequential sentences and a set of Lexical Chains shared by these sentences. The needed set of formal concepts (which determine a segmentation, as above) consists by a special constrained set of concepts {c1 , · · · , cl } such that: • sup{c1 , · · · , cl } = 1B(K) ; • inf {c1 , · · · , cl } = 0B(K) ; (as the lattice B(K) is complete, the sup and inf does exist for each set of concepts) • if the set of concepts with the above property is {c1 , · · · , cl } = {(A1 , B1 ), · · · (Al , Bl )}, then A1 ∪ ...Al = {S1 , S2 , · · · Sn } and A1 ∩ ...Al = Φ. In other words, the set of concepts {c1 , · · · , cl } represents a kind of ”complementing” concepts of the Concept lattice B(K) . Theorem 2 (below) assures that the extents of concepts contained in this set are formed by adjoining sentences and form, indeed, a segmentation: Theorem 2: If K := (G, M, I) is a finite formal context, G := {gi | i = 1, . . . n}, M := {mj | j = 1, . . . m}, having the property that for each attribute mj , the incidences are of the form (gi , mj ) ∈ I, (gi+1 , mj ) ∈ I, · · · , (gi+k , mj ) ∈ I (i.e., the objects are adjoining), then all formal concepts (A, B) have their extents formed by adjoining objects, too. Proof Let (A, B) be a formal concept of K. Its intent is defined to be A0 := {m ∈ M | ∀ g ∈ A, (g, m) ∈V I}. By the Basic Theorem of FCA, we have that (A, B) = m∈B µm, i.e., it is the infimum of all attribute concepts T of its intent. Hence, by the same Basic Theorem, A = m∈B m0 since every attribute concept is of the form µm = (m0 , m00 ). By the assumptions made, for every attribute mj ∈ M , its attribute extent is formed by adjoining objects, m0j = {gi , gi+1 , . . . , gi+k }. Since every meet of sets consisting of adjoining objects has the same property, it follows that the extent A is formed by adjoining objects too. The introduced method could be summarized as Algorithm A Input A text T formed by the sentences {S1 , S2 , · · · , Sn } and a set of Lexical Chains {L1 , L2 , · · · , Lm }. Output One or more linear segmentation of the text T . (For a given rule of summarization, from each segmentation a summary is obtained.) •

4.2. Examples Example 1 Let us consider a first example, where the incidence matrix is given as in Table 1. The Concept Lattice is given in the Figure 1. In this Concept Lattice the Formal concepts are: C1 C2 C3 C4 C5

= ({S1 , S2 , S3 , S4 }, {L1 }) = ({S5 , S6 }, {L2 }) = ({S2 , S3 , S4 , S5 }, {L3 }) = ({S2 , S3 , S4 }, {L1 , L3 }) = ({S5 }, {L2 , L3 })

S1 S2 S3 S4 S5 S6

L1 X X X X

L2

X X

L3 X X X X

TABLE 1. The incidence matrix for Example 1.

Fig. 1. Concept lattice for Example 1.

C6 = ({S1 , S2 , S3 , S4 , S5 , S6 }, Φ) C7 = (Φ, {L1 , L2 , L3 })

Let us remark that the concepts fulfill the above property. The single set of concepts such that inf = C7 , sup = C6 and Ac1 ∪ · · · Ac2 = {S1 , S2 , · · · , Sn } is the set : {C1 , C2 }. The determined segmentation is: [S1 , S4 ][S5 , S6 ]. The summary with the first sentence rule is {S1 , S5 }. Example 2 For the second example we display in the Figure 2 a short 9 sentences text from DUC2002 (AP880314). The Lexical Chains are presented in the Figure 3. Text: S1. The communist world gets its first McDonald’s next week, and some people here are wondering whether its American hamburgers will be as popular as the local fast-food treat, Pljeskavica. S2. The long-awaited opening of the restaurant on one of Belgrade’s main downtown squares will take place March 24, the Yugoslav news agency Tanjug reported, and it will offer Big Macs, fries and the other specialities familiar to McDonald’s customers in the West. S3. The Belgrade media have suggested that the success of the American restaurant depends on its acceptance by Yugoslavians who are long accustomed to the hamburger-like Pljeskavica. S4. Pljeskavica is made of ground pork and onions, and it is served on bread and eaten with the hands. S5. It is sold at fast-food restaurants across the country and costs about a dollar. S6. “In fact, this is a clash between the Big Mac and Pljeskavica,” said an official of Genex, Yugoslavia’s largest state-run enterprise that will operate the McDonald’s. S7. John Onoda, a spokesman at McDonald’s Oak Brook, Ill.,headquarters, said it was the first of the chain’s outlets in a communist country. S8. The next East European McDonald’s is scheduled to be opened in Budapest, Hungary, by the end of this year, said Vesna Milosevic, another Genex official. S9. Negotiations have been going on for years for expanding the fast-food chain to the Soviet Union, but no agreement has been announced.

Fig. 2. Text DUC2002 (AP880314), Example 2 The incidence matrix is established from the Lexical Chains (Figure 3) as for example: L1 is an attribute for the objects

W indowDif f (Hyp, Ref ) =

Manual Lexical Chains for Example 2:

PN −k i=1

|r(i,k)−h(i,k)| N −k

LC1: world (S1) , country (S5) (part/whole), country (S7) (repetition) LC2: McDonald’s (S1), McDonald’s (S2), McDonald’s (S6), McDonald’s (S7), McDonald’s (S8) - repetition all cases LC3: people (S1), customers (S2) (specification) LC4: hamburgers (S1) (part/whole), fast-food (S1) (part/whole), fries (S2) (part/whole), specialties (S2) (generalization), hamburger (S3) (repetition), pork (S4) (part/whole), onions (S4) (part/whole), bread (S4) (part/whole), fastfood (S5) (repetition), fast-food (S9) (repetition) LC5: Pljeskavica (S1), Plesjkavica (S3), Plesjkavica (S4), Plesjkavica (S6)repetition all cases LC6: week (S1) (part/whole), March (S2) (part/whole), year (S8), years (S9) (repetition) .................................................................................... LC10: news (S2), agency (S2) (specification), media (S3) (generalization) LC11: success (S3), acceptance (S3) (part/whole), clash (S6) (opposition) LC12: headquarters (S7) (part/whole), chain (S7) (specification), outlets (S7) (part/whole), chain (S9) (repetition) LC13: official (S6), spokesman (S7) (specification), official (S8) (repetition) LC14: Genex (S6), Genex (S8) (repetition) LC15: American (S1), American(S3) (repetition)

Here r(i, k) represents the number of boundaries of Ref(erence) segmentation contained between sentences i and i + k and h(i, k) represents the number of boundaries of Hyp(othesis) segmentation contained between the sentences i and i + k (N is the number of sentences). When k = 0, W indowDif f (Hyp, Ref )= (number of different elements in the binary vectors of the segmentation Hyp and the segmentation Ref ) /N. The obtained values for text in Example 2 (Figure 2) with k = 0 are: W indowDif f (F CA, M an) = 2/9; W indowDif f (LCD, M an) = 2/9. Thus, for this small text, the performances of our (FCA) algorithm and LCD algorithm (based on Lexical Chain Distribution) are equal.

Fig. 3. Manual Lexical Chains extracted from the text in the Figure 2 (Example 2).

4.3. Concept-oriented segmentation by clustering

G1, · · · , G7, L2 is an attribute for the objects G1, · · · , G8, etc.. We used FCA tool Toscana to obtain the Concept Lattice (see Figure 4), and segmented the text of Example 2, Figure 2, with the Algorithm A. Toscana is a software tool aimed for handling the tasks involved in the study of lattice theory, mainly formal concepts. There are several FCA tools available, a list may be found on the Formal Concept Analysis homepage [25].

Fig. 4. Concept lattice for Example 2. The extents of Formal concepts which satisfy the conditions in Theorem 2 are: {S1 , S2 }, {S3 , S4 , S5 , S6 }, {S7 , S8 , S9 }. The corresponding segmentation is [1,2][3,6][7,9]. The binary vector which signals the begin and the end of a segment is 111001101. Let us compare this result (called FCA) with that obtained with Lexical Chains Distribution (LCD) method ([23]), namely [1,6] [7,9] (corresponding binary vector is 100001101), and with the manually segmentation (Man), [1,3][4,6][7,9] (corresponding vector binary is 101101101). The method of evaluation is W indowDif f ([16]):

The process of segmentation is seen as an objective method, which provides one clearly defined result (for a given number of segments). However, different users have quite different needs with regard to a segmentation (a summarization) because they view the same text from completely different, subjective, perspective. Segmenting (and summarizing) a text must be associated with an explanation of why a given set of segments (a summary) is produced. We view the process of segmentation as connected to the clustering process of the sentences of a text. When the cluster Cl = {Si1 , · · · , Sim } is one of the set of obtained clusters, where i1 ≤ i2 · · · ≤ im , then the linear segmentation is: [S1 , Si1 −1 ][Si1 , Si2 ], · · · [Sim−1 , Sim −1 ], [Sim , Sn ]. The concept terms which are ”specific” to this cluster Cl explain the reason of the segmentation (the summarization). Let us remark that usually clustering texts means selecting of the most important (by frequency) words as features of clustering ([5]). In our method we choose as words the transitive verbs and complement nouns which form the concepts in the FCA approach ([4]). In what follows we refer to these words as concept terms, namely concept attribute terms, M , and concept object terms, G. A sentence is represented as a vector of concept terms: an entry of each vector specifies the frequency that a concept term occurs in the text, including the frequency of subconcept terms. The Algorithm B Input A text T = {S1 , · · · , Sn } , the set of concept terms G∪M (M is formed by transitive verbs and G is formed by their complement nouns), the concept lattice B(K) and the taxonomy H obtained from B(K) (hence from T ([4])). Output Different segmentations of the text T , after different sets of concepts. • •

Calculate the frequency of the concept term t ∈ G ∪ M in the sentence Si denoted by f (i, t); Calculate total frequency of the concept term t in the sentence Si as T otalS (i, t) = f (i, t) + P 0 t0 is a direct descendent of t in the taxonomy H f (i, t );

• • • • •

Calculate the P total frequency of t for all sentences as T otal(t) = n i=1 T otalS (i, t); Choose the first m best supported concept terms (which maximize T otal(t)), t1 , · · · , tm ; Represent each sentence Si by a m-concept term vector : V (i) = (T otalS (i, t1 ), · · · , T otalS (i, tm )) Cluster the set of sentences T = {S1 , · · · , Sn } , where sim(Si , Sj ) = cosine(V (i), V (j)) A cluster corresponds to a segmentation as above. The concept terms specific for this cluster explain the ”view” of segmentation and help user to understand the differences between clustering ( segmentation) results.

5. Conclusions and further work In this paper we apply the FCA theory to obtain the segments of a text. The idea is to determine a set of concepts such that: the concept extents cover the text; the concept extents are disjoint; the set of concepts presents a maximal internal cohesion; the set of concepts display a minimal ”in between” cohesion ( cohesion between a concept and the other concepts). All these constraints put to the set of concepts conducted us to consider the set of ”complementing” concepts of the Concept Lattice (see Section 3, Algorithm A). This kind of concepts are determined with the FCA tool Toscana [25]. The second algorithm introduced in this paper (Algorithm B) approaches the process of segmentation as a clustering process of the sentences of a text, using a taxonomy learned from a text. Each cluster provides a segmentation, explained by the concept terms specific for this cluster.

References [1] R. Barzilay, M. Elhadad : ”Using lexical chains for Text summarization”, in J. Mani and M. Maybury editors, ”Advances in Automated Text Summarization”, 1999, MIT Press. [2] G. Birkhoff: ”Lattice Theory”, 3rd edition, American Mathematical Society, 1967 [3] B. Boguraev, M. Neff: ”Lexical Cohesion, Discourse Segmentation and Document Summarization”, Proceedings of the 33rd Hawaii International Conference on System Sciences, 2000. [4] P. Cimiano, A. Hotho, S. Staab: ”Learning Concept hierarchies from text corpora using Formal Concept Analysis”, 24:305– 339, JAIR , 2005 [5] P. Cimiano, A. Hotho, S. Staab: ”Clustering ontologies from text”, LREC, pp1721-1724, 2004. [6] ed. C. Fellbaum: ”WordNet: an electronic lexical database”, MIT Press, 1998. [7] B. Ganter and R. Wille: ”Formal Concept Analysis. Mathematical Foundations”, Ed. Springer, 1999. [8] B. Grosz, B. Joshi, S. Weinstein: ” Centering: a framework for modelling the local coherence of discourse”, Computational Linguistics, 21(2),1995,pp203-225. [9] B. Grosz, C. Sidner: ”Attention, Intentions and the Structure of Discourse”, Computational Linguistics, 12(3), 1986, pp 175204. [10] M. Hearst: ”TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages”, Computational Linguistics, 23(1), 1997, pp33-76. [11] C.Manning, H.Schutze: ”Foundation of statistical natural language processing”, MIT, 1999.

[12] M. Marathe, G. Hirst: ”Lexical Chains Using Distributional Measures of Concept Distance”, Proceedings 11th International Conference CICLing 2010, Romania, March 2010. [13] R. Mitkov: ”Anaphora Resolution”, Pearson Education, Longman, 2002. [14] J. Morris and G. Hirst: ”Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text”, Computational Linguistics, Vol 17, Number 1, 1991, pp 21– 48. [15] M. Okumura, T. Honda: ”WSD and text segmentation based on lexical cohesion”, Proceedings of COLING-94, pp 755– 761. [16] L. Pevzner and M. Hearst: ”A Critique and Improvement of an Evaluation Metric for Text Segmentation”, Computational Linguistics, 28(1), 2002, pp 19-36. [17] U. Priss: ”Linguistic application of Formal Concept Analysis”, in Ed. Ganter, Stumme, Wille, ”Formal Concept Analysis, Foundations and Applications”, LNAI 3626, 2005, pp 149160. [18] U. Priss: ”Linguistic Data Exploration”, in ”Conceptual Structures in Practice”, Eds. P. Hitzler, H. Scharfe, CRC Press, 2009 [19] J. Reynar: ”Topic Segmentation: algorithms and applications”, PhD Thesis, 1998, Univ. of Pennsylvannie. [20] N. Stokes : ”Applications of Lexical Cohesion Analysis in the Topic Detection and Tracking Domain”, PhD Thesis, Faculty of Science National University of Ireland, Dublin, 2004. [21] D. Tatar, A. Mihis, D. Lupsa: ”Text Entailment for Logical Segmentation and Summarization”, 13th International Conference on Applications of Natural Language to Information Systems, June 2008, London, UK, LNCS 5039 (ed. E. Kapetanios, V. Sugumaran, M. Spiliopoulou), pp233-244. [22] D. Tatar, A. Mihis, G. Serban: ”Lexical Chains Segmentation in Summarization”, Proceedings of SYNASC 2008, IEEE Computer Society, pp 95-101 [23] D. Tatar, E. Tamaianu-Morita, G. Serban-Czibula: ”Segmenting text by lexical chains distribution”, Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques, KEPT2009, Cluj-Napoca (Romania), July 24, 2009, pp.41–49. [24] R. Wille: ”Communicative Rationality, Logic, and Mathematics”, in LNAI 4933, ICFCA 2008. [25] http://www.upriss.org.uk/fca/fca.html

Lihat lebih banyak...

Text Segments as Constrained Formal Concepts

Descrição do Produto

Comentários