A global process to access documents’ contents from a geographical point of view

Descrição do Produto

A Global Process to Access Documents’ Contents from a Geographical Point of View Mauro Gaio b,∗ Christian Sallaberry b Patrick Etcheverry c Christophe Marquesuzaa c Julien Lesbegueries b LIUPPA – EA 3000 b Universit´ e

de Pau et des Pays de l’Adour, 64013 Pau, Universit´e Cedex, France c IUT

de Bayonne, Chˆ ateau-Neuf, 64100 Bayonne, France

Abstract Local cultural heritage document repositories are characterized by contents strongly attached to a territory (i.e. Geographical references). The user must be able to consider such repositories according to a focus, which takes into account his/her geographical interests, and which allows one to access the relevant document’s contents from a geographical point of view. This paper presents the Virtual Itineraries in the Pyrenees (PIV) project. Spatial and temporal core models are proposed to give a formal representation of geographical information. The models take into account the characteristics of heterogeneous human modes of expression: written language and captures of drawings, maps, pictures, etc. Semantic processes have been built to automatically manage the spatial and temporal information from non-structured data. A “back office” prototype, which adds these processes to classic information extraction (IE) approaches, while associating a geographical information retrieval (GIR) service is proposed. This service searches for any links between formal representations of geographic information in document collections, and similar representations in a user’s information query. Finally the paper presents the design work, giving the details of the principles of result visualization and navigation, while proposing a “front office” first implementation of the system. Key words: core geographic feature model, non-structured data, semantic processing, content-based information access, cartographic information visualization, heterogeneous expression modes

∗ Corresponding author Email address: [email protected] (Mauro Gaio).

Preprint submitted to Elsevier

10 April 2007

1

Introduction

The large scale tendency of digitalizing document collections introduces a challenge to regional media libraries in that it consists in managing a growing amount of repositories of cultural and heritage document collections (books, newspapers, postal cards, lithographs). Moreover, local media library’s staff would like non-expert users (scholars, teachers, learners or tourists) to have easy access these repositories. A large amount of information in these repositories is characterized by contents that strongly refer to specific geographical locations and their surrounding area [7]. But these geographic references are embedded within the content of the documents, and appear in different forms depending on the way they are expressed (expression modes such as texts, maps, tables). Generally, such semi-structured and non-structured data are supported by Electronic Document Management Systems (EDMS) and Library Management Systems (LMS). All these systems aim at providing fast and effective content-based access to a large amount of information. However, unlike Geographic Information Systems (GIS) or some Relational DataBase Management Systems (RDBMS) software, EDMS or LMS software do not offer high-level spatial operators and use classical information extraction (IE) and information retrieval (IR) approaches. They are generally statistical for their indexing method and their query language. Accordingly they are insufficient to manage information which requires semantics that depends on spatial and temporal concepts linked to a given phenomenon. Therefore the results of queries about geographical information are generally disappointing. When dealing with textual documents a semantic approach is a good way to more accurately manage such specific information. Processing text at the semantic level means going beyond the particular words and morpho-syntactic constructions. It means identifying the particular concepts they refer to and also their relationships. The formal representation provided by this identification allows a system to be more precise. For example distinguishing between the uses of the named entity ”Pau” as a direct reference to a location, and as a well-known location, as a reference for an another location. Semantic representations also lend themselves to inference, in particular to refine or extend the interpretation of a text, when used with background knowledge. For example, background knowledge about spatial relationships could be used to match a text about a specific location to a query about a whole region, even if the text doesn’t explicitly mention the region. Most operational systems mainly addressed the morpho-syntactic level and propose formal grammars to describe the grammatical structures of a text. Some other systems like LinguaStream [5] or GATE [14] propose an integrated 2

environment for creating complex Natural Language Processing (NLP). For geographic information only very few systems support the whole process like SPIRIT [22]. The purpose of the project ”Virtual Itineraries in the Pyrenees (PIV)” consists in a web-based platform with an in-depth geo-semantic identification to overcome the limits of EDMS or LMS software.

1.1 Contribution The PIV system aims at managing the repository of the electronic documents of a regional media library 1 by indexing the geographic information, so that the results can be visualized and accessed by navigation. In other words, a ` ´ process makes the content and/or query rewhole operational Eback officeE trieval process more efficient each time the information includes geographical ` ´ first implementation allows a user location references. And, a Efront officeE to ask for a documentary requirement, to visualize the results and to navigate it according to geographical criteria. So, the PIV system enhances the classical services of existing LMS and/or EDMS with new services which mark and retrieve the spatial and temporal aspects of the information. It uses a specific architecture based on web services (spatial and temporal semantical core models) to represent Geographic Features (GFs) and XML indexes to better manage geo-semantic marks (see Figure 1). Dedicated GIS services are used for two reasons: (1) as an ontology data management system (i.e. the storing and reasoning based on spatial data used for the needs of background spatial knowledge) (2) for the geo-referencing indexation processing (i.e. it computes complex spatial intersections) The PIV approach relies on spatial and temporal core models, which distinguishes it from other systems, like GIPSY [37] or SPIRIT [22]. Thanks to the recursive strategy of GF representation, all geographical information expression modes ((i.e. texts, maps, images) can be formalized. Another difference with SPIRIT is its back-office spatial reasoning. The SPIRIT system only tags direct references to a location, whereas it manages locations that are referenced via the well-known named entities within textual queries. The PIV system implements the rules involving spatial relations and geographic literary expression modes, as well as any accurate local specific geographic 1

This project is led in partnership with the Pau City Council and the MIDR media library. The repository consists of books, newspapers, postal cards and lithographs of the XIXth and XXth Century.

3

Fig. 1. PIV system: synoptic view of the “back office” process.

resource (any fountain, wood or mountain might be validated). Therefore, the PIV system tags direct and indirect locations with recursive aspects of the information in both corpora and queries. Some previous papers describe different aspects of the semantic approach used in the PIV system to analyze and describe geographic information contained in documents [17, 27, 19]. Finally, another specificity concerns the main focus of the PIV project. Its approach is to assist the user, starting from the phase of formulation of its request until the phase of visualization of the results.

1.2 Organization The main contribution of this paper is to present the global process of a new system. We first focus on the spatial and temporal information semantic processes in our repository of documents. Then, we discuss how the PIV system improves information retrieval accuracy each time a query contains spatial and/or temporal criteria. Finally a design work is depicted, which allows some specific user interactions that deal with his/her geographical interests.

2

Semantic processing

Geographical information in a document repository like the PIV one is distributed across various expression modes such as text, maps and tables. Each mode has specificities regarding the kind of geographical information it has to express. A text is more effective in explaining facts in relation to a geographic 4

place (like a named entity), than it is in describing the complex spatial organization of a phenomenon. In this case, a map is much more efficient. However, the notion of time and evolution, which is difficult to render on an image or a static map, is naturally better conveyed by text or graphics (such as curves, which are better suited for showing the evolution of a phenomenon.) In such corpora, a GF is composed of a Spatial Feature (SF), a Temporal Feature (TF) and a phenomenon. The sentence “churches of the XVth Century, 8 miles South of Pau” is a good example of a complete GF (Figure 2). Let us assume that to build of a geographical retrieval process of such corpora we need to: • always have an explicit SF. • Consider that TF could be implicit or not locally expressed or may have a range of more than one SF. • Have to approached the phenomenon with the keywords of the LMS, helped with the statistical algorithms of classical IR methods (if it is neither spatial nor temporal). Consequently, to process geographical information, an in-depth analysis of spatial information is mandatory. We rely here on a semantic processing approach that has been developed for several years and has given significant results ([26, 9, 36]). 2.1 The “target/site” concept As reported in linguistic and psycho-linguistic works, humans have a specific way of representing spatial information in written language. According to [6], we can link a place to a category and associate it with a natural or artificial boundary. Four categories relative to a local cultural heritage document collection can be specified: named boundaries (countries, counties, parishes. . . ), hydrographic features (rivers, estuaries, lakes. . . ), man-made features (cities, towns, villages. . . ), and physiographic features (mountains, plains, coast lines. . . ). Referring to such places involves several elements. C. Vandeloise [34] studied this assumption in written language and proposed the target/site concept. In written language, the target and the site corresponding

Fig. 2. The composition of a GF.

5

Dans la premi` ere moiti´ e du XIXe si` ecle, lors d’importantes chutes de neige dans les montagnes du sud-ouest de la France, ` a proximit´ e de quelques villages Basques en Pyr´ en´ ees Atlantiques... (1) dans les montagnes du sud-ouest de la France (2) ` a proximit´ e de quelques villages Basques en Pyr´ en´ ees Atlantiques (3) Dans la premi` ere moiti´ e du XIXe si` ecle Fig. 3. Example of text’s expression mode.

to a given spatial reference, has a designated position in a sentence. When the target corresponds to the subject, the site corresponds to the object. More generally, in his hypothesis the target corresponds to the subject of our description, and the site corresponds to its spatial and temporal references. Our assumption is to extend this hypothesis to any other expression mode. In this framework, we focus on a discursive or graphical structure combining several geographic named entities, called, from now on, a Geographic Feature (GF). In other words we focus here on the notion of recursive explicit spatial and/or temporal absolutes or relative references. These are limited but unavoidable types of information carried by documents’ contents. We claim that the automatic analysis of such structures in a repository of cultural and heritage document collections provides an interesting content retrieval approach for querying and interacting in or with such documents.

2.2 Core Models In Figure 3, (1) expresses the exhaustive determination selecting all spatial entities of the given mountain type (“montagnes”) located in a certain zone, which matches the south-east half of a well-know named geographic entity (“France”). In (2) the determination (introduced by “quelques” (some)) is relative, i.e. only a part of the elements given by the type has to be considered. Here, the type specifies that we only keep “basques” villages from a given administrative boundary (“Pyr´en´ees-Atlantiques”). Finally the relationship expressed with “`a proximit´e ” (close by) evoke the “real” spatial reference. In (3) only a part of the XIXth element of the temporal type ’Century’ (“si`ecle”) has to be considered. The part of the element is given by ’first part’ (“premi`ere moiti´e ”). These models have been thought to be compliant with geographical information contained in our repository. This information may be, expressed inside 6

a more or less complex discursive form. It may be polysemic and sometimes context-dependent. Thereby, the proposed semantic process for analyzing GFs depends on an adaptative core model for describing them. The model is based on a quite naive formal representation of spatial features in comparison with those present in the world of GIS ([11], [16], [21] or GML 2 ) but let us go on to the next stage of the system; the information retrieval process and its compliance with the temporal component of the information. In such a model, according to the linguistic hypothesis, the SF and the TF components of a GF may be recursively defined from one or several other SFs and/or TFs and their relationships are part of the definition Figure 4. The [34]’s idea can easily be defined in a recursive way. C(s) 1..*

Representation

1

C(t)

1..*

0..*

SF

Representation

1

1..*

TF

0..1

0..1

A(s)

A(t)

Relationship

Relationship

1 1..*

B(s) A_SF

1 1..*

B(t) R_SF

A_TF

R_TF

Fig. 4. Spatial & Temporal Core Models’ Simplified Schema.

Modelling Spatial Feature So a SF (Figure 4) have (A(s)) at least a geometric representation. A SF could be (B(s)) an Absolute Spatial Feature (A SF), if it only consists in one named entity allowing a geo-localization. Or a SF could be a Relative Spatial Feature (R SF), if it is defined using a spatial relationship with at least one SF. (C(s)) Spatial relationships can be topological (adjacency, inclusion. . . ) or Euclidean (distance, geometric, orientation. . . ) [12], [13]. For instance an adjacency relationship appears when we evoke a SF’s spatial proximity to another SF. This relationship is evoked in written language with terms like ’near’ ’close by’, ’step by step’. . . , as “near the Laruns village”, where the whole expression is a R SF; whereas “Laruns village” is an A SF. This means that a geometric relationship appears when we need to evoke several SFs to define a spatial feature, by the evocation of a geometrical figure: i.e. “The triangle Bordeaux-Biarritz-Pau” Every relationship has attributes in order to characterize it. For example a relationship of distance has a numerical parameter, a relationship of adjacency 2

Geography Markup Language - http://opengis.net/gml - However we use in our model a GML-based language to describe the geo-location of SFs.

7

has a qualifier. All the resulting SFs are conform to the PIV core spatial model. Thus, we can manage A SFs like “in Laruns” and relative ones like “near Laruns”, “at about 10 km in the south of Pau city”, “between Pau and Laruns”, etc. If we take the example “The triangle Bordeaux-Biarritz-Pau” again, it is a GF composed of a R SF, wich is defined by the geometric relationship “triangle” and by three A GFs “Bordeaux”, “ Biarritz” and “Pau”. So SFs extracted from various expression modes can be formally represented thanks to the core spatial model. Although this model has been built thanks to linguistic ideas on spatial reasoning, a similar schema can be modelled for time reasoning [4].

Modelling Temporal Feature As said in [10], the TFs could be implicit or not locally expressed or may have range on more than one GF. Anyway in our corpus when a TF is expressed it is an expression relating to a historical time. The expressions studied are either temporal periods (as “in the first part of the XIXth Century”), or durations (like “for 80 years”). Note that our work in progress has not yet approached the question linked to the temporality of the events. Because of its importance in queries only expressions evoking the anchor in a chronology have been considered. For these reasons a very traditional method in formal semantics [8] seems to give sufficient results for our spot. The first step translates the temporal expression into a formal structure representing three types: • algebraic intervals with a hierarchical relationship like Allen’s [1] interval, such a structure clearly appears in the expression “the first part of the XIXth Century”; • metric as in the expression “for 80 years”; • operation of sequencing as “in the next period”. The second step implies a contextual and referential interpretation to produce a “core” interval between two dates. This approach is similar to the spatial one so that a similar recursive model can be built. The temporal model is composed of a TF that can be an A TF (Figure 4) (B(t))(a date or a named entity : le si`ecle des lumi`eres) or an R TF. An R TF is composed of at least one A TF and of some intermixed relationships (C(t)). For instance the R TF “between 1914 and 1918”: • is first defined by two A TFs “1914” and “1918”, • then an interval relationship (“between”), links these A TFs to focus on the closed interval. 8

Temporal relationships can be closed or open interval relationships (“between”, “during”, etc.) or distance relationships (“before”, “after”, “10 years ago”). It could be useful but not mandatory that a TF have a geometric representation (A(t)).

2.3 The analysers Let us retain that like the words or expressions in the written language, in images, maps or graphics it is obvious that the important semantic information necessary to interpret them is not represented in single pixels, but in meaningful image objects and their mutual relationships. These meaningful objects will be named from now on “sems”. [33] [36] proposes a semantic definition to represent spatial data. Finally the eCognition system provides a powerful toolkit for image analysis 3 . An interesting characteristic of maps is the fact that they follow a rather strict structural pattern, since their construction by humans is more or less guided by formal rules, or at least by identified habits. Therefore, the use of the semiotic approach of information representation, as studied in [3], allows one to derive a model of the map that will serve as a basis for the analyser tasks. Generally a document content processing sequence is composed of four main steps: (1) “tokenisation” divides the document into smallest sems, (2) lexical and morphological analysis proceeds to a sem recognition, (3) the syntactic analysis, based on grammars rules, allows to find the bonds between sems finally, (4) the “semantic” step carries out a more specific analysis allowing meaningful sems groupings to be interpreted. In our data processing the sequence is a little bit different. Let us take into account the PIV geo-semantic textual data processing 4 . After a classical preprocessing textual tokenisation sequence (Figure 5 sub-process A) and according to [2] we adopt an active reading behaviour, that is to say sought-after information is a priori known (sub-process B in Figure 5). Then after this pattern analysis is performed to fetch “kernel” sems and thanks to a definite clause grammar (DCG) (illustrated in Figure 5 with the sub-process C), both morpho-syntactic and semantic analyses are processed. So multi-grained expressions (words, noun phrase, sentence. . . ) containing AGFs (i.e. well known locations) are extracted first. This sub-process is ended with an AGF validation (Figure 5 sub-process D), thanks to internal and/or external gazeteers. Then multi-grained expressions containing RGFs are built from pointed out AGFs. 3 4

http://www.pcigeomatics.com/products/definiens.html all the processes described here are fully implemented

9

B

INPUT: non−structured document

A

Splitter Tokenizer

Logic. Sub−Struct. Patterns

Lemma Rules

Lemmatizer

C

Semantic Tagger

Token Marker

Dates structures

Named Spatial

Named entities

Features Patterns

Temporal feature

Spatial feature

initiator lexicon

initiator lexicon

POS Tagger

New textual flow with "candidates" GF Expressions DCG Grammar

Spatial and/or temporal Relationships Rules

D Xml structured document with geo−semantic tags

Internal & External Ressources AGFs’ Validation process

Fig. 5. PIV system: Geo-semantic textual data processing

2.4 Multi-indexing As our system architecture is open (loosely composed), based on web services, we can easily integrate specific tools according to needs. Indeed, an indexation layer could be built for each semantic axis. So we can define several models and processes and develop a dynamic multi-indexing system: a model definition and a specific grammar are enough to automatically build a new indexation layer. More precisely, when some documents are added, they are marked and there is an identifier for each object, namely a paragraph, part of an image or a specific layer in a map. Then each extracted feature (corresponding to a semantic aspect) is depicted with its description and an object identifier. We can thus retrieve it by a pattern matching algorithm based on the description and applied on the object pointed by the identifier.

3

The PIV System Information Extraction and Retrieving Processes

The PIV system implements IE and IR complementary approaches to better manage a cultural and heritage textual corpus. We need to search through a collection of documents (non-structured data for spatial computation usage), to find GFs that are semantically related to other GFs that have been detected 10

in a free text query. Then, it will be necessary to extract fragments of these documents, to classify them and, finally, to present them to the user. As previously mentioned our “loosely composed” system based on web services easily integrates specific tools according to one’s needs. Thus, our system manages geographic data for spatial semantics with GIS, via developed web services. Actually, GIS web services tools are first used to validate GF’s candidates, in the semantic processing stage. Then we use them to compute the GF’s geometric forms and geo-localization during the index creation stage. In the next section, only the process of a textually expressed SF will be detailed.

3.1 The PIV System Information Extraction for Textual Expressions Figure 6 shows a semantic feature form as a result of SF extraction.It is obtained by using the PIV geo-semantic textual data process (shown in Figure 5. This process has been fully implemented, thanks to the Linguastream platform 5 [35]. The extracted SF “entre Arudy et Bescat” “between Arudy and Bescat” is interpreted as an R SF. This is defined by a geometric relationship and two A SFs “Arudy” and “Bescat” (which are French villages). It could be a gap between the database data structures and a SF expressed in semantic feature form - more precisely between the GIS data structures and our spatial semantic feature extraction system. Our system architecture allows for missing information in a SFs’ definition so it can manage incomplete ones. It can call on additional services in order to complete these lacks.

Fig. 6. Extracted SF and its line semantic structure.

From a technical point of view, we build index files from this marking tool (see next section) using the XML technology. GIS tools provide a solution for a candidate’s validation and indexation. Indeed we have deployed a GIS 5

http://www.linguastream.org

11

database using Postgis 6 and a layer of French villages. The validation then consists in proving the candidates existence in the database.

3.2 Geographic Criterium-based Information Retrieval For every extracted SF, an instance of the core geographic model is created and stored in index files. This instance consists of the name of the feature, its interpretation (A SF or R SF with their relationships) and a corresponding geometric shape (for instance representing a concerned area). During the last stage of the semantic processing, this geometric object is computed using GIS services. The next section firstly describes the notion of shape for an SF’s geo-localization in index files and in queries. Then the spatial semantic-based information retrieval stage is explained.

3.2.1 The spatial location of geographic features The core geographic model can represent an SF regardless of its expression mode (text, image, etc.): the common denominator is the geo-localization and an associated geometric representation (a geospatial shape). Giving a representation to SFs Every feature extracted from documents is geographic data. That is why, if we want to think of a “shape”, we have to recover a geo-located geometric representation of each SF. Obviously this is done with the help of GIS tools. If we consider the different levels of granularity and the different levels of accuracy, the geometrical data corresponding to the shape of an SF can change. GISs provide several geometric objects: points (a church for example), polylines (a road), polygons, multi-polygons (a city), etc. Moreover, efficient topological functions are available in order to manage these objects. However, to carry out some experiments and evaluations we have implemented all the processes in a prototype. These first experiments and evaluations have been done to validate our hypothesis that area-based information retrieval for corpora like those of the PIV project is an efficient way of thinking. To reduce the software development gap, we have chosen to take a rough granularity of information. According to this choice, the possible different geometrical shapes of each spatial feature have been simplified into Minimum Bounding Rectangles (MBR)(Figure 7(A)).

6

http://postgis.refractions.net

12

R_SF in the center of Beost

"near Laruns" query

Beost A_SF

MBR of Eaux−Bonnes village

Laruns Eaux−Bonnes

Eaux−Bonnes

in the east of Eaux−Bonnes

(A)

(B)

Fig. 7. (A) Eaux-Bonnes : its polygon and its MBR. (B) A query example and its matching MBRs. 1 3 5 7 9

ComputeMBR(GF) { i f (GF i s a AGF){ return CallGeoRefWeb Serv ice (GF ) ; } e l s e i f (GF i s a RGF){ r e l a t i o n

Lihat lebih banyak...

A global process to access documents’ contents from a geographical point of view

Descrição do Produto

Comentários