Fluid geopoetics DATA project Natalia Boyarskaya and Mikhail Maiatsky UNIL EPFL
[email protected] and
[email protected]
Abstract The Fluid geopoetics DATA is a collaborative interdisciplinary research project, uniting efforts of the experts in literary studies, philosophy, and geography and in computer science. Its goal is to examine an underexplored category of "negative geopoetics" in Russian literature (XIX-‐XXI centuries). The project resulted in creation of a knowledge database that allows identifying, examining, and classifying "no-‐places" and negatively described places in the Russian text of this period. A platform of collaborative exploration, using modern methods of automated text processing has been developed and is made available for the community. This paper describes technical development of the project. KEYWORDS: Data in literary studies, user-‐curated and dynamic ontology, negative geopoetics, semantic clouds, conceptualization and categorization, ontology-‐organized knowledge base, key phrase.
1 Intoduction The Fluid geopoetics DATA is a collaborative interdisciplinary research project, uniting efforts of experts in literary studies, philosophy, and geography from the University of Lausanne (UNIL team, led by Prof. Anastasia de la Fortelle) and computer scientists from Swiss Federal Institute of Technology in Lausanne – (EPFL team, led by Prof. Karl Aberer). The project was launched in February 2015 in the framework of the CROSS programme, that supports collaborations between researchers in humanities at UNIL and specialists in sciences and engineering at EPFL.
1
This project continues the classical approach of geopoetics, to the notion of negative geopoetics1. The latter brings to light an underexplored category of objects in Russian culture and literature during the XIX-‐XXI centuries: local spaces that were previously marginalized (as not suitable for the perspective of the centrally-‐placed observer); the places that refuse any cultural markers, and resist to any (mis)use by a global mythology. The negative geopoetics pretends to examine, describe intertextually and classify the landscapes whilst focusing on splits and rifts, such as catastrophes, revelations of the Other and of the irreducible: waste-‐lands; landscapes of catastrophes – natural (falling meteors) or sociogenic (Tchernobyl); abandoned GULAG camps; various types of wetlands; junk-‐yards, waste deposits, squats, abandoned houses (occupied by migrants, tramps or homeless), short-‐stay places (flophouses, camping sites, refugee camps) and “no-‐places” (“a space in which neither identity, nor relation, nor history is symbolized”2). The current project is based on the previous studies «Obscure territories and "negative geopoetics” (“Territoires obscurs et “géopoétique negative”)» (UNIL, 2014-‐2015). This project also stems from and is based upon the on-‐going SNSF Sinergia project at EPFL – «Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries». The Sinergia project is a collaboration between physicists, complexity scientists and computer scientists. Its main focus is the development and advance of automated methods of information management in natural sciences. Its results are implemented in the ScienceWISE platform (http://ScienceWISE.info) that allows importing, storing and searching of scientific data, and provides a semantic recommender system. The ScienceWISE platform allows a community of scientists, working in a specific domain to generate dynamically as a part of their everyday work a web-‐based interactive semantic environment for Science, consisting of highly structured meta-‐data directly connected to the body of research papers. The ScienceWISE Ontology underpins the whole system and is the result of the combination of automated tools and a large crowdsourcing effort. The system automatically splits large collections of texts into hierarchy of research topics. For the humanity scholars it was important to adapt the automated tools of analysis to their own big collection of texts, relevant for negative geopoetics; for the computer scientists the Fluid geopoetics DATA project gave a novel use-‐case of an application of semantic technologies and methods of complexity science to 1
See: Forquenot de La Fortelle, A. “Sur quelques étrangers exotiques dans la prose contemporaine russe”/ Exotismes dans la culture russe (Études de Lettres, n° 283, UNIL, 2009, p. 253-‐262; Vinogradova A. “Les espaces de la marginalité dans la littérature russe actuelle". 2010; Coldefy-‐Faucard : Coldefy-‐Faucard A., “La tentation de l’Arctique chez Boris Pilniak”, Exotismes dans la culture russe, Études de Lettres, n° 283, UNIL, 2009, p. 217-‐226; Coldefy-‐Faucard 2010: Coldefy-‐Faucard A., « Géographie du mythe », Revue des Deux mondes, Paris, octobre-‐novembre; Nadtochiy 2016, Edouard, «Χωρα, Snuff, Obscure Territories» (in Russian) // «Sinij divan», 2016 (№ 20), s. 43-‐60. 2 Augé, M. Non-places : Introduction in anthropology of supermodernity. London : Verso,1995.
2
the field of humanities. The main common objective of this project was a conceptualization of the corpora of the documents, prepared as a part of the interdisciplinary UNIL’s project «Obscure territories and “negative geopoetics”». For this purpose we needed to build: • a representative and comprehensive corpus of negative geopoetics data and • a high-‐quality ontology of geopoetical concepts that represent the field and is justified by usage.
2 Technical development 2.1 ScienceWISE platforme The ScienceWISE.info allows scientists to reorder daily new articles according to their personal interests, such that the most interesting articles appear first; bookmark and annotate this articles using scientific ontology; create and organize personal literature collections, perform semantic search for scientific literature. The ScienceWISE platform (Fig. 1) includes a number of elements: (1) an expanding collection of field-‐specific expert-‐community-‐ranked encyclopedia articles (mostly on physics); (2) an ontological structure (concepts and logical relations between them) encompassing this encyclopedia; (3) established connections of ontology entries to a vast collection of research papers; (4) an operational platform, allowing scientists to annotate and conceptually index (bookmark) the research papers, link them against the ontology, validate and dynamically update the ontology through annotation, etc. The ScienceWise is more than a simple platform for entering or organizing the information. It performs some reasoning on top of the existing ontology, simple disambiguation of concepts, and provides tools to describe semantic relations (Fig. 2). The system itself consolidates all local inputs into the current ontology and creates a comprehensive, global and dynamic knowledge system.
3
Fig. 1. High-‐level architecture of ScienceWISE
Fig. 2. Visual representation of the concept in the ontology, together with its patterns category (top), semantic relations to the other concepts (left and right arrows) and alternative definitions (bottom).
2.2 Fluid geopoetics DATA as a new project of humanitarian branch of ScienceWISE platform The first application of the ScienceWISE platform in the field of humanities was an attempt to build The Digital humanities ontology, using the archives of Digital Humanities journals and the papers of participants of the International Conference which took place in July 2014 in Lausanne (https://dh2014.org). This project inherited the principles and model of the organisation of scientific information from natural sciences. Fluid geopoetics DATA project makes a new step toward the Digital Humanities research. The major challenge of the project concerns the database of the texts 4
themselves, which contains not papers and articles but the fiction and literary texts. Successful adaptation of the Fluid geopoetics DATA to the ScienceWISE infrastructure demands for additional tools of semantic and linguistic analysis that correspond better to the richness and diversity of the literary language (in red, Fig. 3).
Fig. 3. Negative Geopoetics DATA / ScienceWise integration schema.
2.2.1 Creation of the corpus of texts The collection of the texts of Fluid geopoetics DATA is based on the Russian e-‐ library (Lib.ru) and it stays open for new additions. It also contains a corpus of texts in English from the Gutenberg project (Gutenber.org). This compilation allows us to demonstrate that the same semantic tools are applicable to collections in various European languages. The database contains more than 2000 authors and 27400 texts in Russian, and 6700 authors and 14500 texts in English. The authors can be ordered according to the document frequency of use of the concepts of our ontology (which therefore takes into account the total number of words in the work of the author). The collection contains texts dating back to the year 1562. The period filter allows choosing any period. One can also concentrate on any genre interesting for research (genre filter, Fig. 4).
5
Fig. 4. Time-‐ and genre-‐filter in work
For the literary studies, especially for the literary history, it is very important to be able to trace the development of the phenomena, trends, patterns, etc. A feature of the timeline allows make clear these evolutionary aspects of the negative geopoetics (Fig. 5).
Fig. 5. Timeline view of the negtive geopoetics evolution.
2.2.2 Ontology of negative geopoetical concepts We started to create the ontology of negative geopoetics from the initial list of concepts that we selected manually. On the base of Dictionary of associations (wordassociations.ru) and Dictionary of synonyms (sinonimus.ru) we semi-‐ automatically produced semantic clouds for each of the primary concepts and increased the initial list to 700 concepts (in green, Fig. 3). Then we completed lexical analysis (tokens) by one that refers to the meaning and sorted out the list to ten semantic centres. We designated them by artificial -‐ity words to emphasize their abstract, or ideal, or « constructed» character (SWAMP-‐ity; EMPT-‐ity; CHAOS-‐ity and the like). Among others was the DRIV-‐ity category that helps us to consider the aspect of movement and circulation within negative geopoetics. These concepts are geopoetically and negatively «marked». They form the basis of any further statistical calculation. 6
Fig. 6. Concepts categorization
There are other concepts that do not have obvious negative semantic, but that are necessary for the descriptions of obscure territories and negative places. We grouped them at the subcategory of « non-‐classifiable » which have been called «AUXILIARIES». These concepts are excluded from the statistics, but present in the ontology, accompanying marked concepts, entering di-‐, tri-‐, etc. -‐grams.
Fig. 7. N-‐gramms
Using the method of user-‐curated and dynamic ontology elaborated by ScienceWISE experts we gradually built the negative geopoetics ontology. Based on this ontology, the system extracts from any given text all the ontological concepts («FOUND CONCEPTS»). In addition to it the system automatically identified concepts that were not previously known and offers them to the user for validation. All these concepts are ordered by their relevance to the current text. Some of the most relevant concepts from all these lists are 7
suggested as «CHOSEN CONCEPTS». This automatic suggestion can be manually validated and improved, by simply moving the concepts between the columns. If a concept is missing, it can be easily added to the ontology by any researcher who is working with this text. In this way the ontology is collaboratively developed (Fig. 9).
Fig. 9. Mechanism of user-‐curated and dynamic ontology at work: conceptualisation and categorization of the new word.
We have applied to negative geopoetics database the high-‐quality state-‐of-‐the-‐ art algorithms for concept discovery which was developed by the ScienceWISE researchers. As result we received a number of concepts that have been used to describe the negatively marked places in the corpus, that is not just a list of free keywords, but rather a list of key phrases. The algorithm allows to detect various word-‐concept representations of the same concepts within a literary text. Following the ScienceWISE principle of the ontology-‐organized knowledge base, we consider literary writing (similarly to scientific papers on physics) as «bags of concepts» (not just «bags of words» what would be significantly reducing). The use of the modularity-‐based community detection technique allows to determinate automatically the number of communities and their hierarchy.
8
Fig. 10. Representation of the concept of negative geopoetics MARE (БОЛОТО, in Russian) together with different kinds of semantic relations.
2.3 As a result, a created Negative Geopoetics DATA allows us:
• • • • •
• •
To define automatically what kind of negative place describes each literary test; To assign the text to one or several categories of negative geopoetics (…-‐ty); To show the ranks of negative concepts, used in the text or by some author or during a certain period of the literary history; To find out the negatively depicted toponyms and discover among them those used metaphorically (Sahara, Siberia…); To provide a list of different morphological groups of negative geopoetics such as: negative place action (to rot, to decay, to perish…), negative place epithet (burnout, fetid, deserted, disused…) etc. To determinate the relation between a negative place and a character, a negative place and a narrator; To identify general topic (city, village, industry, transport, war…);
We are working also on the possibility: •
To recognize a stylistic context (description of emotions, judgement, subjective perception of a landscape…)
•
To trace the correspondence with the genres (war stories, local legends, traveller notes…).
9
3 Conclusion The researchers from the University of Lausanne obtained a powerful and rewarding tool that could be further developed and improved in the process of collaborative exploration. All scholars in humanities are free to use the created database in comparative literary studies and as well as for the exploration of any other subjects. The indisputable advantage and innovativeness of the negative geopoetics database is the possibility to analyse not just one text, but to operate on a big corpus or a number of smaller sub-‐corpora. Using the database we were able to find a number of results specific for the literary data as such: plenty of synonyms and a strengthening of horizontal relations. These results will be reported elsewhere. The ScienceWISE researchers have got the opportunity to test their specific methods in a field of literary studies and to draw a conclusion about these compatibilities in relation with the corpus of literary texts in a given language. The project opens many interesting possibilities. For example, it would be interesting to evaluate the co-‐occurrence of negative geopoetical elements: are they independent or concomitant? We leave it for future work.
4 Acknowledgement
This work was supported by the CROSS (EPFL-‐UNIL collaborative grant) and by the Swiss National Science Foundation.
10