Representing spatiotemporal processes to support knowledge discovery in GIS databases

Share Embed


Descrição do Produto

Published in Proceedings: 8th International Symposium on Spatial Data Handling Spatial Data Handling, edited by T. K. Poiker and N. Chrisman, Pp. 431-440.

Representing Spatiotemporal Processes to Support Knowledge Discovery in GIS databases May Yuan Department of Geography University of Oklahoma. [email protected], Telephone: 405-325-4293, Facsimile: 405-325-6090 Abstract This article consists of two objectives: (1) to outline a GIS framework that can represent and compute information about spatiotemporal behaviors of processes; and (2) to use this representation in support of automatic GIS query processing to allow integration of GIS and Knowledge Discovery in Databases (KDD) technologies. KDD technology has emerged as an empowering tool in the development of the next generation database and information systems through its abilities to extract new, insightful information hidden within a large heterogeneous database. Furthermore, it transforms the information to knowledge through hypothesis testing and theory formation. There are many research issues involved in the integration of GIS and KDD. Specifically, this article emphasizes the importance of enhancing GIS abilities to represent dynamic processes and support automatic query processing as a step towards the integration. The conceptual framework presented here outlines the architecture of such a dynamic representation, along with algorithms for automatic query processing on higher-level information about processes (i.e. frequency, duration, movement, and rate) from the preliminary data recorded in a GIS. Keywords: Spatiotemporal representations, GIS query, Knowledge Discovery in Databases. 1. Introduction Over the last decade, we have seen an explosive growth in our capabilities to collect geographic data. As a result, advanced remote sensing and survey technologies have flooded us with multi-terabytes of data. Yet, despite the wealth of data, the utility of GIS technology in scientific research is considerably limited because information implicit in GIS data is not easy to discern. This generates an urgent need for new methods and tools that can intelligently and automatically assist us in transforming geographic data into information and, furthermore, knowledge. Information scientists face the same problems since the digital revolution expedites the production of mountains of data from credit card transactions, medical examinations, telephone calls, stock values, and other numerous human activities. Collaborative efforts in artificial intelligence, statistics, and databases communities have been developing technologies of knowledge discovery in databases (KDD) to extract useful information from massive amounts of data in support of decision making (c.f. Gardner 1996, Bhandari et al. 1997, Hedberg 1996).

The development of KDD sets new challenges for database technology: new concepts and methods are needed for basic operations, query languages, and query processing strategies (Lmielinski and Mannila 1996). A KDD process includes "data warehousing, target data selection, cleaning, preprocessing, transformation and reduction, data mining, model selection (or combination), evaluation and interpretation, and finally consolidation and use of the extracted knowledge" (Fayyad 1997, P5). Specifically, data mining aims to discover something new from the facts recorded in a database through automatic querying processes. It prescribes the steps toward efficient development of knowledge discovery applications. Hitherto, data mining tools mostly adopt techniques from statistics (Glymour et al. 1996), neural networks (Lu et al. 1996), and visualization (Lee and Ong 1996) to classify data and extract patterns. Ultimately, KDD aims to enable an information system to provide support for complex queries such as "give me all areas where the likelihood of having droughts exceeds 0.5." Support for queries of this kind is difficult when a database contains only data records of precipitation, temperature, and soil moisture, but it has no data about droughts. KDD first provides users a set of tools to select a sample data set and identify relationships between the target attribute and those included in the database. Secondly, it automates query processes to compute the needed information. Hence, database query processing plays an important role in KDD development to enable efficient data retrieval and flexible query formulation. However, query processing and optimization for GIS databases is relatively underdeveloped (Samet and Aref 1995, Egenhofer 1992). Enhancing geospatial query support is essential to the success of enabling KDD in GIS databases. At the heart of the problem of enhancing GIS query support is geospatial information representation. Most GIS data models are static, map-based models that are unable to provide adequate support to represent spatiotemporal behaviors of geographic processes (c.f. Armstrong 1988, Langran and Chrisman 1988, Worboys 1992). Some data models are designed to represent processes but are limited to specific data structures (c.f. rasters in Pequet and Duan 1995, points in Raper and Livingstone 1995). Computation of information in response to a user query is impossible in a data model incapable of representing the requested information. Nevertheless, processes are often the primary interest in scientific modeling, and we need to represent processes in a way that dynamic properties of processes can be unveiled. However, these static, map-based presentations have circumscribed the usefulness of KDD technologies to GIS databases in two ways. First, GIS support for queries about temporal properties (such as frequency and duration) of a geographic phenomenon is difficult unless these temporal properties are explicitly coded as data fields. This is because map-based data models lack abilities to provide temporal associations among data layers, resulting in difficulty to compute temporal properties that require accessing data objects across numerous layers simultaneously. The second drawback of static, map-based data models is related to inadequate support for queries about spatiotemporal behaviors (such as rate and movement) of a process. To overcome these obstacles, we need a dynamic GIS data model to enable representing processes and computing information about spatiotemporal behaviors of these processes. Peuquet (1994) proposed and elaborated a conceptual framework to cope with temporal dynamics of geospatial data. Here, further emphases are placed on the need for such a data framework to enable GIS support for queries on higher-level information (such as frequency or rate) based on preliminary data records

(such as time and location). Fulfilling the need is a key pre-requisite for a GIS to automate query processing and achieve the support for knowledge discovery. The next section outlines a representation designed to capture the dynamics of geographic processes to empower GIS support for automatic query processing. Discussions proceed on an example of storm processes and algorithms for computing information about spatiotemporal behaviors of these processes in a GIS context. A conclusion section summarizes the arguments and suggests future research. 2. Representing geographic processes for KDD Unlike a geographic theme, which denotes attributes at locations, a geographic process progresses in space and time. Hence, a GIS representation for geographic processes needs to be able to portray "something moving across space through time." This sets two requirements. First, the representation needs to provide a mechanism to define a process according to a process defined in biology, meteorology, hydrology, soil sciences etc, so that the "something" can be identified. Taxonomies developed in these disciplines can form process classes in the representation. The definition of a process further determines the attribute sets of the correspondent process class. Any process object in a process class inherits attributes from the process class, in addition to supplementary properties particular to the object. Data mining can proceed on examining attribute values of process objects and reclassifying them according to correlation unveiled within or among process classes to which these process objects belong. While the definitions of processes can vary across disciplines, additional process classes are introduced to cope with the meanings of these processes from different perspectives. Hence, these process classes and their relationships denote the semantics of a GIS database. Hitherto, this portion of the representation does not involve spatial properties, and it can adopt a relational or object-oriented data model. The second requirement for representing geographic processes arises with the need to describe a movement across space through time. Mappings of process objects to space and time create instances of these processes in a geographic context to represent motion. The mappings require the construction of temporal associations among spatial objects (i.e. points, lines, polygons, or cells) to denote change of locations or spatial extents in a time sequence. We can time-stamp every spatial object to enable temporal sequencing. This approach represents time as an attribute of spatial objects, and thus it hampers representation of properties of time (such as units, dimension - instants vs. intervals, classes - world time vs. database time, etc.) and computation of temporal relationships. An alternative approach is to separate time from space and define temporal objects to facilitate classifying temporal objects and denoting temporal relationships. The separation of the meanings of geographic features, time, and space is comparable to the conceptual framework for temporal dynamics proposed by Puequet (1994). The separation can also facilitate spatial and temporal reasoning with schemes proposed by Egenhofer and Al-Taha (1992), Allen (1983), Freska (1992) and others, since spatial or temporal reasoning tasks involved in these frameworks only manipulate spatial or temporal objects but not both to derive corresponding relationships. Nevertheless, associations of spatial and temporal objects by common keys or pointers can support the need for spatiotemporal reasoning and

analysis. With these associations, temporal objects can relate spatial objects in a sequence, and, when they refer to a process object, the representation portrays the dynamics of the process in space and time.

A. a graphic representation:

Extratropical Cyclone

Semantics

Supercell

Squall line

Hail

t T1t T2 t

Time

2

1

T3

T4

3

t4

T6 T5 t5 t6

Tornado

t7

S2 S1

Space T3 T4

S3

T6 T5 S

1 0

S4 S7 T3 S S 9 S6 5 T6 S8

T6

T t

is a super process of instantiation in time or space time interval time instant

B. a frame representation: Process class: Hail Process class: Extratropical Cyclone Super process: Supercell Sub process: Squall Line, Supercell Space-time relation: Aggregates (Sub process) Sub process: null Space-time relation: (T4 ;S6 ), (T5 ;S7 ) Process class: Squall Line Super process: Extratropical Cyclone Process class: Tornado Subclass: null Super process: Supercell Space-time relation: (T1 ;S1 ,S2 ) (T2 ;S2 ,S3 ) Sub process: null Space-time relation: (T6 ;S8 ,S9 ,S1 0 ) Process class: Supercell Super process: Extratropical Cyclone Sub process: Hail, Tornadoes Space-time relation: (T3 ;S4 ,S5 ), Aggregates (Sub process) Figure 1: A spatiotemporal representation of storms in an extratropical cyclone system

Figure 1 illustrates a simplified example of using the aforementioned representation to describe storms in an extratropical cyclone. The figure includes two main types of thunderstorms: supercell storms and squall-line storms. Supercells are isolated rotating storms capable of producing destructive hail and tornadoes (Weisman and Klemp 1986), while a squall line is generally considered as a continuous line of storms (Hane 1986). Meteorologists have definitions for these processes according to their dynamics and other physical properties. These definitions and the relationships among these processes form the semantic components of the information. Instantiating a process object to record a supercell, for example, requires creating a supercell object with attributes specified in the supercell process class plus initiating correspondent temporal and spatial objects to denote the time and location of the occurrence. Specifically, Figure 1 depicts an example of an extratropical cyclone with both squallline and supercell storms. The squall line initiated at t1 in areas S1 and S2, moved to area S3 at t2, and stayed there until t3 when a supercell developed. Subsequently, the supercell produced hail in area S6 during T4 and in S7 during T5. In T6, the supercell produced a tornado in areas S8, S9, and S10. The representation of these process objects is a simple semantic network holding the relationships among process classes. Each process class has a specific set of attributes pertinent to the definition of the process type. Temporal objects indicate time of significance. The figure presents an interval-based representation of time, in which an interval (T) consists of two time instants (t) to denote the starting and ending times of the interval with which we can compute duration and rate of movement of a process. The frequency of a process type (such as supercell) is also computable if the process type contains multiple process objects (representing instances) associated temporal objects over a time range of interest. The combination of semantics and time explicitly expresses that a process lasts as long as its sub-processes continue. Similarly, the spatial extent of a process varies with the spatial extents of its sub-processes. For example, the extratropical cyclone started as a squall line and lasted through the period of a supercell storm with hail until the end of its tornado. Hence, instantiation of a process object maps semantics to time and space, while aggregation of temporal and spatial objects of sub-processes determines the temporal and spatial extents of their super-processes. In summary, the representation can portray spatiotemporal behaviors of processes in terms of spatial variations in frequency, duration, movement, and rate. Although current GIS data models are unable to represent the dynamics of a process, they provide an excellent foundation for handling spatial objects. For example, Langran and Chrisman’s (1988) Space-Time Composite model is applicable to modeling spatial objects in Figure 1. However, the Space-Time Composite model needs mechanisms to enable aggregations of spatial objects for super-processes. Technologies to aggregate spatial objects are available, including Arc/INFO’s dynamic segmentation for linear objects (also useful to handle time intervals) and regions for areal objects (van Roessel and Pullar 1993). Therefore, implementation of the aforementioned representation is plausible with the combination of three components: (1) semantic networks for process classes, (2) interval-based representations for temporal object classes, and (3) the space-time composites model with algorithms of spatial aggregation for spatial object classes. This architecture enables automatic query processing to extract

information not only about properties of a process object but about the development of the process in space and time and potentially the history of a process class. 3. Automatic GIS query processing Automatic query processing is fundamental to knowledge discovery support in that it automates a GIS to enhance the value of raw data by extracting higher-level information, which can probe the user to better understand the phenomena generating the data. More often than not, researchers obtain large amounts of data but have limited tools to explore the data to unveil patterns and relationships. Furthermore, few researchers without practical experience in database and GIS technologies know how to specify the desired query to begin with. Automatic query processing plays an important role to interface scientists with GIS databases to explore and understand patterns embedded in raw data. With extracted higher-level information from the raw data, scientists can formulate hypotheses, filter what is useful from background, and search for hypotheses that require a large amount of highly specialized domain knowledge. The representation framework in Figure 1 greatly eases the incorporation of automatic GIS query processing to extract information about spatiotemporal behaviors of processes, including frequency, duration, movement, and rate of movement. These kinds of information are not easily handled by representation and computation capabilities in current GIS. Frequency analysis attempts to compute the occurrences of a type of process with common characteristics during a certain period of time or within a certain area. Because the representation organizes processes of the same type into a process class, the properties and the number of occurrences of process instances is readily available from the process class in a GIS database. In geographic applications, frequency distributions in different periods of time and across space are of great interest. GIS query processing for frequency analysis needs to account for attribute-based, timebased, and space-based clusters of processes. Algorithms for query processing on frequency analysis, therefore, include procedures to (1) identify a process class, retrieve process instances, (2) search temporal and spatial associations with these process instances, (3) analysis clustering of these process instances in space and time, (4) count the number of process objects in each cluster, and (5) present the result in a histogram. These procedures involve basic searching and arithmetic calculation and thus can be automated easily. Among them, the most challenging procedure is cluster analysis. In addition to traditional cluster analysis methods in statistics and quantitative geography, neural network techniques have been shown to be effective in discerning clusters. GIS query processing can incorporate these developed methods (mainly for attribute analysis) with necessary extensions to space and time dimensions for temporal and spatial cluster analysis. With these cluster analysis methods, users are able to specify a process type to initiate automation of GIS query about frequency to acquire clusters of processes according to characteristics and occurrences in time or space. Information about spatial distributions of the duration of a process can provide insights into the potential influence of environment settings on the life span of a process type. The representation in Figure 1 relates process classes (and hence process objects) to temporal objects with which the system can calculate duration of a process based on starting and ending times. Process objects in super-classes without direct links to temporal objects have a duration as the

summation of all correspondent process objects in their sub-classes. For example, the life span of an extratropical cyclone is equal to the combined duration of squall-line storms and supercells, and the duration of a supercell includes the life spans of the associated hail and tornadoes. An algorithm to automate GIS query for spatial distributions of process duration consists of routines to (1) specify process classes of interest, (2) retrieve process instances from these classes, (3) compute duration for these processes, and (4) map these duration values to associate spatial objects. The key procedure here is to compute duration for process objects in super-classes without direct associations to temporal objects. The duration of a super-class object (such as extratropical cyclone) is the difference between the minimal starting time (t1) and the maximal ending time (t7) of associated process objects in its sub-classes (i.e. squall line and supercell). The duration distribution can be mapped according to the spatial extents (represented by spatial objects) of these processes. Information representing movement has been a major challenge to GIS researchers. The dynamics of a process require a means to associate spatial objects of the same process through time. The example in Figure 1 demonstrates an alternative GIS representation built on processes and temporality of spatial objects, while these spatial objects, in fact, are in a frame of conventional map-based models. This representation separates processes and time from space, instead of modeling them as attributes of spatial objects as in a map-based approach. Therefore, a process object can have multiple temporal objects and spatial objects in a GIS database to convey the information that a process moves to different locations through time. For example, the movement of a squall line is a combination of S1 to S3 from t1 to t3 (Figure 1). This approach simplifies automating GIS queries about movement since the query processing involves only search routines to find temporal and spatial objects for a specific process class. Next, sorting routines are used to order the search results based on the retrieved temporal objects. Although the search has to trace the semantic network of process classes to reveal process objects in subclasses, this procedure is common to all kinds of queries using the proposed framework and should be programmed as a common routine. Once a GIS can handle information about movement, the system can also support queries about rate of movement. Rate of movement is the ratio between travel distance and travel time of a given process. While computing travel time (Tt) is straightforward by finding the difference between the starting time and ending time of the process, there are numerous ways to compute travel distance based on different assumptions. One approach is to use the distance (Dt) between weighted or unweighted geometrical centers of two spatial object sets associated with a process at two time points:

Dt = ( X 1 − X 2 ) 2 + (Y1 − Y2 ) 2 where X1 and X2 are the X coordinates of centers for the two spatial object sets, and Y1 and Y2 are the Y coordinates of the two centers. For example, a squall line is associated with S1 and S2 at t2 but is associated with S2 and S3 at t3 (Figure 1). Hence, the squall line moves from S1 and S2 to S2 and S3 from t2 to t3. Using the center approach, we calculate two centers: one for S1 and S2, and the other for S2 and S3. The distance between the two centers will be calculated as Dt. Another approach to compute travel distance is based on change in the spatial extent of a process

during a period. The travel distance is then equal to the square root of the absolute change in spatial extent: Dt =



Area1 −



Area 2

where ∑ Area1 and ∑ Area 2 is the total area of spatial objects at the beginning (Area1) and the end (Area2) of the study period. This approach can also examine the expansion or subtraction of a process in space. Most physical processes involve expansion or contraction. Therefore, the rate of movement is also a variable. The measurement techniques used to sample a process determines spatial and temporal resolutions of the process and results in the amount of detail that can be calculated in rate of movement. For a process object consisting of sub-class processes, its rate of movement varies in time and space according to the rates of its sub-processes. For example, the representation in Figure 1 shows that an extratropical cyclone moved at a variable rate at different areas (i.e. S1 – S10) at different periods of time (i.e. T1, T2… T6). The proposed presentation supports computation of area change for expansion and subtraction by explicit mappings of a process to spatial objects through time. Once the travel distance has been determined, the rate of movement can be derived by the ratio between travel distance and travel time: Rt = Dt / Tt The computation and automation of the query process is straightforward. With abilities to this automatic query processing, a GIS can provide users with information to detect areas of unusual movement of a process to probe environmental factors that control the progression of the process. 4. Conclusion In order to provide a proactive analysis and decision support environment, a GIS needs KDD concepts and techniques to cope with ever growing geospatial databases. Scientists need tools to explore GIS databases and extract higher-level information from volumetric raw data records. Despite continuing improvements in GIS design, few inexperienced GIS users know what to query and how to formulate a query when they deal with large databases. KDD concepts and techniques have shown promising signs of yielding significant enhancements to resolve manageable, informative, and comprehensive patterns from massive databases in many socioeconomic and scientific applications (c.f. Fayyad et al. 1996a, Fayyad et al. 1996b, Matheus et al. 1996, Apte and Hong 1996, Knorr and Ng 1996). Likewise, KDD capabilities can empower GIS support to explore scientific data, extract informative patterns for building and testing hypotheses, and formulate insightful knowledge for decision making. In short, information production is the key emphasis in KDD. Because information representation is a critical factor to information production, developing a dynamic GIS representation to allow searching and generating higher-level spatiotemporal information, which otherwise may be undiscernible, is an important step towards the GIS-KDD integration. This paper demonstrates a representation that enables GIS modeling of spatiotemporal behaviors of processes. The representation is an integrated model of semantic networks, intervalbased time representations, and space-time composites, each of which handles objects of

processes, time, and space to convey the dynamic information about frequency, duration, movement, and rate. With this representation, a GIS is able to provide the means for higher-level information production to support complex queries about spatiotemporal behaviors. Algorithms to compute information on frequency, duration, movement, and rate have been outlined and readily transformable to the design of automatic GIS query processing. Research is in progress to implement the representation and automatic GIS query processing with an attempt to integrating KDD and GIS. As KDD challenges database technology for new concepts and methods for query languages, basic operations, and query processing strategies, the integration of KDD and GIS provokes studies to enhance spatiotemporal query support and information production. References Armstrong, M. P., 1988. Temporality in spatial databases. Proceedings: GIS/LIS’88, 2:880-889. Allen, J. F. 1983. Maintaining knowledge about temporal intervals. Commun. ACSM, 26(11): 832-843. Apte, C. and Hong s. J. 1996. Predicting equity returns from securities data with minimal rule generation. In Advances in Knowledge Discovery and Data Mining, Fayyad et al. (Ed.), AAAI Press/MIT Press, Boston, MA. Chapter 22. Bhandari, E. Colet, E., Parker, J., Pines, Z, Pratap, R., Pratap, R. and Ramanujam, K. Advanced scout: data mining and knowledge discovery in NBA data. Data Mining and Knowledge Discovery, 1, 121-125. Egenhofer, M. J. 1992. Why not SQL! International Journal of Geographical Information Systems, 6(2): 71-85. Egenhofer, M. J. and Al-Taha, K. K. 1992. Reasoning about gradual changes of topological relationships. International Journal of Geographical Information Systems, 6(4): 196-219. Freska, C., 1992, Temporal reasoning based on semi-intervals. Artificial Intelligence, 54:199227. Fayyad, U. 1997. Editorial. Data Mining and Knowledge Discovery. 1, 5-10. Fayyad, U. Haussler, D, and Stolorz, P. 1996a. Mining scientific data. Communications of the ACM, 39(11):51-57. Fayyad, U., Djorgovski, S. G., and Weir N. 1996b. Automating the analysis and cataloging of sky surveys. In Advances in Knowledge Discovery and Data Mining, Fayyad et al. (Ed.), AAAI Press/MIT Press, Boston, MA, Chapter 19. Gardner, C. 1996. IBM Data Mining Technology. IBM Corporation, Stamford, Connecticut. Glymour, C., Madigan, D., Pregibon, D., and Smyth, P. 1996. Statistical inference and data mining. Communications of the ACM, 39(11):35-41. Hane, C. E. 1986. Extratropical squall lines and rainbands. In Mesoscale Meteorology and Forecasting. Ray, P. S. ed. American Meteorology Society, Boston, Massachusetts. Pp. 359-389.

Hedberg, S. R. 1996. Search for the mother lode: tales of the first data miners. IEEE Expert, 11(5): 4-7. Knorr, E. M. and Ng, R. T. 1996. Finding aggregate proximity relationships and commonalities in spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 8(6):884-897. Langran, G. and Chrisman, N, R. 1988 A framework for temporal geographic information. Cartographica, 25(3):1-14. Lee, H. and Ong, H. 1996. Visualization support for data mining. IEEE Expert, 11(5):69-75. Lmielinski, T. and Mannila, H. 1996. A database perspective on knowledge discovery. Communications of the ACM, 39(11):58-64. Lu, H., Setiono, R., and Liu, H. 1996. Effective data mining using neural networks. IEEE Transactions on Knowledge and Data Engineering, 8(6):957-961. Matheus, C., Piatetsky-Shapiro, G., McNeill, D. 1996. Selecting and reporting what is interesting: the KEFIR application to healthcare data. In Advances in Knowledge Discovery and Data Mining, Fayyad et al. (Ed.), AAAI Press/MIT Press, Boston, MA. Chapter 20. Peuquet, D. J. and Duan, N., 1995. An event-based spatiotemporal data model (ESTDM) for temporal analysis of geographical data. International Journal of Geographical Information Systems, 9(1): 7-24. Peuquet, D. J., 1994. It’s about time: a conceptual framework for the representation of temporal dynamics in geographic information systems. Annals of the Association of American Geographers, 84(3):441-462. Raper, J. and Livingstone, D., 1995. Development of a geomorphologic spatial model using object-oriented design. International Journal of Geographical Information Systems, 9(4): 359-384. Samet, H. and Aref, W. G. 1995. Spatial data models and query processing. In Modern Database Systems: The Object Model, Interoperability, and Beyond, Kim, W. (Eds.), Chapter 17, pp. 338-360. ACM Press. New York, New York. van Roessel, J. and Pullar, D., 1993, Geographic region: a new composite GIS feature type. Proceedings: AutoCarto 11: 145-156. Worboys, M. F. 1992. A model for spatio-temporal information. Proceedings: the 5th

International Symposium on Spatial Data Handling, 2:602-611. Weisman, M. L. and Klemp J. B. 1986. Characteristics of isolated convective storms. In Mesoscale Meteorology and Forecasting. Ray, P. S. ed. American Meteorology Society, Boston, Massachusetts. Pp. 331-358.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.