A probabilistic approach to model adaptive hypermedia systems

Share Embed


Descrição do Produto

1st International Workshop on Web Dynamics in conjunction with the 8th International Conference on Database Theory London, UK, January 3, 2001

Session 1 Beyond Hubs and Authorities: Spreading Out and Zooming In (Invited Talk) Soumen Chakrabarti ................................................................................................................................... 3

A Web Site Navigation Engine (demo) Mark Levene, Richard Wheeldon, and Jon Bitmead ................................................................................... 4

Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages Mathias Gery, Jean-Pierre Chevallet, and Richard Wheeldon .................................................................... 6

Learning of Ontologies from the Web: the Analysis of Existent Approaches Borys Omelayenko ..................................................................................................................................... 16

Session 2 Fast Search & Transfer ASA (Invited Talk)

Knut Magne Risvik ..................................................................................................................................... 26

An Active Web-based Distributed Database System for E-Commerce Hiroshi Ishikawa, and Manabu Ohta .......................................................................................................... 27

Dynamics of Web Rings Sebastian M. Maurer, Bernardo A. Huberman, and Eytan Adar ................................................................ 37

Session 3 Logics for Mobility (Invited Talk) Luca Cardelli............................................................................................................................................... 49

A Probabilistic Approach to Model Adaptive Hypermedia Systems Mario Cannataro, Alfredo Cuzzocrea, and Andrea Pugliese ...................................................................... 50

Run-time Management Policies for Data Intensive Web Sites Christos Bouras, and Agisilaos Konidaris ................................................................................................... 61

2

Beyond Hubs and Authorities: Spreading Out and Zooming In Soumen Chakrabarti Indian Institute of Technology, Bombay

Abstract After crawling and keyword indexing, the next wave that has made a significant impact on Web search is topic distillation: analyzing properties of the hyperlink graph for enhanced ranking of Web pages in response to a query. Hyperlink induced topic search (HITS) and PageRank (used in Google) are two examples. The linear algebra involved in HITS and PageRank is standard, but selecting the relevant subgraph of the Web to which these algorithms should be applied is considerably less clear. PageRank was intended for the entire Web graph (or as much as a crawler can collect) whereas HITS used keyword match followed by a distance-one graph expansion to determine the relevant subgraph. The clean graph model used in HITS and PageRank, where pages are nodes with no finer characteristics other than a few scalar popularity scores, is also in question. Pages have valuable markup structure and accompanying text. Moreover, the `hubs' or resource lists that make HITS so successful are often `mixed', meaning only specific regions in those pages are relevant to the query. In this talk we will discuss two enhancements to the graph selection process. First we will describe a learning system called a ``focused crawler'' which discovers and collects large relevant graphs useful for enhanced topic distillation, starting with a few relevant examples and without crawling the Web at large. Second we will discuss a fine-grained model for `micro-hubs' and new algorithms based on the Minimum Description Length principle which let us identify regions in mixed hubs which are relevant to a query, which enhances both topic distillation as well as information extraction. We will justify, using analyses and anecdotes, that as the Web evolves from static files to dynamically generated semi-structured content, these more complex models and algorithms will become crucial to the continued success of automatic resource discovery, extraction, and annotation.

3

AWeb Site Navigation Engine Mark Levene, Richard Wheeldon, Jon Bitmead NavigationZone, Department of Computer ScienceUniversity College London, U.K.

1 Introduction Often users navigating (or “surfing”) through a web site “get lost in hyperspace”, when they lose the context in which they are browsing, and are unsure how to proceed in terms of satisfying their original goal. The unresolved problem in web site usability, of assisting users in finding their way, is termed the navigation problem. (See [LL00] for a survey and critique on the navigation problem.) This problem is becoming even more acute with the continuing growth of web sites in terms of their structure, which is becoming more complex, and the vast amount information they intend to deliver. In contrast users are not willing to invest time to learn this structure and expect the delivery of the relevant content without delay. To tackle this problem we are developing a navigation system for semi-automating user navigation which builds trails of information, i.e. sequences of linked pages, which are relevant to the user query. The preferred trails are presented to the user in a tree-like structure which they can interact with. This is in sharp contrast to a search engine which merely outputs a list of pages which are relevant to the user query without addressing the problem of which trail the user should follow. We discuss the architecture of the navigation system and give a brief description of the navigation engine and user interface.

2 Architecture of the Navigation Engine The architecture of the navigation system is shown in Figure 1. the user interface executes on top of a conventional browser such as Microsoft Internet Explorer or Netscape Navigator. It obtains the preferred trails for navigation, given a user query, from the navigation engine and requests pages for display from the Web site via a proxy. The navigation engine consists of two main modules: (i) the information retrieval module, and (ii) the best trail module. The information retrieval module does conventional information retrieval over Web pages combined with a page ranking algorithm that takes into account the potential of a Web page as a starting point for navigation. The best trail module computes the preferred trails for navigation given the input query. The algorithm is adaptive in the sense that it dynamically searches for the preferred trails by mimicking a user navigation session and scoring the trails as they are expanded according to the topology of the Web site. (See [LL00] for some detail on the algorithmic process of used by the navigation engine to compute the preferred trails given a user query.) The navigation engine interacts with a Web case, which is a database for storing the details of the pages of the Web site in preprocessed form. The software robot is an offline process which is responsible for the creation of the Web case.

4

3 User Interface The user interface includes two main mechanisms: (i) the navigation tool bar and (ii) the navigation tree window. The navigation tool bar comprises a sequence of URLs. The first two URLs displayed on the navigation tool bar are the last two URLs of the pages that the user browsed, thus providing a history mechanism. The next URL on the navigation tool bar is the current URL identifying the page the user is currently browsing. The URLs that follow the current URL on the navigation tool bar are the consecutive URLs of the best trail found from the current URL. All URLs are clickable and cause the navigation tool bar to be updated accordingly. The navigation tree window displays the preferred trails given the user query, organized in the form of a tree structure with the trails being ranked from the most preferred, according to their score. The user interacting with the navigation tree window can select any URL on one of the preferred trails causing it to be the current URL. The navigation tool bar, the navigation window and the browser window are all synchronized according to the current URL. The mechanisms of the user interface provide the user with guidance and context throughout a navigation session, given the input query. The user interface can be embodied in a Web site as a navigation mechanism complementing or replacing a Web site search engine.

References [LL00] M. Levene and G. Loizou. Web interaction and the navigation problem in hypertext. In A. Kent, J.G. Williams, and C.M. Hall, editors, Encyclopedia of Microcomputers. Marcel Dekker, New York, NY, 2000. To appear.

5

Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages Mathias G´ery, Jean-Pierre Chevallet Equipe MRIM (Mod´elisation et Recherche d’Information Multim´edia) Laboratoire CLIPS-IMAG, B.P. 53, 38041 Grenoble Cedex 9, France E-mail : Mathias.Gery, Jean-Pierre.Chevallet@imag.fr

Abstract : The World Wide Web is a distributed, heterogeneous and semi-structured information space. With the growth of available data, retrieving interesting information is becoming quite difficult and classical search engines give often very poor results. The Web is changing very quickly, and search engines mainly use old and well-known IR techniques. One of the main problems is the lack of explicit HTML page structure, and more generally the lack of explicit Web sites structure. We show in this paper that it is possible to extract such a structure, which can be explicit or implicit: hypertext links between pages, the implicit relations between pages, the HTML tags describing structure, etc. We present some preliminary results of a Web sample analysis extracting several levels of structure (a hierarchical tree structure, a graphlike structure). Keywords : Web Information Retrieval, Web Pages Analysis, Structure Extraction, Statistics

1

Introduction

The task of an Information Retrieval System (IRS) is to process a whole set of electronic documents (corpus), with an aim of making it possible to retrieve those matching with their information need. On the contrary of Databases Management Systems (DBMS), the user expresses with a query the semantic content of the documents that he seeks. We distinguish two principal tasks: Indexing: The extraction and storage of the documents semantic content. This phase requires a representation model of these contents, called document model. Querying: The representation of the user’s information need, generally in a query form. It is followed by the retrieval task, and the presentation of the results. This phase requires a representation model called query model, and a matching function to evaluate documents relevance. IRS were classically used for textual documents databases, or multimedia databases like medical corpora. The Web growth constitutes a new applicability field for IR. The number of users on the Web has been estimated at 119 millions in 1998 (NUA Ltd Internet survey, July 1998), 333 millions in 2000 (NUA Ltd Internet survey, June 2000). The number of accessible pages has been estimated in December 1997 at 320 millions [20], in February 1999 at 800 millions [21] and in July 2000 at more than 2 billions [26]. The Web is a huge and sometimes chaotic information space without central authority. In this context, and in spite of standardization efforts, these documents are very heterogeneous in their contents as in their presentations: HTML standard is respected in less than 7% of HTML pages [4]. We can expect to find almost everything there, but retrieving relevant information seems to be like Finding the Needle in the Haystack... Nowadays, the great challenge for research in IR is to help people to profit of the huge amount of resources existing on the Web. But it exists yet no approach that satisfies this information need in a both effective 1 and efficient 2 way. For assisting the user in his task, some search engines (like Altavista, AllTheWeb, or Google 3 ) are available on the Web. They are able to process huge documents volumes with several tens 1

measure of IR-tool quality, evaluated classically using precision and recall measures measure of system resources use: memory usage, network load, etc... 3 http://www.altavista.com, http://www.alltheweb.com, http://www.google.com 2

6

of million indexed pages. They are nevertheless very fast and are able to solve several thousands of queries per second. In spite of all their efforts, the answers provided by these systems are generally not very satisfactory. Preliminary results obtained with a test collection of the TREC conference Web Track has showed the poor results quality of 5 well known engines of the Web, compared to those of 6 systems taking part to TREC [17]. In fact, most of existing search engines use well-known techniques like those described by Salton 30 years ago [30]. Most of them prefer a wide coverage of Web with a low indexing quality to a better indexing on a smaller part of the Web. In particular, they consider generally HTML pages as atomic and independent documents, without taking into account relations existing between them. The notion of document for a search engine is reduced to its physical appearance, a HTML page. But Web’s structure is used in few of Web search engines like Google [6]. With an aim of Structured IR, we wanted to determine which structure exists on the Web, and which structure it is possible to extract. This paper is organized as follow: after presentation of related works in section 2 (i.e. IR with structured documents, hypertexts and Web), we will present our hypotheses about what will be an ideal structure on the Web in section 3.1. Then we will propose our approach to validate our hypothesis and check if this kind of structure exists on the Web in section 3.2. Finally we will introduce the Web sample that we have analysed in section 4.1 and some preliminary results of our experimentations in sections 4.2, 4.3 and 4.4, while section 6 gives a conclusion about this work-in-progress and some future directions of our works.

2

IR and structure on the Web

The Web is not only a simple set of atomic documents. The HTML standard allows description of structured multimedia documents, it is widely used to publish on Web. Furthermore, Web is an hypertext, with URL’s use (Uniform Resource Locator) for the description of links. This structure was used for IR, as well in the context of structured documents as in the context of classical hypertexts. We distinguish 3 main approaches proposing techniques of information access using structure: navigation, DBMS and IR approach.

2.1

Navigation approach

Navigation is based on links, used for finding and consulting some interesting information. In the case of a navigation within a hypertext composed by several hundreds of nodes, this solution can be useful. This task is more difficult to achieve on larger hypertext, mainly because of disorientation and cognitive overload problems. Furthermore, it is necessary to have the right links at the right place. A solution is proposed by “Web Directories” as Yahoo or Open Directory Project 4 which propose an organized hierarchy of several millions of sites. These hierarchies are built and verified manually, and thus it is expensive and difficult to keep them up-to-date. Furthermore, exhaustiveness is impossible to reach.

2.2

DBMS approach

Documents are represented using a data schema encapsulated in a relational or Object Oriented [10] data schema. It allows an interrogation using a declarative query language based on an exact matching and forced by the data schema structure. The hypertext structure integration in the database schema has been much studied, for example by [25], [3] (ARANEUS project), [16] (TSIMMIS project), etc. Integration attempts at the level of query language can be found in hyperpaths [1] or POQL [10]. In fact, these approaches are extensions of the proposed solution for the documents structure integration.

2.3

IR approach

IR approach deals with structured documents, promoting a hierarchical indexing : during the indexing process, information is propagated from document sections to top of document, along composition relations. 4

http://www.yahoo.com, http://www.dmoz.org

7

This method is refined by Lee [22] who distinguishes several strategies of ascent. Paradis in [28] distinguishes several structure types, data ascent depending on different link types. Hypertext structure has been taken into account at the indexing step. For example, the hypertext graph can be incorporated into a global indexing schema using conceptual graph model [9] or using inference networks [11]. World Wide Web Worm [24] enables the indexing of multimedia documents by the use of the anchor’s surrounding text. Amitay [2] promotes also document’s context use. Marchiori [23] adds the “navigation cost notion” that expresses the navigation effort for reaching a given page. SmartWeb [13] considers the accessible information of a Web page at indexing step, so page relevance is evaluated considering the page content but also the page’s neighbors content. Kleinberg (HITS [18]) promotes the use of both links directions: he introduces the hub page 5 and authority page 6 concepts. For automatic ressources compilation, the CLEVER system [8] based on the same idea, obtains good results against manually generated compilation (Yahoo!7 ). Gurrin [15] has tried to improve Kleinberg’s approach. He distinguishes 2 links types (structural and functional) and uses only structural ones. The well-known Google search engine [6] uses textual anchors to describe pages referenced by links from these anchors.

2.4

Related Works : Discussion

We think that navigation approach is well adapted to manually manageable collections, but the Web is too big to be acceded only with navigation. Navigation can be an interesting help to other techniques, for example to consult search results. About DBMS approaches, we think that a declarative query language is not adapted to the Web heterogeneity. Moreover, these approaches rely on the underlying data base schema, and Web pages have to be expressed following this schema or following predefined templates. According to Nestorov [27] we think that even if some Web pages are strongly structured, this structure is too irregular to be modeled efficiently with structured models like relational or object. IR approach enables natural language querying, and considers relevance in a hypertext context. At present, most of the IR approaches are based on pages connectivity use, with the notion of relevance propagation along links. The drawback is the bad use of this information because of the fact that relations (links) and nodes (documents) are not typed on the Web. We think that these approaches are interesting and useful. The lack of explicit Web structure to improve them encourages us to work on Web structure extraction. Several works have focused on statistics studies [5], [4], [31], dealing with the use of HTML tags or with the links distribution which leads for example to the notion of hub and authority pages. Pirolli [29] has categorized Web pages following 5 predefined categories which are related to site structure, based on usage, site connectivity and content data. Broder [7] has studied the Web connectivity and has extracted a macroscopic Web structure. But none of these works deals with Web structure (structured documents and hypertexts) extraction related to IR objectives.

3

Is the Web well structured?

The main objectives of our Web sample analysis are to identify the Web explicit structure, and to extract the Web implicit structure. Obviously, the Web is clearly not structured in the DataBase sense of the term. But HTML allows people to publish structured sites. Thus we will talk about hierarchically structured Web as well as structured in the hypertext sense. The question is : “Is the Web sufficiently structured (especially hierarchically) to index it following a structured IR model?”. This main objective leads us to some other interesting questions like : “What is a Document on the Web” or “How can I classify a Web link?”. We present our approach to answer these questions. Firstly, in section 3.1 we present the kind of structure that we wanted to identify/extract from the Web. We hypothesize that this ideal structure for the Web exists. The underlying problematic is about a structured IR model: our final goal is to develop an IR model adapted to Web. 5

A page that references a lot of authorities pages. A page that is referenced by a lot of hubs pages. 7 www.yahoo.com 6

8

Secondly we will present in section 3.2 our interpretation of HTML use related to our hypothesis. Web’s structure depends mainly of the HTML use by sites authors, thus finally we will present in section 4 some preliminary results of an Web sample analysis.

3.1

Hypotheses

We will try to check the following assumptions, directly related to the concept of information in a semistructured and heterogeneous context: Hypothesis 1: information granularity. We think that information units on the Web can have different granularities. By assembling these various constituents, one can build entities of more important size. We distinguish at least 6 main granularities, and hypothesis 1 is detailed following these 6 granularities: H1.1: elementary constituent. We think that on the Web there is the notion of elementary constituents, which can be composed using a morpheme, a word or a sentence. In our approach, elementary constituent is at the sentence level. H1.2: paragraph. By assembling sentences, one can build paragraph-sized entities. This is our first level of structure. This structure is a list, reflecting the logical sequence existing in order to constitute an understandable unit. H1.3: document section. This second level includes all the elements that composes a “classical” document, like sub-sections, sections, chapters, etc. All of them are built using paragraphs. They include also some other attributes like title, author, etc. H1.4: document. This third level is the first one that introduces a tree like structure, based on document sections. Moreover, reader must follow a reading direction for a better understanding. For example, people generally read “introduction” before “conclusion”. H1.5: hyper-document. This level loses the reading organization when gluing documents. This level can be associated with parts of hypertext, where a reading direction is not obligatory any more. H1.6: clusters of hyper-document. This last level is useful to glue the hyper-documents that have some characteristics in common, like the theme or the authors. This can be seen as the library shelf metaphor. Hypothesis 2: relations. There are various relations between documents, whatever their granularity. We distinguish at least 3 main relations types, and hypothesis 2 is detailed following these 3 types: H2.1: composition. This relation expresses the hierarchical (tree-like) build of higher granularity entity. This relation is used in the five first levels of the previous granularity description (i.e. paragraphs are composed by sentences). Composition deals with attributes shared along composition relations, for example author name. It also deals with the compound element lifetime: a paragraph doesn’t exist any more without its sentences. The composition can be split in weak and strong composition according to the sharing status. The composition is weak if an element can be shared. In this case the relation draws a lattice, otherwise we obtain a tree. H2.2: sequence. Certain documents parts can be organized by the author in an orderly way: part B precedes part C and follows part A. This order suggests a reading direction to the reader, for a best understanding. This relation only concerns H1.1 to H1.4, it can be modeled using the probability that a part can be best understood after the reading of a part  . This conditional probability value can be the fuzzy value of the sequence from  to . H2.3: reference. This relation is weak, in the sense that it can link elements at any granularity level because they have something in common. For example, an author can refer to another document for a complementary information or two documents can refer each other because of their similarity. The next generation of Web search engines will have to consider all these granularities and relations. In the next section, we interpret the HTML usage on the Web, in relation with these hypotheses.

9

3.2

Web analysis to validate assumptions

Our objectives are to study different Web characteristics, with an aim of validating our hypotheses. Without considering under-sentences granularities, we have made the hypothesis that it exists 6 main granularities on the Web (cf section 3.1), from sentence level until cluster of hyper-documents level. To validate hypothesis 1.1, 1.2 and 1.3, we have chosen HTML tags as describing inside-page granularities. H1.1 It is possible with HTML to describe elementary constituents, with or . Several are at the presentation level, others at the semantics level. We place our analysis at the sentence level, and we do not have found a lot of tags that explicitly isolate sentence like do. All others tags are internal sentence elements. H1.2 We propose to place at this level simple paragraphs and “blocs elements” like or . It exists sub-blocs elements like that we place also at this level. Of course we propose to use paragraphs separators , . H1.3 To express document sections, one can use HTML separators . In fact, we could use the whole Web page as a section. H1.4 We propose to consider the physical HTML page as a document. But we could also take a set of interconnected Web pages as document assuming that links between them represent composition. H1.5 The first proposition we can do is to consider an hyper-document to be an Internet site which is defined as a set of pages on the same site. H1.6 To represent our cluster of hyper-document, we propose the notion of Web domain (i.e. “.imag.fr”). To validate hypothesis 2, we have tried to identify composition and sequence links. All unidentified links are categorized as reference links. Implicit similarity and reference relations are not extracted. H2.1 Composition can be identified by inside-pages H1.3 tags, representing strong composition. Also, inside-sites links can be identified as hierarchical, representing strong or weak composition. H2.2 The sequence can be found by looking at the implicit position of a fragment relatively to the following text segment (inside-pages). Also, some inside-sites links from a page to one of its sisters can be considered. H2.3 All the remaining links are classified in this category. This type of relation can be represented on the Web using hypertext links. But it can also be implicit, like quotations for example. It is possible to describe a structure, but is it a reality on the Web? We have to verify if these sub-page granularity tags are used by authors (H1.1 to H1.3), and we have to check if the concept of page, site and domain are relevant on the Web (H1.4 to H1.6). For each page, we have to rebuild hierarchical tree structure, and to identify a structured documents hierarchy between HTML pages.

4

Experiments results

We will present in this section some preliminary results about a Web sample analysis, and particularly statistics used to validate our assumptions.

4.1

Web pages sample: IMAG collection

We have collected an “October 5 2000 snapshot” Web sample, using our Web crawler “CLIPS-Index” (cf section 5). We have chosen to restrict our experiment to the Web pages of the IMAG domain8 , which are browsable starting from URL “http://www.imag.fr”. These pages deal with several topics, but most of them 8

Institut d’Informatique et de Math´ematiques Appliqu´ees de Grenoble : hosts which name is ended by .imag.fr

10

are scientific documents, particularly in computer science field. Main characteristics of this collection are summarized in figure 1. #per page Hosts Pages French language English language Others language Distinct terms Size (HTML) Size (text) Lines Links

11,6 Ko 3,7 Ko 207 37,8

# in coll 39 38’994 5’649 23’819 9’068 241’000 443 Mb 141 Mb 8’079’676 1’475’096

Extension .html .htm .java .cgi .txt .php3 No extension Directory Others Total

Figure 1: IMAG sample: main characteristics

#pages 25’665 2’530 1’021 219 82 71 8’134 933 339 38’994

% 65,82 6,49 2,62 0,56 0,21 0,18 20,86 2,39 0,87 100

Figure 2: Pages format

Our spider has collected, taking less than 2 hours, almost 39.000 pages which are identified by their URL from 39 hosts, for a size of 443 Mb. It is not surprising that most of the pages are in HTML format (72 % of .html and .htm, cf figure 2). After analysis and textual extraction, it remains about 140 Mb of textual data containing more than 241.000 distinct terms.

4.2

Granularity analysis

We have extracted statistics related to entities granularities described in section 3.1. It appears that HTML ability to represent different insidepage granularities as described in section 3.2 is widely used: each page contains on average 17 level 1 objects, 17 blocs elements, 29 paragraphs separators and 3,3 section (cf figure 3). Hypothesis 1.1, 1.2 and 1.3 seem to be correct, but need manual experiments to be validated.

Level H1.1 H1.2 H1.3 H1.4 H1.5 H1.6

Object Level 1 objects Blocs elements Paragraphs separators HTML separators Pages Sites Domains

#objects 663’000 659’000 1’142’000 130’000 38’994 39 1

Figure 3: Inside and outside-page granularity Pages average size is 3,3 sections or 11,65 Ko (cf figure 1). This is greater than other studies results (almost 7 Ko [31], [5]). Textual pages size (pages without HTML tags) is on average of 3,69 Ko. But these statistics are related to physical aspects of documents. We have to consider entities linkages to conclude something about logical aspects. It exists on average 37 links per page in our collection (cf figure 4): if we don’t consider redundant links (same source and same destination), it remains only 550’000 distinct links: on average 14,11 per page, which is not far from other studies (13,9/page [31], 16,1/page [4]). Links Inside-pages Outside-pages Outside-sites Outside-domain Total

#links 118’248 1’318’490 2’093 36’265 1’475’096

% 8 89,38 0,14 2,46 100

Per page 2,97 33,81 0,05 0,93 37,12

Per site 3’128 33’807 57,67 930 39’118

Distinct 13’897 500’472 1’708 34’130 550’207

Figure 4: Links analysis: all/distincts links

11

% 2,53 90,96 0,31 6,20 100

Per page 0,36 12,83 0,04 0,87 14,11

Per site 356 12’832 43,79 875 14’108

Web pages are heavily linked together, but without link categorization it is difficult to distinguish which pages are hyper-documents, which are structured documents or which are sections (Hypothesis 1.4). Especially, we can’t confirm that a Web document is represented by an HTML page. #links/page 0 1 2 3 4 5+

#pages 38’275 396 106 85 39 93

% 97,63 1,52 0,36 0,18 0,1 0,21

Figure 5: Outside-sites links per page

4.3

There are a few outside-sites links: only 2,6% of them, contained by 2,4% of pages. Thus, we think that the site compactness validate Hypothesis 1.5: hyper-documents are represented by sites. Only 5,4% of these outside-sites links are inside-domain links: most of sites are connected with outsidedomain sites, we conclude that a cluster of hyperdocuments is not represented by a Web domain (Hypothesis 1.6).

Internal HTML page structure extraction

We have identified several internal pages levels (cf section 4.2): Depth #pages % section, paragraph, sentence and even internal sentence. These 0 2’995 7,68 levels are defined by HTML pages writers. With these structure 1 1’963 5,03 elements (cf figure 3), we are able to rebuild hierarchical tree 2 2’227 5,71 structure which are relatively large (cf figure 6). 3 5’354 13,73 Particularly, we have to extract most of the composition and se4 12’159 31,18 quence relations. Composition relations are implicit, from page 5 4’101 10,52 to all its sections, but also from sections to all their paragraphs, 6 8’345 21,4 etc. Sequence relations are also implicit, from each page ele7 1’083 2,78 ment (except the last one) to its physical successor. Hypertext 8 et + 767 1,97 links which do not correspond to an extracted composition or sequence relation are supposed representing reference relations. Figure 6: Internal structure extraction

4.4

External HTML page structure extraction

We make the assumption that Web site directory structure includes some semantics that can be automatically extracted. This semantics is proposed “a priori”, because we suppose the manner that pages are placed in the directory hierarchy follows the “principle of least effort”. We assume the directory hierarchy reTypes #links % flects the composition relation. It must of course Internal 118’248 8,02 be validated by experimentation using manual valHierarchical 880’421 59,69 idation. We examine in this part all ways links Transversal 438’069 29,70 are joining pages across the directory hierarchy and Cross 2’093 0,14 we propose the following links categories: internal Out 36’265 2,46 (inside-page), hierarchical and transversal (insidesite), cross (outside-site) and out (outside-domain) (cf Figure 7: Links types figure 7). We are interested by categorizing the relations represented by these links. We have interpreted each link type to type the relation that it expresses, in the following way: Internal 8% of links stay in the same page. We have no proposition for their category. Hierarchical We call hierarchical links those whose the source and target are in the same directory path. These links are the most common in our sample with 60%. If these links reflect the composition

12

structure, we can deduce that this sample is strongly structured. Transversal The target of the link is neither in the ascendant directories nor in the descendant directories but is in the same site. There are 30% of links in this category. We can probably classify them in the weak composition or in reference links. Cross site The target is on an other site: only 0,1% are concerned. They are candidates to be reference. Outside IMAG Target is outside IMAG domain: only 2,5%, they are also candidates to be reference. We detail hierarchical links in 3 categories: horizontal, up and down. #up-levels same directory +1 +2 +3 +4 +5 +6 +7 Total

#links 432’359 223’717 51’330 48’232 37’185 11’573 450 2 804’848

% 29,31 15,17 3,48 3,27 2,52 0,78 0,03 0 54,56

#down-levels -8 -7 -6 -5 -4 -3 -2 -1 Total

Figure 8: Hierarchical links: up/horizontal

#links 8 548 137 764 4’455 8’583 10’118 50’960 75’573

% 0 0,04 0,01 0,05 0,30 0,58 0,7 3,45 5,12

Figure 9: Hierarchical links: down

Horizontal Target is in the same directory. These links are candidate to express a sequence relation. In our experiment 29% of links are in this category (cf figure 8). Up Source is deeper in the directory path. These links go up in site hierarchy. This is the second ranking value with 25% (cf figure 8). It exists more links going up than going down, we think that this is caused by a lot of “Back to the Top” links. Down: Target is deeper in the directory path. These links are less frequent (5%) (cf figure 9) than other hierarchicals. We think that they could belong to the composition hierarchy. We conclude that the directory path is not built in a random manner: the majority of the links follow it. The site also seems to have a strong consistency: 98% of the links are inside-sites links.

5

Technical details: CLIPS-Index and Web pages analysis

We have developed a robot called CLIPS-Index9 in collaboration with Dominique Vaufreydaz (from GEOD team), with the aim of creating Web corpora. This spider crawl the Web, collecting and storing pages. CLIPS-Index tries to collect the bigger amount of information in this heterogeneous context which is not respectful of the existing standard. It is an interesting problem to collect the Web correctly. In spite of this, our spider is quite efficient: for example, we have collected (October 5 2000) 38’994 pages on the .imag.fr domain, comparatively to Altavista which index 24.859 pages (October 24 2000) on the same domain and AllTheWeb which index 21.208 pages (October 24 2000). 3,5 millions pages from french-speaking Web domains where collected during 4 days, using a 600Mhz PC with 1 Gb RAM. CLIPS-Index crawls this huge hypertext without considering non-textual pages, and respects the robot exclusion protocol [19]. It does not overload distant Web servers, despite the launching of several hundred HTTP queries simultaneous. CLIPSIndex, running on an ordinary 333Mhz PC with 128Mo RAM which cost less than 1.000 dollars, is able to find, load, analyze and stock something like ½¾ millions pages per day. 9

http://CLIPS-Index.imag.fr

13

We have also developed several analysis tools (23’000 lines) using PERL (Practical Extraction and Report Language) for HTML extraction, links analysis and typing, topology analysis, statistics extraction, text indexing, language extraction, etc.

6

Conclusion and future works

We think that it is interesting and useful to use Web structure for IR. Because of the lack of Web explicit structure, we have to identify explicit structure and extract implicit one. We have proposed a framework composed by 6 entities granularities and 3 main relations types. We have proposed some rules to extract these granularities and relations, based mainly on HTML possibilities to describe structured elements, and on study of relations that exist between hypertext links and Web server directories hierarchy. Our first experiments show that, in a first hand the hypotheses H1.1, H1.2, H1.3 (internal structure level) and H1.5 (site granularity) seem to be correct, and in a second hand that hypotheses H1.4 (page granularity) and H1.6 (cluster of sites granularity as internet sub-domains) seem to be false. It is possible to identify and extract structure from the Web: several granularities and several types of relations. But we have to continue these experiments. Firstly, we have to improve our relations categorization and our hierarchical structure extraction. Secondly, we need to check extracted informations manually, to validate our hypotheses. Thirdly, we have to analyze bigger collections: several domains, more heterogeneous pages. IMAG collection is undoubtedly not very representative of the Web, because of its small size compared to French Web. Moreover it represents only a single Web domain. And finally, our main objective is to propose a structured IR model, based on these 6 granularity levels and 3 relations types. An Information Retrieval System based on this model will use some IR methods used in the context of structured documents and hypertexts [12]. It will actually use Web structure for IR, and thus will be able to help facing the IR problem on the Web. We are also working on the use of DataMining techniques for extracting useful knowledge for improving IR results [14].

References [1] B. Amann. Interrogation d’Hypertextes. PhD thesis, Conservatoire National des Arts et M´etiers de Paris, 1994. [2] E. Amitay. Using common hypertext links to identify the best phrasal description of target Web document. In Conference on Research and Development in IR (SIGIR’98), Melbourne, Australia, 1998. [3] P. Atzeni, G. Mecca, and P. Merialdo. Semistructured and structured data in the Web : Going back and forth. In Workshop on Management of Semistructured Data, Tucson, 1997. [4] D. Beckett. 30 % accessible - a survey to the UK Wide Web. In World Wide Web Conference (WWW’97), Santa Clara, California, 1997.

Wide Web Conference (WWW’98), Brisbane, Australia, 1998. [7] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. In World Wide Web Conference (WWW’00), Amsterdam, Netherlands, 2000. [8] S. Chakrabarti, B. E. Dom, R. K. David Gibson, P. Raghavan, S. Rajagopalan, and A. Tomkins. Spectral filtering for resource discovery. In Conference on Research and Development in IR (SIGIR’98), Melbourne, Australia, 1998. [9] Y. Chiaramella and A. Kheirbek. An integrated model for hypermedia and information retrieval. In M. Agosti and A. F. Smeaton, editors, Information Retrieval and Hypertext. Kluwer Academic Publisher, 1996.

[5] T. Bray. Measuring the Web. In World Wide Web Conference (WWW’96), Paris, France, [10] V. Christophid`es and A. Rizk. Querying structured documents with hypertext links using May 1996. OODBMS. In European Conference on Hypertext Technology (ECHT’94), Edinburgh, Scot[6] S. Brin and L. Page. The anatomy of a largeland, 1994. scale hypertextual Web search engine. In World

14

[11] W. B. Croft and H. Turtle. A retrieval model [21] S. Lawrence and C. L. Giles. Accessibility of for incorporating hypertext links. In ACM Coninformation on the Web. Nature, July 1999. ference on Hypertext (HT’89), Pittsburg, USA, [22] Y. K. Lee, S.-J. Yoo, and K. Yoon. Index struc1989. tures for structured documents. In ACM Con[12] F. Fourel, P. Mulhem, and M.-F. Bruandet. A ference on Digital Libraries (DL’96), Bethesda, generic framework for structured document acMaryland, 1996. cess. In Database and Expert Systems Applications (DEXA’98), LNCS 1460, Vienna, Austria, [23] M. Marchiori. The quest for correct information on the Web : Hyper search engines. In World 1998. Wide Web Conference (WWW’97), Santa Clara, [13] M. G´ery. Smartweb : Recherche de zones de California, 1997. pertinence sur le world wide web. In Congr`es [24] O. A. McBryan. GENVL and WWWW: Tools INFORSID’99, La Garde, France, 1999. for taming the Web. In World Wide Web Confer[14] M. G´ery and M. H. Haddad. Knowledge discovence (WWW’94), Geneva, Switzerland, 1994. ery for automatic query expansion on the World Wide Web. In Workshop on the World-Wide [25] A. Mendelzon, G. Mihaila, and T. Milo. Querying the World Wide Web. In Conference on Web and Conceptual Modeling (WWWCM’99), Parallel and Distributed Information Systems LNCS 1727, Paris, France, 1999. (PDIS’96), 1996. [15] C. Gurrin and A. F. Smeaton. A connectivity analysis approach to increasing precision in re- [26] A. Moore and B. H. Murray. Sizing the internet. Technical report, Cyveillance, 2000. trieval from hyperlinked documents. In Text REtrieval Conference (TREC’99), Gaithers[27] S. Nestorov, S. Abiteboul, and R. Motwani. burg, Maryland, 1999. Inferring structure in semistructured data. In Workshop on Management of Semistructured [16] J. Hammer, H. Garcia-Molina, J. Cho, Data, Tucson, 1997. R. Aranha, and A. Crespo. Extracting semistructured information from the Web. In [28] F. Paradis. Using linguistic and discourse Workshop on Management of Semistructured structures to derive topics. In Conference Data, Tucson, 1997. on Information and Knowledge Management (CIKM’95), Baltimore, Maryland, 1995. [17] D. Hawking, N. Craswell, P. Thistlewaite, and D. Harman. Results and challenges in Web [29] P. Pirolli, J. Pitkow, and R. Rao. Silk from a search evaluation. In World Wide Web Confersow’s ear : extracting usable structures from the ence (WWW’99), Toronto, Canada, May 1999. Web. In Conference on Human Factors in Computing Systems (CHI’96), Vancouver, Canada, [18] J. M. Kleinberg. Authoritative sources in a hy1996. perlinked environnement. In ACM-SIAM Symposium on Discrete Algorithms (SODA’98), San [30] G. Salton. The SMART retrieval system : exFrancisco, California, 1998. periments in automatic document processing. Prentice Hall, 1971. [19] M. Koster. A method for Web robots control. Technical report, Internet Engineering [31] A. Woodruff, P. M. Aoki, E. Brewer, P. GauTask Force (IETF), 1996. thier, and L. A. Rowe. An investigation of documents from the World Wide Web. Computer [20] S. Lawrence and C. L. Giles. Searching the Networks and ISDN Systems, 28, 1996. World Wide Web. Science, 280, April 1998.

15

Learning of Ontologies for the Web: the Analysis of Existent Approaches Borys Omelayenko Vrije Universiteit Amsterdam, Division of Mathematics and Computer Science, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands Email: [email protected], URL: www.cs.vu.nl/~borys

representation of the semantics of data accompanied by domain theories (i.e. ontologies) will enable a Knowledge Web that provides a qualitatively new level of service. It will weave together a net linking incredibly large segments of human knowledge and complement it with machine processability. This will require enrichment of the entire Web with lots of ontologies that capture the domain theories. Their manual construction will require enormous human efforts, thus ontology acquisition becomes a bottleneck of the Semantic Web. Recently ontologies have become a hot topic in the areas of knowledge engineering, intelligent information integration, knowledge management, and electronic commerce [Fensel, 2000]. Ontologies are knowledge bodies that provide a formal representation of a shared conceptualization of a particular domain. Modern research focus lies in Web-based ontology representation languages based on XML and RDF standards and further application of ontologies on the Web (see [Decker et al., 2000]). Ontology learning (OL) is an emerging field aimed at assisting a knowledge engineer in ontology construction and semantic page annotation with the help of machine learning (ML) techniques. In the next section of the paper we discuss the general scheme for semantic querying of the Web with three ontological components required; the subsequent sections discuss OL tasks and available ML techniques. The survey section describes the applications of ML techniques for the learning of different ontology types, and we conclude with comparison of the approaches.

Abstract The next generation of the Web, called Semantic Web, has to improve the Web with semantic (ontological) page annotations to enable knowledge-level querying and searches. Manual construction of these ontologies will require tremendous efforts that force future integration of machine learning with knowledge acquisition to enable highly automated ontology learning. In the paper we present the state of the-art in the field of ontology learning from the Web to see how it can contribute to the task of semantic Web querying. We consider three components of the query processing system: natural language ontologies, domain ontologies and ontology instances. We discuss the requirements for machine learning algorithms to be applied for the learning of the ontologies of each type from the Web documents, and survey the existent ontology learning and other closely related approaches. Introduction

Nowadays the Internet contains a huge collection of data stored in billions of pages and it is used for the worldwide exchange of information. The pages represent mainly textual data and have no semantic annotation. Thus, query processing based mostly on inefficient keyword-matching techniques becomes a bottleneck of the Web. Tim Berners-Lee coined the vision of the next version of the Web, called Semantic Web [BernersLee&Fischetti, 1999], that would provide much more automated services based on machineprocessable semantics of data and heuristics that make use of these metadata. The explicit

Semantic Querying of the Web

In: Proceedings of the International Workshop on Web Dynamics, held in conj. with the 8th International Conference on Database Theory (ICDT’01), London, UK, 3 January 2001

In this section we discuss the general scheme for semantic querying of the Web, the types of

16

frequently while the ontology of the catalogue will remain the same). The Semantic Web will require creation and maintenance of a huge number of the ontologies of all three types, and the following ontology learning tasks will become important.

ontologies involved in query process, and basic ML algorithms available for learning the ontologies. The General Scheme The general scheme of the querying process is presented in Figure 1. First, the user formulates the query in natural language. Then the query is transformed into a formal query with the help of the natural language ontology and the domain ontology. The Web pages are (possibly incomplete) instances of some domain ontologies, and they will contain pieces of data semantically marked up according to the underlying domain ontology. The query processor has to find the mapping between the concepts of the initial query, the domain model used to expand the query, and the ontology instances on the Web. This mapping will be non-trivial and will require inference over domain ontologies.

Ontology Learning Tasks

Previous research in the area of ontology acquisition proposed lots of guidelines for manual ontology development (see [Lopez, 1999] for an overview) that organize the work of the knowledge engineer, but they pay no attention to the process of the acquiring of the ontology by humans. The human experts have to evolve the best knowledge acquisition process themselves from their past experience acquired by passing through numerous case studies. Thus, we have to separate several tasks in OL on our own: Ontology creation from scratch by the knowledge engineer. In this task ML assists the knowledge engineer by suggesting the most important relations in the field or checking and verifying the constructed knowledge bases. Ontology schema extraction from Web documents. In this task ML systems take the data and meta-knowledge (like a meta-ontology) as input and generate the ready-to-use ontology as output with the possible help of the knowledge engineer. Extraction of ontology instances populates given ontology schemas and extracts the instances of the ontology presented in the Web documents. This task is similar to information extraction and page annotation and can apply the techniques developed in these areas. Ontology integration and navigation deals with reconstructing and navigating in large and possibly machine-learned knowledge bases. For example, the task can be to change the propositional-level knowledge base of the machine learner into a firstorder knowledge base. Ontology update task updates some parts of the ontology that are designed to be updated (like formatting tags that have to track the changes made in the page layout). Ontology enrichment (or ontology tuning) includes automated modification of minor relations into existing ontology. This does not change major concepts and structures but makes the ontology more precise. Unlike ontology update, this task deals with

Ontological Components There are a number of domains where ontologies were successfully applied. The three ontologies that are important for querying the Web (see Figure 1) are: Natural Language Ontologies (NLO) that contain lexical relations between the language concepts; they are large in size and do not require frequent updates. Usually they represent the background knowledge of the system and are used to expand user’s queries. These ontologies belong to so-called ‘horizontal’ ontologies that try to capture all possible concepts, but they do not provide detailed description of each of the concepts. Domain ontologies capture knowledge of one particular domain, i.e. pharmacological ontology, or printer ontology. These ontologies provide detailed description of the domain concepts from a restricted domain (so-called ‘vertical’ontologies). Usually they are constructed manually but different learning techniques can assist the (especially inexperienced) knowledge engineer. Ontology instances represent the main piece of knowledge presented in the future Semantic Web. As today’s Web is full of HTML documents of different layout, the future Web will be full of instances of different domain ontologies. The ontology instances will serve as the Web pages and will contain the links to other instances (similar to the links to other Web pages). They can be generated automatically and frequently updated (i.e. a company profile from the Yellow Pages catalogue will be updated

17

Bayesian learning is mostly represented by Naive Bayes classifier. It is based on the Bayes theorem and generates probabilistic attribute-value rules based on the assumption of conditional independence between the attributes of the training instances. First-order logic rules learning induces the rules that contain variables, called first-order Horn clauses. The algorithms usually belong to the FOIL family of algorithms and perform general-to-specific hill-climbing search for the rules that cover all available positive training instances. With each iteration it adds one more literal to specialize the rule until it avoids all negative instances. Clustering algorithms group the instances together based on the similarity or distance measures between a pair of instances defined in terms of their attribute values. Various search strategies can guide the clustering process. Iterative application of the algorithm will produce hierarchical structures of the concepts. The knowledge bases built by ML techniques substantially differ from the knowledge bases that we call ontologies. The differences are inspired by the fact that ontologies are constructed to be used by humans, while ML knowledge bases are only used automatically. This leads to several differences listed in Table 1. To enable automatic OL we must adapt ML techniques so that they can automatically construct ontologies with the properties of manually constructed ontologies. Thus, OL techniques have to possess the following properties, which we trace in the survey: - ability to interact with a human to acquire his

the relations that are not specially designed to be updated. The first three tasks relate to ontology acquisition tasks in knowledge engineering, and the next three to ontology maintenance tasks. In this paper we do not consider ontology integration and ontology update tasks. Machine Learning Techniques

The main requirement for ontology representation is that ontologies must be symbolic, human-readable and understandable. This forces us to deal only with symbolic learning algorithms that make generalizations and skip other methods, like neural networks, genetic algorithms and the family of 'lazy learners' (see [Mitchell, 1997] for an introduction to ML and the algorithms mentioned below). The foreseen potentially applicable ML algorithms include: Propositional rule learning algorithms that learn association rules, or other attribute-value rules. The algorithms are generally based on a greedy search of the attribute-value tests that can be added to the rule preserving its consistency with the set of training instances. Decision tree learning algorithms, mostly represented by the C4.5 algorithm and its modifications, are used quite often to produce highquality propositional-level rules. The algorithm uses statistical heuristics over the training instances, like entropy, that guide hill-climbing search of the decision trees. Learned decision trees are equivalent to the sets of propositional-level classification rules that are conjunctions of attribute-value tests.

http://www.cs…

Web pages: http://www.cs…

Formal Semantic Query to the Web

Natural Language Query

Natural Language Ontology

ontology instances ontology Web pages: instances ontology instances

Web pages: http://www.cs…

Domain Domain Ontologies Domain Ontologies

Ontologies

Figure 1. Semantic querying of the Web

18

Instance-of links

knowledge and to assist him; this requires readability of internal and external results of the learner; - ability to use complex modelling primitives; - ability to deal with complex solution space, including composed solutions. Each ontology type has special requirements for ML algorithms applied for learning these types of ontologies.

the actions. Concept features are usually represented by adjectives or adjective nouns (like ‘strongstrength’). Thus the ontology can be represented by frames with a limited structure. NLOs define the first and basic interpretation of user’s query, and they must link the query to specific terminology and specific domain ontology. General language knowledge contained in a general-purpose NLO like WordNet [Fellbaum, 1998] is not sufficient for such a purpose. In order to achieve this, lots of research efforts have been focused on NLO enrichment. NLO enrichment from domain texts is a suitable task for ML algorithms, because it provides a good set of training data for the learner (the corpus). NLOs do not require either frequent or automatic updates. They are updated from time to time with intensive cooperation from a human, thus ML algorithms for NLO learning are not required to be fast. Domain ontologies use the whole set of modelling primitives, like (multiple) inheritance, numerous slots and relations, etc. They are complex in structure and are usually constructed manually. Domain ontology learning concentrates on discovering statistically valid patterns in the data in order to suggest them to the knowledge engineer who guides the ontology acquisition process. In future we would like to see an ML system that guides this process and asks the human to validate pieces of the constructed ontology. ML will be used to predict the changes made by the human to reduce the number of interactions. The input of this learner will consist of the ontology being constructed, human suggestions and domain knowledge. Domain ontologies require more frequent updates than NLOs (just as new technical objects appear before the community has agreed about the surrounding terminology), their updates are done manually and ML algorithms that assist this process are also not required to be fast. Ontology instances are contained in the Web pages marked up with the concepts of the underlying domain ontology with information extraction or annotation rules. The instances will require more frequent updates than domain ontologies or NLOs (i.e. a company profile in a catalogue will be updated faster than the ontology of a company catalogue).

Table 1. Manual and machine representations Machine-learned knowledge bases Modelling primitives Simple and limited. For example, decision tree learning algorithms generate the rules in the form of conjunctions over attributevalue tests. Knowledge base structure Flat and homogeneous.

Tasks Classification and clusterization that map the objects described by the attribute-value pairs into a limited and unstructured set of class or cluster labels. Problem-solving methods Very primitive, based on simple search strategies, like hill-climbing in decision tree learning.

Manually constructed ontologies Rich set of modelling primitives (frames, subclass relation, rules with rich set of operations, functions, etc.). Hierarchical, consists of various components with subclass-of, part-of and other relations. Classification task requires mapping of objects into a tree of structured classes. It can require construction of class descriptions instead of selection. Complicated, require inference over a knowledge base with a rich structure, often domain-specific and application-specific.

Solution space The non-extensible, fixed set of class labels.

Extensible set of primitive and compound solutions. Readability of the knowledge bases to a human Not required. They can be Required. They may be used only automatically and (at least potentially) used only in special domains. by humans.

NLO contain hierarchical clustering of the language concepts (words and their senses). The set of relations (slots) used in the representation is limited. The main relations between the concepts are: ‘synonyms’, ‘antonyms’, ‘is-a’, ‘part-of’. The verbs can contain several additional relations to describe

19

word ‘waiter’ has two senses: the waiter in the restaurant (related words: waiter–restaurant, menu, dinner); and a person who waits (related words: waiter–station, airport, hospital). The system queries the Web for the documents related to each concept from the WordNet and then builds a list of words associated with the topic. The lists are called topic signatures and contain the weight (called strength) of each word. The documents are retrieved by querying the Web with the AltaVista search engine by asking for the documents that contain the words related to a particular sense and do not contain the words related to the other senses of the word. A typical query may look something like ‘waiter AND (restaurant OR menu) AND NOT (station OR airport)’ to get the documents that correspond to the ‘waiter, server’ concept. NLOs, like EuroWordNet or WordNet, help in the understanding of natural language queries and in bringing semantics to the Web. But in specific domains general language knowledge becomes insufficient and that requires creation of domainspecific NLOs. Early attempts to create such domain ontologies to perform information extraction from texts failed because the experts used to create the ontologies with lots of a priori information that was not reflected in the texts. The paper [Faure&Poibeau, 2000] suggests improving NLO by unsupervised domain-specific clustering of texts from corpora. The system Asium described in the paper cooperatively learns semantic knowledge from texts which are syntactically parsed, without previous manual processing. It uses the syntactic parser Sylex to generate the syntactical structure of the texts. Asium uses only head nouns of complements and links to verbs and performs bottom-up breadth-first conceptual clustering of the corpora to form the concepts of ontology level. On each level it allows the expert to validate and/or label the concepts. The system generalizes the concepts that occur in the same role in the texts and uses generalized concepts to represent the verbs. Thus, state of the art in NLO learning looks quite optimistic: not only does a stable generalpurpose NLO exist but so do techniques for automatically or semiautomatically constructing and enriching domain-specific NLO.

The Survey

This section presents the survey of existing techniques related to the learning and enriching of the NLO from the Web, Web-based support for domain ontology construction, and extraction of ontology instances. These approaches cover various issues in the field and show different applications of ML techniques. Learning of NLO Lots of conceptual clustering methods can be used for ontology construction but no methodology or tool has been developed to support the elaboration of conceptual clustering methods that build taskspecific ontologies. The Mo'K tool [Bisson et al., 2000] supports development of conceptual clustering methods for ontology building. The paper focuses on elaboration of the clustering methods to perform human-assisted learning of conceptual hierarchies from corpora. The input for the clustering methods is represented by the classes (nouns) and their attributes (grammatical relations) received after syntactical analysis of the corpora, which are in turn characterized by the frequency with which they occur in the corpora. The algorithm uses bottom-up clustering to group 'similar' objects to create the classes and to subsequently group 'similar' classes to form the hierarchy. The user may adjust several parameters of the process to improve performance: select input examples and their attributes, level of pruning, and distance evaluation functions. The paper presents an experimental study that illustrates how learning quality depends on the different combinations of parameters. While the system allows the user to tune its parameters, it performs no interactions during clustering. It builds the hierarchy of the frames that contain lexical knowledge about the concepts. The input corpora can be naturally found on the Web, and the next paper presents a way of integrating NLO enrichment with the Web search of the relevant texts. The system [Agirre et al., 2000] exploits the text from the Web to enrich the concepts in the WordNet [Fellbaum, 1998] ontology. The proposed method constructs lists of topically related words for each concept in the WordNet, where each word sense has one associated list of related words. For example, the

Learning of Domain Ontologies Domain-specific NLO significantly improves semantic Web querying but in specific domains general language knowledge becomes insufficient

20

in terms of the common understanding of the domain, i.e. in the terms of the domain ontology. The system for ontology-based induction of highlevel classification rules [Taylor et al., 1997] goes further and uses ontologies not only to explain the discovered rules for a user, but also to guide learning algorithms. The algorithm consequently generates queries for an external learner ParkaDB, that uses the domain ontology and the input data to check consistency of the query, and consistent queries become classification rules. The query generation process continues until the set of queries covers the whole data set. Currently the domain ontologies used there are restricted to simple concept hierarchies where each attribute has its own hierarchy of concepts. On the bottom level the hierarchy contains attribute values present in the data, the next level contains a generalization about these attribute values. This forms one-dimensional concepts, and a domain ontology of a very specialized type. The approach uses a knowledge-base system and its inference engine to validate classification rules. It generates the rules in terms of the underlying ontology, where the ontology still has a very restricted type. The paper [Webb, Wells, Zheng, 1999] experimentally demonstrates how the integration of machine learning techniques with knowledge acquisition from experts can both improve the accuracy of the developed domain ontology and reduce development time. The paper analyses three types of knowledge acquisition system: the systems for manual knowledge acquisition from experts, ML systems and the integrated systems built for two domains. The knowledge bases were developed by experienced computer users who were novices in knowledge engineering. The knowledge representation scheme was restricted to flat attribute-value classification rules and the knowledge base was restricted to a set of production rules. The rationale behind this restriction was based on the difficulties that novice users experience when working with first-order representations. The ML system used the C4.5 decision tree learning algorithm to support the knowledge engineer and to construct the knowledge bases automatically. The use of machine learning with knowledge acquisition by experts led to the production of more accurate rules in significantly less time than knowledge acquisition alone (up to eight times less). The complexity of the constructed knowledge bases

and query processing requires special domain ontologies. The paper [Maedche&Staab, 2000] presents an algorithm for semiautomatic ontology learning from texts. The learner uses a kind of algorithm for discovering generalized association rules. The input data for the learner is a set of transactions, each of which consists of a set of items that appear together in the transaction. The algorithm extracts association rules represented by sets of items that occur together sufficiently often and presents the rules to the knowledge engineer. For example, shopping transactions may include the items purchased together. The association rule may say that ‘snacks are purchased together with drinks’ rather than ‘crisps are purchased with beer’. The algorithm uses two parameters: support and confidence for a rule. Support is the percentage of transactions that contain all the items mentioned in the rule, and confidence for the rule X? Y is conditional percentage of transactions where Y is seen, given that X also appeared in the transaction. The ontology learner [Maedche&Staab, 2000] applies this method straightforwardly for ontology learning from texts to support the knowledge engineer in the ontology acquisition environment. The main problem in applying ML algorithms for OL is that the knowledge bases constructed by the ML algorithms have a flat homogeneous structure, and very often have prepositional level representation (see Table 1). Thus several efforts focus on improving ML algorithms in terms of ability to work with complicated structures. The first step in applying ML techniques to discover hierarchical relations between textually described classes is taken with the help of RippleDown Rules [Suryanto&Compton, 2000]. The authors start with the discovery of the class relations between classification rules. Three basic relations are considered: intersection (called subsumption in marginal cases) of classes, mutual-exclusivity, and similarity. For each possible relation they define a measure to evaluate the degree of subsumption, mutual exclusivity, and similarity between the classes. For input, the measures use the attributes of the rules that lead to the classes. After the measures between all classes have been discovered, simple techniques can be used to create the hierarchical (taxonomic) relations between the classes. Knowledge extraction from the Web (data mining from the Web) uses domain ontologies to represent the extracted knowledge to the user of the knowledge

21

In a classical setting the algorithm C4.5 will take the instances described by attribute-value pairs and produce a tree with nodes that are attribute-value tests. The authors propose replacing the attributevalue dictionary with a more expressive one that consists of simple data types, tuples, sets, and graphs. The method [Bowers et al., 2000] uses a modified C4.5 learner to generate a classification tree that consists of tests on these structures, as opposed to attribute value tests in a classical setting. Experiments showed that on the data sets with structured instances the performance of this algorithm is comparable to standard C4.5 but taskoriented modifications of C4.5 perform much better. The system CRYSTAL [Soderland et al., 1995] extends the ideas of the previous system AutoSlog, which showed great performance increase (about 200 times better than the manual system) on a creation of concept node definitions for a terrorism domain. It uses an even richer set of modelling primitives and creates the text extraction and markup rules, with a given domain model as input, by generalizing semantic mark-up of the manually marked-up training corpora. Manually created markup is automatically converted into a set of case frames called ‘concept nodes’ using a dictionary of rules that can be present in the concept node. The concept nodes represent the ontology instances and the domain-specific dictionary of rules defines the list of allowable slots in the ontology instance.

was mostly the same for all systems. The questionnaire presented in the paper showed that the users found the ML facilities useful and thought that they made the knowledge acquisition process easier. Future prospects for research listed in [Webb, Wells, Zheng, 1999] were to lead to ‘a more ambitious extension of this type of study that would examine larger scale tasks that included the formulation of appropriate ontologies’. Learning of the domain ontologies is far less developed than NLO improvement. The acquisition of the domain ontologies is still guided by a human knowledge engineer, and automated learning techniques play a minor role in knowledge acquisition. They have to find statistically valid dependencies in the domain texts and suggest them to the knowledge engineer. Learning of Ontology Instances In this subsection we survey several methods for learning of the ontology instances. The traditional propositional-level ML approach represents knowledge about the individuals as a list of attributes, with each individual being represented by a set of attribute-value pairs. The structure of ontology instances is too rich to be adequately captured by such a representation. The paper [Bowers et al., 2000] uses a typed, higher-order logic to represent the knowledge about the individuals.

Table 2. Comparison of the ontology learning approaches

[Bisson et al., 2000] X X [Faure&Poibeau, 2000] X [Agirre et al., 2000] X [Junker et al., 1999] X [Craven et al., 2000] X [Bowers et al., 2000] X [Taylor et al., 1997] X X [Webb, Wells, Zheng, 1999] X X [Soderland et al., 1995] X X [Maedche&Staab, 2000] X X

X X X

X X X X X

X

X X X X X X X X

X

22

Modifications of ML techniques

Human interaction

Complex modelling primitives

Complex solution space

Partial Yes No No No No No Yes No No

No Simplified frames No Several predicates No Yes, rich structure Yes, but restricted No Yes No

Concept hierarchy Simplified frames No No No Yes, rich structure No No Yes No

Clustering

First-Order Rule learn.

Bayesian learning

Propositional learn.

ML technique

Enrichment

Instance Extraction

Extraction

OL Task

Creation

Ontology Instances

NLO

Approach

Domain Ontologies

Type

position) and three predicates governing these types for treating text categorization rules as logical programs and applying first-order rule learning algorithms. The rules learned are derived from five basic constructs of a logical pattern language used in the framework to define the ontologies. The learned rules are directly exploited in automated annotation of the documents to become the ontology instances. The task of learning of the ontology instances fits nicely into an ML framework, and there are several successful applications of ML algorithms for this. But these applications are either strictly dependent on the domain ontology or populate the mark-up without relating to any domain theory. A general-purpose technique for extracting ontology instances from texts given the domain ontology as input has still not been developed.

After formalizing the instance level of the hierarchy, CRYSTAL performs a search-based generalization of the concept nodes. A pair of nodes is generalized by creating a parent class with the attributes that both classes have in common. The knowledge representation language for the concept nodes is very expressive, which leads to an enormous branching factor for the search performed during the generalization. The system stores the concept nodes in a way that best suits the distance measure function, and therefore performs reasonably efficiently. Experiments on a medical domain showed that the number of positive training instances required for a good recall was limited; after between 1 and 2 thousand, recall measure no longer grows significantly. The system performs two stages necessary for OL: it formalizes ontology instances from the text and generates a concept hierarchy from these instances. A systematic study of the extraction of ontology instances from the Web documents was carried out in the project Web-KB [Craven et al., 2000]. In their paper the authors used the ontology of an academic web-site to populate it with actual instances and relations from CS departments’ web sites. The paper targets three learning tasks: (1) recognizing class instances from the hypertext documents guided by the ontology; (2) recognizing relation instances from the chains of hyperlinks; (3) recognizing class and relations instances from the pieces of hypertext. The tasks are dealt with using two supervised learning approaches: Naive Bayes algorithm and first-order rule learner (modified FOIL). The system automatically creates mapping between the manually constructed domain ontology and the Web pages by generalizing from the training instances. The system performance was surprisingly good for the restricted domain of a CS website where it was tested. Major ML techniques applied for text categorization performed to some degree of effectiveness [Junker et al., 1999], but beyond that, effectiveness appeared difficult to attain and was only possible in a small number of isolated cases with substantial heuristic modification of the learners. This shows the need for combining these modifications in a single framework based on firstorder rule learning. The paper [Junker et al., 1999] defines three basic types (one for text, one for word, and one for text

Conclusions

The above case study is summarized in Table 2. The first column specifies the approach; the next columns represent the ontological component of the Web query system, the OL tasks, and the relevant ML technique respectively. The last three columns describe the degree to which the system interacts with the user and the properties of the knowledge representation scheme. From the table we see that a number of systems related to the natural-language domain deal with domain-specific tuning and enrichment of the NLOs with various clustering techniques. Learning of the domain ontologies is done by now only on a propositional level, and first-order representations are used only in the extraction of ontology instances (see Table 2). There are several approaches in the field of domain ontology extraction, but the systems used there are the variants of propositional-level ML algorithms. Each OL paper modifies the applied ML algorithm to handle human interaction, complex modelling primitives or complex solution space together. Only one paper [Faure&Poibeau, 2000] makes all three modifications of the ML algorithm for NLO learning, as also shown in the table. The research in OL goes mostly in the way of straightforward application of the ML algorithms. This was a successful strategy for beginning, but we would need substantial modifications of the ML algorithms for OL tasks.

23

Workshop on Description Logics (DL2000), Aachen, Germany, August, 2000.

Acknowledgements

[Evett, 1994] M. Evett. PARKA: A System for Massively Parallel Knowledge Representation. Computer Science Department, University of Maryland at College Park, 1994.

The author would like to thank Dieter Fensel for helpful discussions and comments, and Heiner Stuckenschmidt and four anonymous reviewers for their comments.

[Faure&Poibeau, 2000] D. Faure, T. Poibeau: First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX. In: S. Staab, A. Maedche, C. Nedellec, P. Wiemer-Hastings (eds.), Proceedings of the Workshop on Ontology Learning, 14th European Conference on Artificial Intelligence ECAI'00, Berlin, Germany, August 20-25, 2000.

References

[Agirre et al., 2000] E. Agirre, O. Ansa, E. Hovy,

D. Martinez: Enriching very large ontologies using the WWW. In: S. Staab, A. Maedche, C. Nedellec, P. Wiemer-Hastings (eds.), Proceedings of the Workshop on Ontology Learning, 14th European Conference on Artificial Intelligence ECAI'00, Berlin, Germany, August 20-25, 2000.

[Fellbaum, 1998] C. Fellbaum. WordNet: An Electronic Lexical Database. The MIT Press, 1998.

[Berners-Lee&Fischetti, 1999] T. Berners-Lee, M. Fischetti. Weaving the Web. Harper, San Francisco, 1999.

[Fensel, 2000] D. Fensel. Ontologies: Silver Bullet for Knowledge Management and Electronic Commerce. Springer-Verlag, Berlin, 2000.

[Bisson et al., 2000] G. Bisson, C. Nedellec, D. Canamero: Designing Clustering Methods for Ontology Building - The Mo'K Workbench. In: S. Staab, A. Maedche, C. Nedellec, P. WiemerHastings (eds.), Proceedings of the Workshop on Ontology Learning, 14th European Conference on Artificial Intelligence ECAI'00, Berlin, Germany, August 20-25, 2000.

[Junker et al., 1999] M. Junker, M. Sintek, M. Rinck: Learning for Text Categorization and Information Extraction with ILP. In: J. Cussens (eds.), Proceedings of the 1st Workshop on Learning Language in Logic, Bled, Slovenia, June 1999, 1999, pp. 84-93. [Lopez, 1999] F. Lopez: Overview of Methodologies for Building Ontologies. In: Proceedings of the IJCAI-99 workshop on Ontologies and Problem-Solving Methods, Stockholm, Sweden, August 2, 1999.

[Bowers et al., 2000] A. Bowers, C. GiraudCarrier, J. Lloyd: Classification of Individuals with Complex Structure. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML'2000), Stanford, US, June 29-July 2, 2000, pp. 81-88.

[Maedche&Staab, 2000] A. Maedche, S. Staab: Semi-automatic Engineering of Ontologies from Text. In: Proceedings of the Twelfth International Conference on Software Engineering and Knowledge Engineering, Chicago, 2000.

[Craven et al., 2000] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery: Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118: 69-113, 2000.

[Mitchell, 1997] T. Mitchell. Machine Learning. McGraw Hill, 1997.

[Decker et al., 2000] S. Decker, D. Fensel, F. van Harmelen, I. Horrocks, S. Melnik, M. Klein, J. Broekstra: Knowledge Representation on the Web. In: Proceedings of the 2000 International

[Soderland et al., 1995] S. Soderland, D. Fisher, J. Aseltine, W. Lehnert: Issues in Inductive Learning of Domain-Specific Text Extraction Rules. In: Proceedings of the Workshop on New Approaches

24

to Learning for Natural Language Processing at IJCAI'95, Montreal, Quebec, Canada, 1995.

[Taylor et al., 1997] M. Taylor, K. Stoffel, J. Hendler: Ontology-based Induction of High Level Classification Rules. In: Proceedings of the SIGMOD Data Mining and Knowledge Discovery workshop, Tuscon, Arizona, 1997.

[Studer et al., 1998] R. Studer, R. Benjamins, D. Fensel: Knowledge Engineering: Principles and methods. Data and Knowledge Engineering, 25: 161-197, 1998.

[Webb, Wells, Zheng, 1999] G. Webb, J. Wells, Z. Zheng: An Experimental Evaluation of Integrating Machine Learning with Knowledge Acquisition. Machine Learning, 31(1): 5-23, 1999.

[Suryanto&Compton, 2000] H. Suryanto, P. Compton: Learning Classification taxonomies from a classification knowledge based system. In: S. Staab, A. Maedche, C. Nedellec, P. WiemerHastings (eds.), Proceedings of the Workshop on Ontology Learning, 14th European Conference on Artificial Intelligence ECAI'00, Berlin, Germany, August 20-25, 2000.

25

Fast Search & Transfer ASA Knut Magne Risvik R&D Director Search Technology

Abstract Search Engines experience the effect of web dynamics all the time. Crawling and link management for a search engine is struggling as the web is getting more dynamic. We will present some studies of the web dynamics based upon our crawling and link management, and we will also present some ideas to make the search engines more useful in a dynamic web.

26

An Active Web-based Distributed Database System for E-Commerce Hiroshi Ishikawa

Manabu Ohta

Tokyo Metropolitan University, Dept. of Electronics & Information Eng. Abstract ECbusiness models like e-brokers on the Web use WWW-based distributed XML databases. To flexibly model such applications, we need a modeling language for EC businesses, specifically, its dynamic aspects or business processes. To this end, we have adopted a query language approach to modeling, extended by integrating active database functionality with it, and have designed an active query language for WWW-based XML databases, called XBML. In this paper, we explain and validate the functionality of XBML by specifying e-broker and auction business models and describe the implementation of the XBML server, focusing on the distributed query processing in the WWW context.

SQL to XML is inadequate because RDB and XML data have different data models. So we take a nonprocedural query language approach to modeling XML-based businesses. Further, we extend the query language approach by integrating ECA rules [8] with it for modeling control flow of the business processes. Thus, we call XBML an active query language approach to modeling EC businesses. At the same time, we make XBML efficiently executable on the server-side to agilely implement the business models. This paper will not propose a new query language for XML submitted to W3C, although XBML contains the query functionality as a basic part for specifying business processes. XBML integrates the query facility with ECA rules for controlling business processes. We will describe the functionality of XBML and validate the usability of XBML by specifying the example business model with XBML in Section 2. We will describe the current implementation of an XBML server, focusing on the distributed query processing in Section 3.

1 Introduction XML data are widely used in Web information systems and EC applications. In particular, e-broker [12] business models on the Internet like Amazon.com, use a large number of XML data such as product, customer, and order data. In order both to flexibly model and agilely realize such applications, we need a modeling language for EC businesses, in particular, business processes. To this end, we will adopt a query language approach to modeling EC businesses, extended by integrating active database [8] functionality with it, and will provide an active query language for XML data centric in business models, tentatively called XBML (Xml-based Business Modeling Language) by extending the earlier version [10]. As an active query language approach, we need to consider its continuity with nonprocedural database standards such as SQL. We rationalize the necessity of a modeling language for EC businesses. First, the modeling language must be able to integrate the components by reducing their complexity and to make the integrated system understandable. Second, the modeling language must be able to do more than model EC businesses. Indeed, we can use XML as interfaces of each component to make integration easy. However, this just models only the static aspects of the components. We must be able to model the dynamic aspects of business models, that is, business processes. For example, the author [4] discusses the necessity of modeling Web-based applications although he takes an HTML/JavaScript approach in the context of extending UML. But this approach would increase the complexity of modeling the business logic and the overhead of the client-server interaction on the contrary. Instead, we need a nonprocedural language approach to modeling the bushiness processes such as SQL as an analogical solution although just applying

2 Approach 2.1 Database Schemas and Business Model We use the following database schemas or DTD fragments for illustrating the functionality of XBML:

27

The following is a part of XML data with conformity to the above DTD:

discussed in the previous section.

Hiroshi Ishikawa L2 S210 Object-Oriented Database System Springer Verlag 69.00

(1) Searching Products XBML provides the following functions for describing product search processes: - XBML allows product search by selecting products based on their attributes, such as titles and authors, and constructing search results based on them. - XBML allows ambiguous search by allowing partially-specified strings and path expressions. - XBML supports data join used in related search to promote cross-sell and up-sell. - XBML configures search results by sorting and grouping them based on product attributes. - XBML supports “comparison model” of similar products by allowing search multiply bound across shopping sites. - XBML provides localized views (e.g., prices) of global products by introducing namespaces (i.e., contexts). Note that we cannot describe sorting, grouping, and namespaces due to the limit of space.

We take an ordered directed graph as a logical model for an active query language XBML as a modeling language of EC businesses. That is, the data model of the XBML can be represented as data structures consisting of nodes (i.e., elements) and directed edges (i.e., contain, or parent-child relationships), which are ordered. We also use e-broker business models based on XML data for describing the XBML functionality. Here we will provide the working definition to EC business models in general. The EC business models consist of business processes and revenue sources based on IT such as Web and XML. We assume that e-broker business models on behalf of customers consist of at least the following business processes: (1) The customer searches products by issuing either preciselyor approximately-conditioned queries against one or more suppliers and /or navigating through the related links. (2) The customer takes recommendations from suppliers into account if any. (3) The customer compares and selects products and puts them into the shopping cart. (4) The customer checks out by placing a purchase order with registration. (5) The customer tracks the order to check the status for shipping.

Data selection and construction The basic function of XBML is to select arbitrary elements from XML data by specifying search conditions for accommodating flexible product searches in e-broker business models. XBML allows any combination of retrieved elements to produce new element constructs for further services. The following query produces new elements consisting of titles and authors of books published by Prentice-Hall and firstly authored by Ullman: (Query1) select result {$book.title, $book.author } from dlib URI “www.a.b.c/dlib.xml”, book $dlib.book where $book.publisher.name = “Prentice-Hall” and $book.author[0].lastname =“Ullman” and $book.@year gt “1995”

The basic unit of XBML is a path expression, that is, an element variable followed by a series of tag names such as “$dlib.book”. The user must declare at least one element variable in a from-clause. In particular, the user can bind XML data as input specified by URI to element variables such as dlib. Note that URI in our context must have the form “www.x.y.z/d.xml” but not “www.x.y.z”. This declares a context where an XBML query is evaluated. References of element variables are done

The revenue source in e-broker models is sales. 2.2 Business Model Specification We have adapted the design of XBML to the requirements for supporting EC business models

28

by prefixing “$” to them. The user checks a condition for selection in a where-clause. Two values of elements are compared in an alphabetical order. Compare operators include “=”, “!=”, “lt” for “=”. XBML allows indexed access to ordered elements by specifying an index [i]. Attributes are referenced by prefixing “@” to them. “{}” in a select-clause enclosing elements delimited by “,” creates new XML elements of a specified construct such as author and title tags. The result of an XBML query is XML data, which can be retrieved as well as existing data. In our current design, the resultant XML data have no DTD, that is, they are well-formed XML data. For example, the result of the above query has the following structure, automatically wrapped by a tag “XBML:result”:

retrieving products represented as slightly heterogeneous elements (i.e., semi-structured XML data), which depend on data and suppliers in e-broker business models. Here we define semi-structured XML data as follows: (1) Elements with the same tag are repeated at more than or equal to zero times, depending on parent elements, such as authors of books. (2) Elements with the same tag have variant sub-structures, depending on parent elements, such as offices of authors. As these characteristics cannot be determined in advance, we allow partially-specified path expressions.The following query retrieves authors of any material such as book and article named Ishikawa. (Query2) select result {$anyauthor} from dlib URI “www.a.b.c/dlib.xml”, anyauthor $dlib.%.author where $anyauthor.lastname =“Ishikawa”

A First Course in Database Systems Jeff Ullman Gates Building … … …

Here “%” denotes “wild card” in path expressions, which also allows approximate searches in e-broker business models. “$dlib.%.author” matches both of “book.author” and “article.author”. Data join XBML joins different elements by comparing their values in a where-clause. The following query joins books and articles by authors as a join key within the same XML data:

Here, we define the basic syntax of XBML as follows:

(Query3) select result {$article, $book} from dlib URI “www.a.b.c/dlib.xml”, article $dlib.article, book $dlib.book where $book.author.firstname = $article.author.firstname and $book.author.lastname = $article.author.lastname and $book.title = “%Electronic Commerce%”

query = select target from context-list [where-clause] [orderby-clause] [groupby-clause] target=expression | tag ‘{’expression-list ‘}’ expression-list = expression ‘,’expression-list | expression expression = [tag] ‘$’ variable | [tag] ‘$’ variable ‘.’ path path = ‘%’ | tag | ‘@’attribute| path ‘.’ path | ‘(’ path ‘|’ path ‘)’ | text context-list = context ‘,’ context-list | context context = variable URI uri-list | variable expression uri-list = uri uri-list | uri where-clause = where condition condition = term | condition or term term = factor | term and factor factor = predicate | not predicate predicate = expression compare expression orderby-clause = orderby expression-list groupby-clause = groupby expression-list

In e-broker business models, this helps increase cross-sell and up-sell. Here the customers can do approximate searches over XML data by using wild card “%” in strings, that is, partially-specified strings, as is often the case with search in e-broker business models. The query result has the following structure: …

Partially-specified path expression XBML allows regular path expressions for flexibly

29

those customers who purchased the products selected by the customer. The facility for function definition and the query transformation technique have an important role in recommendation as follows.

… … … … … … … …

Function definition Functions correspond to “parameterized views”. Functions modularize recurring queries in EC business models to increase their reuse. The user defines a function by specifying an XBML query in its body. The syntax has the following form: function-definition = function name ‘(’ parameter-list ‘)’ as ‘(’ query ‘)’ parameter-list = parameter ‘,’ parameter-list | parameter

Multiple binding The user can have universal access to multiple data sources by binding a single element variable to multiple URIs (i.e., URI list) in a where-clause. The following example retrieves books authored by the same author from two online bookstores (bound to dlib) by only a single query at the same time:

As personalized recommendation, the following function recommends products based on the keywords which the customer (specified by its identifier, customerid) have registered in advance as his psycho-graphic data:

(Query4) select result {$book.title, $book.author} from dlib URI “www.a.b.c/dlib.xml” “www.x.y.z/dlib.xml”, book $dlib.book where $book.author.lastname =“Ishikawa”

function personalized-Recommendation (customerid) as (select result {$book.title, $book.price} from dlib URI “www.a.b.c/dlib.xml” , book $dlib.book, r URI “www.a.b.c/registration.xml”, customer $r.register.customer where $book.keyword = $customer.keyword and $customer.id = customerid)

The users need to declare the partially-specified path expression to accommodate the heterogeneity of datasources. This function is necessary for,comparing similar products or searching the lowest price in multiple stores.

The next example in the collaboratively-filtered recommendation category recommends products based on similarity that there are other customers who purchased the product selected by the customer (i.e., indicated by selected). function collaboratively-filtered-Recommendation (selected) as (select result {$book.title, $book.price} from dlib URI “www.a.b.c/dlib.xml” , book $dlib.book, r URI “www.a.b.c/registration.xml”, customer $r.register.customer where $book = $customer.purchased and $customer.purchased = selected)

(2) Recommendation Related search as a recommendation process is crucial in promoting cross-sell and up-sell, indeed. It is classified into three categories to the extent to which the customer in session is involved. (1) Non-personalized recommendation The customer is not involved. The e-broker recommends some products as general trends, independently of the customer. Or, the e-broker shows the customer products highly rated by the other customers. (2) Personalized recommendation The customer only is involved. The e-broker recommends some products based on the customer’s psycho-graphic data, such as interests, or historical data, such as purchase records. (3) Collaboratively filtered recommendation [17] Both the customer and the others are involved. The e-broker recommends products purchased by

Query transformation Until now, we have treated recommendation and search as separate processes. However, when the customer specifies search keywords, the search result can be expanded to include recommended products by transforming the original search query. Query transformation is classified into two rules as follows: (1) Keyword addition rule This rule has the general form: keyword1 ==> keyword1 | keyword2

For example, the originally specified keyword “Electronic Commerce” adds a new keyword

30

result $XBML:result.result where $result.checked =“yes”

“Internet Business” and the disjunctive condition is added to the end of the query as follows: (Query5) select result {$book} from dlib URI “www.a.b.c/dlib.xml, book $dlib.book where $book.keyword = “Electronic Commerce” or $book.keyword = “Internet Business”

(4) Placing Orders Selected items in the shopping cart remain to be added to ordering databases. Thus, addition of new elements is a mandatory function for constructing practical e-broker models. Addition of new elements often needs making them unique by invoking a dedicated function, defined in programming languages such as Java. To this end, XBML also allows function invocation in a query.

This technique is similar to query expansion [3] used in information retrieval. Note that this type of transformation keeps data sources unchanged. (2) Data source addition rule This rule uses set operations on queries to modify the original one. The rule has the following general form:

Insertion and function invocation We provide the syntax for insertion by using an XBML query as follows:

query1 ==> query1 set-operator query2

insertion = insert into target query

Here set-operator includes union, intersection, and difference. For example, when the customer searches books on EC, he will search articles on EC at the same time by modifying the original query with a disjunctive query as follows:

The following query places a purchase order in e-broker business models by consulting the current shopping cart and customer data and invoking a function: (Query8) insert into $order select order {@id = OrderID($customer.id, date()), item $cart.item} from r URI “www.a.b.c/registration.xml”, customer $r.register.customer, XBML:result URI “www.a.b.c/XBML:result.xml” , cart $XBML:result.cart, o URI “www.a.b.c/ordering.xml”, order $o.order where $customer.lastname =“Kanemasa”

(Query6) select result {$book} from dlib URI “www.a.b.c/dlib.xml, book $dlib.book where $book.keyword = “Electronic Commerce” union select result {$article} from dlib URI “www.a.b.c/dlib.xml, article $dlib.article where $article.keyword = “Electronic Commerce”

We analyze the application-based Web access patterns [5] to create the transformation rules, not discussed here due to space limitation.

Here, in a select-clause, function calls “OrderID($customer.id, date())” generate unique order numbers. Ordering initiates internal processes, such as payment and shipment, hidden from the customers. Please note that “$order” in the into-clause is permanent in “www.a.b.c/ordering.xml” while “order” in the select-clause is temporarily constructed in this query.

(3) Moving to Carts In general, EC business models involve temporary data, such as search results and shopping carts, valid only within sessions as well as permanent data such as books and customers. XBML handles such temporary data as first-class citizens.

(5)Tracking Orders Ordering and shipping constitute a supply chain in the EC business models. Further, shipping is often outsourced. Thus, the involved data are managed at separate sites whether on intranet or on the extranet. To this end, XBML allows data join across different sites in addition to that within one site.

Use of query results XBML allows a query against the intermediate query results as well. The customer checks the result of searching products or recommendation to place an order. The following XBML query moves only the customer-checked items in the search result to the shopping cart: (Query7) select cart {item $result.book} from XBML:result URI “www.a.b.c/XBML:result.xml” ,

Join of data from multiple data sources The user can join heterogeneous XML data from

31

different data sources indicated by different URIs. In e-broker business models, the following query produces a set of ordered items and shipping status by joining order identifiers of order entry data and order shipping data at different sites indicated by separate variables bound to multiple URIs, such as o and s:

by sellers corresponds to registry of products by suppliers, just implicit in the e-broker model. Searching and recommendation of auction items are very close to those of products in the e-broker model. Indeed, bidding is a new process, but it can be viewed as a series of tentative ordering until the buying customer wins the auction. In other words, the event that the customer wins the auction moves auction items to the shopping cart. The winner’s placing a purchase order is very close to that in the e-broker model. Order tracking in the auction model is analogous to that in the e-broker model although it may require a new business model, such as e-escrow, to guarantee the bargain contract. The revenue source is a part of the contract price as fees in the auction model. Thus, we would say that our XBML can apply to the auction model as well. However, it is also true that controlling business processes, or modeling events by some ways is necessary. Thus, the auction model requires triggering business processes at a specified time or on some database events such as insert. Active databases or ECA (Event-Condition-Action) rules [8] will be able to specify such business processes on events more elegantly than procedural programming languages plus the current version of XBML. Therefore, we extend current XBML by introducing the following construct for ECA rules:

(Query9) select result {$order.item, $ship.status} from o URI “www.a.b.c/ordering.xml”, s URI “www.d.e.f/shipping.xml”, order $o.order, ship $s.ship where $order.id=$ship.id and $order.id=“cidymd”

In general, there are two approaches to resolving heterogeneity in schemas of different databases: schema translation based on ontologies and schema relaxation based on query facilities. XBML takes the latter approach, that is, XBML uses regular path expressions and element variables to enable the user to retrieve multiple databases with heterogeneous schemas by a single query at one time because the regular path expressions can match with more than one path and the element variables can be bound to more than one path. Further, we allow well-formed XML data containing a set of heterogeneous element as a query result. Of course, we admit that a simple solution to schema translation between heterogeneous DTD is based on XSL (i.e., XSL Transformations). 2.3 Applicability to Other Models and Extension In the previous subsection, we have discussed the applicability of XBML to the e-broker models. Now we ascertain its applicability to business models other than the e-broker model. Indeed, there are rather novel EC business models, such as the reverse auction model. However, new business models are often created by mutation of business processes of existing models. We take the auction model [12] as an example. The auction model consists of the following processes: The selling customer registers auction items. (1) The buying customer searches auction items. (2) The buying customer takes recommendations into account if any. (3) The buying customer bids. (4) The winner customer checks out by placing a purchase order with registration. (5) The winner customer tracks the order to check the status for shipping.

on event if condition then action

Events include operations of XBML (e.g., select and insert) and a specified time. Conditions are specified as conditions of XBML. Actions are also specified by XBML. For example, we think of the situation that when the highest bidding price of the auction specified by id1 is updated, if the current time is before the closing time of the auction, then the auctioneer specified by id2 increases his bidding by a specified value value3. The corresponding ECA rules can be specified as follows: on insert into $auction.price if now() lt $auction.closing-time then insert into $auction.auctioneer.price select increase ($autioneer.price, “value3”) from actn uri “www.a.b.c/actn”, auction $actn.auction where $auction.id = “id1” and $auction.auctioneer.id = “id2”

We can observe similarity between the auction model and e-broker model. Registry of auction items

Here, now() returns the current time and increase(var, val) increments the variable var by a value val. The

32

ECA rules are defined in advance and invoked on events. The ECA rules can elegantly implement the recommendation (e.g., Query5):

searching node identifiers in Attribute_Node. We cluster data in node and edge tables on a breadth-first tree search basis. We have found this way of clustering contributing very much to reducing I/O cost. Further, we have known from our preliminary experiments that the DTD-dependent mapping approach is mostly two times more efficient than the universal one. However, we have focused on more of our implementation efforts on the universal mapping approach for the following reasons: (1) The approach can free the burden of defining idiosyncratic mappings from the users. (2) The approach can store XML data whose DTD are unknown in advance. (3) The approach can store heterogeneous XML data, in particular, semi-structured XML data in the same database.

on select result {$book} if $book.keyword = “Electronic Commerce” then select result {$book} from dlib URI “www.a.b.c/dlib.xml, book $dlib.book where $book.keyword = “ElectronicCommerce” or $book.keyword = “Internet Business”

Note that the result of the “event query” (i.e., first “select”) is replaced by that of the “action query” (i.e., second “select”) in this case. 3 Implementation XBML is intended for use in not only modeling EC business models, but also realizing them agilely. XBML must be efficiently implemented, too. XBML containing URIs intrinsically requires distributed query processing. So we construct the XBML server as follows: (1) We construct local XBML servers as a basis. (2) We construct global XBML servers by extending the local servers with server-side scripting techniques.

Next, we describe the system architecture for a local XBML server or an XBML processing system. We make appropriate indices on tag values, element-subelement relationships, and tag paths in advance. We describe how the XBML processing system works. The XBML language processor parses an XBML query and the XBML query processor generates and optimizes a sequence of access methods for efficient execution. The primitive access methods are basic operations on node sets, implemented by using RDBMS or ODBMS. They include get_NodeId_by_Path&Val, get_ParentId_by_Child, get_ChildId_by_Parent, get_Value_by_Id, get_NodeId_by_Path, and get_LabelId_by_LabelText in addition to node set operators, such as union, intersection, and difference. We illustrate the translation by using the query:

3.1 Local Server We describe the basic architecture and implementation of a local XBML server. First, we describe storage schema for XML data. We have explored approaches to mapping DTD to databases (RDBMS, i.e., Oracle and ODBMS, i.e., Jasmine [9]) and to implement an XBML processing system [11]. If any DTD or schema information is available, we basically map elements to tables and tags to fields, respectively. We call this approach DTD-dependent mapping, where the user must specify mapping rules individually. Otherwise, we take a DTD-independent mapping or universal mapping approach, which divides XML data into nodes and edges of an ordered directed graph and stores them into separate tables for nodes and edges with neighboring data physically clustered. We provide separate tables for nonleaf and leaf nodes. The order fields of Leaf_Node and Edge tables are necessary for providing access to ordered elements by index numbers. Identifiers, such as ID and IDREF, realizing internal links between elements are declared as attributes and are stored as Value of the separate Attribute_Node table. So references through identifiers are efficiently resolved by

select $book.title from dlib URI “www.a.b.c/dlib.xml”, book $dlib.book where $book.publisher.name = “Prentice-Hall”

This is parsed into an internal form, which denotes a logical query plan represented as an ordered-graph: (Proj (Sel $book (Op_EQ “Prentice-Hall”)) $book.title)

$book.publisher.name

Here, Sel, Proj, and Join (not in the above example) denote selection, projection, and join of XML data, respectively. Op_EQ denotes “=”. This internal form is reorganized in a pattern-directed manner, such as placing Sel before Join, and is transformed into the following primitive operations:

33

(1) get_NodeId_by_Path&Val (Op_EQ “$book.publisher.name” “Prentice-Hall” ) returns a node set set1 (i.e., $book.publisher). (2) get_ParentId_by_Child (set1 “$book”) returns a node set set2. (3) get_ChildId_by_Parent (set2 “$book.title”) returns a node set set3. (4) get_Value_by_Id (set3) returns a value set as a result.

the event and action queries as a single transaction. However, the generated queries tend to be long, in particular, for cascading events. For the moment, we adopt the query rewriting approach in favor of the ease of the implementation. 3.2 Global Server Now we construct the global XBML server by extending the above local XBML servers with server-side scripting techniques. We provide preliminary definitions to queries. First, we categorize queries as follows: (A) Single-URI query This type of query contains only one XML data source specified by a single URL in the query, such as Query1 (selection) and Query3 (join). (B) Multiple-URI query This type of query contains multiple XML data sources specified by multiple URIs in the query. This type is further categorized into two as follows: (B1) Decomposable query This type of query can be decomposed into a combination of single-URI queries with set operators, such as Query4 (multiple binding) and Query6 (set operators). (B2) Non-decomposable query This type of query cannot be decomposed into a combination of single queries alone. This type of query contains join queries over multiple URIs, such as Query9 (join of multiple data sources).

Both RDBMS and ODBMS can be used as the database system of the XBML processing system with the upper layers unchanged by virtue of the above primitive operators. We describe the implementation of ECA rules. First, we define an event query by using the event and the condition in the rules and define an action query by using the action in the rules. Further, we store a dedicated ECA rule database whose entry consists of a pair of such an event and an action query. Now we consider the following approaches to ECA rule: (1) Monitor-based approach The monitor checks each usual query against the event query patterns in the ECA rule database and issues the corresponding action query of the matched event query if the condition is satisfied. The monitor usually keeps a queue of events generated by the matched event query and invokes the action query by looking up in the queue. (2) Query rewriting approach We modify the query processing. The parser checks each query against the event query patterns in the ECA rule database and recursively processes the corresponding action query of the matched event query by adding a check on the condition. That is, the “event query” and “action query” in the rules are translated into a sequence of queries (i.e., primitive access methods) with a condition check.

Second, we categorize queries in another way: (a) Local query XML data sources specified by URI are inside the relevant XBML server. (b)Global query XML data sources specified by URI are outside the relevant XBML server. Now we show that non-decomposable (i.e., intrinsically global) query can be transformed into a series of single URL local or global queries and local queries (join). We assume that the original query contains n URIs. We translate a non-decomposable query by two steps: (1) create a single-URI (local or global) query for each of n URIs with the insertion of the query result into the local server. (2) create single-URI queries performing join of

Next we consider the merits and demerits of the above approaches as follows: (1) Monitor-based approach The monitor can control the whole processes in a centralized manner. However, we need a monitor itself as an extra mechanism. It is not trivial to provide the facility for executing the event and action queries as a single transaction. (2) Query rewriting approach We need no extra mechanism for controlling processes.. It is rather straightforward to execute

34

}

the results stored in the local server, which are local queries, by reducing all URIs to a single-URI.

The above query processing has some room for improvement in performance. Thus, if the non-decomposable query has no selection conditions, the whole remote data sources specified by the generated single-URI queries must be copied to the local server. For example, consider Query9 when the global server is resident at “ordering site” or at a third site. Of course, if there is any selection condition on the join key, the condition is propagated to all the single-URI queries. We call this technique simple selection condition propagation. It is a kind of static query rewriting. However, we want more improvement. So we refine the process-or-dispatch scheme to sort the result of the query and return the value range with respect to the join key (i.e., MIN and MAX values) by adding “order-by” to the query. Then, the conditions “join-key ge min-value and join-key le max-value” are dynamically added to the subsequent generated single-URI query. In turn, the query is evaluated to produce a new value range of the join-key (i.e., min-value’ and max-value’). The following characteristic holds: min-value’ >= min-value & max-value’ H profiles. In the rest of the Section we will define the graph structure of the EACs, whereas we omit the description of the overall adaptive hypermedia that can be directly obtained by applying the given definitions. An EAC with M different profiles (reading keys), is a set N of XML documents where the generic document i∈N contains, for each profile k=1, …, M, a set Lik of annotated outgoing links (i, j, k) where j is the destination node. It can be mapped in a multigraph G where each node corresponds to a XML document and each directed arc to an outgoing link:

545

G = ( N , E ),

E=

U

Lik

i ∈N k =1 , ..., M

For the sake of simplicity, the multigraph G can be referred to as the set of the directed weighted graphs Gk, k=1,…,M, obtained extracting from G the nodes and arcs corresponding to each profile. Each Gk is named Logical Navigation Graph. Gk = ( Nk , Ek ) , Nk = { i | (i, j, k) ∈ E ∨ (j, i, k) ∈ E } , Ek = { (i, j) | (i, j, k) ∈ E } The proposed probabilistic approach assumes that the weight Wk(i,j) of the arc (i, j) in Ek is the conditional probability P(j|k,i), namely the probability that a user belonging to the profile k follows the link to the j node having already reached the i node: Wk(i, j) : Ek → [ 0, 1 ] Wk(i, j) = P(j|k,i), (i, j) ∈ Ek, k=1, ..., M P(i|k,i) is considered to be always zero, as it is impossible a link from a node to itself. For each node i, the sum of the weights of outgoing arcs, for each profile, is always one. ∀i ∈ Nk

∑W (i, j ) = 1 ,

k=1, ..., M.

k

j∈N k

A path S in Gk is defined as an ordered set of nodes: S = { S 0 , S 1 , ... , S l | (S j ,S j+1 ) ∈ Ek, j=0, ..., l-1 }. We do not use the standard arc-based definition of a path because relaxing the condition (sj , sj+1 )∈Ek allows to consider a path involving different Logical Navigation Graphs. This could happen if a user in the profile k chooses a link from a node sj , to a node sj+1 and he/she is moved to a new profile h; in that case we refer to G and consider the alternative condition (sj+1 , sj+2 , h) ∈ E. The probability that a user belonging to the profile k follows the S path is:

PSk = so

∏ W (S , S

j =0 .. l −1

k

j

j +1

)

~

PSk is the product of the probabilities associated to the arcs belonging to the S path. The “shortest” path S ijk between

two nodes i and j for a given profile k is the path with the maximum joint probability given as:

~ Pijk = max (PSkk ) k Sij

where

ij

S ijk is the generic path between the nodes i and j through arcs belonging to the profile k. The “shortest” path for

each profile can be computed once. It is appropriate to always show a not too poor set of links, to let the user easily reach the resources he/she is interested in. Moreover, it is important to avoid the horizon narrowing phenomenon [Maes97] so the graphs should not contain high probability cycles. Similarly to the modelling of an EAC, an adaptive hypermedia AH, with H different stereotype profiles (reading keys), is a set of Q EACs where the generic node i (i=1, …, Q) contains, for each profile k=1,…,H, Lik annotated outgoing links. It can be mapped in a multigraph where each node corresponds to an EAC and each directed arc to an outgoing link. Some intrinsic properties of the hypermedia structure can be expressed, for each profile k, by the following values: • The medium of the probability of the minimum paths in Gk; high values of this term indicate the existence of highly natural paths in the hypermedia. • The medium of the length of the minimum paths in Gk; high values of this term mean longer natural paths in the hypermedia, which could be an advantage in the overall personalization process; • The number of nodes belonging to profile k. These values can change over time: the hypermedia structure can dynamically be updated (adding or removing nodes, arcs or their weight) on the basis of semi-automatic observation of the behaviour of many users or on the basis of an increased knowledge of the Application Domain by the author.

556

In the proposed system, these properties are taken in account constructing three PDFs whose values are proportional to them: ~ ~    ∑ Pijk  Sk M  ∑ ij M  ( i , j ) ∈ E ′  (i , j )∈E′  M ∑ ∑  E ′ δ (k − i )  E ′ δ( k − i)  i =1 k i =1 k [ N i δ( k − i) ] ∑         i = 1 µ( k ) = , n( k ) = , p (k ) = ~k M ~ k P S ∑ ij ∑ ij M M ∑ Ni ( i , j )∈ E′ k

k

∑ i =1

where

Ek′

k

i =1



( i , j )∈ Ek′

i =1

Ek′

Ek′ is the set of arcs in the transitive closure of Gk. Then a weighted medium expressing the “intrinsic relevance”

of the profiles is computed: s (k ) =

β0 µ( k ) + β1 n( k ) + β2 p ( k ) β0 + β1 + β2

where the values of µ(k) and n(k) should be traded-off as a profile with few nodes could have few paths with higher probabilities. An high value of each of the terms in s(k) expresses a high relevance with respect to the profile k, so βi >0.

4. A probabilistic approach to adapt the hypermedia to the user’s behaviour The probabilistic User Model collects information about the user’s actions to build a discrete probability density function (PDF in the following) A(k), with k=1, …, M, measuring the “belonging probability” of the user to each profile (i.e. how much each profile fits him/her). During the user’s browsing activity the system updates A(k) and the user’s profile is changed consequently. So, on the basis of the user’s behaviour, the system dynamically attempts to assign the user to the “best” profile. Moreover, the user can also “force” the changing of his/her profile. Browsing starts from the presentation unit associated to a starting node. If the user is registered, the last A(k) is set as current. Otherwise, he/she is assigned to a generic profile, or to one calculated on the basis of a questionnaire; the initial value of A(k) is called A0 (k). When the user visiting the node Rr-1 requests to follow a link, the system computes the new PDF A’(k), on the basis of the User Behaviour Variables and of s(k), then it decides the (new) profile to be assigned to the user. The page corresponding to the Rr node is then generated in the computed profile. To avoid continuous profile changing it is possible to keep a profile for a given duration (i.e. the number of traversed links), evaluating the A’(k) distribution at fixed intervals. 4.1 The probabilistic User Model The user’s behaviour is stored as a set of User Behaviour Variables. The main variables are: • The current profile, k c; • The current discrete PDF A(k), k=1,…,M, measuring the user’s “belonging probability” to each profile; • The recently followed path R = {R1 , …, Rr-1, Rr }, which contains the last visited nodes, where Rr-1 is the current node and Rr is the next node. The last arc (Rr-1, Rr , k c) is the outgoing link chosen by the user; • The time spent on recent nodes, t(R1 ), ..., t (Rr-1). On this basis, the system evaluates, for each profile k: •

PRk , the probability of having followed the R path through arcs belonging to the profile k;

~ • PRk1 , R r , the reachability of the next node Rr starting from the first node R1 , through arcs belonging to the profile k; • Dt [k], the distribution with respect to the profile k of the visited nodes from R1 to Rr-1, weighted with the time spent on each of them. For example, let { n 1 , n 2 , n 3 } be the recently visited nodes and { t1 , t2 , t3 } the time units spent on each of them: if node n 1 belongs to profiles k 1 and k2 , node n2 belongs to k2 and k3 and node n 3 belongs to k 1 and k4 , the distribution is evaluated as Dt [k] = [ (k 1 , t1 +t3 ), (k 2 , t1 +t2 ), (k 3 , t2 ), (k 4 , t3 ) ]. The visiting times should be accurate; an interesting approach for an accurate computation is proposed in [MT00]. A high value of

PRk indicates that the visited nodes in R are relevant for the profile k as the actual path is “natural”

~ for the profile k. The reachability PRk1 , R r of the next node starting from the first node in R takes in account the way the user could have reached that node. In fact, a high reachability of Rr in the profile k means the user would have reached the next node in a more “natural” way by following the links of the profile k.

567

Temporary deviations that do not move the user’s interests can be taken in account trading off the effects of

PRk and

~ PRk1 , R r on A(k). The former takes in account the actual path so aims to move towards the profile corresponding to recent preferences, whereas the latter aims to disregard recent (local) choices, as the “shortest” paths not necessarily consider the visited nodes between R1 and Rr . The distribution Dt [k] shows how the time spent on visited nodes is distributed with respect to each profile; it is obviously an indicator of the interest the user has shown, with respect to the various profiles. To avoid an “infinite memory” effect, only the most recently followed r-1 links (r nodes) are considered. As an example, if R was the path followed since the initial node, the probability PRk of having followed R in the profile k would be zero if the user has visited just one node not belonging to the profile k. Note that we consider Wk(i, j) = 0 if (i,j) ∉ Ek, k=1, …, M. The evaluated variables are taken in account constructing three PDFs whose values are proportional to them:

∑ [P δ(k − i )] M

c(k ) =

i =1

M

∑P i =1

∑[P M

i R

,

r( k ) =

i R

i =1

~i R 1, R r

]

~ ∑ PRi , R i =1

1

t

,

M

∑[D [i]δ(k − i )] M

δ( k − i)

t ( k) =

i =1

M

∑ D [i] t

r

i =1

Finally, a weighted medium expressing the “dynamic relevance” of the profiles is computed:

d( k) =

α0c (k ) + α1r (k ) + α2 t(k ) α 0 + α1 + α2

An high value of each of the terms in d(k) expresses a high relevance with respect to the profile k, so αi > 0. 4.2 Algorithm for the evaluation of the belonging probabilities The algorithm to compute the new PDF A’(k) on the basis of the user’s actions has the following structure: Input: • The discrete PDFs A(k), A0 (k) and s(k) (see Section 3); • The recently followed path R = {R1 , …, Rr-1, Rr }, composed by the last r nodes visited by the user; • The time spent on recently visited nodes, t(R1 ), ..., t(Rr-1). Output: A new probability density function A’(k). Steps: 1. Compute the new discrete PDF d(k); 2. Compute the new discrete PDF A’(k) as follows:

A' (k ) =

γ 0 A0 (k ) + γ 1 A(k ) + γ 2 d ( k ) + ∆γ 3 s( k ) γ 0 + γ 1 + γ 2 + ∆γ 3

where ∆ = 1 if s(k) has changed, ∆ = 0 else.

We compute the new A'(k) as a weighted medium of four terms; in particular, the first term expresses the initial user choices, the second term considers the story of the interaction, the third term captures the dynamic of the single user whereas the last term expresses “structural” properties of the hypermedia. An high value of each of the terms in A’(k) expresses a high relevance with respect to the profile k, so γi > 0. The new profile could be chosen making a random extraction over the A’(k) distribution or referring the highest A’(k) value.

5. System architecture A system supporting the probabilistic Application Domain and User Models has been designed and partially implemented. The system has a three-tier architecture comprising the User, Application and Data Layer (Fig. 2). User layer corresponds to the browser. It carries out “light” computation because it receives HTML pages and executes only

578

an invisible applet that signals to the server that the user is not interested to that page at the moment (e.g. has reduced to icon its window). Basic web technologies used to implement the system are either those concerning the on-line access to database, data composition and data delivery (e.g. JDBC [MagLa99b], Java Servlet [MagLa99], Enterprise Java Beans, XML), and those allowing client-side elaboration (e.g. Java Applet).

XML Editor

Data Repository

XML Validator

XML Repository Object Repository DTDs

Repository Level

Fragment Browser/composer

Data Access Module

Graph Object Validator

User Modelling Component Wrapper

Wrapper

Sn

Author Module

Sn

Data Layer

Data Sources Level

Hypermedia Modeller

Personalization Component

Session Manager

Web Server

Application Layer

Fig. 2. System architecture (Application and Data Layers) and Author Module 5.1 Application layer At the Application Layer (Fig.2) there are three main modules: Session Manager (SM), User Modelling Component (UMC) and Personalization Component (PC) [Ard&al99]; they run together with a Web-Server. Session Manager maintains a separate session for each user. It responds to the user’s actions and communicates them to the User Modelling Component. The UMC, which maintains the most recent actions of the user, executes the algorithm for the evaluation of belonging probabilities, evaluates the new profile, and gives the result back to the SM. With the new profile the SM asks to the Personalization Component for the presentation unit, instantiated with respect to that profile, to external variables and with respect to technological conditions. The PC dynamically generates the HTML code of the page; after having received the definitive page the SM sends it to the user. In the final page the most likely links are generally more visible (e.g. showing a “next page” button or using different colours). Furthermore, it can be useful to maintain the most recent pages visited by the user in a cache, thus reducing the overhead of reconstructing recent HTML pages for users with the same profile, technological conditions and external variables. The Personalization Component executes the following steps: 1. Extracts from the Data Repository the XML Presentation Description to be instantiated; 2. Browses the “content” Section and constructs the outgoing HTML page, referring to the aliases declared in the “fragments” Section and selecting them on the basis of User, Technological and External Variables. 3. Returns the definitive HTML page.

589

5.2 Data Layer The main goal of the Data Layer (Fig.2) is to store persistent data and to offer efficient access primitives. It provides data to the Application Layer, allowing access to the logical navigation graphs, to XML documents and data fragments, respectively to the UMC and to the PC. The Data Layer comprises the Data Sources Level, the Repository Level and a Data Access Module. The Data Sources Level is an abstraction of the different kinds of data sources used to implement the hypermedia. Each data source Si is accessed by a Wrapper that also generates, in a semi-automatic way, the XML Metadata describing the data fragments stored in S i . The Repository Level is a common repository storing data provided by the Data Source Level or produced by the author. It stores: • XML documents into an XML Repository. We choose to adopt the XDM (XML Data Model [FGZ00]) format to store the XML information as a set of relational records in a relational database (XMLDB); • Persistent objects into an Object Repository; the objects represent logical navigation graphs and data about registered users; • The DTDs used to validate XML documents. Finally, the Data Access Module implements an abstract interface useful for separating the modules which access data from the particular implementations of the Data Sources and Repository levels. 5.3 Author Module The design of the adaptive hypermedia is carried out using an Author Module. It allows to design and validate (with respect to syntactic and semantic correctness) the XML documents representing the Presentation Descriptions and the persistent objects representing the Elementary Abstract Concepts and the overall hypermedia. The main components of the Author Module are (Fig. 2): • The Hypermedia Modeller, which allows to design the adaptive hypermedia as a multigraph of EACs. Moreover, it allows to design the multigraphs representing an EAC. In particular, it allows to define the probabilities of the arcs and offers a set of utilities such as: search of shortest (maximum weigth) path, minimum spanning tree, etc. • The Graph Object Validator, which receives graph descriptions from the Hypermedia Modeller and, after a validation of them (e.g. coherence of probabilities, congruence with the links contained in the Presentation Descriptions etc.), generates persistent objects (e.g. Java Entity Beans) containing the weighted graphs and stores them in the Data Layer (Object Repository). The persistent objects contain all the data useful for the system computed at the generation time, e.g. the shortest path for each pair of nodes and the PDF s(k) (see par.3.2.1); the utility emphasises the existence of some high-probability cycles and asks for confirm. The use of a persistent representation allows to reuse (part of) of the hypermedia, e.g. some EACs, for different web systems. • The Fragments Browser/Composer, which allows the browsing of information fragments and XML Metadata, provided by the heterogeneous data sources or stored in the repository, and their aggregation to form more complex data (which are also represented by XML documents). • The XML Editor, which allows the editing of XML documents (Presentation Descriptions and fragments metadata) in the forms of pure text, graphically as trees, or in a “visual” way. It is possible to create new documents, to edit preexisting ones and to browse the available information fragments and metadata, by means of the Fragments Browser. • The XML Validator, which performs a validating parsing of XML documents with respect to the DTDs and stores them into the repository.

6. Conclusions and future work In this work we presented a probabilistic approach for the development of Adaptive Hypermedia Systems based on stereotype profiles. The architecture uses a layered multigraph structure to describe the application domain and a probabilistic model to adapt the web-site contents and links to the user’s behaviour. The user’s behaviour is described using a probabilistic model: the main contribution of the User Model is to combine information regarding the followed path, the “shortest” path (i.e. the path not necessarily followed but with the highest probability) and a classification of the visited nodes, to calculate the distribution of the belonging probabilities. Moreover, the system takes into account the structural properties of the hypermedia also represented through a PDF. So, the most promising profile, that is a “view” over the application domain, is dynamically assigned to the user, using that probability distribution. The views over the domain model are currently static (with a fixed set of profiles) but it is possible to dynamically change the weights of the graphs’ arcs on the basis of context change and user’s behaviour.

5910

We are developing a prototype of the system using Java [HC99] to enhance portability. Current and future work will concern the validation and tuning of the probabilistic model, the development of tools to improve and simplify the authoring phase, the full support of the personalization of the hypermedia structure on the basis of technology and external variables.

References [Abi&al99] S. Abiteboul, B. Amann, S. Cluet, A. Eyal, L. Mignet, T. Milo, Active views for electronic commerce, in Proceedings of the 25 th VLDB Conference, 1999. [ACT98] L. Ardissono, L. Console, I. Torre, Exploiting User Models for personalizing news presentations, in Proceedings of the 2 nd workshop on adaptive systems and User Modeling on the WWW, 1998. [Ard&al99] L. Ardissono, A. Goy, R. Meo, G. Petrone, L. Console, L. Lesmo, C. Simone, P. Torasso, A configurable system for the construction of adaptive virtual stores, World Wide Web Journal, Baltzer Science Publisher, 1999. [AMM97] P. Atzeni, G. Mecca, P. Merialdo, To weave the Web, in Proceedings of the 23 rd VLDB, 1997. [BE94] C. Boyle, A.O. Encarnacion, Metadoc: An Adaptive Hypertext Reading System, in User Modeling and User-Adapted Interaction 4(1): 1-19, 1994. [BE98] P. Brusilovsky, J. Eklund, The value of adaptivity in hypermedia learning environments: a short review fo empirical evidence, in Proceedings of the S econd Workshop on Adaptive Hypertext and Hypermedia, 1998. [BP97] D. Billsus, M. Pazzani, Learning probabilistic User Models, in Proceedings of the Sixth International Conference on User Modeling, 1997. [Brus96] P.Brusilovsky, Methods and techniques of adaptive hypermedia, in User Modeling and User Adapted Interaction, v.6, n.23, 1996. [Brus97] P. Brusilovsky, Efficient techniques for adaptive hypermedia, in: C. Nicholas and J. Mayfield (eds.): Intelligent hypertext: Advanced techniques for the World Wide Web. Lecture Notes in Computer Science, Vol. 1326, Berlin: Springer-Verlag, pp. 1230, 1997. [CFP99] S. Ceri, P. Fraternali, S. Paraboschi, Data-driven one-to-one web-site generation for data-intensive applications, in Proceedings of the 25 th VLDB Conference, 1999. [CP00] M. Cannataro, A. Pugliese, An XML-based architecture for adaptive web hypermedia systems using a probabilistic User Model, IEEE IDEAS 2000 Conference, 18-20 Sept. 2000, Yokohama, Japan. IEEE Computer Society Press, 2000. [DeBra99] P. De Bra, Design issues in adaptive web-site development, in Proceedings of the second workshop on adaptive systems and User Modeling on the WWW, 1999. [DC98] P. De Bra, L. Calvi, AHA: A generic adaptive hypermedia system, in Proceedings of the 2 nd workshop on adaptive hypertext and hypermedia, Pittsburgh, 1998. [DBH99] P. De Bra, P. Brusilovsky, G.J. Houben, Adaptive Hypermedia: from systems to framework, ACM Computing Surveys, Symposium Issue on Hypertext and Hypermedia, 1999 [Fern&al] M. F. Fernandez, D. Florescu, A. Y. Levy, D. Sucin, Catching the boat with Strudel: experiences with a web-site management system, in Proceedings of SIGMOD’98, 1998. [FGZ00] S. Flesca, S. Greco, E. Zumpano, Modelling and querying XML data, IEEE IDEAS 2000 Conference, 18-20 Sept. 2000, Yokohama, Japan. IEEE Computer Society Press, 2000. [FKN96] J. Fink, A. Kobsa, A. Nill, User-oriented adaptivity and adaptability in the AVANTI project, in Designing for the Web: empirical studies, Microsoft usability group, Redmond, 1996. [HC99] C.S. Horstmann, G. Cornell, Core Java 1.2 , vol. 1 e 2, Sun Microsystems Press, 1999. [HH98] J. Hothi, W. Hall, An evaluation fo adapted hypermedia techniques using static User Modelling, in Proceedings of the 2 nd workshop on adaptive hypertext and hypermedia, Pittsburgh, 1998. [HL99] G.J. Houben, P.Lemmens, A Software architecture for generating hypermedia applications for ad-hoc database output, Technical Report, Eindhoven University of Technology, Department of Computing Science, http://wwwis.win.tue.nl/~houben/pub/csr99-16.ps. [Joerding99] T. Joerding, A temporary User Modeling approach for adaptive shopping on the web, in Proceedings of the Second Workshop on Adaptive Systems and User Modeling on the WWW, 1999. [Kay95] J.Kay, Vive la difference! Individualized interactions with users, in Proceedings of the 14 th IJCAI, Montreal, 1995. [Maes97] P. Maes, Software agents: humanizing the global computer, in IEEE Internet Computing, v.1, n.4, Luglio/Agosto 1997. [MagLa99] MageLang Institute, Fundamentals of Java Servlets, Short course, 1999. [MagLa99b] MageLang Institute, JDBC-Java database programming, Short course, 1999. [MT00] F. Murtagh, F. Tao, Towards knowledge discovery from WWW log data, in Proceedings of the International Conference on Information Technology: Coding and Computing, 2000. [Nijholt99] A. Nijholt, J. Hulstijn, Multimodal Interactions with Agents in Virtual Worlds, in N. Kasabov (ed.), Future Directions for Intelligent Systems and Information Science, Phisica-Verlag: Studies in Fuziness and Soft Computing, 1999. [Walsh98] N.Walsh, What do XML documents look like?, XML.com Tutorial, 1998. [W3C00] World Wide Web Consortium, Extensible Markup Language, Recommendation, 2000. [WHD99] H.Wu, G.J. Houben, P. De Bra, Authoring support for adaptive hypermedia applications, Ed-Media 99 Conference.

6011

Run-time Management Policies for Data Intensive Web sites Christos Bouras1,2 , Agisilaos Konidaris1,2 1 Computer Technology Institute-CTI, Kolokotroni 2 Computer

3, 26221 Patras, Greece

Engineering and Informatics Department, University of Patras, 26500 Rion, Patras, Greece e-mails : {bouras, konidari}@cti.gr

Abstract Web developers have been concerned with the issues of Web latency and Web data consistency for many years. These issues have become more important in our days since the accurate and imminent dissemination of information is vital to businesses and individuals that rely on the Web. In this paper, we evaluate different run-time management policies against real Web site data. We first define the meaning of data intensive Web sites and categorize them according to their hit patterns. Our research relies on real world Web data collected from various popular Web sites and proxy log files. We propose a Web site run-time management policy that may apply to various real Web site hit patterns and Web data update frequencies.

1. Introduction In our days the WWW is the most popular application on the Internet because it is easy to use and has the ability to keep all network functions that are executed throughout a browsing session, transparent to the user. The user has the notion that he is requesting information and this information is somehow brought to his/her computer. He/she does not have to know how this happens. Even though most users don't know how the WWW works, almost all of them experience a phenomenon that is formally known as Web Latency [5]. Web Latency is nothing more than the delay between the time that a user requests (by clicking or typing a URL) a Web page and the time that the Web page actually reaches his computer. This intuitive and informal definition of Web latency urged Web site developers and Web equipment developers to try and reduce the waiting time of users in order to make their browsing sessions as productive as possible. A solution that has been proposed for reducing web latency, and at the same time keeping web data consistent with database changes, is run-time management policies [12, 13]. The issue of implementing run-time management policies for data intensive web sites is as old as data intensive web sites themselves. In order to keep up with user demand and show efficient response times, web sites have resorted to run-time management policies. A run-time management policy may be viewed as a web server component that is executed in parallel with the web server's functions and has total (or partial in some cases) control over those functions. A run-time management policy receives inputs such as user request rates, database update frequencies and page popularity values. The outputs produced may be the pages that should be pre-computed, the pages that should be kept in the server's cache and the pages that should be invalidated from cache. A run-time management policy for web servers is a dynamic element that can be handled in different ways. Not all policies are suitable for all web sites. A lot of work has been carried out in the fields of specifying, executing and optimizing run-time management policies for data intensive web sites [11, 14].

2. A definition of Data intensive Web sites In order to propose run-time management policies for data intensive Web sites we must first determine the meaning of Data-intensive Web sites. The term is used for quite a lot of different

61 1

"types" of web sites that share a common characteristic. The common characteristic is that a lot of data is demanded of these web sites at a certain point in time. The term "Data intensive web site" differs from site to site. We argue that Data intensive Web sites must be further analyzed in order to determine suitable run-time management policies for each one. In order to break down Data intensive web sites into categories, we consider two criteria. The first criterion is user demand (meaning number of requests) in time, and the second is the number and change rate of the dynamic elements of a Web page in these web sites[6]. At this point we must clarify that, for simplicity reasons, we will refer to the default pages1 of sites from now on in our analysis. Our research is focused on the default page because it is the page that most of the sites' requests are directed to (according to the Zipfian distribution of requests). Having this fact in mind we analyze data intensive web site default pages according to the following criteria: 1. The number of requests in time. This criterion is closely related to the geographical topology of every web site and its target group. An obvious example is that of a Greek portal. Many Greek portals can be considered Data intensive web sites but only for a limited (and in general well defined) time slot, every day. The peak web access time in Greece is considered to be 09:00 to 14:00 Central European Time (CET). The Greek portals' peak request times are the same. The requests for pages decline after 14:00 CET and come to a minimum overnight. As opposed to Greek portals, that show a peak request time of about 5 hours, and then a substantial reduction in requests, major US portals such as cnn.com show significant demand all through the day. The request threshold of major US portals is much higher than that of Greek portals. Even though they experience similar request curves, as those shown in Figure 1, due to the fact that US users outnumber users from all other countries, their curves do not decline as much as those of Figure 1. In simple terms this means that US portals experience more intense prime-time periods and show much higher request thresholds while not at prime time. In coming years, with the expected expansion of the Internet in other countries (in Europe, Asia and Africa), the curves of Figure 1 (especially for portals of Global interest), will tend to become straight lines. The deference between Greek and US portals has to do with language (Greek versus English) and information importance. It is obvious at this point that a run-time management policy for a Greek portal should be very different from a policy for a major US portal. The problem becomes even more complex when we consider web sites such as the site of the Olympics or web sites set-up for election days, that show peak request demand, all through the day, but only for a limited number of days (or hours). These sites should be treated as an entirely different category in relevance to their run-time management policies. 2. The number and rate of change of dynamic elements in the default page. This criterion is directly related to the database involvement in the construction of the Web page. More dynamic web page elements (also referred to as dynamic components in this paper), demand more interaction between the web page and the database. The rate of change of dynamic elements in a default web page, is mainly related to the nature of the information that they contain. For example a dynamic element that contains stock market information must be updated frequently. The in.gr portal has a maximum of 4 dynamic elements in its default page, of which only 2 are updated frequently (in October 2000). The cnn.com site contains 6 dynamic elements in its default page of which 3 are updated frequently (in October 2000). Considering the first criteria that we have mentioned above, we may break down data intensive web sites to the following general categories: • Web sites that show limited peak demand (e.g. Greek portals) • Web sites that experience peak and heavily tailed demand (e.g. US portals) • Web sites that experience continuos peak demand for a limited number of days ( e.g. election sites and sites of sporting events) All three categories of web sites may contain various numbers of dynamic elements in their default pages, and these dynamic elements may be updated in various time intervals. By combining the two criteria we end-up with virtually unlimited categories of web sites. All of 1

The default page in this study is the page that is transferred to the user when he/she requests a base URL such as http://www.cnn.com. It is also referred to as the initial web site page.

62 2

these web site categories must be efficiently serviced by run-time management policies. It is obvious that each web site needs a self-tailored run-time management policy. In Figure 1 we have included four different request distributions in time. These distributions come from diverse Web sites. It is obvious that the peak request period is much more intense and demanding in the case of the Greek portal in.gr. The curves in Figure 1 reveal that almost all Web servers show peak response times (limited or extended less or more demanding). In this paper we propose a general cost estimation for a hybrid run-time management policy that can be tailored to different web site needs and we then present its performance evaluation. According to the categorization made in this paragraph, we determine that the main aim of a runtime management policy is to efficiently handle the different request demand of web sites, by servicing different numbers of dynamic elements that have different change rates. Average Number of Request Per Hour of the Day for the Greek School Network (http://www.sch.gr/)

500000 400000 300000 200000 100000

Average Number of Request Per Hour

0 0

5

10

15

20

Average Number of Requests

Average Number of Requests

Average Number of Request Per Hour for the University of Patras (http://www.upatras.gr/)

25

80000 60000

Average Number of Request Per Hour of the Day

40000 20000 0 0

5

10

Time

20

25

Average Requests per Hour in May 2000 for the largest Greek portal in.gr (http://www.in.gr/)

12000 10000 8000 6000 4000 2000 0

Average Number of Accesses Per Hour of the Day

Average Unique Requests (Users)

Average Number of Accesses Per Hour of the Day for the Proxy server of the University of Patras (proxy.upatras.gr) Average Number of Requests

15 Time

10000000 8000000 6000000

Average Requests per Hour in May 2000

4000000 2000000 0

0

5

10

15

20

25

0

5

10

15

20

25

Time

Time

Figure 1 Request distribution for 2 popular Greek sites, 1 University proxy server and the most popular Greek portal

3. Run time management policies There are two extreme approaches for handling dynamic data on the web. The first is the on the fly creation of dynamic web pages on every user request and the second is the materialization and caching of all dynamic web pages before any request (or in frequent time intervals). The first approach has the obvious drawback of large response times and the second approach has the drawback of possibly serving "stale" data to users. In this paper we will not refer to these approaches since, in our opinion, they do not qualify as run-time management policies. We will only refer to run-time management policies that follow a hybrid approach. These policies implement two basic functions at run-time: • Adapt to differentiating request demand. This means that they can adapt to match the changes of user request in time. • Satisfy differentiating numbers of default web page dynamic elements with differentiating update times. This means that they can perform well, when handling pages that contain different numbers of dynamic elements, which must be updated in different time intervals. The issue of materialization and caching of dynamic web pages, is one that has been thoroughly discussed in the bibliography [4, 14]. The basic idea behind materializing and caching dynamic web pages, is that a web server response is much quicker when serving static, rather than dynamic pages. This is easily understood, since in the case of static pages, the web server does not have to query the database for data contained in the web pages. A run-time management policy is basically about dynamic web page materialization and caching. There are several issues related to these functions of the policies. These are:

63 3



Cache consistency and invalidation policy. The basic problem behind caching dynamic data is the consistency between the data that are stored in the database and the data that are contained in the cached web pages. The problem of keeping caches consistent is a very "popular" problem that has been addressed not only in the context of Web server caches, but mostly in the context of proxy servers. Some of these policies can be applied to the Web server caches [10, 8].



Object caching. Another interesting issue (directly related to the previous) in Web server caches, is the selection of objects that must be cached. Throughout this paper we have considered that the objects that can be kept in a Web server's cache are only "whole" Web pages. This is not true. The issue of caching fragments of Web pages has also been presented in the bibliography [1, 2, 3, 7, 9] and is very interesting, in relevance to our approach that is presented in the following section. Cache size and type. The size of the cache is one of the most important parameters in caching of dynamic web pages on server. In the extreme case that an infinite (or a very large) cache was available, many of the problems related to caching would be eliminated. Since, in our days, large disks are available and are fairly inexpensive, the issue of cache size should not be considered very important when referring to disk caches. In the case of memory caches the issue of capacity is still considered very important. Caches and web server dependency. A caching scheme for dynamic web pages should be applicable to many different web server implementations. To make this possible, caching schemes should be implemented as a standalone application that should be able to exchange information with various web servers. Methods of Materialization triggering. A materialization policy may be implemented with the use of various techniques. Many frequently updated portals update their data through web forms. A news site for example must rely on Web forms, because it relies on journalists to update the site and thus requires a user friendly update environment. The use of web forms causes the triggering of the materialization policy and the consequent update of Web site pages.







3.1.

Data intensive Web site study results

In order to record the "behavior" of web sites, in relevance to their default web page update policies, we performed a study that included two well known Greek portals (www.in.gr and www.naftemporiki.gr) and one "global" portal (europe.cnn.com). The first Greek portal is mainly a news and search engine portal and the second is a stock market web site that includes a stock market quote on its default page. In order to determine the default web page update policy in relevance to their request demand we performed an analysis by issuing default web page requests every 3 minutes. From these requests we were able to measure their response times and whether their default pages had changed from our previous request. The results show a high rate of change in all three sites. The www.in.gr site shows a slight decline in update frequency during early morning hours, but no substantial changes occur during peak times. The www.naftemporiki.gr site shows a decline in update frequency during peak times but on average the update frequency remains the same. The europe.cnn.com site shows a standard update frequency at all times. The update frequency results may be interpreted as run-time management policies for the default web pages of the sites. The advertisement policies of the sites may have played a substantial role in these results. A more frequent request policy on our part may have shown different results (since the update frequency of a site may be less than 1 change per three minutes) but we did not want to load sites with our requests. The first column of Figure 2 shows the response time of the three sites after issuing requests to the sites every 3 minutes through a web client that we implemented in Java. The second column contains the corresponding number of changes that occurred to the default page, every ten requests to a web site. The change of the web page was computed through its checksum.

64 4

Page changes from previous count

Whole day Response times for www.in.gr

13:43

12:19

9:34

10:54

15:43

16:01

16:20

16:39

15:40

16:05

16:30

15:24

15:15

10 8 6 4 2

Time

Time

Figure 2 Request times of three popular sites and the correspondent default web page update frequencies The main result of the statistics in Figure 2 is that web sites follow update policies that rely on standard time intervals. A web site is updated every n minutes at all times (peak or not). This gave us another motive to construct an efficient run-time management policy.

4. The proposed Run-time management policy In this section we will propose a run-time management policy based on characteristics that we have observed in real world data intensive web sites. We will propose a general model for constructing and validating a run-time management policy. Our run-time management policy is based on the assumption that a policy must be based on the current web server conditions and the current web site request demand.

4.1.

Proposed policies and cost estimation

The cost estimation of a run-time management policy is the basis of its performance evaluation. In order to have a theoretical view of the efficiency of a policy, one must perform a cost estimation corresponding to the policy. In this paragraph we perform a cost estimation for a default web page of a data intensive web site. First we will refer to a policy, that we will call ondata-change-materialize policy, that materializes every dynamic component of a web page, every time it changes. This policy has the advantage that web pages are always consistent with database changes but imposes a substantial load to servers in the case of frequently changing web pages. In data intensive web sites, and especially during the period of intensity the policy described above might not be a very good proposal. Consider an example of a web page that contains three

65 5

16:55

14:50

14:25

14:00

13:10

12:44

12:19

11:53

11:29

0 11:03

19:12

10:38

14:24

12

10:13

Default page Response time

Default page updates

9:36

15:05

Page changes from previous count

Greek Peak response time for europe.cnn.com

4:48

14:47

Time

Time

80000 70000 60000 50000 40000 30000 20000 10000 0 0:00

14:08

13:48

19:12

13:26

14:24

13:06

0

12:46

20000

12:27

40000

12:07

Default page updates

60000

7 6 5 4 3 2 1 0 11:47

Response times

80000

9:36

8:15

Page changes from previous count

100000

4:48

6:55

Time

Peak response times for www.naftemporiki.gr

0:00

5:32

4:48

4:11

0:00

2:51

19:12

1:14

14:24

Time

14:29

9:36

13:35

4:48

23:36

0 -200000:00

2 0 22:14

20000

20:54

40000

8 6 4

19:34

60000

18:14

80000

12 10

16:54

Response time

100000

15:33

Default page updates

120000

dynamic components. The data contained in the database, and concerns the first component, changes very frequently (e.g. as in the case of stock quotes). The data of the other two dynamic components change periodically but with a much smaller frequency. According to this cost estimation, the default page should be re-computed and materialized very frequently because of the frequent changes in data of the first component, even though the other two components did not need refreshing. The cost of materialization in this case, equals to the cost of materializing the web page every time a dynamic component that is included needs refreshing. As one can conclude, this is not a very good proposal especially at peak request times. The other proposal consists of a hybrid model. Our basic concern is peak request demand periods. Those are the periods when a web site should perform best. Our hybrid model combines two run-time management policies. The on-data-change-materialize policy, that has been already presented, and an on-data-change-batch-compromise policy that we will explain in the following paragraphs. The basic idea is that the on-data-change-materialize policy should be executed until the Web server load reaches a certain threshold after which the on-data-changebatch-compromise policy is executed. The on-data-change-batch-compromise policy attempts to estimate the optimal wait time between default web page materializations. We will attempt to clarify this by presenting the three web page dynamic elements example, the first of which is a stock quote. The stock quote element has a change frequency f1=6 changes per minute. The second element has a change frequency f2=2 changes per minute and the third has a change frequency f3=0.2 changes per minute. The ondata-change-materialize policy considers these three frequencies independent variables. In simple terms this means that the default page would be materialized 8.2 times per minute, on average. This is very costly and can not be considered a good practice. In our on-data-changebatch-compromise policy, we relate these three change frequencies by attempting to implement a batch materialization model that relies on accepted compromises in data "freshness". In our example the stock quote dynamic element should be refreshed every 10 seconds, the second element every 30 seconds and the third every 5 minutes (300 seconds). The problem here, is to combine the web page materializations that are caused by one dynamic element change frequency with those caused by another. The obvious solution is to divide all dynamic element change periods with the smallest period. In our case, since the smallest change period is 10 seconds we would divide 10 seconds/10 seconds = 1, 30seconds/10seconds=3 and 300 seconds/10 seconds=30. We name the smallest value (e.g. 1) the Minimum Compromise Factor-CFMIN, the second result (e.g. 3) the First Compromise Factor-FCF and the third (e.g. 30) as the Maximum Compromise Factor-CFMAX. The on-data-change-batch-compromise policy may be implemented by choosing any of the compromise factors that were derived. The data on the web page are "more" consistent to database changes, as the compromise factor becomes smaller. The compromise factor is a meter of the web page developer's or administrator's willingness to compromise in data "freshness", in order to achieve better Web server performance. After a compromise factor has been chosen, the materialization of the web page is executed in batches. Let’s consider the case that the FCF has been chosen. This means that a web page materialization takes place every 3 changes of the stock quote dynamic element value in the database, which will include a change in the second element value and so on. The concept of the CF (Compromise Factor) can also be very beneficial to a run-time management policy. Its value does not have to be the result of the division that we mentioned above. The value of the CF may significantly differ from this result. This may happen in cases where the administrator believes that the result of the division is unacceptable because the services demanded by users (web site Quality of Service) require a different approach. Our proposal aims at relating the value of the CF with the Maximum Request Handling value (MRH) on the server, which is the maximum requests that the web server can fulfill at any point in time, and the Current Request Rate (CRR) which is the current amount of requests to the web server. It is obvious that the value of the CF should be greater on a server with a MRH of 100000 that has a CRR of 90000 than the value of the CF on a server with an MRH of 100000 and a CRR 20000. In our proposal the CF value is computed at run-time in frequent time intervals.

66 6

5. Performance Evaluation 5.1.

Experimental set-up and methodology

The basic requirement that we wanted to ensure in the topology that we used to evaluate the hybrid run-time management policy, was the non-intervention of conditions not related with the policies. In order to ensure this as much as possible we used the topology shown in Figure 3. Client 1

SubNet 1

SubNet 2

Switch Client 2

Server

... Client 20

Figure 3 The experimental set-up We tested our policies by implementing the client-server model on campus at the University of Patras. We used a Pentium III Computer with 256MB of RAM and 30GB Hard disk running Windows NT as a server and an Ultra Sparc 30 with 128MB of RAM and 16GB Hard Disk running Solaris 2.6 to simulate our clients. Our client program was executed on the Ultra Sparc and was able to simulate simultaneous client requests to the server running as independent UNIX processes.

5.2.

Evaluation of different types of dynamic web components

In our performance evaluation we tested six different categories of dynamic web components. The categories are shown in Table 1. Only three of the categories are dynamic components. The other three are the corresponding static components. We included them in the evaluation in order to show the big difference between static and dynamic components of web pages.

Category ID

Category Name

Description

1

Simple Query Static component A static component that is a result of a simple query (SQS)

2

Simple Query Dynamic component A dynamic component that includes the same simple query of 1 (SQD)

3

Multiple Queries Static component A static component that is the result of multiple queries to the database (MQS)

4

Multiple Queries component (MQD)

5

Simple Query Static component with A static component containing the large result size (LDS) data of a simple query that returned a substantial amount of data

6

Simple Query with large result size A dynamic component that contains Dynamic component (LDD) the same simple query as 5 that returns a substantial amount of data

Dynamic A dynamic component that contains the same multiple queries as 3

Table 1 The categories of Web components used in our performance evaluation

67 7

Mean Response times

140000 120000 100000

LDD LDS

80000

MQD

60000

MQS

40000

SQD SQS

20000 0 0

10

20

30

40

Number of users

Figure 4 The Mean Response times of all categories of components (MQS and LDS static components are not shown clearly because they show response times very similar to SQS) In is obvious from Figure 4 that the dynamic component that contains a lot of data has the largest response times. The second worse response times are shown for the dynamic component that contains multiple queries and the third worse from the dynamic component that contains a simple query. The static components follow with smaller response times. 140000 Mean Response time

Mean Response time

25000 20000 15000

SQD SQS

10000 5000

120000 100000 80000

LDD

60000

LDS

40000 20000 0

0 0

10

20

30

0

40

10

20

30

40

Number of Users

Number of Users

Mean Response time

60000 50000 40000 MQD

30000

MQS

20000 10000 0 0

10

20

30

40

Number of Users

Figure 5 Mean response times for corresponding static and Dynamic pages Figure 5 illustrates the large differences between dynamic web page components and static web page components. The difference becomes more obvious as the clients become more, and consequently response time rises. The basic conclusion that may be extracted from this first step of evaluation is that web developers should implement dynamic web components that do not contain a lot of data and are not the result of complex database queries. Thus the use of views and indexes in web database design can be very useful in relevance to dynamic web page elements.

5.3.

Evaluation of the hybrid run-time management policy

In this section we evaluate our hybrid run-time management policy. We constructed a default web page that consisted of 3 dynamic elements that needed refreshing every 5, 10 and 20 seconds. We issued a number of requests equal to 11 requests per second for that web page, through client processes. Twenty clients were monitored and their responses were recorded. We also issued a number of requests to the web server through other client processes in order to achieve a background load on the web server. The evaluation Figures contain equalities such as 68 8

CF=10sec which must be interpreted as CF corresponding to 10 second default web page update frequencies. The rationale behind the experiments was to show that different CF values may improve performance under specific server loads.

3300 3200 3100 3000 2900 2800 2700

Mean response times for 11 requests per second and CF=20sec

Mean response times for ~60% server load

0

5

10

15

20

25

Mean Response times for ~30% server load

Mean response times (ms)

Mean Response time (ms)

Mean response times for 11 requests per second and CF=10sec

2950 2900

Mean Response time for ~30% server load

2850

Mean Response times for ~60% server load

2800 2750 0

Client

5

10

15

20

25

Client

Mean Response time (ms)

Mean response times for 11 requests per second and CF=5 sec 3100 3050 3000 2950 2900 2850 2800 2750

Mean Response times for ~30% server load Mean Response times for ~60% server load

0

5

10

15

20

25

Client

Figure 6 Mean response times of the default page with different values of CF and different server loads Figure 6 shows the mean response times of the default page in 20 clients, calculated for different values of CF and different server loads. In the case of CF=10, meaning that the default web page was materialized every 10 seconds, we see a clear improvement in the default web page response times when the server load is close to 30% of the maximum. This improvement is even clearer in the case that CF=20. We would expect such a result since what it means is that a web page is transferred faster to a client when it is materialized every 20 seconds than when it is materialized every 10 seconds, under specific server loads. The situation becomes more complex when CF=5 since the response times are not clearly better for a lower server load. This is logical because the rate of default page materialization itself, imposes a load to the server. The overall conclusion of Figure 6 is that the CF is directly related to the response times of the server, under specific server loads. Thus, the CF should be calculated in a way that it would result in better server response times. It is obvious that the value of CF should increase as the load of the Web server (which is the result of greater request demand) rises.

6. Future Work and conclusions Our future work will aim at improving the hybrid run-time management policy described in this paper. We believe that the policy that is presented in this paper can be improved much further. It can be improved by using what we call Web page component caching. We would like to evaluate our policy together with schemes presented in [1, 2, 3, 7, 9]. These schemes present the idea of fragment caching instead of whole web page caching. This way we would like to evaluate schemes that store fragments or materialized web page elements in cache and integrate them on user request. We would also like to further enhance the computation of CF with other parameters. In this paper we have only related CF to the Current Request Rate, the Maximum Request Handling value and the update frequencies of components. We would like to be able to relate it with other parameters such as general network state and web page popularity in the future.

69 9

The results are interesting, since it was shown that by adopting a compromise policy, the performance of the web server may improve even in peak conditions. The basic problem is to define the compromise factor that we named CF. We believe that our hybrid policy may be enhanced much further in order to reduce web latency that is related to web servers.

7. References [1] Jim Challenger, Paul Dantzig, Daniel Dias, and Nathaniel Mills, "Engineering Highly Accessed Web Sites for Performance ", Web Engineering, Y. Deshpande and S. Murugesan editors, Springer-Verlag. [2] J. Challenger, A. Iyengar, K. Witting., "A Publishing System for Efficiently Creating Dynamic Web Content", In INFOCOM 2000, March 26-30, 2000 Tel Aviv, Israel [3] Jim Challenger, Arun Iyengar and Paul Dantzig, "A Scalable System for Consistently Caching Dynamic Web Data ", In Proceedings of IEEE INFOCOM'99, New York, New York, March 1999. [4] G. Mecca, P. Atzeni, A. Masci, P. Merialdo, G. Sindoni "The Araneus Web-Based, Management System" - In Exhibits Program of ACM SIGMOD '98, 1998 [5] Md Ahsan Habib, Marc Abrams "Analysis of Sources of Latency in Downloading Web Pages", WebNet 2000, October 30 - November 4, 2000 San Antonio, Texas, USA [6] Brian E. Brewington, George Cybenko. "Keeping Up with the Changing Web". IEEE Computer Magazine May 2000 [7] Craig E. Wills, Mikhail Mikhailov, "Studying the impact of more complete server information on web caching", 5th International Web caching and Content delivery Workshop, Lisbon, Portugal, 22-24 May 2000 [8] Khaled Yagoub, Daniela Florescu, Valérie Issarny, Patrick Valduriez, "Caching Strategies for Data-Intensive Web Sites", Proc. of the Int. Conf. on Very Large Data Bases (VLDB) Cairo, Egypt , 10-14 september, 2000 [9] A. Iyengar, J. Challenger. "Improving Web Server Performance by Caching Dynamic Data", In Proceedings of the USENIX Symposium on Internet Technologies and Systems, Monterey, California, December 1997 [10] Arlitt, Martin F., Friedrich, Richard J., Jin, Tai Y., "Performance Evaluation of Web Proxy Cache Replacement Policies", HP Technical Report HPL-98-97, 980601 Available at http://www.hpl.hp.com/techreports/98/HPL-98-97.html [11] Alexandros Lambrinidis and Nick Roussopoulos "On the Materialization of WebViews", CSHCN Technical Report 99-14 (ISR T.R. 99-26) In proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB '99), Philadelphia, Pensylvania, USA, June 1999 [12] Yi Li and Kevin Lu, "Performance Issues of a Web Database" In Proc. of Eleventh International Workshop on Database and Expert Systems Applications 4-8 September, 2000, Greenwich, London, UK [13] Daniela Florescu, Alon Levy, Alberto Mendelzon, "Database Techniques for the WorldWide Web" A Survey. SIGMOD Record, 27(3):59-74, 1998 [14] Birgit Proll, Heinrich Starck, Werner Retschitzegger, Harald Sighart, "Ready for Prime Time Pre-Generation of Web Pages" in TIScover, WebDB '99, Philadelphia, Pennsylvania, 3-4 June, 1999.

70 10

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.