DVQ: Towards Visual Query Processing of XML Database Systems

Share Embed


Descrição do Produto

World Wide Web: Internet and Web Information Systems, 6, 233–253, 2003  2003 Kluwer Academic Publishers. Manufactured in The Netherlands.

DVQ: Towards Visual Query Processing of XML Database Systems SHIHUI ZHENG, AOYING ZHOU and LONG ZHANG [email protected] Department of Computer Science & Engineering, Fudan University, Shanghai 200433, China HONGJUN LU [email protected] Computer Science Department, Hong Kong University of Science & Technology, Hong Kong, China

Abstract XML has been recognized as a promising language for data exchange over the Internet. A number of query languages have been proposed for querying XML data. Most of those languages are path-expression based. One difficulty in forming path-expression based queries is that users have to know the structure of XML data against which the queries were issued. In this paper, we describe a DTD-driven visual query interface for XML database systems. With such an interface, a user can easily form path-expression based queries by clicking elements in the DTD tree displayed on the screen and supplying conditions if necessary. The interface and the query generation process are described in detail. Keywords:

XML, DTD, visual query, path expressions

1. Introduction The Extensible Markup Language (XML) is an emerging standard for data representation and exchange on the Internet. There is a consensus among the researchers that XML is as an easy-to-write, easy-to-parse language to exchange data in a variety of applications on the Internet. Currently, most business data is stored in relational or object-relational systems. It will continue to be so since the matured relational technology provides excellent queriability, scalability and availability. It becomes a natural choice to store XML documents in relational systems to make use of such advantages and hopefully ease the difficulties in integrating XML documents into other applications. A large amount of research work has been conducted to study issues related to storing XML documents using relational database management systems [9,15,22,23]. Storing an XML document into a relational database system requires to map the document into relational tables. In other words, a relational schema must be defined. Yoshikawa and Amagasa categorize such XML-Relation mapping into two categories: structure-mapping-based approach and model-mapping-based approach [27]. The structure-mapping-based approach generates a relational schema from an XML document based on its logical structure described by its DTD (Document Type Descriptor). The model-mapping-based approach generates a relational schema for an XML document based on its representation model (e.g., trees) without understanding its logical structure.

234

ZHENG ET AL.

To query XML data, several XML query languages were proposed including Lorel [1], XQL [21], XML-QL [8], XPath [6], and XQuery [4]. Although those query languages differ from each other in ways of expressing queries, one of the common features of XML queries is that they are mainly path expression based, which is rather different from setoriented relational query languages, such as SQL. Compared to querying relational databases, querying XML data is more difficult since users need to know the structure of the data. When XML data is stored in relational systems, this becomes much more difficult, especially when structure-mapping based approach is used for schema mapping as the original structure may not be well reflected by the relational schema. While storage of XML and related query processing techniques received well attention, the efforts in providing users with facilities to ease the difficulty of writing path-expression based XML queries are very limited. In this paper we present DVQ [28], a DTD-driven visual query interface for an XML-Relational database system VXMLR [29]. VXMLR adopts a structure-mapping based approach to map XML data into relational tables managed by a commercial RDBMS. Its query interface, DVQ, displays the structure of the stored XML data on the screen. By simple clicking the nodes in the DTD structure and entering related conditions, a user can easily form path expressions. DVQ then generates the path expression based queries automatically. The query results are represented by XSLT and delivered to users through DVQ. With DVQ, VXMLR users navigate the nested structure of the stored XML data, form queries and browse the result documents through DVQ seamlessly. There are the following advantages of DVQ: • It provides users with a graphical interface so that complex queries can be formed by users’ simple GUI actions. Not only XML experts, but also ordinary users can specify queries without the knowledge of the XML query languages. • It is powerful enough to specify regular path expressions with wildcards and any complicate conditions with conjunction, disjunction and negation. At the same time, the WYSWYG (What You See is What You Get) feature of DVQ makes query formulation rather straightforward. • In addition to enter queries, DVQ also provides facilities for users to navigate through the XML structure and to view the intermediate results of the major steps in query processing. • It is a separate module running at the client side and driven by DTD of original XML data. That is, it is independent of the underlying XML-relational mapping schema. So, it is a portable module that can be used in any XML database systems. The remainder of this paper is organized as follows. In Section 2, we discuss the related work. In Section 3, we introduce some background information. In Section 4, we describe the architecture of DVQ. Section 5 describes how DVQ forms path expressions with set constraint conditions using examples. In Section 6, we present DVQ’s function of monitoring XML query execution and displaying the query results. Finally, conclusions are presented in Section 7.

DVQ: TOWARDS VISUAL QUERY PROCESSING OF XML DATABASE SYSTEMS

235

2. Related work Although a number of XML query languages, such as Lorel [1], XML-GL [3], XQL [21], XML-QL [8], Quilt [5], XPath [6], and XQuery [4] have been proposed, there are still very few visual query interfaces for XML. In particular, there is no visual query interface fully supporting regular path expression formation still. PESTO [2] first proposed the notion of fusing query formulation and result browsing seamlessly. But PESTO is an interface for object databases. It cannot browse and query XML data. WebOQL [17] is a SQL-style query language proposed to query the WorldWide Web based on graph models. The interface of WebOQL can specify the select parts and conditions on Web content. However, it does not support path expressions either. The DataGuide [12,13] interface in Lore system [16] can generate simple Lorel [1] queries by exploring the DataGuide of semistructured data. However, DataGuide does not present data structure as accurate as DTD. For example, conjunction, disjunction, and repetition (“*” and “+” etc.) are not shown in DataGuide. Moreover, the DataGuide interface can only specify simple path expressions, and set conjunctive conditions on only a unique path. EquiX [7] provides a simple, form-based GUI interface based on DTD. Users can specify queries easily using EquiX. However, the expressiveness of EquiX is too limited to form regular path expressions. The BBQ [19] interface of MIX [20] also displays the DTD structure of XML data to facilitate the user to browse and query XML data from multiple data sources. Still, it cannot construct arbitrary regular path expressions and complex constraint conditions yet. Another related work is XRS [24,25], an XML indexing and retrieval engine. The interface of XRS is similar to DVQ in that DTD tree structures are provided for ender users and a Java component is used to render the XML output into the HTML. However, XRS is a structured search engine, rather than an interface for XML database systems. XRS allows a user to form text-retrieval style queries that retrieve document fragments containing a certain keyword and returns elements sorted by the similarity of them to the queries. It supports neither regular path expressions, nor constraint conditions like value comparison, path join, type coercion, and etc. There are also a few efforts on visual query language for XML documents. XML-GL [3] is a language proposal that represents XML documents and queries as trees/graphs. However, XML-GL provides only a visual syntax for XML documents, but no visual interface for end users to form queries. In [10,11], Erwig proposed a rules-based visual language, Xing, for querying and transforming XML data. In Xing, XML documents are displayed in a form-like way as nested rectangles and queries are expressed by several document patterns, which support union and wildcard (∗) on tags and binding on elements arbitrarily deeply nested in a document. However, no visual interface for Xing has been reported.

3. Background information In this section, we introduce some background information about XML data and XML queries.

236

ZHENG ET AL.

member(name,email?,URL?,publication*,project*)> project(projtitle,introduction?,member*)> member ID ID #IMPLIED> publication (title,publisher,year)> labname (#PCDATA)> projtitle (#PCDATA)> name(lastname,firstname)> lastname (#PCDATA)> firstname (#PCDATA)> email (#PCDATA)> URL(#PCDATA)> introduction (#PCDATA)> title(#PCDATA)> publisher (#PCDATA)> year(#PCDATA)>

Figure 1. A sample DTD.

3.1. A DTD sample Unlike HTML documents where their structures are usually unknown, the data type definition (DTD) of an XML document describes the nested relationships among data elements in the document. A DTD example is shown in Figure 1, and we will use it as a running example in the following discussion. The DTD in Figure 1 describes XML documents about the information of laboratories. There are two basic constructs in XML documents, element and attribute, indicated by !ELEMENT and !ATTLIST, respectively. Elements in XML documents can be nested. Each document has a unique root element, which is labinformation in our example. It contains three subelements, labname, project, and member. Character * after project indicates that a laboratory element instance can have zero or more projects. + indicates that a laboratory element instance may have one or more members. Each member element has a unique ID attribute, and sub-elements name, email, URL, publication and project. The sign ? after URL indicates that this element is optional, that is, a member instance may not have a URL. The contents of an element is denoted by #PCDATA. The other portion of the DTD can be interpreted in a similar way. Note that member and project are defined recursively. That is, a member element can have project as its subelement, and vice versa. 3.2. Path expressions We can see from the DTD example that data elements in an XML document are nested to form a tree structure (a graph with IDREF). Figure 2 depicts an example document conforming to our DTD example. Its graphical representation in a data graph is shown in Figure 3. With such a structure, we can represent an element by the label path consisting

DVQ: TOWARDS VISUAL QUERY PROCESSING OF XML DATABASE SYSTEMS

XML Aoying Zhou [email protected] An Algorithm ... 1999 Journal of Computer Science ... Zengping Tian Data Mining Figure 2. A sample document fragment.

Figure 3. The data graph of sample document.

237

238

ZHENG ET AL.

Figure 4. Reference architecture of VXMLR.

of the label sequence of nodes from the root of the tree to the elements. For example, if we adopt the dot notation to denote the parent–child relationship, the path for a member’s name can be expressed as project.member.name. In general, the path expressions can be more complex. If r, r1 , and r2 are elements or attributes, a path expression has the form of r = (r) ∗ |(r) + |(r)?|r1 .r2 |(r1 |r2 )|#|name, where regular operators ∗, +, ? mean 0 or more, 1 or more, and 0 or 1 occurrences, respectively. Concatenation r1 .r2 is used to form a path from r1 to r2 . Alternation “|” stands for disjunction. Wildcard “#” denotes arbitrary occurrences of any regular expressions. We distinguish two types of path expressions: simple path expression (SPE) and regular path expression (RPE). Simple path expressions consist of only element or attribute names like labinformation.project.member.name. Regular path expressions contain regular operators. For example, path expression #.(project.member)*.name is a RPE.

3.3. VXMLR: A visual XML-Relational database system The detailed description of VXMLR is out of the scope of this paper. To illustrate the context where DVQ works, we depict the architecture of VXMLR as shown in Figure 4. In VXMLR, XML documents are stored in a relational database system, managed by a relational DBMS. An input XML document is first parsed into a DOM tree. At the same time, the DTD for the document is extracted. The document tree is then mapped to relational tables and stored in the database. Users issue queries through DVQ, the visual interface.

239

DVQ: TOWARDS VISUAL QUERY PROCESSING OF XML DATABASE SYSTEMS

(a) Project pathID

projectID

P1 P4

1 64

projtitle

introduction

parentID

null null

null 23

XML Data Mining (b) Member

pathID

memberID

P2 P2

3 23

name

firstname

lastname

null null

Aoying Zengping

Zhou Tian

ID

email

URL

parentID

member1 member12

[email protected] null

null null

1 1

(c) Publication pathID

publicationID

P3

28

title

year

publisher

An Algorithm . . .

1999

Journal of . . .

parentID 3

(d) PathDirectory pathIDpath P1 P2 P3 P4

project project.member project.member.publication project.member.project ...

Figure 5. The stored relations and path directory.

Those queries are then rewritten into SQL statements to be submitted to the underlying relational DBMS. To generate efficient SQL statements from path expression based queries, VXMLR maintains some statistics of data and a path directory. Both the statistics and the path directory are used in the query rewriting process to reduce the number of SQL statements and simplify join conditions. The query results returned from the DBMS are constructed and expressed using XSL, which are then delivered to the users through DVQ. In VXMLR, XML documents are mapped to relational tables based on the Shared Inlining method proposed in [23]. The main idea of Shared Inlining is to use the simplified DTD graph of a document to generate a relational schema, and then transform the document into relations. Using this approach, the elements without parents and the child elements with 1 : n relationships to their parents are stored in separate relations. Conversely, child elements appearing exactly once for each occurrence of their parents are inlined into its parent elements’ relations as attributes. If a DTD graph contains a cycle, a separate relation is also made, which inlines all of the mutually recursive nodes on a cycle. Each tuple in the generated relations is stored with the identifier of the node corresponding to this tuple. Another column, parentID, is also added to the relation, to link the relation to the relation corresponding to its parent in the DTD graph. For our example document in Figure 2, Shared Inlining will create 3 relations Project, Member, and Publication, as shown in Figure 5.

240

ZHENG ET AL.

Path directory is a generalization of 1-Index in semi-structured data [18]. 1-Index is based on the notion of bisimulation among nodes in a data graph, which is extensively studied in [14]. A binary relation ∼ =b on a data graph is called a bisimulation if, for any two nodes u and v, u ∼ =b v, we have (1) u and v have the same label; (2) if u is the root, then v is the root, and vice versa; (3) if u and v are not roots, then there is a bisimulation between the parents of u and v. Two nodes u and v are bisimilar, if there is a bisimulation between them. For example, node 3 and 13, 26 and 63 in Figure 3 are bisimilar. Based on the bisimilarity, the structure of an XML document can be reduced to a concise graph, that is, the 1-Index. A node in a 1-Index represents an equivalent class of bisimilar nodes in a data graph. Each node has an extent that keeps the identifiers of the corresponding nodes in the data graph. An edge in 1-Index represents the parent–child relationship between the corresponding equivalent classes. A path directory can be viewed as a 1-Index built in relational databases. It keeps all of the individual label paths in a 1-Index and the identifer of equivalent class reached by the label path. In brief, a path directory is a relation PathDirectory(path, pathID), where each tuple stores a path (label sequence starting from root) reaching those elements (in the data graph) mapped as tuples, and a pathID, which is the unique identifer for the target node of the path in 1-Index. Further, we add an attribute PathID in each relation to store the identifier of equivalence class of the elements corresponding to the tuples in this relation. Note that a node in a 1-Index stands for an equivalence class of elements in a data graph. The pathID makes the path directory and the database associated. The target elements reached by a path (in the data graph) can be located through a join between path directory and the target relation directly. The detailed processing will be addressed in the later sections. As an example, Figure 5 also shows the path directory for the data graph shown in Figure 3.

4. DVQ: The visual query interface VXMLR system was implemented on top of Microsoft SQL Server. The server side program is implemented in C++. The client side program, i.e., the visual query interface, which runs on a browser, is written in Java. Figure 6 shows the reference architecture of DVQ [28]. The portion above the dashed line is the client side of VXMLR, while the portion below is the server side. The system works in the following way: at first, once a DTD descriptor is selected by a user, the DTD Sender transfers the encoded DTD information to DVQ by way of ISAPI. Next, the DTD Receiver at the client side receives the decoded DTD and the DTD Tree Generator parses it into a hierarchical structure, which then is displayed in the Query Interface. Once a user has constructed the query and clicked the Submit button, the Query Generator module generates path expression queries automatically. The Query Sender module then sends the generated query to the server. At the server side, the Query Receiver decodes the query and submits it to the Query Engine of VXMLR. The Query Engine translates the query into efficient SQL statements, which is then executed by the underlying RDBMS. VXMLR Query Engine constructs the result

DVQ: TOWARDS VISUAL QUERY PROCESSING OF XML DATABASE SYSTEMS

241

Figure 6. The reference architecture of DVQ.

document using the result relational tables returned by the RDBMS. Finally, the result document is represented by the XSL Processor and transferred to the browser for displaying. Figure 7 is the screen dump (main window) of DVQ interface. It consists of two panels and some buttons. The left panel displays the DTD structure of XML data to be queried. It is the sample DTD in Figure 1 as a hierarchical directory, where elements and attributes are represented by their names, preceded with diamonds or dots, respectively. Sub-elements or attributes are expressed as the children of their parent elements. User can click the icon before an element to explore or collapse its subtree. Recall that project and member are defined recursively. The recursion is represented by “project-member” and “member-project” in the DTD panel. The DTD panel provides an easy way for user to navigate the nested structure of the stored XML data conforming to the DTD displayed. To the right of the DTD panel is two tabbed panels, Selection and Condition. The Selection panel is used to form the target items (i.e., a list of path expression) to be retrieved. The Condition panel is for user to enter the conditions that the retrieved items must satisfy. The basic components of both panels are path expression. At the bottom of the main window are four buttons and a check box. The leftmost is the Reset button, which lets users to clear up the constructed items in the Selection and Condition panels. The Preview button brings up a dialog that displays the generated query for users to preview. To the right of the Preview button is a XSLT check box, which indicates that the query result will be returned as a plain XML file or an XML page expressed by XSLT. To the right of the XSLT check box is the Submit button, which triggers the query to be submitted to the VXMLR server for executing. When the rightmost Monitor button is pushed, an XML page showing the elapsed time and query statements in each step of query processing, will be displayed.

242

ZHENG ET AL.

Figure 7. The DVQ interface.

5. Forming visual queries A user query session in VXMLR consists of four steps: first, a user browses the DTD structure and forms the output items by clicking nodes in the DTD structure and set constraints through the Condition panel. Second, after the user completes the specifications of the output items and constraints, a query in the form of path expressions is generated. Third, the generated query is submitted to the VXMLR server for executing. Finally, the result is returned from VXMLR server to the interface for user to browse. 5.1. Forming path expressions We first introduce the Selection panel and then show how to formulate path expressions in the select part of a query. Figure 8 shows the DTD panel and Selection panel. The Selection panel consists of three text boxes, Candidate Item, Output Items, and Ready Path Expression, and some buttons. The first text box on the top of Selection panel, Candidate Item, is used for showing the path expression being constructed. There are four buttons under the Candidate Item text box. The first three buttons are regular operator buttons

DVQ: TOWARDS VISUAL QUERY PROCESSING OF XML DATABASE SYSTEMS

243

Figure 8. Specifying simple path expressions by SELECT panel.

labelled with “?”, “*”, and “+”, respectively, which trigger those regular operators being added to the current path expression shown in the select text box. To the right of the regular operator buttons is the Add Item button, which brings up the path expression in the Candidate Item text box added into the Output Items text box under those buttons. Now, we show by example how to form simple path expressions. A simple path expression is constructed in a natural way by clicking the nodes in the DTD panel following top-down links. Clicking nodes on a path starting from the root triggers the names of the clicked nodes to be appended into the Candidate Item text box consecutively. Figure 8 shows that the nodes labinformation, project, member and publication are clicked consecutively. The generated path expression is SPE, labinformation.project.member.publication which is displayed in the Condition Item text box. The formation of regular path expression is a little complicate. In DVQ, regular operator “#” and “|” are inserted into the current path expression automatically, while operator “*”, “+”, and “?” need to be inserted by users manually. At first, if the element first clicked is neither the root element, nor the child of the element in the tail of the current path expression in the Condition Item text box, then there is at least one element between the tail of the current path and the clicked element. A “#” operator is inserted before it automatically. For example, as shown in Figure 9(a), user clicks the project element

244

ZHENG ET AL.

(a)

(b)

Figure 9. Example for specifying regular path expressions by DTD panel (a) and SELECTION panel (b).

first followed by member and name elements. The first clicked element, project, is not the root. Those actions form a regular path expression #.project.member.name Second, the different branch paths starting from the same node are treated as disjunctive. When two branch paths are specified by a user, “|” operator will be added between them automatically. Assume that a user first forms path expression #.project.member.name, and then clicks element publication. The formed path expression will be disjunctive expression #.project.member.(name|publication) Finally, a user can insert the regular operators “*”, “+”, and “?” manually by first highlighting a segment in the path expression, followed by clicking the corresponding operator buttons. The operators are then added to the path segment automatically. In Figure 9(b), highlighting path segment project.member in expression #.project.member.name followed by clicking the “+” button forms a regular path expression #.(project.member)+.name Now the user clicks the Add Item button. It triggers the current path expression in the Candidate Item text box to be appended to the Output text box. Users can specify target items and add them into the output items continuously. Alternatively, users can also use the Remove Item button under the Output Item text box to remove one or more path expressions from the Output Item text box. At last, the path expressions in the Output Item text box will be used to construct the select part of the query.

DVQ: TOWARDS VISUAL QUERY PROCESSING OF XML DATABASE SYSTEMS

245

Figure 10. Forming constraints with CONDITION panel.

5.2. Setting constraints on output items Constraints can be placed on the paths in output items by Condition panel. Figure 10 shows the Condition panel. It consists of three text boxes, three pop-menus and several buttons. As in the Selection panel, the Path Expression text box on the top of the Condition panel and the three regular operator buttons under it are for users to form path expressions, on which constraints are then imposed. Below the Condition panel are two text boxes labelled with “Conditions” and “Condition List,” which let the user combine multiple constraints to form complex conditions. To the left of the “?” operator button is the comparator popup menu, which consists of a set of basic comparators (“=”, “>”, “>=”, “=2000 is formed. At this point, users can click the “ADD” button under the Expression text box to add the above condition into the Condition List text box or withdraw the formed condition by clicking the Clear button under the expression text box. In the Condition List text box in Figure 10, the condition has been added into the Condition text box. DVQ is able to construct arbitrary complex conditions with “AND”, “OR” and “NOT” predicates. Figure 10 also shows how to form complex conditions using the Predicate popup menu at the bottom of the Condition panel. A user first highlights that the two conditions have been added to the Condition text box, and then selects the “AND” predicate from the Predicate popup menu. Those actions bring up the two conditions in the Condition text box to be connected by “AND” predicate and the following conjunctive condition #.(project.member)+.publication.title LIKE ’%XML%’ AND #.(project.member)+.publication.year>=2000 is formed. Continuing to add expressions and predicates, arbitrary nested complex conditions composed of conjunction, disjunction and negation can be composed also. The generated conditions are simultaneously displayed in the Conditions text box above the Condition List text box. To the right of the predicate popup menu of the two buttons are labelled with “Clear” and “Add”, which bring up the conditions in the Condition text box to be removed from or added into the WHERE clause of the formed query.

5.3. Executing queries and browsing results After a user specifies the path expressions and the related conditions as shown in Figure 9(b) and Figure 10, DVQ will form the query automatically. User can preview the generated query by clicking the Preview button in the right part of the main window. Figure 11 shows the dialog triggered by the click action, which displays the generated query. Note that variables are added into the queries. DVQ converts the query into the forms of Figure 11 automatically. Before describing the query generation algorithm, we introduce a notation, “frequent common prefix.” Given a set of path expressions S, we say

DVQ: TOWARDS VISUAL QUERY PROCESSING OF XML DATABASE SYSTEMS

247

Figure 11. The preview dialog box displaying the generated query.

a path expression p is the frequent common prefix of S, if (1) p is the common prefix of more than one path expressions in S, and (2) there does not exist another common prefix q, q is the prefix of p. The generation algorithm picks out all of the frequent common prefixes appearing in the path expressions formed in the output items and constraint conditions, and replace them with variables. The algorithm is described as follows. Step 1. Find a frequent common prefixes p in all of the path expressions formed in the output items and constraint conditions. Step 2. Replace prefix p with a variable xi in all of the path expressions in the output item and constraint conditions. Step 3. Add p to the FROM clause. Step 4. Set i = i + 1, and repeat Step 1 until there is no common prefix of two or more path expressions. Step 5. Add all of the path expressions in the output items into the SELECT clause and those in the constraint conditions into the WHERE clause. For our example, #.(project.member)+ and publication are two common prefixes. The generation algorithm first finds the frequent common prefix #.(project. member)+ first, and replaces it with variable $x1. After that, publication is picked out and replaced with variable $x2. The final SELECT clause consists of the target item $x1.member.name and WHERE clause consists of $x2.title LIKE ’%XML%’and $x2.year>=2000 The generated query can be submitted to the VXMLR sever by clicking the Submit button. Then the query session enters into the third step: VXMLR server translates the query into SQL statements and submits it to the underlying RDBMS, and then transforms the query results into XML documents. If the XSLT check box is selected, then the XSLT

248

ZHENG ET AL.

Figure 12. The result document represented by XSL.

sheet of the document is also returned. Finally, the result document is displayed in the browser and delivered to users. For example, the following query Select $x1 From sigmod.issue.articale.article $x1 Where $x1.initialpage>=100 and $x1.initialpage
Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.