Using regular tree automata as XML schemas

June 12, 2017 | Autor: Boris Chidlovskii | Categoria: Digital Libraries, Query Optimization, XML Schema, Tree Automata

Descrição do Produto

Using Regular Tree Automata as XML schemas Boris Chidlovskii Xerox Research Centre Europe, Grenoble Laboratory 6, Chemin de Maupertuis, F–38240 Meylan, France E-mail: [email protected]

Abstract We address the problem of tight XML schemas and propose regular tree automata to model XML data. We show that the tree automata model is more powerful that the XML DTDs and is closed under main algebraic operations. We introduce the XML query algebra based the tree automata model, and discuss the query optimization and query pruning techniques. Finally, we show the conversion of tree automata schema into XML DTDs.

1 Introduction The Extensible Markup Language (XML) is the universal format for structured documents. Based on SGML (ISO 8879), XML is designed to meet the challenges of largescale electronic publishing and data exchange on the Web. XML documents are composed of nested elements and the logical structure of elements is defined by Document Type Definitions (DTDs). Defined as grammars, DTDs are highly flexible; the structure they impose in documents is often less restrictive than the rigid structure of relational schemas but more restrictive than allowing any-toany relationships between object types. The knowledge of DTDs is highly useful for the query formulation through the user graphic interface and for the XML query optimization [6, 17]. However, despite all these features, DTDs appear to be somewhat unsatisfactory as a schema formalism [3, 21]. Some obvious missing things in DTDs, such as elementary data types, structural and validation constraints, have been recently addressed in the W3C XML Schema proposal [28]. There remains however another conceptual difficulty with DTDs, not addressed by the W3C proposal. The DTDs appear to be surprisingly inflexible when one attempts to capture modifications imposed by various XML retrieval and manipulation mechanisms. Even for simple XML se0 To

appear in Proc. IEEE Advances in Digital Libraries Conf., 2000

lection queries, DTDs face the tightness problem, when no DTD can precisely represent the structure of query results [21]. To understand the cause of the tightness problem, we have to look at the way SGML/XML define a document structure. The content models of individual elements in XML documents are regular expressions. However, the language defined by an entire DTD is not regular; it is formalized as the extended context-free language (ECFL) [26]. The class of context-free grammars is not closed under certain elementary operations, such as intersection and negation [14] and extending context-free grammars with regular expressions brings in no additional power. Consequently, if an XML query is more complex than a regular expression, we cannot necessarily make a context-free grammar (the result DTD) for the set of documents fitting both the original DTD and the query formulation. There is another important requirement imposed by the SGML standard and XML/SGML compatibility, called grammar unambiguity. The grammar unambiguity is local; it is aimed at parsing SGML/XML files without lookahead and generating understandable DTDs [26, 27]. It turns out that the verification of unambiguity is technically a complex issue and the class of unambiguous grammars is not closed under operations of union and negation. As a consequence, the unambiguity requirement is hardly compatible with the need of automatic manipulation of DTDs; it makes the use of DTDs as XML schemas even more problematic. Fortunately, the situation is not so despairing. Indeed, all elements in XML files have beginning and ending tags; this enriches the XML parsing trees with element tags and allows one to apply existing language theories not to documents, but to the parsing result, that is, the document trees.

1.1

Our contribution

In this paper, we address the problem of tight XML schemas and introduce a novel mechanism for modeling XML documents, based on tree automata. This model owns a number of beneficial properties similar to the string

regular expressions; in particular, tree automata form the boolean algebra. With the tree automata model, we consider a powerful set of algebraic operations, defined in the way similar to the relational algebra. For any query expressed by these operations, a precise schema represented by a tree automaton can be obtained in polynomial time. The algebraic operations allow using regular path expressions in query formulation and we show how XML schema given by tree automata can be used for the pruning of path expressions. The tree automata mechanism is more powerful than DTDs. Thus, any DTD can be correctly captured by a tree automaton. On the other hand, we show that translation of tree automata into DTD may require some generalization. We describe the translation procedure and prove that the generalization performed in this way is minimal.

1.2

view provided that sources satisfy given DTDs. There are two drawbacks with the DTD inference proposed in [15, 21]. First, extension of DTDs with the subtyping and context-free languages works for view queries only. Second, they do not address the problem of DTD unambiguity which makes the problem of DTD inference even more complicated. Regular tree languages were studied in the theoretical computer science [22, 24, 23]. Initially studied by Rabin in the context of automata theory, they found different applications to the language theory, decision problems in logic, etc. Murata first proposed to use regular tree languages as a formal model for schemas of structured documents. Recently, Murata proposed to apply tree regular languages for the XML schema transformation [18, 19]. We extend the Murata’s proposal in order to introduce (non)deterministic tree automata and design the XML query algebra. On the other hand, we cope with document trees and avoid constructing document forests permitted by Murata. Our approach is inspired by the relational algebra with its closure of algebraic operations on relations. The remainder of this paper is organized as follows. In Section 2, we present the tree automata model. We introduce regular tree expressions and tree automata and show how they can be used for modeling XML files and capturing DTDs. In Section 3 we prove that the regular tree languages form a boolean algebra. In Section 4, we introduce a set of algebraic operations for querying and modifying XML files. We show that for any composition of the algebraic operations, the tree automata can precisely describe the structure of resulting XML files. In Section 5, we present the translation of tree automata into XML DTDs and discuss the unambiguity for generated DTDs. Section 6 concludes the paper.

Related Work

Research on XML data has been preceded by intensive research on semi-structured data [1, 3, 8, 11]. Semi-structured systems model Web data as edge-labeled graphs [1, 9]. Also, different mechanisms have been introduced as schema for semi-structured data: sets of patterns in the semi-structured data graph [20], data guides [12], graph schemas [10], schema definition language (ScmDL) [5], etc. A number of languages has been proposed to query semistructured and XML data. According to the position papers presented at the W3C Workshop on XML query language, the list of desired features includes basic operations such as selection, reduction and combination, etc. [16]. For any view query, it should be possible to derive a precise description (schema) of the view. Such a description should be derived from the source DTD and the query definition. So far, most query languages for XML, including QLXML, remain view definition languages [2, 9]; they exploit XML DTDs for specific tasks, such as formulating and constraining view queries [17], and the query optimization [6]. However, DTDs fall short of becoming XML schemas. Beyond missing elementary data types and structural constraints, DTDs impose order in XML data, while both relational and semistructured models adopted by the database community, cope with unordered tuples/objects. The DTD inference problem has been addressed in [15, 21] that introduced the XMAS language and mediator architecture for XML files. They considered the DTD inference for XML view queries and found that, even for the selection queries which extract a list of elements from the input files, the DTD mechanism exposes serious limitations. As a solution, they extended the basic DTD mechanism with the sub-typing and context-free languages. With these extensions, [15] can infer tight DTDs for the view queries. The tight DTD precisely characterizes the documents in the

2 Tree automata model 2.1

XML and DTDs

An SGML/XML DTD describes the structure of valid documents. It specifies the elements allowed in the document, and for each element, it defines its content model, i.e., the structure of its content in terms of the other elements or elementary data types that can occur in the content. Element content models in SGML/XML are (string) regular expressions. String regular expression r over an alphabet generates a regular language denoted L(r), L(r) . For regular expressions defined over , we use the following notations:

r1r2 denotes for the concatenation of r1 and r2. r1jr2 denotes for the union of r1 and r2. 2

r denotes 0 or more occurrences of r (the Kleene clo-

a

r+ denotes 1 or more occurrences of r. r? stands for rj, where is empty string.

w

sure).

a @ b c S w b d w

Without loss of generality, we make several simplifying assumptions about XML files:

a)

c)

#PCDATA #PCDATA 2.3

Regular tree grammars

A regular tree grammar (RTG) is a mechanism for generating document trees. Any XML schema describes a set of valid documents, and we consider RTG as a formal representation of an XML schema. A regular tree grammar (RTG) is a 5-tuple < ; D; N; P; ns >, where:

Document trees

Let be a finite set of XML element types and D be a finite set of XML elementary data types. Empty XML element is denoted as an assigned empty data type , 2 D. A document tree (or simply tree) over and D is defined inductively as follows:

3.

b)

= fbody, paper, title, Example 1 Let author, journalg and D = f,#PCDATAg. Then, body(paper(title(#PCDATA) author(#PCDATA) journal)) is a document tree (see Figure 1.c). In the XML syntax, this tree can be represented as follows:

Importantly, the SGML standard requires that the content models be unambiguous, when an element in the document can satisfy more than one primitive token in the content model without lookahead. For example, the element definition is ambiguous. When parsing the SGML fragment #7, a parser will lookahead to decide whether string ’#7’ matches the first label element in this definition or the second one. The intent of DTD unambiguity is to avoid generating parsers that have an exponential size. The XML data should provide the upper compatibility to SGML and therefore DTDs for XML files, both sources and views, should be unambiguous as well.

2.

Figure 1. Examples of document trees.

2. XML files do not have mixed content elements, i.e., no element declaration mixes data types with elements.

1.

paper !! Q ! Q ! Q ! author journal title #PCDATA #PCDATA

1. We do not include attribute type declarations and entities in the DTDs.

2.2

body

is a finite set of element types, 2. D is a finite set of data types, 3. N is a finite set of non-terminals, 4. P is a finite set of production rules of the form n ! a(r), where n is a non-terminal in N , a is a symbol in and r is either w 2 D or a regular expression over alphabet N , L(r) N , 5. ns is the starting non-terminal, ns 2 N . If r in the production n ! a(r) is a data type w 2 D, the production replaces the non-terminal n with the tree a(w). If, instead, r is a regular expression, the production replaces n with a tree a(t1 t2 : : :tk ), where t1 t2 : : :tk 2 L(r). A regular tree grammar G allows a document tree t, if it is derived from the starting non-terminal s by applying productions 1.

(the null tree) is a tree, a(w), where a 2 and w 2 D, is a tree, a(t1 t2 : : :tk ), k 1, where a 2 and ti is a tree, i = 1; : : :; k, is a tree,

4. nothing else is a tree. It follows from the definition that all internal nodes of a document tree are ordered. Figures 1.a and 1.b give two tree examples: a(w) and a(b(w)c(b(w)d())). In the trees, all elements of (namely, a and b) are used to label non-leaf nodes, while elements of D (namely, w and ) are used to label leaf nodes. We abbreviate a() as a. Thus, the second example can be rewritten as a(b(w)c(b(w)d)). 3

P = f n1 ! n2 ! np !

of P and does not contain non-terminals. The regular tree language L(G) is the set of document trees allowed by the grammar G. It is straightforward to construct an RTG that capture a given DTD. As an example, consider the following DTD for bibliographic data:

paragraph (#PCDATA)>

Clearly, this DTD also allows document trees not allowed by grammar G2 . In Section 5, we provide the full procedure for translating RTGs into DTDs.

Figure 2 shows the derivation steps for the document tree in Figure 1.c. The derivation starts with the non-terminal nb (Figure 2.a), the application of productions for nb and np are shown in Figures 2.b and 2.c. Finally, the application of productions for nt, na and nj will give the tree in Figure 1.c

n

g

Productions for non-terminals n1 and n2 have section in the right-hand side. However, the former has content model np n2, and the latter has the content model np . This implies that top-level sections can have subordinate sections, but these subordinate sections cannot have subordinate sections. No DTD can exactly capture this RTG, since every occurrence of segments is forced to have the same content model. The smallest DTD that allows the same document trees as the RTG is obtained by generalization as follows. The content of a section element is either paragraph* section* or paragraph*. Therefore, the minimal DTD covering the above RTG is the following:

body (papers*)> paper (title,author*,journal?> title (#PCDATA)> author (#PCDATA)> journal (#PCDATA | EMPTY)>

; D; N; P; nb >, where fbody; paper; title; author; journalg; D f#PCDATA; g; N = fnb; np ; na; ntg and

section (np n2 ), section (np ), paragraph (#PCDATA)

na

2.4

Regular tree automata

To recognize whether a document tree belongs to a regular tree language, we build tree automata. Analogously to finite-state automata, tree automata can be deterministic and non-deterministic. The difference however is that tree automata can work in an ascending and descending manner. The automata we build below are ascending ones; they start at leaves of document trees and terminate in document roots. We remark that the descending automata have the same power as the ascending ones [23, 24].

nj

Definition 1 A tree automaton (TA) A is 5-tuple < ; D; Q; ; F >, where ; D; Q are finite sets of element types, datatypes and states, respectively; is a function : E ! Q, where E is either an element of D, E = w 2 D, or a regular expression over Q, L(E) Q; F is a set of final states, F Q.

c)

Figure 2. Document tree derivation: a) starting non-terminal nb ; b) application of rule for nb; c) application of rule for np .

The RTG G2 in Section 2.3 can be captured by the tree automaton M2 =< ; D; Q; ; F >, where = fsection, paragraphg, D = f#PCDATAg, Q = fq1; q2; qp g, F = fq1g and

Any DTD can be expressed as an RTG, however not every RTG has the corresponding DTD. Consider another example. Let RTG G2 be < ; D; N; P; n1 >, where = fsection, paragraphg, D =f#PCDATAg, N = fn1; n2; np g and

(section; q2 qp ) (section; qp) (paragraph; #PCDATA) 4

= = =

q1 , q2 , qp .

(1) (2) (3)

Figure 3 shows an example of document tree belonging to the regular tree language L(G2 ) and the “traces” of the tree recognition with TA M2 . Each internal node in the tree is marked with the number of rule applied (1,2 or 3; see the definition of M2 above). Rules are applied to leaves first and continue in the ascending way. Although the automaton has not the starting state and the sequence of rule applications can vary, a more important issue is the choice of rules to apply. While all paragraph nodes in Figure 3 can be processed uniquely with rule 3, two section nodes marked with rule 2 have the non-deterministic choice between rules 1 and 2. The following section discusses the determinism and non-determinism of tree automata in detail. 1 2 3

section

2

To build a deterministic tree automaton (DTA), we first get rid of the mixed non-deterministic rules. If two rules (a; E1) = q1 and (a; E2) = q2 are non-deterministic, we rewrite each as two rules: (a; L1 \ L2 ) = q1, (a; L1 ? L2 ) = q1 and (a; L1 \ L2) = q2 and (a; L2 ? L1 ) = q2, where \ and ? are operations of intersection and difference of two regular expressions. By such a rewriting, we obtain the set of clean non-deterministic rules. Then, we can apply the subset construction used for the string automata. As an example, consider the translation of the above nondeterministic tree automaton (NTA) M2 (see Section 2.4) into a deterministic one. As the first and second rules in M2 are mixed non-deterministic, we analyze their regular expressions. We have (q2 qp ) \ (qp ) = qp and (q2 qp ) ? (qp ) = q2+ qp, where q2+ means 1 or more occurrences of state q2 . The rule rewriting yields the following set with two clean non-deterministic rules:

section

section

3 3 paragraph paragraph paragraph

3 paragraph

(section; q2+ qp ) (section; qp) (section; qp) (paragraph; #PCDATA)

#PCDATA 4

#PCDATA #PCDATA #PCDATA 3 1 2

q1 , q1 , q2 , qp .

Then, the standard subset construction will find the subset f0q1; q2g for (section ; qp) . After renaming fq1; q2g as q2 and fq1g as q10 , the resulting DTA will have q10 as the final state and the following rules:

Figure 3. Document tree recognition with the TA M2 .

(section; q2+ qp ) (section; qp) (paragraph; #PCDATA) 0 0

2.5

= = = =

Determinism and non-determinism

= = =

q1 , q2 , qp . 0 0

The reduction of rules in the above DTA is rather an exception. Like constructing deterministic finite-state automata for string regular expressions can result in the exponential automata size in the worst case, the DTA can have as many as O(2n ) states, where n is the number of states in the original NTA. Additionally, the removal of mixed rules can also result in the exponential growth of the rule number. Fortunately, like with finite-state automata, the worst cases happen rarely. The tree automata constructed for about 70 DTDs 1 confirms this observation. The recognition of a document tree by a tree automaton takes polynomial time. The time the automaton takes to make a move is given by time needed to recognize the current strings from D [ Q by one or more regular expressions of production rules. A string can be recognized by a string automaton in almost linear time, therefore any automaton move takes polynomial time. In the following, we use NTA for recognizing document trees.

String regular expressions used in transition rules of tree automata represent sets of allowed strings over Q and therefore we can manipulate them using set-oriented operations, as the regular languages are closed under union, intersection and negation. Thus, two rules (a; E1) = q and (a; E2) = q of a tree automaton are equivalent to one rule given by (a; E1jE2) = q. Two rules (a; E1) = q1 and (a; E2) = q2 are called deterministic if L(E1 ) \ L(E2) = ;. The tree automaton A is deterministic if all rules of A are deterministic. If a tree automaton is non-deterministic, it can be converted into a deterministic one by the subset construction [14]. However, the presence of regular expressions in transition rules requires the corresponding extension of the standard procedure. In the following, we distinguish between two types of the rule non-determinism. Two rules (a; E1) = q1 and (a; E1) = q2; where q1 6= q2, are called clean non-deterministic, while two rules (a; E1) = q1 and (a; E2) = q2, where L(E1 ) \ L(E2 ) 6= ; are called mixed non-deterministic.

1 DTDs are collected and available at http://www.oasisopen.org/cover/topics.html#dtds.

5

2.6

Properties of regular tree languages

automaton states. In other words, a subtree is addressed by expression P iff the automaton is in state q 2 QP . This is not true for an NTA, where the correspondence between a regular path expression and the subset of automaton state can be not unique. Nevertheless, provided that the tree automata are built to capture DTDs and modified under boolean operations, the set of subtrees addressed by a regular tree expression are uniquely captured by a subset of automaton states. The detailed construction is beyond the scope of this paper. When defining the query algebra for XML files, we assume the document tree closure, when the result of any operation on document tree(s) is again a document tree. First we define the algebraic operations on XML files, then we will show how to update the XML schema given by an tree automaton. Boolean operations. First, we adopt the boolean operations to the complex query patterns. XML data is modeled as ordered tree, and the order can be relevant or irrelevant for the query formulation. This results in two different sets of operations, set-oriented and list-oriented operations. The construction of the set-oriented operations is immediate. The operation union (D1 ; D2; e) takes sets of subtrees in D1 and D2 addressed by path expression e and merge them in the result document. For example, let D1 be the document tree given in Figure 3 and D2 be the subtree of D1 , rooted at the right section node marked with rule 2. Then, union(D1 ; D2 ; section:paragraph) will produce the tree corresponding to the following XML fragment:

Properties of tree regular grammars are very close to those of string regular grammars. In this section we prove that the tree regular languages form a boolean algebra. Theorem 1 Let L1 and L2 be two document tree languages accepted by NTAs M1 and M2 , respectively. Then, there exist NTAs that accept the following languages: 1. the intersection of L1 and L2 , 2. the union of L1 and L2 , 3. the complement of L1 (the set of all trees not contained in L1 ). Proof. Let two NTA be M1 =< 1 ; D1; Q1; 1 ; F1 > and M2 =< 2; D2 ; Q2; 2; F2 >. The union tree automaton M = M1 [ M2 is obtained in a straightforward way, by simple merges of corresponding sets in M1 and M2 . Consider now the automata intersection. In the result automaton M = M1 \ M2 , the sets and D are merges of corresponding sets of M1 and M2 . Each state of M is a pair (q1; q2), where q1 2 Q1 and q2 2 Q2. The result automaton M contains the rule (a; L1 \ L2) = (q1; q2), iff M1 contains 1 (a; L1) = q1, M2 contains 2 (a; L2 ) = q2 and L1 \ L2 6= ;. State q = (q1; q2) of M is final iff q1 is the final state of M1 and q2 is the final state of M2 . By the construction, the intersection automaton allows only document trees allowed by both automata M1 and M2. Finally, the complement tree automaton M 0 is given by

#PCDATA4 #PCDATA3 Similarly, operations intersect(D1 ; D2; e) difference (D1 ; D2; e) consider intersection

< ; D; Q; ; Q ? F > 2 3 Query algebra

and and difference of sets of subtrees addressed by path expressions e in D1 and D2 . The construction of list-oriented algebraic operations is similar to extending the object-oriented model with lists and bags [13, 25]. However, the design of an exhaustive operation set is beyond the scope of this paper. Our intend instead is to show that any possible algebraic operation can be uniquely and tightly represented by a corresponding tree automaton change. For example, if one introduces the concatenation operation for lists of sub-trees in two documents, the corresponding schema transformation is straightforward.2 Retrieval operations. Now we consider the schema transformation for queries with complex retrieval patterns.

In this section we extend the basic boolean operations to obtain a powerful language for querying XML documents. We show that the tree automata (TA) schemas are closed under all operations. Therefore, for any composition of basic operations, we can build the precise schema for the result document. Regular path expressions. First, we introduce regular path expressions that are a common way to query semistructured data. They allow the user to specify arbitrary traversals in the data. Paths are words over the alphabet , starting with the root element. A regular path expression defines a regular set of paths over . Syntactically, regular path expressions are often written in the form :a: where a 2 and is a shorthand for any sequence from . In a document tree, a path expression addresses none, one or multiple subtrees. In a DTA, a path corresponds to one state and a regular path expression P uniquely corresponds to a subset QP of

2 The concatenation can however complicate the further translation in DTDs, since the concatenation of two unambiguous grammars does not preserve, in the general case, the unambiguity property.

6

We represent some additional operations often used for formulating complex query patterns, namely, projection, selection, element collapsing and product. Below we describe each of the operations and the modifications they concede to TA schemas.

consider them as a system of equations with regular coefficients and use the minimal fix-point in order to collapse the element a [14]. Product is defined as product(D1 ; D2; pathExp), where D1 and D2 are documents; the operation replaces all subtrees in D1 addressed by pathExp with the document D2 .

Selection is defined as select(D; pathExp; cond), where D is a document, pathExp is a generalized path expression over , cond is a selection condition of the form path =< value >; the operation considers subtrees of D addressed by path as a set and selects those of them that satisfy the condition of cond for the content of the elements defined by path. As an example, for a bibliographic file D that conforms to DTD given in Section 2.3, O1 = select(D; body:paper; author =0 Wood0) will select all papers published by Wood.

The TA schema of a product tree is obtained from two source tree automata M1 and M2 , by merging their 0 00 0 productions and adding rules (a; q ) = q , where q 00 is a finale state of M2 and q is addressed by pathExp in M1 . Note the removal of all unreachable state is needed to get the final schema form. Theorem 2 The set of regular tree expressions is closed under operations of projection, selection, element collapsing and product.

Under a selection operation, a TA schema M is transformed as follows. Any production (a; E) = n in M , such that a 2 L(path), is replaced with the production (a; E \ r(cond)) = n, where r(cond) constrains the content model of a in the form of a regular expression. For the selection O1 above, r(cond)= :na:.

The set of boolean operations (union, intersection, difference) extended with operations of projection, selection, product and element collapsing, forms a query language Q. Any combination of operations from Q generate a document tree. For any operation, the corresponding modification of TA schema takes polynomial time.

Projection is defined as project(D; pathExp), where D is a document and pathExp is a regular path expression over ; it results in discarding those paths in D that are not addressed by pathExp. For the bibliographic file D, O2 = project(D; :(titlejauthor):) will remove all journal subtrees from D.

Theorem 3 For any query expressed as a combination of operations of Q, the precise tree automaton schema can be derived in polynomial time. Example 2 Consider again a bibliographic file D with the DTD given in Section 2.4 and query Q= ”Find all authors who published papers in the Computer journal”. The result document is obtained by the following sequence of operations:

Operation project(D; pathExp) transforms a tree automaton M as follows. Any production (a; E) = n such that a appears in at least path P 2 L(pathExp), is replaced with the production (a; E \ r(a)) = n, where r(a) is a regular expression for a in L(pathExp). For the projection O2 above, r(paper)= (ntjna ).

d1 = select (D, body.paper,journal=’Computer’) d2 = project (d1, body.paper.author) Res = collapse(d2, body.paper)

Element collapsing is defined as collapse(D; a), where D is a document and a is an element type in . It removes all tree nodes a from D. Once a node of type a is removed from the tree, all immediate descendants of the node get linked to the immediate ascendant of the node.

The bibliographic DTD schema is captured by the regular tree grammar G1 (see Section 2.4). The corresponding tree automaton M1 has 5 states, Q = fqn; qb; qt; qa,qj g, where qb is the final state, and the following transition rules:

Under the operation collapse(D; a), the tree automaton M is transformed as follows. For any pair of productions (a; E1) = n1 and (b; E2(a; :::)) = n2 , a 2 Dom(E2 )3 , we replace each entry of a in E2 with E1 and the result automaton contains the rule (b; E2(E1; :::)) = n1 . If one or more regular expressions Ei address each other by (linear) recursions, we

3 Dom(E ) denotes Dom(E ) Q.

the set of states used in the definition of

(body; qp) (paper; qt qa qj ?) (title; #PCDATA) (author; #PCDATA) (journal; #PCDATA) (journal; )

E,

7

= = = = = =

qb , qp , qt , qa , qj , qj .

where QP

To obtain the precise schema of the Res document, we analyze, one by one, the three algebraic operations. The first, select operation constrains the content of paper elements with the *.journal.* condition. The TA schema of d1 therefore differs from M1 by one rule only: (paper; qt qa qj ) = qp . Similarly, the second, project operation constrain body’s content with the paper* condition and, then, constrain paper’s content with the author* condition. Finally, collapsing the paper element gives the following production rules for the Res document:

(body; qa) (author; #PCDATA)

= =

= fn1; n2; n3g, FP = fn3g and

(n1; journal) = n2, (n1; :journal) = n1, (n2; body) = n3 , (n1; :body) = n2 . The pruning of P ?1 with the tree automaton M1 gives ?1 = < ; Q; 0; F >, where the string automaton Ppr Q = f((n1; #PCDATA); (n2; nj ); (n2; np ); (n3; nb)g, F = f(n3; nb)g and 0 ((n1; #PCDATA); journal) = (n2 ; nj ), 0 ((n2; nj ); paper) = (n2 ; np), 0 ((n2; np ); body) = (n3 ; nb).

qb , qa .

Figure 4 illustrates the pruning of a regular path expression. After the pruning, the automaton Ppr allows the path body.paper.journal.

4 Path pruning with NTA The query algebra Q introduced in the previous section, uses regular path expressions. A naive evaluation of such expressions can be expensive, because it may require searching all or a large part of document tree. The cost of evaluating regular path expressions can be reduced significantly by using TA schema. We describe an optimization technique for path pruning and query rewriting, similar to that proposed for the schema graph method [8, 10]. Query pruning restricts search to a fragment of the XML file; moreover, path rewriting can entirely eliminate or substantially reduce the tree traversal. Query pruning is executed as intersection of the tree automaton and a string automaton, the result is a string automaton again, with all unreachable branches pruned off. Regular path expressions refer to paths in the document tree seen as strings over . Since we use the ascending tree automata, we use the inversion of the path expressions. That is, we inverse query expressions in the way that any such inversion has the root of the document tree as the terminal element. Let NTA M be < ; D; QM ; ; FM > and the path expression be given by the string automaton P be < ; QP ; ; FP >. The resulting path is given by the string automaton Ppr =< ; QM QP ; 0; F >, where each state is a pair (qm ; qp), qm 2 QM , qp 2 QP , FM QM QP 0 0 and 0 contains the rule 0((qm ; qp); a) = (qm ; qp), iff 0 0 (qp; a) = qp and (a; E) = qm and qm 2 L(E). The intersection of TA schema and string automaton required by the path pruning takes polynomial time.

not journal

a)

n1

not body

n2 journal

b)

n3 body

n 2’

n 1’ journal

n 3’ paper

n 1’ = (n 1 ,#PCDATA) n 2’ = (n 2 , n j )

n 4’ body

n 3’ = (n 2 ,n p) n 4’ = (n 3 ,n b)

Figure 4. Path pruning: a) string automaton for P ?1 = :journal: :body; b) pruned path ?1. automaton Ppr

5 Translating tree automata into DTDs As we have seen, the tree automata are more powerful that DTDs and not every tree automaton has a corresponding DTD. However, for a given tree automaton there exists at least one DTD that allows the same document trees as the tree automaton. Below we describe the translation of a tree automaton into a DTD. For a given tree automaton A = < ; D; N; ; F >, the algorithm produces the content model for each tag a in the document tree, a 2 . The translation algorithm is given in Figure 5.

Example 3 Consider the TA M1 for the bibliographic database given in Section 2.4 and let P =body.*.journal be the generalized path expression. The automaton that accepts the inversion P ?1 = :journal: :body is given by < ; QP ; ; FP >, 8

Algorithm 1. 1. for each state q 2 Q do T(q) := ; for any a, such that 3 q ! a(r)g do T(q) := T(q) j a od od 2. for each tag a 2 do CM(a) := ;

5.1

Algorithm 1 generates the minimal element contents for XML tags, however it does not cope with the DTD unambiguity. Some ambiguous DTDs can be converted into equivalent unambiguous ones. As an example, the ambiguous element definition can be disambiguated as follows: Unfortunately, in most cases, the conversion requires generalization of grammars. For DTD unambiguity, we rely on work done by Wood and Br¨uggeman-Klein [7, 26], who studied the conditions when regular expressions used for definition of XML element contents are unambiguous. They introduced the notion of 1-determinism and proved that only those regular expressions satisfying the 1-deterministic condition can be used in DTD. Our method also exploits the 1-deterministic property established for element contents of an unambiguous DTD. Orbit properties of 1-deterministic grammars give the sufficient condition that a given finite-state automaton recognizes words that form 1-deterministic regular language. If the automaton has no orbit property, it should be generalized by merging nodes and adding transitions. For a detailed description of the method transforming ambiguous regular expression into a unambiguous, we refer to [4].

R(a) := frj 9s 2 N : 3 s ! a(r)g

for each r in R(a) do replace each entry of q in r with T(q):

r0 := r(q ! T(q)) CM(a) := CM(a)jr0

od od Simplify and add EC(a) to the DTD Figure 5. Translation algorithm.

Any transition rule in a tree automaton associates a state

q 2 Q with an element type a 2 and defines its content by regular expression r. Since every occurrence of a is forced to have the same content model in DTD, we collect all a’s content defined by different rules and merge them in one regular expression. On Step 1 of Algorithm 1, all element types ai associated with a state q form a union T(q) := a1ja2 j : : :. Step 2 collects in R(a) all regular expressions that define contents for element a. Since each regular expression r in R(a) is defined over Q, each state q is replaced with the associated union of elements T(q). The union of regular expression r0 after the replacement form the content model of a for the resulting DTD. As we have seen, the tree automata which have no corresponding DTDs require generalization and the produced DTDs allow more document trees that the tree automata do. However, Algorithm 1 generates the minimal DTDs. In other words, if a DTD B is generated by Algorithm 1 for a given NTA M , there is no other DTD that allows the same set of documents as M but disallows some documents allowed by B . Although the generalization of tree automata is minimal, it may be still essential. The generalization can be avoided if one is permitted to rename XML elements. For the automaton M2 , the introduction of an additional element subsection for subordinate sections will result is the following DTD that allows the same set of trees as M2 :

subsection (paragraph*)> [3] Serge Abiteboul, Peter Buneman, and Dan Suciu. paragraph (#PCDATA)> Data on the Web. Morgan Kaufmann Publishers, 1999. 9

[4] H. Ahonen. Disambiguation of SGML Content Models. In Charles K. Nicholas and Derick Wood, editors, Principles of document processing, PODP’96, Palo Alto, CA, 1996, volume 1293 of LNCS, pages 27–38, 1997.

Semistructured Data and Non-Standard Data Format, in conjunction with ICDT’99, Jerusalem, Israel, 1999. [16] David Maier. Database Desiderata for an XML Query Language. In Position paper in W3C’s Query Language Workshop, 1998.

[5] Catriel Beeri and Tova Milo. Schemas for integration and translation of structured and semi-structured data. In Proc. Intern. Conf. Database Theory, Jerusalem, Israel, 1999.

[17] J. McHugh and J. Widom. Query Optimization for XML. In Proc. IEEE Very Large DataBases Conf., 1999. [18] Makoto Murata. Data model for document transformation and assembly. Lecture Notes in Computer Science, 1481:140–152, 1998.

¨ [6] K. B¨ohm, K. Aberer, M. T. Ozsu, and K. Gayer. Query Optimization for Structured Documents Based on Knowledge on the Document Type Definition. In In IEEE Proc. Advances in Digital Libraries, Santa Barbara, CA, April 1998, pages 196–205, 1998.

[19] Makoto Murata. Hedge automata: a formal model for xml schemata. URL: http://www.xml.gr.jp/relax/hedge nice.html, 1999.

[7] Anne Br˝uggemann-Klein. Unambiguity of extended regular expressions in SGML document grammars. Technical Report Aug16-1, Technical University of Munich, August 16, 1994.

[20] Svetlozar Nestorov, Serge Abiteboul, and Rajeev Motwani. Extracting schema from semistructured data. SIGMOD Record (ACM Special Interest Group on Management of Data), 27(2):295, 1998.

[8] Peter Buneman, Susan Davidson, Mary Fernandez, and Dan Suciu. Adding structure to unstructured data. In Proc. of the Intern. Conf. on Database Theory, volume 1186 of Lecture Notes in Computer Science, pages 336–350. Springer, 1997.

[21] Yannis Papakonstantinou and Pavel Velikhov. Enhancing Semistructured Data Mediators with Document Type Definitions. In Proceedings of the 15th International Conference on Data Engineering, pages 136–145, 1999.

[9] Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu. A Query Language for XML. In Proc. 8th Int. World Wide Web Conference, Toronto, Canada, pages 77–91, 1999.

[22] Masako Takahashi. Generalizations of regular sets and their application to a study of context-free languages. Information and Control, 27(1):1–36, January 1975. [23] J. W. Thatcher. Characterizing derivation trees of context free grammars through a generalization of finite automata theory. Journal of Computer and System Sciences, 1:317–322, 1967.

[10] M. Fernandez and D. Suciu. Optimizing Regular Path Expressions Using Graph Schemas. In Proc. Intern. Conference Data Engineering, Orlando, FL, February 1998, pages 14–23, 1998.

[24] J. W. Thatcher. Tree automata: An informal survey. In Alfred V. Aho, Currents in the Theory of Computing, pages 143–172. Prentice-Hall, 1973.

[11] D. Florescu, A. Levy, and A. Mendelzon. Database techniques for the World-Wide Web: A survey. SIGMOD Record, 27(3):59–74, 1998.

[25] S. Vandenberg. Algebras for Object-Oriented Query Languages. PhD thesis, Computer Sciences Department, University of Wisconsin-Madison, Madison, WI 53706, June 1993.

[12] Roy Goldman and Jennifer Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proc. 23rd Intern. Conf. on Very Large Data Bases, pages 436–445, 1997.

[26] D. Wood. Standard Generalized Markup Language: Mathematical and philosophical issues. Lecture Notes in Computer Science, 1000:344–365, 1995.

[13] S. Grumbach and T. Milo. An Algebra for POMSETS. In Proc. Intern. Conf. Database Theory, 1995.

[27] XML. : Extensible Markup Language 1.0. W3C Recommendation, http://www.w3.org/TR/1998/RECxml-19980210.html.

[14] J. Hopcroft and J. Ullman. Introduction to Automata Theory, Languages and Computation. AddisonWesley, N. Reading, MA, 1980.

[28] XML. Schema Requirements, W3C Note, 15 February 1999, http://www.w3.org/TR/NOTE-xml-schemareq.

[15] B. Ludascher, Y. Papakonstantinou, P. Velikhov, and V. Vianu. View Definition and DTD Inference for Xml. In Proc. Workshop on Query Processing for 10

Lihat lebih banyak...

Using regular tree automata as XML schemas

Descrição do Produto

Comentários