Adding semi-structured data & querying to a SQL engine: Use case Spark SQL

June 7, 2017 | Autor: Diego Cedillo | Categoria: Database Systems, Unstructured Data Management, Apache spark

Descrição do Produto

Adding semi-structured data & querying to a SQL engine: Use case Spark SQL Diego Cedillo

1.

ABSTRACT

heterogeneity, and the possibility that any value may be an arbitrary composition of the array, bag and tuple constructors, hence enabling arbitrary nested structures, such as arrays of arrays. Our contribution with this paper is to present a methodology for transforming an existing SQL relational engine into a SQL++ engine that can handle heterogeneous data, array values and array indexing, unstructured and semi-structured data, and it opens the path for the creation of the new operators (correlate, array scan, etc) proposed for the SQL++ query language. Our methodology includes a new encoding for data that does not have a schema, a type for representing heterogeneity, a new schema option for semi-structured data and a new operator to access unstructured data that allows to abstract other operators from handling it, maintaining the rest of the pipeline unchanged. We test the proposed methodology by modifying the Spark SQL module from Apache Spark [1]. Here, we show the particular steps taken in the modification of Spark SQL into Spark SQL++ following the methodology and how can this be achieved on an incremental fashion. Moreover, we present how the caching functionality of Spark SQL can still be used in Spark SQL++. The experiments section compares the performance of Spark SQL++ with the original version of Spark SQL when we use a full-schema supported by Spark SQL. From this experiment, we show a minimal difference in performance. Moreover, we compare the time execution of the same queries when the data has a complete schema, a partial schema and no schema. We consider the cases where the data is on disk or if it is cached used the in-memory caching of Spark SQL. The results shows us a relation between the schema of the data and the performance loss when this schema is not present. This relation is shown by an incremental performance degradation as more parts of the queried elements correspond to the unstructured part of the data. The presence of schema is also important because it allows to introduce more optimization in the query engine such as columnar storage, indexes, and statistics of the data stored. These optimizations are harder to support and maintain considering that they are not specifically detailed on the schema of the data. The structure of this paper is as follows: First, we present a small background about SQL++ in Section 3, we introduce the assumptions of a relational engine and the methodology proposed in Section 4. Section 5 presents the current state of Spark SQL and the changes made to transform it to Spark SQL++. In Section 6, we show experiments comparing the

Current database engines that provide support for JSON data usually offer a non-standardized and semantically incomplete language. In the other hand, relational databases use follow a common language (SQL), but fail to represent heterogeneous and nesting data. In this context, SQL++ is presented as a complete language that it is a superset of SQL and JSON. In this paper, we present a methodology for transforming a SQL relational engine to an engine that can support the SQL++ query language. We explain the new data encodings, expressions, data types and operators needed to achieve this goal. In order to prove our methodology, we present our work in modifying Spark SQL to transform it to Spark SQL++. Moreover, we provide experiments that show that our changes do not create performance loss for existing queries and for queries that involve semi-structured data can be solved with a performance loss proportional to the number of attributes that are unknown.

2.

INTRODUCTION

Nowadays, there are numerous databases that claim support for unstructured data such as SQL-on-Hadoop, NewSQL, Spark SQL, MongoDB etc, but they only provide this support with certain limitations and in many cases with query languages that are not expressive enough and specific to their database. In the other side of the spectrum, we find relational databases that don’t support nesting and heterogeneous data but provide a more extensive, complete and generic query language called SQL. In this setting, SQL++[7] is presented as a semi-structured query language that is backwards compatible with SQL, in order to be easily understood and adopted by SQL programmers. The semi-structured SQL++ data model is a superset of JSON and the SQL data model. The SQL++ model expands JSON with bags (as opposed to having JSON arrays only) and enriched values, i.e., atomic values that are not only numbers and strings. Vice versa, one may think of SQL++ as expanding SQL with JSON features: arrays,

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

named value value

complex value scalar value primitive value

enriched value tuple value collection value array value bag value

→ → | | | → | → | → | | | → → → | → →

name :: value null missing scalar value complex value tuple value collection value primitive value enriched value ’ string ’ number true false type ( (primitive value ,)+ ) { (name : value ,)+ } array value bag value [ (value ,)* ] {{ (value ,)* }}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Figure 1: BNF Grammar for SQL++ Values

performance of Spark SQL and Spark SQL++ with different schemas. We introduce some related work in Section 7 and we describe our conclusions and future work in Section 8.

3.

BACKGROUND

3.1 3.1.1

type

allow null allow missing plain type

complex type any type union type scalar type primitive type

enriched type tuple type open type closed type attr required attr optional attr collection type array type bag type

| → → → | | | → | → → → | → | | → | | | → → → | → | → | → →

plain type [ allow null ] [ allow missing ] null | not null missing | not missing any type union type scalar type complex type tuple type collection type any type | . . . primitive type enriched type string number boolean date ... { open type } { closed type } attr , ... , * attr , ... required attr optional attr name : type name ? : type array type bag type [ type ] {{ type }}

Figure 2: BNF Grammar for SQL++ Types

SQL++

inside of them, it does not mean that the array/bag has to be heterogeneous, since the type any can be used. A tuple type represents the type of a tuple value. It could be either open or closed. Consider a tuple value tv with attributes a1 , a2 , ...an and a closed tuple type t with attribute names b1 , b2 , ...bm . The tuple value tv is considered compatible with a closed tuple type t if for every attribute ai of tv with a value vi there is an attribute bj with type btj on t such that ai = bj and vi is of type btj and every attribute in t has its corresponding match with the attributes in tv. A open tuple type however, is more flexible and a tuple value t is compatible with an open tuple type o with attributes c1 , c2 , .., cl , ∗ if for every attribute ck with type tk there is an attribute ai with value vi of type tk , ck = ai and every attribute of o has its corresponding match with attributes of tv. The relaxation of the schema comes by the fact that tv can have more attributes that the ones presented in o. This attribute/value pairs are consider the open part or unstructured part of the tuple. Inside a tuple type we can find attributes that are required (mandatory to appear on the tuple value) or optional, that may or may not appear on the tuple value.

Data model

The SQL++[7] data model is a superset of both SQL’s relational tables and JSON, based on the fact that they present similar concepts. A JSON scalar corresponds to an SQL string/number/boolean, an JSON object literal to a SQL tuple, and a JSON array is similar to a SQL table with order. On Figure 1 the BNF grammar for SQL++ values is presented. A name is a string and it is unique. A value is a scalar, complex, missing or null. A complex value is a tuple or a collection. A tuple is a set of attribute/value pairs, where each attribute is a unique string inside the tuple. A collection is either an array or a bag, both may contain duplicates. While an array is ordered and can be accessed as an ordinal position, a bag is unordered like a SQL table. A scalar value can be primitive (values of the JSON specification i.e. String, numbers or booleans) and enriched values, such as dates and timestamps, are specified using a type contructor over primitives. The elements inside an array or a bag can be of any kind and can be heterogeneous, which means that there are no restrictions between the elements inside a collection. Moreover, in contrast to SQL, SQL++ allows arbitrary composition of complex values. In order to handle this values and the structure of the data, an SQL++ engine needs to support the data types and schema presented on Figure 2. The type any is the Universal supertype of the type system. It allows all values. In other words, it describes that the data with this type can be heterogenous on the different instances or that the type of the data is unknown beforehand. The scalar types have a one to one correspondence with the scalar values as expected. Similarly, collection types have their correspondence with collection values. The fact that an array/bag type require a type for the values

4.

METHODOLOGY

4.1 4.1.1

Generic Relational Engine Data model

Consider a generic Relational Database Management System (RDBMS) that supports the primitive value types: string (varchar), boolean and number. The data is stored using tables that represent bags of tuples. Where a bag is an unordered collection of elements and a tuple is a collection of attribute name and value pairs. Each table has an schema that describes the n attributes names of the tuples and the 2

be supported by a tuple type and array type as presented on Figure 2. In order to support the open tuple type on Figure 4, we need to add an encoding that will keep track of the attribute name/ value pairs for each attribute that is not specified on the schema. For representing this, we use a hash map called open tuple with attributes names as keys. The part of the data specified on the schema will continue to be stored in each one of the columns while the open tuple will be stored in an extra column with type any and attribute name “*”. Conversely, a column with type any can also contain elements in a open tuple to represent a tuple value without schema. Likewise, we need to add the support for the array type and the array value. These value types are supported is some relational databases such as PostgreSQL.

Figure 3: Architecture of a Relational Database engine

data types. A relational schema consist of a non-nested closed tuple type similar to Figure 2. On the relation table, we have collections with n elements called rows that contain the data described on the schema. These rows represent the values of the tuples.

4.1.2

Architecture

A generic architecture of a database engine is presented in Figure 3. The SQL string is transformed to an AST tree using the parser. This AST tree is converted to a logical plan using the Logical plan builder. A logical plan is a tree of relational operators that describe the step by step the actions to execute the query. The most common relational operators are: selection γ, projection π, cartesian product ×, join ./, scan (iteration over the elements of a table). Each one of the operators of a logical plan inputs a collection of tuples and it does some transformation to these tuples, such as filtering (γ), picking elements (π), combining tuples (× and ./), etc. After this transformation it outputs also a collection of tuples. The collection of tuples between two operators in a logical plan tree is called intermediate results. The optimizer takes the logical plan and improves the plan by applying a set of known rewrite rules such as: predicate pushdown, early selection, combining cartesian products and filters into joins, etc. The resulting logical plan is considered optimized and it can be transformed to different physical plans that, depending of the cost estimation based on statistics, the physical planner will select the most efficient. The execution engine uses the physical plan to produce the result as a collection of tuples. In the execution, the leaf operators on the physical plan retrieve the information from the relational tables and are passed to the next operator using a pipeline method. The pipeline consist on passing the tuples from one operator to the other one by one. This technique allows to avoid saving the intermediate results before the next operator uses them.

4.2

4.2.2 4.2.2.1

New operators.

Consider a query over a relation with the open tuple type schema of on Figure 4. If the query specifies an attribute on the known part of the schema, that attribute will be indexed used the regular procedure followed by a relational engine. In case the query specifies an attribute that is not present on the schema, we need to provide a method that will allow us to access the open part of the data and retrieve the requested value. Considering this, we need a new expression that will allow to represent this request called path expression. This expression will contain the steps along the path of the open schema to retrieve the desired value. The path is represented using the attributes names separated by “.”. The expression t.a.b represents the path to access the attribute b in the nested tuple with attribute a from the relation t. The path expression evaluation will follow to path described using the map of the open tuple. Additionally, we need a Navigate operator that will solve this path expressions and it will allow to pass the results to the next operator as a scan operator will pass the attributes when we access the structured part of the data. This new operator allows to abstract the other relational operators of how to handle unstructured data by solving the path expressions and returning the requested values as an output. For each tuple that the Navigate operator inputs, it outputs a tuple that is the result of concatenating the input tuple with resolution of the path expressions in the Navigate operator. The Navigate operator can be located as a child of the operator that it will need the path expression value, corresponding to a just in time navigation. Another possible option is early navigation where the location of the Navigate operators is right above the operators that retrieve the information from the data sources (scan operators). In such case, each navigate operator will contain all the path expressions that are used on the query and that can be accessed using the child scan operator. Depending on the architecture of the engine, a location can be more convenient than the other. The early navigation allows to avoid the creation of new patterns in the logical plan tree, besides the difference near the leafs (scan operators). This option can result in avoiding changes in the rules of a rule-based optimizer.

Adding the ++ to a Relational Engine

In order to illustrate the modifications that can be made to extend a SQL relational database into an heterogeneous engine, we present 3 schemas on Figure 4. The first schema corresponds to an schema that it is supported by a relational engine. The second and third schemas will be valid schemas for the new SQL++ engine. The schemas are presented following the syntax of type declaration of SQL++.

4.2.1

Architectural changes

Changes in the data model

In a homogeneous relational database we only allow a unique data type per column. The first step to allow heterogeneous data is to create the type any as a new data type. This type should allow any value inside the columns of the relational table. This includes nesting data that will 3

Schemafull {{

open schema

schemaless any

{{ { userName : string, rating : integer, reviewText : string, review : string, categories : [string], texttime : string, gPlusPlaceId : string, gPlusUserId : string, utime : long }

{ userName : string, rating : integer, reviewText : string, review : string, categories : [string], texttime : string, gPlusPlaceId : string, * } }}

}}

Figure 4: Schema examples of the Google reviews dataset

4.2.2.2

Functions.

5.1.1

In order to support functions in the relational engine, we need to allow the possibility of function resolution at runtime. The function resolution consist on, given a function call, find the correct function that satisfy the name of the function, and the parameter types. This resolution needs to be done at runtime only when one of the parameters of the function call has a type any and, therefore, the actual value type will be given for each instance. When there is no function that matches the requirements at run-time, we can choose to represent the failure by a null in the result of by throwing an error.

4.2.2.3

• It does not support heterogeneous values. Which means that it is not allowed to have values of different types under the same attribute name. Since all tuples share the same schema, each attribute on a tuple type is associated with only one type and the type any is not supported. • It does not allow arbitrary attributes to appear on the tuple values. Instead, the only attributes permitted are the ones given by the tuple type of the schema.

Changes to operators.

The inclusion of the Navigate operator abstracts the other operators from handling unstructured data and, therefore, no additional changes should be necessary. However, the operator usually present checks and restrictions that validate the types of the expressions. This restrictions should be updated considering that every type is a subtype of the type any. This means, for example, that any expression with the type any should also satisfy restrictions for expressions with boolean types such as predicates. In SQL++, the union operator is more general that the union operator of SQL since it allows to union two intermediate results considering the number of elements of the first query as the number of elements to output from the union. The output types of the union will be given by the most general type of each one of the columns. The intersect and except operators are also extended to include heterogeneous behavior that allows to compare intermediate results with different types.

4.2.2.4

The values supported by Spark SQL are similar to the values supported by SQL++ shown in Figure 1, except that Spark SQL uses bags only for representing a collection of tuples (for representing a relational table) and it can not be used as a regular stored value. Spark SQL presents a similar type system to the type system of SQL++ as shown in Figure 5. The differences are: • It does not allow missing values • There is no any and, hence, there is no heterogeneous data • There is no union type (more than one type option) • No bag type • The tuple type is called structure type and only supports closed tuple types.

Total Ordering.

• Spark SQL supports maps, that are basically tuples where the attributes are not restricted to a string but to a value of any type.

The inclusion of heterogeneous data raises situations that need to be considered. For example, if two elements of different types under the same column are compared, which one should be considered greater than the other ? Should the operation be allowed ?. Similarly, if an order by is requested for a column with type any, what values should be presented first. In SQL++[7], this problems are solved by presenting a total order that determines what data types come before others. This total order can be used for comparison between different types and for ordering heterogeneous columns.

5. 5.1

Data Model

Spark SQL is a module of Spark [1] that uses a nested data model based on Hive [2]. The Spark SQL data model is a super set of the SQL relational tables but it does not implement the complete flexibility of JSON since:

• Additionally, Spark SQL supports user-defined types (UDT).

5.1.2

Language

As a query language, Spark SQL uses a subset of the SQL extensions of Hive. This includes the tuple navigation and array/map navigation. A tuple navigation is of the form t.a where it returns the value of attribute a inside the tuple t. An array navigation is an expression of the form x[i] where the evaluation of the expression returns the value at position i considering array x.

EXTENDING SPARK SQL Spark SQL 4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

|

type allow null plain type complex type

scalar type primitive type

numeric type

enriched type struct type struct field collection type array type map type

→ | | → | | → | → | | → | | | | | → | | → → → →

plain type [ allow null ] null | not null scalar type complex type tuple type collection type map type primitive type enriched type StringType numeric type BooleanType ByteType ShortType IntegerType LongType FloatType DoubleType DateType UDT { struct field * } name : type array type [ type ] map( type : type )

Dataframes can be instantiated from tables of an external data source or from existing RDDs of the objects. The developer can manipulate the DataFrames using the functions that correspond to relational operators such as projection(select) and join. A DataFrame can also be created from an RDD[3] of elements of some case class A (A case class in Scala is a regular class that exports its constructor parameters and provides a recursive decomposition mechanism via pattern matching [5]). In such case, Spark SQL will infer the schema of the data using Scala reflection, that it is a method implemented in Spark SQL for obtaining the Spark SQL types for the data according to the scala values used. Scala Reflection is also able to obtain the names of the attributes of class A. This attribute names together with the infered types feed the struct fields that compound the schema for the data. Spark SQL uses a Dataframe as the input and output of every algebra operator that solves the query. In Spark SQL, a DataFrame is lazy in the sense that it represents a logical plan that will compute a dataset, but the execution will occur only when the user calls an ‘output operation” such as save. This lazy evaluation allows to reorganize operators in order to optimize the query and to execute the query with a pipeline. Spark SQL also allows to create a DataFrame from a JSON data source in two ways. The first allows to specify the data source and the schema that will be used for the data. The schema can be a subset of the attributes contained on the JSON file, in such cases the queries can only access the data represented on the schema provided. The second form is by schema inference. This schema inference is very simple compared to the one proposed in [8]. Spark SQL allows the developer to specify the portion of the data that will be randomly selected to take as a sample for the schema inference. This procedure takes the sampled tuples and creates a struct field for each attribute name found. The first time that an attribute name is added, the type of the value is inferred and taken as the reference type. When an attribute name is found again in other tuple, the type is compared between the new value and the type stored on the struct field. In case the types are different, it will check if both types are numeric. In such case it will select the type that it is more general based on numeric widening. If at least one of them is not numeric, Spark SQL will select as a general type the type string.

Figure 5: Spark SQL Types

Similarly, a map navigation is an expression of the form m[k] where the result is the value v with key k on the map m. Spark SQL, in contrast to SQL, it only supports queries on the from clause.

5.1.3 5.1.3.1

Architecture RDDs and Dataframes.

“Spark offers a functional programming API similar to other recent systems, where users manipulate distributed collections called Resilient Distributed Datasets (RDDs) [9]. Each RDD is a collection of Scala, Java or Python objects partitioned across a cluster. RDDs can be manipulated through operations like map, filter, and reduce, which take functions in the programming language and ship them to nodes on the cluster.” [3] “Spark SQL runs as a library on top of Spark, as shown in Figure 1. It exposes SQL interfaces, which can be accessed through JDBC/ODBC or through a command-line console, as well as the DataFrame API integrated into Sparks supported programming languages.” [3] A DataFrame is a distributed collection of rows with a homogeneous schema represented by a Struct Type shown in Figure 5. The schema is homogeneous since it is declared once and all the values presented in the rows correspond to the Struct Fields (Figure 5 line 23) presented on the schema. More precisely, the element i of every row correspond to the i struct field and it must be of the type given on that struct field i. The DataFrame is to Spark SQL what a table is to a relational database. A DataFrame is an extension of the concept of an RDD with objects of type row, that keep track of the schema given to these rows and allows relational operations in order to solve queries and optimize its execution. In fact, a DataFrame can be created using a RDD of Rows and its corresponding Struct Type that defines the schema of the data inside the rows. Therefore, a row is the encoding of a tuple value given the schema as a Struct Type.

5.1.3.2

Catalyst Optimizer.

Spark SQL uses an optimizer called Catalyst. This optimizer is extensible and it was designed following the functional programming capabilities of Scala. The goals of optimizer are to make it easy to add new optimization techniques and features on Spark SQL, and to enable external developers to extend the optimizer. Catalyst was design to support transformations on the logical plan tree using rules and cost-based optimizations. Catalyst is supported by two main concepts on Spark SQL. First, Spark SQL uses trees of objects for most purposes. They are used to represent the logical plan and the expressions with in an operator of the logical plan. The second main concept is rules. Catalyst uses functions called rules to manipulate these trees and transform them according to certain criteria. The function can be basically any

5

code written in Scala but the most common way of representing a rule is by pattern matching. The Catalyst optimizer is used in the following phases shown in Figure 6: for analyzing a logical plan and resolve relations, for the logical plan optimization, for transforming the logical plan to a physical plan, and for code generations to compile some parts of the query directly to Java bytecode. [3]

5.1.3.3

only if the encoding size is less than 80% of the original size. Since Spark partitions the data within the cluster, the encoding is done per column and per partition. When querying data that is previously cached, Spark SQL uses an special operator called in-memory relation. This operator is an scan operator (retrieves the data from a source) that accesses only the columns that will be used by the query later on the pipeline. This means that it will only decode the information that it is needed. Moreover, this special operator handles partitions with early filters based on the statistics of the data of each partition. If the statistics of a partition indicate that the filtered value is not on the partition, that partition is discarded, otherwise is decoded and the data passed to the next operator that is usually a filter operator with the same predicate that will filter the data within the selected partitions. This process is called partition pruning in Spark SQL. Because of the partition and column prunning, the in-memory relation operator is considered a mix between the operators scan, project and filter. Additionally, the in-memory operator passes a DataFrame to the next operator on the pipeline, meaning that there is no need to consider columnar storage after this operator.

Query Planning.

Figure 6 shows the complete query planning of Spark SQL. When a query is given in SQL syntax Spark SQL will use the parser to create the AST tree called Unresolved Logical Plan. This Unresolved Logical Plan can also be created using the relational functions over the DataFrame. This plan contains unresolved relations which are a representation of a relation name that has not been check weather it exists on the relation’s catalog. Each operator also contains unresolved attributes, that correspond to attributes that have not been checked if they exist on the relations being queried. Similarly, we can find unresolved functions that correspond to function names that have not been check for its corresponding function signature (function resolution). The Analyser uses rules and the catalog of tables to resolve the relations and references, and to transform the unresolved logical plan to a logical plan. The logical plan now contains Logical Relations that correspond to a scan operator that will access the relation requested. The resolved logical plan also contains attribute references that point to the correct position of the data on the row according to the schema of the relation. This relations are considered solved since they have the information of the source relation and the position inside that relation. Once the logical plan is resolved, it comes the logical optimization phase. This is also based on the rule system of Catalyst and it executes common rules for query optimization such as projection pruning, predicate pushdown, join optimizations, constant folding and others. The physical planning phase takes the optimized logical plan and generates multiple Physical plans that match the Spark execution engine. After these plans are generated, a cost model is applied to select the best physical plan that, when executed will give as a result a DataFrame. As described in [3] for now Spark SQL only uses cost-based optimization on select join algorithms (for relations that are known to be small).

5.1.3.4

5.2

Adding the ++ to Spark SQL

In order to illustrate the modifications made to extend Spark SQL to Spark SQL++, we use again the schemas on Figure 4. The first schema is an schema currently supported by Spark SQL using the Structured type and its structured fields. The second and third schemas will be valid schemas for Spark SQL++.

5.2.1 5.2.1.1

Changes in the Data model

Open Tuple Type and the type any . Consider the case that we have a JSON data source with a column that contains integers and Arrays of integers under the same attribute and we ask Spark SQL to infer the schema of the data. Spark SQL will determine the data type as String, since there is no unique way of representing both types as described in section 5.1.3.1. However, this representations is inaccurate considering that it does not allow to access the data of the array because it is not of array type. For cases like this where we want to represent heterogeneous data we add the data type any. This data type can also be used to represent that the type of the data in a column is unknown. For the case where the entire schema of the data is unknown as in the case 3 of Figure 4, we add the option to use the any as a valid schema that will represent any primitive value or nested data inside. Now consider the second schema on Figure 4, in which, we know the schema of the data partially. For this case, we extend the struct type described on 5.1.1 to support a new open struct type. This new type of structure has a extra struct field named “*” of type any that will contain the unstructured data. The encoding on Spark SQL is based on rows that contain the data and the schema that represents the attribute names for each one of the elements on the row. When we consider the open tuple type, the structured part will follow the same encoding as in Spark SQL and the extra column “*” will contain the unstructured elements. However, this unstructured elements need to keep the attribute name of

In-memory caching .

Spark SQL can materialize frequently used data in memory using columnar storage under the explicit request of the programmer. The columnar layout differs from the row encoding of the data previously discussed, since it clusters the data according to the column that it belongs rather that using the rows. This layout allows query optimizations by only accessing the columns that are used in the query. This columnar cache can reduce memory footprint because it applies columnar compression schemes. Each column has its own encoding and statistics of the data contained on that column. When the cache function is called, Spark SQL considers different encodings for each column including Dictionary Encoding, Integer and Long-integer delta, run-length encoding and boolean bit set. Once all encoding are completed, it chooses the smallest size encoding, if and

6

Figure 6: Architecture of Spark SQL, from [3]

the data beside the values stored. For this, we create a new type of encoding called open tuple that follows the syntax of a tuple value in Figure 1. This open tuple that will contain a collection of key-value pairs, representing the attribute name-value pairs. This open tuple is used as the open part of a open structure type and it works like a more specific type of the Row[3] that specifies the attribute name of a column by keeping a map of the attribute name and the value assigned to it. This kind of representation allow us to solve queries over unstructured data such as JSON with small changes over a relational query processor. An schemaless option represented by any, is considered essentially a open struct type with no structured part, therefore the encoding will be a row with only one element inside and this element will be the open tuple.

5.2.2 5.2.2.1

SELECT t.userName, t.rating, t.gPlusPlaceId, t.utime FROM GoogleReviews as t WHERE t.texttime = ‘Feb 15, 2005’

Figure 7: Query 1

π[userN ame#5, rating#6, gP lusP laceId#11, utime#13L]

σ(texttime#10 = F eb15, 2005)

Relation [userN ame#5, rating#6, reviewT ext#7, review#8, categories#9, texttime#10, gP lusP laceId#11, gP lusU serId#12, utime#13L]

Changes on the Relational Operators

Figure 8: Logical Plan for schema full

The Navigate Operator.

Consider querying data over a relation with the open schema of Figure 4. When the query specifies a known attribute, Spark SQL will connect it to the corresponding index column where the value can be found using an AttributeRefence. For the case where the query specifies an unknown attribute (not given on the schema), Spark SQL++ adds a new expression called Path Expression that has as a child the reference to the column with the open data (named “*”) and returns the value found following that path inside the open tuple if any, or null otherwise. The path expressions created need to be solved into attribute references, so the rest of the operators can access them as any other known attribute. For this, we included the Navigate operator, that given as an input a Row of elements and the path expressions that we need to navigate into, it outputs the original Row of elements concatenated with the values that returned the evaluation of each one of the path expressions. In Spark SQL++, the Navigate operator is strategically located as the parent of any Logical Relation operator (scan operators) on the logical plan tree (early navigation). This position allow us to avoid changing optimization phase of Figure 6, since it only analyses patterns of operators that do not include the scan operators. In the other hand, if we would have used a just in time navigation, the complete structure of the logical plan tree would have been affected as well as the patterns found in them and, therefore, we would

have needed to rewrite the optimization rules to cover the cases with and without a navigate operator. For determining the path expressions that each navigate operator needs, Spark SQL++ uses a new rule in the Analysis phase of Figure 6. This new rule looks for the Unresolved Expressions that could not be solved with the structured part of the data, and that the expression accesses a relation that has as schema an open structure or an any schema. This rule guarantees that only the path expressions needed to solve the query will be included on the navigate operators. Moreover, we include a rule that checks that if no path expression is added to the navigate (the query doesn’t access the unstructured part of the data), the navigate operator will be eliminated, resulting in a logical plan tree equal to the one provided by Spark SQL. Consider the query on Figure 7. If this query is applied with the full schema of Figure 4, it will have a logical plan like the one presented on Figure 8. If we apply the same query to a relation with an open schema as the one on Figure 4, it will result in a logical plan similar to the one presented on Figure 9. Notice that this logical plan contains the Navigate operator that navigates the open part of the relation to solve the path expression t.utime. Finally, consider the same query with a relation with the schema any. The logical plan, in this case, will have to consider a Navigate with the path expressions that access all the attributes that are on the select clause plus the attributes used on the 7

purpose, Spark SQL++ first infers at runtime the specific type of value using the scala reflection provided by Spark SQL, then it proceeds as a regular cast from the inferred type to the desired type. The possibility of casting a value with type any allows to solve the problem of arithmetic operations when at least one of the operands is of type any, by converting all the operands with any to Decimal. If the value found can’t be converted to a number, it will return a null and, therefore, the final result of the arithmetic operation will also be null as desired. Even though the casting operation allows this alternative, we decided to take the slightly more complete resolution that takes the most general type between the operands to resolve the arithmetic operation. This guarantees that, for example, if we are only adding integers, the answer will be an integer rather that a decimal. Moreover, it guarantees that we are not doing extra work by converting every value to decimal, but only converting the values that are needed to the most generic type in the operation. Similarly, it guarantees that we are not using floating point or decimal operations that are more costly unless they are absolutely needed. For every arithmetic operation, first we find the most general type of the operands, considering that:

π[userN ame#14, rating#15, gP lusP laceId#19, utime#23]

σ(texttime#20 = F eb15, 2005)

N avigate [∗#21.utime AS utime#23]

Relation [userN ame#14, rating#15, reviewT ext#16, review#17, categories#18, gP lusP laceId#19, texttime#20, ∗#21]

Figure 9: Logical Plan for open schema

π[userN ame#24, rating#25, gP lusP laceId#26, utime#27]

σ(texttime#23 = F eb15, 2005)

N avigate [∗#22.texttime AS texttime#23, ∗#22.userN ame AS userN ame#24, ∗#22.rating AS rating#25, ∗#22.gP lusP laceId AS gP lusP laceId#26, ∗#22.utime AS utime#27]

Byte

Lihat lebih banyak...

Adding semi-structured data & querying to a SQL engine: Use case Spark SQL

Descrição do Produto

Comentários