KDDML-G: a grid-enabled knowledge discovery system

June 28, 2017 | Autor: Franco Turini | Categoria: Distributed Computing, Computer Software, Knowledge Discovery

Descrição do Produto

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 Published online 28 June 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1208

KDDML-G: a grid-enabled knowledge discovery system Andrea Romei∗, † , Matteo Sciolla, Franco Turini and Marlis Valentini Dipartimento di Informatica, Universit`a di Pisa, Largo Bruno Pontecorvo 3, Pisa 56127, Italy

SUMMARY KDDML-G is a middleware language and system for knowledge discovery on the grid. The challenge that motivated the development of a grid-enabled version of the ‘standalone’ KDDML (Knowledge Discovery in Databases Markup Language) environment was on one side to exploit the parallelism offered by the grid environment, and on the other side to overcome the problem of data immovability, a quite frequent restriction on real-world data collections that has principally a privacy-preserving purpose. The last question is addressed by moving the code and ‘mining’ the data ‘on the place’, that is by adapting the computation to the availability and localization of the data. Copyright © 2007 John Wiley & Sons, Ltd. Received 31 May 2006; Revised 10 February 2007; Accepted 10 March 2007 KEY WORDS:

1.

distributed data mining; grid middleware; classification

INTRODUCTION

Grid computing is an emerging parallel and distributed computing technology that focuses on largescale resource sharing: the grid is an abstraction that allows transparent and pervasive access to distributed computing resources. Other desirable features of the Grid are that the access provided should be secure, reliable, efficient, and inexpensive, and that it enables a high degree of portability for computing applications. In [1], the ‘grid problem’ has been defined as ‘flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources’. The latter are referred to as virtual organizations. Over the past years, research and development efforts within the Grid

∗ Correspondence to: Andrea Romei, Dipartimento di Informatica, Universit`a di Pisa, Largo Bruno Pontecorvo 3, Pisa 56127,

Italy.

† E-mail: [email protected]

Contract/grant sponsor: Italian MIUR FIRB; contract/grant number: RBNE01KNFP

Copyright q

2007 John Wiley & Sons, Ltd.

1786

A. ROMEI ET AL.

community have produced protocols, services, and tools that address the challenges concerning scalable virtual organizations. The aim of current work of the KDD Laboratory group is to enable knowledge discovery on a grid environment. Data mining (DM), also known as Knowledge Discovery in Databases (KDD) [2], refers to the use of a large number of statistical techniques to tease patterns out of large volumes of data. DM is now used extensively in industries like banking and finance, communications, insurance, retail sales, and heath care. The scenario of a knowledge discovery process may therefore be a distributed one, where heterogeneous and dynamic ensembles of resources and services, namely virtual organizations, are involved. In order to extract knowledge from large amounts of distributed data, sharing of resources is required which involves direct access to computers, software, and data. As such, the grid represents a suitable and natural environment for KDD applications. In the following we present how the KDDML (KDD Markup Language) environment [3–5] was redesigned in order to support KDD on computational grids. The KDDML system is a Javaimplemented environment that supports the specification and execution of complex knowledge discovery. It acts as an XML-based middleware language and system in support of the KDD process, as meta-data, mining models, and queries are all represented as XML documents. In support of moving the KDDML system onto the grid, we used ASSIST [6–8], a programming environment for the development of parallel and distributed high-performance applications. Last efforts brought actually to the evolution of ASSIST for large-scale platforms and grids [8,9]. In this way, in order to port KDDML on the new environment, we could focus on application issues, and not on the details concerned with the protocols and services required on such a particular environment. The main advantage of enabling the system for a grid environment was to overcome the problem of data privacy and immovability. Indeed, a key problem that arises in any en masse collection of data is that of confidentiality. The need for privacy is sometimes due to law (e.g. for medical databases) or can be motivated by business interests. However, there are situations where the sharing of data can lead to mutual gain. Consider, for example, how DM can advantageously be applied in the medical field for pattern recognition and predictive analysis on pooled data. Despite the potential gain, this is often not possible due to the immovability of the confidential data. This question can be successfully addressed in KDD processes by moving the code and ‘mining’ the data ‘on place’. In support of the needed coordinated use of resources at multiple sites for computation, the newest grid technologies provide protocols, services, and APIs for secure resource access, resource management, fault detection, and communication. The organization of the paper is as follows. Section 2 introduces KDDML as a middleware language for knowledge discovery. Section 3 outlines the parallel architecture of the system; in particular, Sections 3.1–3.3 provide details of the main components. Section 4 reports a complete example of use of the system. Section 5 presents some experimental results. Section 6 discusses related work. Eventually, Section 7 draws some conclusions.

2.

BACKGROUND: KDDML AS MIDDLEWARE LANGUAGE

The KDD process, i.e. the process of finding ‘meaningful information’ within massive data sets, is a complex task that includes several steps: data preparation, model extraction, evaluation, and

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

KDDML-G: A GRID-ENABLED KNOWLEDGE DISCOVERY SYSTEM

1787

deployment. In this respect, the KDDML language [4] has been designed by considering the KDD process as a query process, where the operations within a query can be nested. In the following, we present KDDML as a middleware machine-processable language in support of the entire KDD process. As the name suggests, the KDDML language and system adopt the XML standard as a glue for query definition and data/model representation. The language tries to be as much as possible independent from lower level implementations of DM algorithms and operators, with the aim of confining the technicalities at the level of the implementation of the KDDML system. KDDML queries are XML documents, where XML tags correspond to parameters of those operations and XML sub-elements define arguments passed to the operators. The XML syntax of a generic operator is shown below: .... ... ....

The attribute xml dest="results.xml" states that the results of the operator are stored in the system repository for further processing or analysis. The other attributes correspond to parameters of the operator (e.g. the minimum support for an association rules mining algorithm). Arguments ARG1 NAME, . . . , ARGn NAME of the operator must be of an appropriate type and sequence, i.e. an operator signature must be specified. Intuitively, there is one type for data sources, one type for each mining model (classification trees, association rules, clustering, sequential patterns) and one type for hierarchies of items. Other proprietary types are defined to represent specific KDDML objects, such as conditions, expressions, or algorithm specifications. We denote the signature of an operator FO P E R AT O R N AM E : t1 × . . . tn → t returning type t by defining a DTD for KDDML queries that constraints sub-elements to be of type t1 , . . . , tn . Thus, KDDML queries correspond to terms in the algebra of operators, though syntactically represented as XML documents. The KDDML language is typed and compositional, and satisfies the closure principle required by a DM language [10]. Under this interpretation, the semantics of a KDDML query amounts to a strict functional execution of the corresponding term: the evaluation of the XML-fragment above consists first in a recursive evaluation of fragments from to , and then in a call to FO P E R AT O R N AM E accepting as input n objects and yielding the final result of the fragment. Type checking is mainly static (by means of XML DTDs) and only in some cases dynamic type checking is performed. Moreover, a copy of the result (which may be an intermediate result of a possible larger query) is stored in the system repository if the attribute xml dest is specified. Notice that repositories are persistent, in order to favor the reuse of extracted knowledge and preprocessed data. As an example, consider the query reported in Figure 1. TREE CLASSIFY is the operator that applies a decision tree to predict the class of instances in a test set. The test set is provided by the second element (with tag TABLE LOADER), which specifies a table (testSet.xml) in the data repository of the system. A table is physically represented as an XML file, containing a schema and a reference to the actual data, which is stored in CSV (Comma-Separated Values) format. The data repository is populated by KDDML queries that yield tables as output. As one could expect, SQL SELECT queries and text files (ARFF, C4.5 data format) are basically available.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

1788

A. ROMEI ET AL.

Figure 1. A sample KDDML query.

The construction of a decision tree (tag TREE MINER) takes place on a training set given as ARFF file (tag ARFF LOADER) by applying a decision tree induction algorithm (here YaDT from [11]) with parameters concerning the pruning strategy of the algorithm. The name of the class attribute is provided as attribute of the TREE MINER element. KDDML represents models as an extension of PMML (Predictive Model Markup Language) [12] documents. PMML is the industry standard for actual models representation via XML that covers the exchange of the extracted knowledge and consists of DTDs for a wide spectrum of models, including association rules, decision trees, clustering, regression, neural networks. As for data, the KDDML language assumes a model repository containing extracted DM models. Operators for direct access to the models are available in the language definition. As one could expect, the KDDML system embeds a library of algorithms and operators and basic mechanisms for adding new ones. The set of operators can be classified either according to the type of model they return or to the KDD process phase that they support. More in detail, KDDML language includes operators for: • data/model access from heterogeneous sources; • data preprocessing, including operators for removing or adding new attributes, filtering rows according to a specified condition, rewriting values of attributes, sampling, discretization, normalization, sorting of an attribute according to their values or frequencies and more; • model extraction, including operators to extract a mining model from a data source using a DM algorithm; • model application, including operators to apply an extracted model on new data in order to predict features or to select data; • model (meta-)reasoning, to combine two or more models for further processing; • model filtering, including operators for cleaning models according to specified conditions; • control flow and call to external programs. The design of the KDDML system architecture had to take into special account the requirements of extensibility, which can be distinguished into data sources extensibility, algorithms/operators extensibility, and models extensibility. It has been implemented in Java in order to be portable. In more detail, KDDML has a layered architecture (see Figure 2): each layer implements a specific functionality and supplies an interface to the layer above. The bottom repository layer manages the

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

KDDML-G: A GRID-ENABLED KNOWLEDGE DISCOVERY SYSTEM

1789

Figure 2. KDDML system architecture.

read/write access to data and models repositories and provides programmatic read/write access to the model contents. The operators layer above is composed of the implementations of language operators. The interpreter layer accepts a validated KDDML query and traverses the DOM tree representation of the query by applying a strict functional interpretation at each tag. It evaluates the query and returns its result back to the top GUI layer. Summarizing, a KDDML query can be represented as a tree-structure expressing the nested application and combination of DM steps. Each internal node represents a KDDML operator, and as such the tree describes the model of a corresponding parallel program. We introduce the discussion of the grid-enabled KDDML system by means of Figure 3, which shows a typical scenario where four parties, having private databases located on four distinct sites (iron, caronte, knowledge, paki), aim to cooperate by computing a KDD query. Since the databases are confidential, no party wishes

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

1790

A. ROMEI ET AL.

Figure 3. The structure of a KDD query.

to reveal any of the contents to the other. The only information the parties are willing to make public are the patterns and trends, i.e. the knowledge models, extracted from their data. Starting from this point, we notice that each KDDML query can be decomposed into a set of autonomous tasks. Then the tasks can be mapped and executed on several sites, involving distributed confidential data, but without moving them from their location.

3.

THE OVERALL SYSTEM ARCHITECTURE

In order to support the knowledge discovery on the grid, the original KDDML architecture has been extended: beside the components for query specification, execution, and result visualization, a new component has been introduced, called KPQ-KDDMLProcessingQuery (see Figure 4). It aims at representing an interface between the GUI layer and the underlying layers (see Figure 2), which identify the core KDDML Interpreter. Notice furthermore that, due to the necessity of moving the algorithms to the data location, a lighter version of the original KDDML system had to be defined. This movable version consists of the byte-code of the KDDML language interpreter, and includes some further resources (DTD files, DM algorithms, and other utility libraries). This code, referred to as kddmlLIGHT, will be moved and executed ‘on the place’ every time a query operation attempts to mine some remote and confidential data. The functions of KPQ are: • to provide information about the data available on the grid; • to preprocess each KDDML query, mainly by decomposing it into a set of autonomous subqueries; • to provide the guidelines for mapping the sub-queries on the grid sites.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

KDDML-G: A GRID-ENABLED KNOWLEDGE DISCOVERY SYSTEM

1791

Figure 4. The overall system architecture.

Besides introducing a grid interface, some other services are needed to support the parallel and distributed KDDML execution on the grid. In particular, three further components have been introduced that are responsible for (i) retrieving grid status information; (ii) mapping each subquery onto a grid site, with respect to some constraints; and (iii) managing and coordinating the execution of the subqueries. These functions are performed, respectively, by a Resource Discoverer, a Constraint Compiler, and an Execution Coordinator (EC) (see Figure 4). The typical scenario of a query execution, depicted in the sequence diagram of Figure 5, is the following: 1. Resource Discoverer provides an XML document (InfoGrid.xml), which describes information (i.e. location, size, mobility) of the data sets; 2. the user composes a query assisted by the KDDML GUI; the latter interacts with KPQ in order to discover and involve the available data resources; 3. KPQ preprocesses the query by decomposing it into autonomous sub-queries, and it generates the constraints concerning the problem of task mapping on the grid sites; 4. the mapping constraints generated by KPQ are specified within an XML document (Constraints.xml); the Constraint Compiler will then transform those guidelines in such a way to be subsequently used in the ASSIST-based EC to manage the mapping and the distribution of the sub-queries on the grid sites (Map.xml); 5. once the KDDML query has been preprocessed by KPQ, i.e. it has been decomposed into autonomous sub-queries, and the necessary information about task dependencies and intermediate results have been retrieved, the corresponding knowledge discovery process can be performed using the EC as a interpreter of the parallel and distributed computation. 3.1.

KPQ: system structure

KPQ is made up of three different modules (see Figure 6): the Coordinator, the Query Decomposer, and the Constraint Generator.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

1792

A. ROMEI ET AL.

Figure 5. Sequence diagram describing a query execution.

Figure 6. Collaboration diagram of KPQ’s components.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

KDDML-G: A GRID-ENABLED KNOWLEDGE DISCOVERY SYSTEM

1793

3.1.1. The coordinator The Coordinator has the responsibility of controlling the computation stream by both calling the internal components and the external modules, passing them the correct inputs. Once the query has been preprocessed in order to be executed on the grid, the module will first call the Constraint Compiler and then the ASSIST-based EC. 3.1.2. The query decomposer Its function is that of decomposing the input query into a set of sub-queries. Their execution, provided it is done with respect to the dependencies, leads to the same result obtained by the sequential query. The reasons for splitting up a knowledge discovery query are mainly due to immovability of data sources, and secondarily to the need of high performance. The decomposition algorithm is based on the tree structure of the query, an intrinsic characteristic of the KDDML language. The splitting up algorithm performs a depth first visit through the DOM tree‡ representing the XML query. Whenever the algorithm finds a node corresponding to a KDDML operator that can be isolated as a separated task, then the subtree singled out by this node is cut off and replaced by an operator whose input is the result of the sub-query linked to the subtree. While visiting the DOM tree, the Query Decomposer generates some further data structures that are used by other modules in order to find out information such as the execution order, relations among sub-queries and intermediate results, etc. 3.1.3. The constraint generator After decomposition, the next step of KPQ is to analyze the properties of the generated subqueries in order to fix the constraints for the grid execution. The Constraint Generator produces an XML document containing constraint lists sorted by type. Every different kind of constraints is delegated to a specialized module; finally, all constraints are collected and put into a file called ‘Constraints.xml’. As mentioned above, KPQ passes the generated constraints to the Constraint Compiler, which generates a plan in order to distribute the sub-queries (tasks) all along the grid nodes, maintaining compliance with the constraints. The next task of KPQ is initiating the ASSIST application whose aim is to coordinate the sub-query execution. KPQ stands waiting a response from the ASSIST application. The exit status and the result of the query will be sent to the KDDML GUI that will show it to the user. 3.2.

The constraint compiler

As just noted, KPQ has the responsibility of planning the execution of the overall KDDML query on a parallel and distributed environment by generating a set of constraints. There are five types

‡ The Document Object Model (DOM) includes a set of API for navigating XML documents; it provides a main-memory

hierarchical (tree-like) object-oriented model of XML documents.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

1794

A. ROMEI ET AL.

of constraints: • • • • •

Layering Constraints; Data set Immovability Constraints; Non-Anonymous Result Immovability Constraints; Complexity Constraints; Resource Constraints.

3.2.1. Layering constraints Once the decomposition of a KDDML query is performed, the Layering Constraints realizes the partitioning of the sub-queries into layers, or sets. The sub-queries belonging to the same set are independent tasks and are assumed to have the same start time. This implies the possibility of simultaneous execution for all the sub-queries settled on the same layer. Moreover, there exists an ordering between the layers, which reflect the sub-queries dependencies. 3.2.2. Data set immovability constraints The data stored on the grid nodes allow local access, but they generally cannot be moved on the grid, first of all for privacy reasons. The only way to involve them within a computation is by executing some knowledge discovery process on the node where those data are stored. Notice that, even though some data are immovable, it could be replicated. For each instance of immovable data set, the constraints specify: • the name of the data set; • the name of the sub-queries accessing the data set; • the sites where the data set is stored. 3.2.3. Non-anonymous result immovability constraints Likewise the input data, the intermediate computation results may be confidential too. In that case, they can be moved among remote sites only if they are anonymous. Thus, sub-queries that operate on non-anonymous intermediate results are forced to be performed on the same machine where the previous one, yielding the needed input, has been executed. Both Data set Immovability Constraints and Non-Anonymous Result Immovability Constraints are inviolable. 3.2.4. Complexity constraints Complexity Constraints lead to a classification of all KDDML operators into two classes, according to their computational characteristics. The A-class includes all the operators performing a low number of data scans, while the B-class includes the ones having a high computational complexity, rated in terms of data scans. Model application, pre- and post-processing operators, like REWRITING, RDA FILTER, TABLE MISCLASSIFIED, TREE META CLASSIFIER, CLUSTER NUMBER, are typically A-class computations, while model extraction operators, such as TREE MINER, RDA MINER, etc. belong to class-B.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

KDDML-G: A GRID-ENABLED KNOWLEDGE DISCOVERY SYSTEM

1795

The mapping of A-class sub-queries should therefore aim at warranting input data proximity, as the major gain, in term of performance, is given by reducing the communication cost between the father and child sub-queries. On the contrary, parallelization is taken into account for B-class operators, as overlapping the computation of high cost operators may be advantageous, even if input transfer is required. We therefore assumed two situations: the one in which investing in data locality seems to be more profitable, and the other one, where overlapping the computation can improve the performance. According to the classification just described, the XML constraints file specifies two lists of sub-queries: the first one will be in association with the ‘locality’ constraints and the second one with the ‘overlapping’ constraints. Complexity Constraints can be ‘violated’. 3.2.5. Resource constraints All KDDML operators, which result into tasks of a parallel and distributed computation, may require particular hardware and software resources. We have considered only a restricted number of requested features, i.e. the RAM availability and the presence of the software needed for computation, as they represent the most relevant resources to establish the sub-query mapping. In addition to the necessary resource requirements, sub-queries may also be associated with a priority expression, in order to specify a ‘preference’ for a certain node rather than for another one, according to the ‘goodness’ of the resources they are supplied with. Based on the previous considerations about the classification of the KDDML operators into two classes, the constraints will be reasonably assigned as reported in Table I. The number placed in the second column denotes the minimum RAM capacity required by an operator, estimated in Mega-Bytes. The expression ‘JVM’ placed in the third column indicates that a local Java Virtual Machine is required, in order to execute the KDDML operators. The function ram(n) returns the quantity of RAM available on the grid node n. Finally, the function KDDML is in(n) returns 1 if KDDML is already installed on the remote grid node n, otherwise it returns 0. It must be noticed that the weight in the priority column, which is associated with the presence of the KDDML code, turns out to be very low for B-class operators: for complex DM operators (B-class), it implies giving a higher priority to the grid machines that minimize the number of I/O operations, even if it requires the KDDML code to be moved to the execution site. The presence of KDDML is given the weight value 1 just to give it a priority amongst the sites having the same maximum RAM capacity. All these pieces of information, which are contained in the file ‘Constraints.xml’, will be loaded by the Constraint Compiler in order to produce a task mapping plan with respect of the constraints.

Table I. Resources required for A-Class and B-Class operators. Class A-class B-class

Copyright q

RAM

Software

Priority

— 512

JVM JVM

ram(n) + 10000 ∗ KDDML is in(n) ram(n) + 1 ∗ KDDML is in(n)

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

1796

A. ROMEI ET AL.

Shortly, the mapping algorithm progressively reduces the set of candidate machines (grid nodes) for each sub-query, as follows: first, the inviolable constraints, i.e. the Immovability Constraints and the Resource Requirement Constraints are applied; afterwards the Complexity Constraints, and finally the Resource Priority Constraints are applied. The order in which the sub-queries are taken into account is given by the Layering Constraints, which define also a partition of potentially parallel sub-queries. Eventually, the Constraint Compiler generates as output a file called ‘Map.xml’ upon which the mapping of the sub-queries is based on. 3.3.

An ASSIST-based execution coordinator

The EC has been developed using an environment called ASSIST (A Software development System based upon Integrated Skeleton Technology) [6–9,13,14]. We describe in the following how the ASSIST-based KDDML query processing consists of a task farm computation, where each farm worker performs a step of the KDD process by solving one of the isolated subqueries. With the ASSIST programming environment we could advantageously reuse the KDDML DM environment that was originally designed for standalone computers, but still without paying much attention to the low-level aspects related to the novel underlying platform. Referring to Figure 4, the EC acts as an interpreter of the tree structure of the query. It coordinates the parallel and distributed execution of its nodes, further referred to as tasks or sub-queries. Each sub-query is then solved by invoking the KDDML interpreter, which is moved, if necessary, onto the remote site as a jar library. As shown in Figure 7, the EC, corresponding to the ASSIST program, consists of two sequential modules and a parallel one. One of the sequential modules is the Emitter: it forwards the requests of

Figure 7. The schema of the ASSIST-based Execution Coordinator.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

KDDML-G: A GRID-ENABLED KNOWLEDGE DISCOVERY SYSTEM

1797

task execution as soon as all required input data are available. The parallel module, called Manager, describes a task farm computation, where each worker, V Pi , executes a KDDML interpreter, which is invoked for solving a single task. The results will be sent back to the Emitter, which enforces the dependencies between the tasks. Besides, a further sequential module, called InitModule, is needed in order to perform several initialization operations. Both sub-queries and intermediate results are sent among computing modules within the communication channels that are represented by arrows in the schema of Figure 7. As the sub-queries consist of small-size XML documents, they can be passed ‘by value’ as strings, whereas the output and input data of the sub-queries may have a large and/or unpredictable size. This problem could be faced using a shared memory mechanism that allowed passing those data ‘by reference’. Supported by the smReference library that is provided together with the ASSIST environment [8], the query arguments and results can be written in shared memory areas, and just the references to those areas are passed within the streams. Each sub-query that is submitted to the Manager parallel module must be distributed to an appropriate worker, V Pi , maintaining the mapping constraints. As a suitable mapping plan (see Section 3.2) is produced by the Constraint Compiler, the task distribution is done on this basis. In order to combine kddmlLIGHT, the Java written KDD Markup Language interpreter, and the C + + code within the virtual processor (VP) section of the ASSIST program, the JNI (Java Native Interface) technology has been used. The latter supports an invocation interface that allows embedding a Java Virtual Machine implementation, i.e. through the native libjvm.so library, into the native EC application. This way the kddmlLIGHT.jar library can be moved and invoked on every grid machine where a JRE is installed. Using the JNI functions, each VP creates its own Java Virtual Machine, it invokes the query interpreter several times, and finally it destroys the JVM. Notice that the execution of each query is coordinated by its own instance of the EC program, which is configured on the basis of the number and the location of the data sets: a VP instance will run on each machine that is involved in the computation, i.e. that appears within the mapping plan. Consider that the ASSIST compiler generates five executable files for the EC program depicted above: ND000 InitModule, ND001 Emitter, ND002 Manager ism, ND002 Manager osm, ND002 Manager vpm. The EC will be launched from within KPQ, on the user site. On each machine that will be involved in the query computation a VP-corresponding executable program, namely ND002 Manager vpm, is launched, while the remaining four executables programs are all launched on the user machine. Thus, the information concerning the pre-processed query can be passed from KPQ to EC, on the user site, within some text files. The same way, the overall result and the exit status information is returned from EC to KPQ through the local file system.

4.

EXAMPLE

In this section, we show an example of how the grid-enabled KDDML system can be advantageously used for knowledge extraction from data within a large-scale grid environment. Consider a bank that may wish to perform credit risk analysis in an attempt to identify nonprofitable customers before giving a loan. The attributes of a customer may include: Age, Work-class, Education, Marital-status, Relationship, Capital-gain,

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 19:1785–1809 DOI: 10.1002/cpe

1798

A. ROMEI ET AL.

Capital-loss, and other relevant, but strict confidential and immovable, information.§ A further attribute may be Profitable-customer, which represents the trustworthiness based on a credit history of a customer, and its values may be ‘yes’, ‘no’, or ‘unknown’. The bank is interested in learning rules such as: • If Capital-gain>6849.0 and Education ∈ {‘Assoc-voc’, ‘Masters’, ‘Doctorate’} then Profitable-customer = ‘yes’. • If Capital-gain

Lihat lebih banyak...

KDDML-G: a grid-enabled knowledge discovery system

Descrição do Produto

Comentários