A workflow modeling system for capturing data provenance

May 21, 2017 | Autor: Girish Joglekar | Categoria: Mechanical Engineering, Chemical Engineering, Knowledge Management
Share Embed


Descrição do Produto

Computers and Chemical Engineering 67 (2014) 148–158

Contents lists available at ScienceDirect

Computers and Chemical Engineering journal homepage: www.elsevier.com/locate/compchemeng

A workflow modeling system for capturing data provenance Girish S. Joglekar ∗ , Arun Giridhar, Gintaras Reklaitis School of Chemical Engineering, Purdue University, West Lafayette, IN 47907, USA

a r t i c l e

i n f o

Article history: Received 6 January 2014 Received in revised form 1 April 2014 Accepted 9 April 2014 Available online 18 April 2014 Keywords: Workflow modeling Knowledge management Recipe management Data provenance

a b s t r a c t A workflow is an abstraction of the steps associated with the underlying work process and is typically modeled as a directed graph. The workflow concept under its various manifestations has been used to model applications in diverse areas, including project planning, manufacturing, scientific experiments, execution of computer software, and publishing. While the Open Provenance Model Core Specification had laid the foundation for defining the key concepts in a workflow, a simplified high level graphical representation of a workflow that is widely applicable is not available. In this paper we describe a novel general framework for building workflows and implementing the associated actions, which will facilitate understanding of work processes across multiple disciplines. As such, most work processes are organized hierarchically with well defined control and management responsibilities. This framework will facilitate integration and coordination of activities across associated domains. Additionally, it will act as a template to refer to the associated metadata as well as reference to access the instance data from archives of completed workflow cases. When a specific case is in progress, a finite state machine will guide the user through the steps and provide up to date information about the current state. We describe the main building blocks in the framework, their functionalities and illustrate the integration of workflows between an experimental and a scientific process. © 2014 Elsevier Ltd. All rights reserved.

1. Introduction In the past decade there has been an unprecedented growth in the amount of information being generated and managed both within the technical domain and in the broader business setting. Due to the ever accelerating advances in the applications of information technologies to all aspects of running an enterprise, the problems of managing data, creating knowledge, making better decisions and doing so in real time will continue to become more complex and acute. In order to derive maximum benefit, it is imperative that the information be captured and stored in a structured way, and that it be machine accessible. Only then can such vast amount of information be processed by computer assisted methods to provide effective and timely decision support. The proverb ‘prevention is better than cure’ applies fully to the state of information management. If the information is not stored in a structured, semantically rich fashion to begin with, then it becomes very expensive, and sometimes impossible, to retrieve the desired items of information later. This is evident, for example, from the plant data historians or the electronic lab notebooks that are in use today. Even the current very efficient search engines make brute

force searches highly inefficient and time consuming. The current solution for such situations is to write custom computer software for every special requirement linking multiple information repositories with disparate data identifiers and creating specific search protocols to mine for data related to the issue at hand. Similarly, there are software companies that specialize in annotating information and reports that were created using word processors or spreadsheets, using natural language processing techniques. Therefore, moving forward, in order to avoid such case-by-case solutions, it is important insure that all information is captured in a semantically rich format. Aside from defining the meaning of each data item that is stored in a repository, its metadata, it is also important to define the context of information. The context principally defines the various steps executed in creating the information and the conditions associated with each step. The steps in essence constitute the workflow, alternatively called the provenance of information. The interest in collecting provenance is growing because it is necessary for a variety of functions such as checking the validity and quality of information, facilitating reproducibility, as well as analysis and creation of new knowledge. 2. Workflows

∗ Corresponding author. Tel.: +1 765 404 0065. E-mail addresses: [email protected], [email protected] (G.S. Joglekar). http://dx.doi.org/10.1016/j.compchemeng.2014.04.006 0098-1354/© 2014 Elsevier Ltd. All rights reserved.

The concept of workflows has been in use in a wide variety of domains such as business processes, manufacturing, scientific

G.S. Joglekar et al. / Computers and Chemical Engineering 67 (2014) 148–158

149

research, computing and medicine (Schwartz, 2006). The most common technique for modeling a workflow is to create a high level graphical representation consisting of a directed graph and associated nodes and edges. A graph defines the sequence of and interactions between various steps associated with a workflow. Simple directed graphs provide adequate expressing power for some domains such as business processes and scientific computing where the action(s) associated with a node influence only itself and its immediate neighbors. However, in most manufacturing situations, particularly in chemical manufacturing, an action pertaining to a node may require relationships with a set of nodes beyond the immediate neighbors. Moreover, unlike simple directed graphs where edges typically represent information exchange or signals, in manufacturing some edges may represent continuous transfer of material or exchange of a discrete quantity of material or some entity. Whenever material is exchanged the control of execution of the associated steps is not necessarily the same as that represented by a directed edge, namely the ‘from node’ controls the execution. As a result, additional descriptors are necessary to specifically identify the node that controls the execution of a series of steps. The Open Provenance Model (OPM) core specifications (Moreau et al., 2011), which are general and applicable to workflows, do not adequately address the special situations arising in manufacturing systems. The S88 standards (International Society for Measurement and Control, 1995) for batch processes define the concepts and terminology for manufacturing recipes and use a network description for batch recipes at several levels of specificity (general, site, master). The S88 representation of a recipe is thus the equivalent of a workflow. Under S88 at the process level, execution of the batch recipe is managed by means of a procedural control system. The workflow system described in this paper provides a similar mechanism for step by step execution of a workflow using an engine which also is applicable to non-manufacturing applications. For developing a knowledge framework for bioprocesses a workflow based approach was strongly recommended (Junker et al., 2011). Fig. 2. An example of a scientific workflow.

2.1. Workflow types There are four main categories of work procedures, or workflow types: business workflows, scientific workflows, experimental procedures and manufacturing recipes. Individually they represent different domains. A business workflow is mainly concerned with the modeling of business rules, policies, and project management, and therefore is often control-and activity-oriented. Typically a business workflow has one or more specific deliverables associated with it. Those deliverables could be concrete decisions, information that will support decision or publishable information that is part of a knowledge base. An example of a business workflow is shown in Fig. 1. A

Material Requirement Request Approval Yes

No

Notify

Decide Next Action Quit

Edit

Delete Request

Revise Request Resubmit

Fig. 1. Example of a business workflow for processing material requirement request.

comparison of the main approaches to modeling business processes is given in Borger (2012). A scientific workflow models the execution of computational or data manipulation steps in a scientific application (see Fig. 2). Typically the nodes alternate between data nodes and program execution nodes. The data nodes represent either the data input to or data generated by a computational node. An experimental workflow models the steps executed while conducting an experiment at the laboratory, pilot plant or test-bed level. The main use of an experimental workflow is to record the conditions defining a given experimental run and to record the values of the observed variables. This information may include specific protocols, calibration procedures as well as on- and off-line analyses. Typically a set of experiments is performed based on the design of experiments defined to meet certain objectives. The selection of variables and their ranges is typically a result of a scientific workflow. A manufacturing workflow, or recipe, models the steps executed during the manufacture of a product in a manufacturing facility. The recipe typically defines the preferred values of operating variables and the preferred operating sequence in order to make a given product. The manufacturing recipes instruct the operators and/or provide the information for a plant-wide procedural control system. The recipe may lead to the production of discrete entities or bulk product in a batch or continuous mode. For certain classes of experiments, the manufacturing recipes and experimental workflow may be identical at the conceptual level, the only difference arising in the scale of manufacture or the identify of the equipment

150

G.S. Joglekar et al. / Computers and Chemical Engineering 67 (2014) 148–158

used. An example of a manufacturing recipe will be given in following section. 2.2. Existing workflow modeling approaches Several workflow management systems have been developed for scientific workflows, these include, Taverna (Oinn et al., 2006), Kepler (Kepler Web Site, 2013), and Pegasus (Deelman et al., 2005). Typically, a scientific workflow is modeled as a directed acyclic graph, with nodes representing executable programs connected to nodes representing data. The data node(s) upstream of a program node represents data input to the program, while the data node(s) downstream of a program node represents data created by the program. An example of a scientific workflow for estimating individualized dosage regimen for gabapentin (Laínez et al., 2011) is shown in Fig. 2. The user enters the data required by the two tools (Bayesian parameter estimation and Dosage regimen individualization nodes), which are represented by the ‘GUI Input’ node. The entered data is written out in an xml file. Each tool interprets the xml file and uses data relevant to itself. First, the ‘Bayesian parameter estimation’ tool is executed which estimates the posterior distributions of the parameters of a first order one compartment model. This tool employs a Markov chain Monte Carlo (MCMC) approach for the Bayesian estimation of the posterior distribution. The ‘Dosage regimen individualization’ tool, which is executed next, uses posterior distribution generated by the first tool to determine the optimal dosage regimen for an individual. The inputs are the plasma concentration–time data for an individual, the size of the parameter sampling, the tuning parameter for the MCMC, an initial guess of the parameters, the confidence level for the estimation of the dose, and the preferred interval of administration. The outputs include the marginal probability distribution of the parameters, the dose range that achieves the selected confidence level, and the confidence region for the concentration given the individualized dose. These types of systems for managing scientific workflows are generally not well suited for other types of workflows, which are resource and procedure centric. The Smart Manufacturing Leadership Coalition (Davis et al., 2012) offers a platform for intelligent manufacturing that has the ability to orchestrate workflows that integrate information and decision making processes. The workflows are modeled as scientific workflows using the Kepler system (SMLC Workshop, 2013). A survey of various workflow management tools used for massive data analysis performed in Grid computing environment is presented by Senthil and Santhosh (2012). Graph-based representations of activity networks are also used in Petri net models and in discrete event simulators, including specific batch process simulation systems such as Batch Process Developer (BatchProcess Developer, 2013) and Batches (Batches Users Manual, 2003). Petri nets are principally used to model logical networks representing discrete decisions and do not have a direct link to mechanisms for data generation, retrieval and storage. Discrete event simulators (e.g., ExtendSim) use graphical representation of activity networks, which typically can be represented by a series of steps whose termination is controlled by state and time events and initiation and/or alternative routing is controlled by dispatching rules. Additionally, the typical semicontinuous steps in a chemical process, which span more than two stages, are difficult to model with petri nets and the activity networks. Building petri nets is non-trivial and can become too large to generate all states of the system and difficult to analyze (FAA Human Factor Tools, 2013). Finally workflows also can be used to drive a process simulation model consisting of sets of differential/algebraic equations. For example, one way to use the gPROMSTM simulator to model the recipe shown in Fig. 3 would be to first create a flowsheet consisting of all the unit operations equipment in the process. Then

A

PrepA FillA Mix Empty

B

WorkflowP1 React Filter Store FillB FillAB Filter FillA P1 Heat React TXData Empty

Fig. 3. Workflow for manufacture of product P1.

one would select a model from the gPROMSTM library to represent each subtask and create a ‘workflow’ to sequence each of the subtasks in the recipe in the required order. The workflow can then be used in multiple ways to drive the execution of the gPROMS model. The simplest way would be create the sequencing information from the workflow shown in Fig. 3. The sequencing is clearly defined by the workflow and a simple interpreter can be developed to generate information that is in the format required by gPROMS. A more sophisticated interface could be created for generating a complete gPROMS model based on the workflow by allowing the selection of a subtask model for each subtask and using the process parameters which would be already defined in the workflow. This would eliminate the duplication of information and reduce the time required to build a model. The detailed description of such an approach is beyond the scope of this paper. In this paper we describe a new, general purpose workflow model that provides a generalization covering the variety of workflows and can be used effectively to capture the provenance of data generated from the various domains and forms described above. This workflow based system consists of four components: a graphical editor to build a workflow for the given application, the core building blocks in the workflow model, a library of functions that perform certain actions and an engine that manages the execution of a workflow. A graphically oriented system has several benefits. First, during the early stages of workflow development when the underlying steps are not yet finalized the graphical builder allows models to be created and modified very easily thereby allowing free exchange and convergence of ideas among the collaborators. This facilitates the creation of workflow that is robust and meets the approval of all. Typically a workflow template is used repeatedly to create various instances of its use in specific cases. In that role, a workflow serves as a framework for storing parameter values specific to each instance and the results or data output associated with that instance. Additionally, the associated workflow execution engine provides up to date status information about the instance in progress and assists the end user in managing the details of the current step as well as through the sequence of steps comprising the workflow. Additionally, in its role as a reference to all associated information, a workflow is useful in guiding the user to the required information or allowing computers to systematically access the required information by traversing the underlying structure. Thus, all the information stored in a repository becomes fully machine accessible, a crucial requirement for data analysis and data mining for new knowledge creation. 3. Workflow model The general purpose workflow modeling framework advanced in this paper consists of the following building blocks: a) workflow b) task c) subtask

G.S. Joglekar et al. / Computers and Chemical Engineering 67 (2014) 148–158

d) material flow descriptors: source and sink nodes, subtask input/output ports, and transfer lines e) information flow descriptors: input/output ports, information flows and data nodes. The functions of the building blocks are described in the next few subsections. 3.1. Workflow and task A workflow is a sum total of all of the building blocks that describe the details of a procedure. It is in essence a container that defines the scope of the activities, which in turn are decided by the user. By default, a workflow will consist of at least one task. A task represents a series of steps performed on or by the assigned resource. In the case of experimental or manufacturing workflows, a task is performed on a piece of equipment or an instrument assigned to it. In the case of business workflow, a designated person or team is responsible for implementing the assigned task. In the case of scientific workflows, it is typically the hardware and or software by means of which a computation task is performed. Whenever a resource is shared by multiple tasks, the allocation of that resource to perform a given task is dictated by its availability, suitability and the priority of the task to be performed. The onset and execution of a task are delayed until the required resource is assigned to it. A task is similar to the concept of unit process defined in the SA-88 standards. There is no equivalent concept in OPM. A subtask is a basic elementary step that is used in constructing a task. A series of subtasks constitutes a task and defines its scope. When a task is initiated, its execution begins with its first subtask and continues until the last subtask in its series is completed. 3.2. Material and information flows Depending on the nature of the workflow network of tasks, its execution will entail transfer of actual physical entities or materials, of information or both as these are used and generated. The material flow (in a generic sense) requires definition of material source and sink nodes, locations of ports in specific subtasks which serve as inputs/outputs for that material and transfer lines indicating the movement of material between specific subtasks. The transfer lines define the logistical network for the advancement of material through the workflow. For purposes of graphical depiction in the following discussion, material transfers are represented by solid lines, originating either from a material source node (represented by yellow pentagon) or subtask output shown as a triangle on the right vertical edge of a subtask rectangle, and terminating either in a sink node (a yellow triangle) or a subtask input shown as a triangle on the left vertical edge of a subtask rectangle. The information flow within a workflow likewise requires the definition of data sources and sinks, locations of ports in specific subtasks which serve as inputs/outputs for that information and arcs, or transfer lines, indicating the movement of information between specific subtasks. As in the case of material flows, information flows track the development of information and data as the workflow is executed. For graphical depiction purposes, information transfers are designated by dotted lines with a solid circle designating an output port of a subtask and a hollow circle an input port on a designated subtask. The concepts of material and information flows are certainly not new: it has been used in the representation of computational strategies for the execution of steady state process flowsheet simulation models (Westerberg et al., 1979). Corresponding to the notion of material source and sink nodes, the proposed representation also uses the construct of a data node.

151

3.3. Data nodes A data node serves two main purposes: first, it contains the metadata for the information which is required input to the workflow or which is generated as output from it and, secondly, it contains pointers to the location where the data itself is stored. A workflow typically contains at least one data node, which is a child of that workflow. A data node can be connected only by an information flow to a subtask. For purposes of graphical representation, a data node is shown as a red bordered rectangle with suitable name. A data node can either be a data creation node or a data specification node. 3.3.1. Data creation node A data creation node defines completely the structure, or the metadata, of the data created during the execution of the associated subtask. A data structure may comprise a set of values, a table, or a combination thereof. The metadata of a set of values is simply a set of terms from a predefined vocabulary. The vocabulary provides the interpretation of each term when necessary. A vocabulary in turn may be defined as an ontology or simply via a table. The metadata of a table of data consists of a fixed set of columns, each with a name, an index and a measuring unit. Alternatively, the name and measuring unit may be replaced by a term from a vocabulary. Each instance of a data creation node also contains the path or a pointer to the physical location where the data is stored. 3.3.2. Data specification node A data specification node identifies the specific data item(s) that is to be retrieved from a data repository. A data item in a data specification node is the reference to a specific metadata item from any of the data creation nodes defined in the library of workflows in a repository. Thus, an item in a data specification node may refer to another item in a data creation node, to a column in a table in a data creation node or a parameter associated with task or subtask. 3.4. Example As an example, the batch manufacturing workflow for making product P1 by reacting two chemicals A and B is shown in Fig. 3. (The complete set of workflow symbols used in the graphical representation in Fig. 3 and subsequent figures is summarized in Appendix A.) The production recipe consists of four tasks shown as green rectangles, PrepA, React, Filter and Store, each performed in the specific piece of equipment required by it. Each task, in turn, is modeled as a series of subtasks that correspond to the individual steps performed when that task is carried out. The PrepA task starts with filling the required amount of raw material A (yellow pentagon) during subtask FillA (white rectangle), mixing it for a the specified amount of time (subtask Mix) and then emptying the contents (subtask Empty) into the downstream reactor when that unit is ready to receive this material during operation of subtask FillA of task React. The React task starts with filling raw material B during subtask FillB and then transferring material A from the upstream equipment (subtask FillA). The content is heated (subtask Heat) until a certain temperature is reached and allowed to react (subtask React) until certain yields are achieved. During the reaction the temperature and composition are continuously monitored and recorded (data note TXData). After the reaction, the content is continuously filtered (task Filter, subtask Filter), resulting in a product P1 stream (sink P1 as a yellow triangle) and a mixture of unreacted A and B that is stored in a storage tank during subtask FillAB.

152

G.S. Joglekar et al. / Computers and Chemical Engineering 67 (2014) 148–158

In this example, the information flow between the subtask React and data node TXData is shown in Fig. 3 via dotted line. In general there may be additional data nodes associated with the on- or offline analysis of the product or of the reactants. Although this application is batch manufacturing oriented, the set of building blocks shown in Fig. 3 provides the necessary functionality to model all workflow types mentioned above, at the level of detail required by a given application. Moreover, the graphical representation of a workflow provides a very concise view of the logistics and controls employed in implementing that workflow. In principle, since any organized activity can be cast as a workflow, the workflow model proposed can facilitate the development of a common framework for managing related activities in various domains. Such a framework will facilitate common understanding and communication of concepts horizontally and vertically within an organization.

of the sequence of subtasks on the assigned piece of hardware and creation of the workflows that can be easily adapted to different sets of hardware. The additional complexities due to the physical layout of hardware can be managed easily by the concept of task as well. In the example discussed above, suppose there are two vessels available for task PrepA, P1 and P1, and two reactors for task React, R1 and R2, and P1 can be connected only to R1 and P2 to R2. Such hardware related constraints should not affect a workflow model. The concept of task allows accommodating of such constraints without affecting the subtask level representation of a workflow. The key step in executing a task is to assign a resource to it. Once the resource is assigned, the subtasks are executed in the order they are defined in the workflow model, beginning with the first. When the last subtask is executed, the task is completed and the assigned resource is released.

4. Workflow control

4.2. Subtask control

The logistics and control of the execution of a workflow require the concepts of independent/dependent tasks and active/passive subtasks, which are described in this section.

Given that the communication between tasks occurs at the subtask level, the control of information and material flows between subtasks has to be performed at that level. To capture that level of control, it is convenient to introduce four subtask types: Master, Slave, Chaining and Decoupling. For simple depiction of subtask type in the workflow diagram, we introduce a suitable graphical coding of the subtask type. A material input on a subtask is represented by a triangle, hollow or solid, on the left vertical edge of a subtask, thus pointing toward the subtask. A material input that pulls material from upstream is depicted as a solid triangle, a pulling input. If a material input accepts material pushed by an upstream output it is depicted as a hollow triangle, a passive input. A material output on a subtask is represented by a triangle, hollow or solid, on the right vertical edge of a subtask, thus pointing away from the subtask. A material output that pushes material downstream is depicted as a solid triangle on the right vertical edge of a subtask, a pushing output. If a material output allows downstream input to pull material then it is depicted as a hollow triangle, a passive output. Thus, the output and input on a material transfer line must be off opposite ‘polarity’, that is, either the output pushes material (solid triangle) and input receives material (hollow triangle) or the input pulls material (solid triangle) and the output allows material withdrawal (hollow triangle). The type of a subtask can be determined by its depiction in the associated task. The rules for determining the subtask type are given below.

4.1. Task control In general, a task defines all the actions performed using some set of assigned resources. A task can be independent or dependent. An independent task is initiated by the application or the user at a specific point in time subject to the availability of a set of suitable resources. A dependent task is initiated by an upstream subtask that interacts with the first subtask of that task via either material transfer or information transfer, subject to the availability of required resources. In Fig. 3, PrepA and React are examples of independent tasks, and Filter and Store are examples of dependent tasks. Since the first subtasks of PrepA and React fill raw materials, they are independent, whereas task Filter is triggered when the React task is ready to empty its contents, and similarly the Store task is triggered in order to receive material from the filter. If an application is concerned with just-in-time operation, the initiation of React would be tied to initiation of PrepA via the durations of subtasks FillA and Mix of task PrepA, and FillB of task React. The task type, independent or dependent, can be deduced from the workflow diagram created by the user. The task type plays a key role in influencing the assignment of resources and the logistics of task initiation during the execution of a workflow. By definition, a task contains at least one subtask. The systems available for modeling scientific and business workflows, such as Pegasus or Taverna, do not employ the concept of tasks, and thus a node in those workflows would be equivalent to a subtask. Of course in those systems the arcs represent only information flows. The concept of task provides an extra dimension to the workflow representation, which is very useful in modeling complexities introduced due to the hardware. For example, when multiple equipment or instruments are suitable for a task such as parallel pieces of equipment or parallel CPUs, then those simply become parameters of task without affecting the subtask level representation. Similarly, if the same workflow is to be implemented using a different physical set up, then again only the task information needs to change while the rest of the workflow stays intact. In manufacturing situations, such as the workflow in Fig. 3, where several subtasks are performed in the same piece of equipment, the concept of task is central to creating a workflow model. Without the concept of task, with just nodes and edges it would be very difficult to show accurately the three important characteristics of workflows: multiple suitable pieces of hardware, implementation

4.2.1. Master subtask A subtask is a master subtask if it has i) no material inputs or output and no information inputs or outputs, or ii) only pulling input(s) and no material output(s), or iii) only pushing output(s) and no material input(s), or iv) pulling input(s) and pushing output(s), or v) only information output(s). The various depictions that make a subtask a master subtask are shown in Fig. 4(a). 4.2.2. Slave subtask A subtask is a slave subtask if it has a passive input and no material output, or has passive output and no material input, or has an information input and no material input or output. The

G.S. Joglekar et al. / Computers and Chemical Engineering 67 (2014) 148–158

153

5. Workflow as information aggregator

Subtask Subtask Subtask

Subtask

Subtask

Subtask

Subtask

Subtask

(a) Master subtasks

(b) Slave subtasks

Subtask Subtask

Subtask

(c) Chaining subtasks (d) Decoupling subtasks

Fig. 4. Possible subtask depictions and the associated subtask types.

various depictions that make a subtask a slave subtask are shown in Fig. 4(b). 4.2.3. Chaining subtask A subtask is a chaining subtask if it has a passive input and a pushing output, or has a passive output and a pulling input. In the first case, the upstream subtask pushes material into a chaining subtask, and in turn the chaining subtask pushes material downstream. In the second case, the downstream subtask pulls material from a chaining subtask, and in turn the chaining subtask pulls material from upstream. The various depictions that make a subtask a chaining subtask are shown in Fig. 4(c). 4.2.4. Decoupling subtask A subtask is a decoupling subtask if it has a passive input and a passive output. A decoupling subtask allows the upstream subtask to push material into it and allows the downstream subtask to pull material from it asynchronously. The depictions that makes a subtask a decoupling subtask are shown in Fig. 4(d). A subtask’s type influences the actions taken while executing that subtask. Whenever there is a transfer of material between subtasks, that transfer can be implemented only when the tasks to which the subtasks belong are in the right state. For the workflow shown in Fig. 3, if task PrepA advances to the Empty subtask it must wait for the React task to advance to the FillA subtask. Similarly, if React task advances to the FillA subtask, it must wait for the PrepA task to advance to Empty subtask. In this particular transfer of material, FillA is the master subtask as can be inferred from the diagram. The same is true if the material transfer spans more than two subtasks. Again for the workflow shown in Fig. 3, the Empty subtask of the React task, the Filter subtask of the Filter task and the FillAB subtask of the Store task form a semicontinuous chain of material flow. The chain of subtask can operate only when the corresponding tasks are in the right subtasks. In this particular semicontinuous chain, the Empty subtask is a master subtask, Filter is a chaining subtask and FillAB is a slave subtask. When a task advances into a subtask, the following are the main steps in executing that subtask: wait until all the necessary conditions to start the subtask are satisfied, start all the subtasks controlled by the master subtask, implement the actions associated with the subtask as specified by the user, detect when the conditions for ending the subtask are satisfied, end the subtask, inform all other interacting subtask and advance the task to the next subtask. When the last subtask of a task is completed, the task is ended and the resource assigned to it is released. A specific set of actions is associated with each subtask. The actions are carried out when the subtask is ‘active’ and depend on the application for which the associated workflow is being used. For example, if the workflow in Fig. 3 is used in manufacturing, the subtasks may provide information to an operator such as rpm, temperature, duration and so on. On the other hand, if the workflow is used to drive a simulator, the dynamic model specified with a subtask is implemented. When used as a structure for referencing information, the building blocks in the workflow diagram provide link to access the associated information.

The graphical model of a workflow provides a well-defined structure and reference for the associated information. The information is typically specified as pairs of keyword and values. While the graphical model highlights the relationships between various building blocks and control of their execution as described above, the parameters will be meaningful to a specific application. This feature can be used in a variety of ways to store information efficiently. For example, the parameters that are common to all applications could be stored as one set, while application dependent parameters can be stored as separate sets, all sets using the same workflow model. The same applies to storing the results or outputs of an application. For example, suppose the workflow in Fig. 3 is to be used for comparing the performance of two different manufacturing sites, 1 and 2, using a simulation tool. The main difference between the two sites may just be the pieces of equipment suitable for each task, all other information being identical for both sites. Then the information can be structured such that one set of data will contain all the information that is identical for both sites, one set that describes equipment at site 1 and one that describes the equipment at site 2. To run a simulation model for site 1, the application will use the common information and the set of data for site 1 and so on. The workflow may be used to store information used for manufacturing as a different set altogether. In the case of an experimental workflow application, the data nodes of the workflow provide the structure for recording and accessing experimental parameters and results. By way of example, consider an experimental workflow for the execution of a series of continuous blending experiments with a set of different API and excipients. The apparatus consists of several feeders, a continuous blender and a NIR instrument to measure API content. The study seeks to determine the impact of component material properties, component feed ratios and blender design and operating parameters (e.g., type of blender, impeller angle set, blender rpm) on the root mean squared deviation of the API composition of the blend that is produced. The workflow for the experiment is shown in Fig. 5. (The legend of workflow symbols used in this figure is given in Appendix A.) Note that there are 6 data nodes, and up to 4 input material streams, each with potentially different material properties. To record an experiment, the Enter Data link shown in Fig. 5(a) is opened. This link provides access to forms for entering the parameter values specific to that experiment. The following five groups of parameters are associated with each experiment: General information, Input Materials, Feeder, NIR and Mixer. In addition, the time series data associated with each feeder, and the raw NIR data extracted at the end of each experiment are uploaded into the database. The extracted data can be stored in various forms, such as Excel files. Thus, each experiment instance provides access to the input parameters as well as the results through a relational database. The form to access data associated with experiments is shown in Fig. 5(b). There are links to specific tables in the database or a link to a general information table, which in turn has links to the other tables. The General Information data table is shown in Fig. 6. There is one row for each experiment in this table. The first five columns show the general information entered by the user. The next five columns are links to the associated tables. A Yes in the column indicates that data exists in the associated table for that category for that experiment. The Load Cell table stores links to the Excel files, which contain the time series data for each feeder used during the experiment. Each time series data file contains two columns, Time and Current Flowrate in kg/h. When a link is opened, the user is given the option to browse or download the Excel file.

154

G.S. Joglekar et al. / Computers and Chemical Engineering 67 (2014) 148–158

Fig. 5. Data entry and viewing for the blending experiments.

Once the campaign of experiments is completed, the data tables become a knowledge source, which can be consulted, for instance, to select operating conditions for blends similar to those for which data is already available by a simple search on the key constituents. Likewise the data set may serve as input to a scientific workflow involving the fitting of a correlation for predicting RSD from key blend constituent properties and RPM. In the case of a workflow developed for purposes of representing a manufacturing process, such as shown in Fig. 3, the workflow provides the common structure for retaining all of the operating variables, supporting materials data and preparation steps as well as the interface to a DCS system. Specific data nodes in the workflow can serve to provide the links to specific time windows in the data historian to facilitate access to such information. Although the DCS system historian along with the chemometric tools linked to the DCS system do serve as archives for the raw data, which may satisfy regulatory requirements, these resources do not provide the integrated context for capturing all of the manufacturing information describing the run. From the perspective of data management, the workflow thus defines not only the procedure followed but also all of the associated input, process variable and output information generated in

a manner that both assures that all required data is recorded and that minimizes the duplication of information. A workflow acts as a common lens through which the data generated can be viewed and understood in the context in which it was generated. It is the context that is essential for building understanding and developing deeper knowledge about the activities that the workflows represent.

6. Workflow execution The end uses of a workflow based system may cover a wide range of applications, such as, experimental data management, information archival and retrieval for data mining, creating inputs for commercial simulators, and operator interfaces in plant operation. The software functionalities necessary for executing a workflow can be broadly divided into two sets. One set of functionalities, the core execution engine, implement the steps in a workflow according to the graphical model constructed by the user. The second set of functionalities implement the actions associated with the tasks and subtasks. The core execution engine is independent of the application for which a workflow is used, while the second set

Fig. 6. Example of the main table having a row for each experiment.

G.S. Joglekar et al. / Computers and Chemical Engineering 67 (2014) 148–158

155

of functionalities is tied to the application for which a workflow is being used.

Completed task Active task Active subtask Completed subtask

6.1. Core execution engine The execution of a workflow typically involves one pass or one cycle through all the activities modeled in that workflow. During a cycle, a workflow starts from an initial state, and transitions through all or some of the finite states possible for each of its building blocks. The trajectory of the finite states may be different for each cycle, and is governed by several specific factors such as the values of various parameters, the end application, the decision logic in the execution engine as well as user input at specific points in the workflow. The core execution engine (CEE) implements the logic built into or implied by the workflow diagram, and executes one cycle of the workflow. The engine expects the user to initiate a task when prompted and waits until the user in turn manually triggers the Start Task event. Also, it prompts the user to manually trigger events on a subtask level. The main subtask events are, Start Subtask and End Subtask on each master subtask. Thus, the workflow execution progresses based on the events triggered by the user. The execution logic of the CEE is briefly summarized as follows. When the workflow execution starts, the status of all independent tasks is changed to ‘Active’. When a resource is assigned to a task, determined by a user-triggered action, the task is advanced to its first subtask. At the beginning of workflow execution all subtasks are in the default state. During its execution, a subtask can be in any of the five states: default, waiting, ready, active, completed. The specific subtask level actions which are undertaken will depend on the nature of the subtask (master, slave, chaining, decoupling) and the state of its up- or down-stream subtasks. The logic of proceeding from the initial default state to the ready state consists of five cases: 1. If a subtask is a master subtask, and has either material input, output and/or output signal, then it waits for all the interacting subtasks to advance into ‘ready’ status, setting its own status to waiting. If a master subtask has no interaction with any other subtask, it sets its own status to ‘ready’. 2. If a subtask is a slave subtask, it immediately advances into ready state. 3. If a subtask is a chaining subtask and pushes material down through its output, then if its downstream subtask is not ready, it sets its status to waiting. If its downstream subtask is ready, it sets its status to ready and informs the upstream subtask that it is ready. 4. If a chaining subtask pulls material through its input and its upstream subtask is not ready, it sets is status to waiting. If its upstream subtask is ready, it sets its status to ready and informs the downstream subtask that it is ready. 5. If a subtask is a decoupling subtask, it sets its status to ready and informs the subtasks it interacts with.

PrepA FillA Mix Empty

A

B

WorkflowP1 React Filter FillB Filter FillA Heat React TXData Empty

P1

Fig. 7. Snapshot of task and subtask statuses for a workflow in progress.

2. A slave subtask changes its status to complete when the subtask it interacts with is completed, and advances to the next subtask in its task. 3. A chaining subtask changes its status to complete, informs the other subtasks it interacts with, and advances to the next subtask in its task. 4. In the case of a decoupling subtask, when an upstream or downstream subtask of a decoupling subtask is completed and if none of its other neighbors are waiting or active, it sets its state to ready. The user may change the status of a decoupling subtask to complete when all the upstream interactions are completed. The CEE displays the current status of all tasks and subtasks when a workflow is in progress. For example, the snapshot of the workflow for product P1 is shown in Fig. 7. As indicated by the legend, all subtasks of task PrepA have been completed, currently subtask Heat of task React is being executed, and the subtasks and tasks in their default states have not been executed. The schematics of workflow status diagram at two other times are shown in Fig. 8. In Fig. 8(a), the React subtask is in progress and data are being collected by the TXData node. In Fig. 8(b) the Empty subtask is in progress. Because they form a semicontinuous chain with the Empty subtask, subtasks Filter and FillAB are also in progress at the same time. The Filter subtask is the first subtask of the Filter task. Since the Empty subtask pushes material downstream (it is a master subtask), when the React task advances to the Empty subtask, the CEE creates s request to initiate the Filter task and changes its status to ‘Active’. After the task is started by a user event, it advances to its first subtask, namely Filter. Since Filter is a chaining subtask, the CEE checks if its outputs are satisfied. One of it outputs goes to sink P1, and therefore it is satisfied. However, one output is to subtask FillAB, which is the first subtask of task Store. Again, the core execution engine creates a request to initiate the Store task and changes Completed task Active task Active subtask Completed subtask

The logic associated with moving subtasks from ready to active or completed state is covered by the following four cases depending on the nature of the subtask: 1. For a master subtask, when all the subtasks upstream and downstream of it have advanced to ready status, the master subtask advances to ready status. A user-triggered event is then required to advance the master subtask to active status from ready status. A subsequent event ends a master subtask advancing it to the completed status, and moves the task to the next subtask. In addition, when a master subtask is completed, it informs all its immediate subtask neighbors.

Store FillAB

React FillB FillA Heat React Empty

TXData (a)

React FillB FillA Heat React Empty

Filter

Store FillAB

Filter

P1 TXData (b)

Fig. 8. Snapshots when the React and Empty subtasks are in progress.

156

G.S. Joglekar et al. / Computers and Chemical Engineering 67 (2014) 148–158

Fig. 9. Example of core execution and application engine interactions.

its status to ‘Active’. After that task is started by a user event, it advances to the FillAB subtask. That is when the semicontinuous chain is initiated. When the Empty subtask is ended by a user triggered event, the Filter and FillAB subtasks are ended. Since they all are last subtasks of the corresponding tasks, the associated tasks are completed when these subtasks end. That marks the completion of one cycle of the entire workflow.

6.2. Application engine While the overall execution of a workflow is controlled by the core execution engine, an application engine implements the actions related to the end use. As an example, suppose the workflow shown in Fig. 3 is being used for running a process, and the associated application provides values of key operating parameters and supplementary information to the operator for all processing steps. The interplay between the CEE and the application engine that performs this function is shown in Fig. 6. When a subtask advances to the ‘Active’ status, the actions defined by a certain group of user specified parameters of that subtask are implemented. The CEE invokes the application engine for that subtask. The application engine in turn may access any information with reference to the subtask passed by the CEE. Suppose the process advances into the Heat subtask of the React task in the workflow shown in Fig. 3. The CEE updates the status and displays it using a color code on the operator console. The current status of all subtasks is stored in the data repository, and can be accessed by any application through structured queries. When the operator clicks on the Heat subtask to get the instructions on how to operate it, the application engine retrieves the appropriate information from the repository, and creates a window as shown on the right hand side of Fig. 9. This window contains the key process parameters, any special instructions such as a check list or precautions, and a pair of buttons to mark the start and end of the subtask. Of course, these buttons are triggered by the operator. In this particular case, the application is simply providing the information to the operator and expecting the operator to interpret the information. Alternatively, a more sophisticated application may interpret these parameters as set points to be passed to controllers on the associated equipment. The buttons invoke the CEE and execute the corresponding actions. As described earlier, the CEE manages the current state of the associated workflow instance. Until the operator clicks the ‘End Subtask’ button, the React task stays active in that subtask. It should be noted that the application engines do not directly interact with the CEE, but instead use the current status of

ProjOptimumTRPM DoDOE DoDOE StartExpts

RunPilotPlant DoRuns StartAnalysis

Optimize Optimize Copy

Opt T RPM

ExptConds

Fig. 10. Workflow for a project to determine optimum T and RPM.

the various elements in the workflow to influence the actions. The workflow provides the references to all data in the repository. 7. Workflow hierarchy A workflow allows the capture of all the details, that is, the entire procedure associated with creating any specific data entry. The knowledge of these details is often as important as the resulting data. Almost always data creation is done with a purpose, and each data creation activity is a part of an overarching project. Typically, a project is driven by well-defined objectives and carried out according to a plan. At the very least, a typical project consists of three sets of activities, project roadmap development, generation of data and data analysis. Each activity can be modeled as a workflow, and the relationship between the activities can be shown as a hierarchy of workflows. For example, suppose the workflow shown in Fig. 3 models the operation of a pilot plant to make product P1 that is part of a process development organization. Furthermore, suppose that a process development group would like to use the pilot plant to identify empirically the best temperature and stirrer RPM combination to maximize the yield of P1. Broadly speaking, such a project would start with identifying the main tasks to be accomplished, as shown by the workflow in Fig. 10. The first task is to set up a design of experiments for the independent variables, two in this case, temperature and stirrer RPM. The next task is to perform the runs on the pilot plant. The last task is to analyze the results from the pilot plant runs. Suppose that the DOE activity is performed by a specific support group within the company. In that case, the DOE activity can be modeled as a separate workflow executed by that group. The workflow shown in Fig. 11 is an example of a scientific workflow for full-factorial design of experiments, where the DOE uses the independent variables/factors and the levels per factor as input, implements the program during subtask DOE and creates as output file RunConds. The output is a table, where each row has the values of the independent variable for that particular production run. The

G.S. Joglekar et al. / Computers and Chemical Engineering 67 (2014) 148–158

157

FullFactorialDesign Critical Params

SetupDOE DOE

RunConds

Fig. 11. Workflow for full factorial design of experiments.

Optimization

ExtractData

UseMatLab CreateMFile RunMatlab

Fig. 14. Components of the workflow based knowledge management system.

Opt T RPM

Fig. 12. Workflow for finding the optimum temperature and RPM.

FullFactorialDesign workflow is invoked by the DoDOE subtask in Fig. 6. Upon completion of the FullFactorialDesign workflow, the DoDOE task in Fig. 6 advances to the StartExpts subtask, which in turn triggers the DoRuns subtask. The run conditions generated by the experimental design constitute another input to the DoRuns subtask. The DoRuns subtask implements certain types of actions. For each row in the RunConds file, it invokes the workflow shown in Fig. 3 for the manufacture of P1. Only after all the runs are completed will the DoRuns subtask be completed and the task progression advanced to the StartAnalysis subtask which in turn will trigger the Optimize task. Suppose the optimization activity is also performed by a specific support group within the company. In that case the optimization activity is modeled as a separate workflow executed by that group. The workflow shown in Fig. 12 is an example of steps performed for finding the optimum. For example, suppose the MatLab package is used for determining the optimum. As a first step, an M file must be created to perform the calculations. The M file will include code to extract and preprocess data, invoke the suitable optimization function of Matlab, and write the optimum values in a predefined format. Once the M file is created, it is executed in the RunMatlab subtask. In the ExtractData data node, the user identifies the specific data that is extracted from the pilot plant runs. Thus the entire project can be fully scoped out prior to its initiation with all the individual steps, their relationships and precedence order, the resource requirements, the metadata of all information created and consumed, and so on completely outlined. The complete ensemble of workflows is shown in Fig. 13. Fully defining a project in this fashion has several benefits during project implementation. The CEE and the application engines provide the up-to-date status of tasks and subtasks associated with all workflows. Therefore, the person who may be in charge of the overall project as well as persons responsible to individual tasks can monitor the overall progress very readily. Additionally, since the path and structure of all data created are fully defined, data can be easily shared, retrieved, analyzed and validated at any point in time.

Fig. 13. Ensemble of workflow associated with the project.

8. Workflow implementation A knowledge management system (KMS), which uses workflows to model all information generating processes, has been developed at Purdue University. The three key components of the KMS are shown in Fig. 14. The KMS is implemented on the HUBzero® (McLennan and Kennell, 2010) middleware developed by Information Technology at Purdue (ITaP). HUBzero® is an open source software platform for building Web sites that support scientific discovery, learning, and collaboration which is based on the Joomla (Joomla, 2013) content management system. It also provides the MySQL server for the relational database functions and the Apache web server. One of the attractive features of the HUBzero implementation is that as a result the KMS applications are web-accessible with suitable security controls to enable all members of the team involved in the execution of any specific workflow to browse, search, use and modify the workflow itself or any of its elements based on a variety of levels of authorization. In implementing KMS, the server side scripting was done using PHP and the graphics are rendered using SVG. The workflow builder is an active graphics based tool that allows a user to build the graphical model template of a workflow and to define the parameters associated with each workflow component. The web interface allows a user to create instances of templates for creating new information. It has functionality to access existing information and view it through the associated template, or extract information as structured data for use in further processing. It can also create x-y plots for the pair wise data extracted from the data repository. Thus, the information recorded in the KMS is fully machine accessible with all relationships defined by the graphical model of the associated workflow. The information can be utilized for a wide variety of end uses, such as creating reports, accessing data for analysis, developing workflows for new applications and evolving existing workflow to broaden their scopes. 9. Conclusions A typical knowledge repository holds a wide range of interrelated information. Associated with every information item is a process that provides an explicit model of the various activities and relationships associated with its creation, namely, its provenance. The provenance is as important as the information itself because it facilitates the full understanding, sharing and use of information. A workflow based framework has been developed to model the information creation processes. The graphical model of a workflow proposed is very intuitive and easy to understand and share. Additionally, the set of constructs proposed in this paper for modeling the material and information flows, represent graphically the execution controls required for implementing the associated process. The data nodes provide direct links to information stored in the repository, as well as to the metadata for the associated information, while the workflow fully defines the context for the

158

G.S. Joglekar et al. / Computers and Chemical Engineering 67 (2014) 148–158

information. The explicit relationships defined by the workflow are machine interpretable, and as such can be used for structured access to every information item, thereby facilitating querying, extraction and analysis of the data in the associated data repository. By organizing any decision making process as a collection of interrelated subprocesses, a hierarchy of workflows can be developed to model these interconnected subprocesses. A workflow based knowledge management system has tremendous potential for providing a uniform structure for linking disparate data, as well as encompassing multiple domains of discipline and authority. Acknowledgments The authors gratefully acknowledge the support of the US National Science Foundation through the Engineering Research Center for Structured Organic Particulate Systems under grant EEC0540855. Support for the HUB implementation of the workflow builder was provided under NSF CBET-0941302. The contributions of ITaP staff members, Michael McLennan, Michael Zenter and George Howlett were important to the implementation and are very much appreciated. Appendix A. Workflow symbols Workflow

Pulling material input on a subtask

Task

Passive material input on a subtask

Subtask

Pushing material output on a subtask

Material source

Passive material output on a subtask

Material sink

Information input on a subtask or a data node.

Data node Material flow Information flow

Information output on a subtask or a data node

References Batches Users Manual. Batch Process Technologies; 2003, West Lafayette, IN, USA. BatchProcess Developer. Aspen Technology, Inc.; 2013, http://www.aspentech.com/ products/aspen-batch-plus.aspx [accessed 01.03.14]. Borger E. Approaches to modeling business processes: a critical analysis of BPMN, workflow patterns and YAWL. Softw Syst Model 2012(11):305–18. Davis J, Edgar TF, Porter J, Bernaden J, Sarli M. Smart manufacturing, manufacturing intelligence and demand-dynamic performance. Comput Chem Eng 2012(47):145–56. Deelman E, Singh G, Su M, Blythe J, Gila Y, Kesselman C, Mehta G, Vahi K, Berriman GB, Good J, Laity A, Jacob JC, Katz DS. Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Program 2005;13:219–37. ExtendSim 7 User Guide, Imagine That Inc., San Jose, CA. Available from: . FAA Human Factor Tools, Mathematical Modeling. http://www.hf.faa.gov/ workbenchtools/default.aspx?rPage=Tooldetails&subCatId=31&toolID=200; 2013 [accessed 01.04.14]. gPROMSTM , www.psenterprise.com/gproms.htm [accessed 01.04.14]. International Society for Measurement and Control. Batch control. Part 1. Models and terminology. International Society for Measurement and Control; 1995. Joomla content management system. www.joomla.org; 2013 [accessed 01.04.14]. Junker B, Maheshwari G, Ranheim T, Altaras N, Stankevicz M, Harmon L, Rios S, D’Anjou M. Design-for-six-sigma to develop a bioprocess knowledge management framework. PDA J Pharm Sci Tech 2011(65):140–65. Kepler Web Site. www.kepler-project.org; 2013 [accessed 01.04.14]. Laínez JM, McLennan M, Mockus L, Reklaitis GV. Linking simulation tools using the PharmaHUB work-flow management functionality. In: AIChE Annual Meeting 2011; 2011. McLennan M, Kennell R. HUBzero: a platform for dissemination and collaboration in computational science and engineering. Comput Sci Eng 2010;12(March/April (2)):48–52. Moreau L, Clifford B, Freire J, Futrelle J, Gil Y, Groth P, Kwasnikowska N, Miles S, Missier P, Myers J, Plale B, Simmhan Y, Stephan E, Van Den J. The Open Provenance Model Core Specification (v1.1). Future Generation Comput Syst 2011;27(6):743–56. Oinn T, Greenwood M, Addis M, Alpdemir MN, Ferris J, Glover K, Goble C, Goderis A, Hull D, Marvin D, Li P, Lord P, Pocock MR, Senger M, Stevens R, Wipat A, Wroe C. Taverna: lessons in creating a workflow environment for the life sciences: research articles. Concurr Compt Pract Exp 2006;18(10):1067–100. Schwartz DG, editor. Encyclopedia of knowledge management. Idea Group Inc (IGI); 2006. Senthil MB, Santhosh KV. A survey of workflow management tools for Grid platform. Adv Inform Technol Manage 2012;1(1):1–3. SMLC Workshop web site. https://smartmanufacturingcoalition.org/readingmaterials/presentations-workshop-materials; 2013 [accessed 01.03.14]. Westerberg AW, Hutchison HP, Motard RL, Winter P. Process flowsheeting. Cambridge University Press; 1979.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.