Information Systems Frontiers 1:3, 279±301 (1999) # 2000 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Enterprise Data Quality: A Pragmatic Approach Amjad Umar, George Karabatis, Linda Ness and Bruce Horowitz Telcordia Technologies, 445 South Street, Morristown, NJ, 07960-6439 E-mail:
[email protected]
Ahmed Elmagardmid* Department of Computer Sciences, Purdue University, West Lafayette, IN 47907
Abstract. Enterprise dataÐthe data that is created, used and shared by a corporation in conducting businessÐis a critical business resource that must be analyzed, architected and managed with data quality as a guiding principle. This paper presents results, practical insights, and lessons learned from a large scale study conducted in the telecommunications industry that synthesizes data quality issues into an architectural and management approach. We describe the real life case study and show how requirements for data quality were collected, how the data quality metrics were de®ned, what guidelines were established for intersystem data ¯ows, what COTS (commercial off-the-shelf ) technologies were used, and what results were obtained through a prototype effort. As a result of experience gained and lessons learned, we propose a comprehensive data quality approach that combines data quality and data architectures into a single framework with a series of steps, procedures, checklists, and tools. Our approach takes into account the technology, process, and people issues and extends the extant literature on data quality. Key Words. data quality, data quality metrics, data quality methodology, process ¯ow through, distributed systems
1.
Introduction
Data quality has emerged as a major issue recently due to its potential severe impact on organizational effectiveness. For example, a leading computer industry information service indicated that it expects most business process reengineering initiatives to fail due to a lack of data quality (Wand, 1996). In addition, Redman (1998) shows how poor data quality results in the operational impacts (e.g., lowered customer satisfaction, increased cost, employee dissatisfaction),
tactical impacts (e.g., poorer decision making, more dif®culties in building data warehouses, more dif®culties in reengineering, increased organizational mistrust), and strategic impacts (dif®culties in setting and executing strategies, diversion of management attention). In some cases, in particular the ®nancial statements, the quality of data must be certi®ably free of certain types of errors (Kaplan, 1998). As other examples, wrong price data in retail databases may cost American industry as much as $2.5 billion in overcharges annually (English, 1996) and dirty data is causing major problems with data warehouses (i.e., many users are retrieving wrong information from their data warehouses) (Cellco, 1995). Although data quality is gaining some recognition, views about data quality and approaches to improve data quality differ widely (English, 1996; Wang, 1998; Orr, 1998; Redman, 1992). The fundamental problem with the extant concepts, methodologies, and case studies is that a linkage is not made with the existing day-to-day activities. For example, state of the practice in data architectures typically does not include data quality issuesÐthus most data quality considerations are left out of the regular day-to-day corporate practices. There is a need to extend the data architecture work by explicitly including the following two data quality considerations: * *
How to measure data quality (i.e., metrics). How to improve data quality.
*Participated in project while at Telcordia Technologies.
279
280
Umar, Karabatis and Ness
This paper presents an overall strategy that synthesizes data quality issues into an architectural vision that is supported by a series of steps, procedures, checklists, and tools. This strategy attempts to bridge multiple areas by ®rst de®ning key concepts and outlining a framework for discussion (Section 2). In Section 3, we describe an extensive data quality case study that consists of several interrelated activities: (a) problem analysis that involved a series of workshops with project collaborators from participating companies and identi®cation/prioritization of key problems; (b) design for the prevention and amelioration of poor data quality and inconsistency based on data ¯owthrough and data inconsistency/data quality metrics; (c) assessment of COTS (commercial offthe-shelf ) and near-COTS technology that ranged from data cleanup tools to work¯ows and multidatabase transactions/multisystem integration technologies; and (d) demonstration of the impact of selected technology on real-life large scale data quality problems. Section 4 proposes, based on our experience gained and lessons learned. an extensive approach that combines data quality and data architectures into a single conceptual framework. In particular, our experience has shown that the proposed strategy is quite effective because it takes into account the technology, process, and people issues and can be used to design data quality as an integral part of enterprise data. The focus of this paper is on pragmatic issues and practical insights gained through involvement in large scale data quality problems.
of ``data quality can best be de®ned as ®tness for use'' resonates with this de®nition. However, it needs further re®nement and elaboration. For example, Wand and Ward (1996) show 26 dimensions of data quality, Wang (1998) de®nes quality in terms of information product (i.e., it re¯ects the quality of the characteristics and processes that manufacture the information product), Orr (1998) contends that the use of data must be increased to improve data quality, and Larry English (1996) suggests that data quality should be considered in terms of data de®nition, data content, and data presentation. We focus on the following quality attributes of data, initially proposed by Ballou and Pazer (1985) since they include other views: *
*
* *
These attributes are related to the data content and provide the core data quality metrics. Additional important data quality attributes are: *
*
*
2.
Data Quality Concepts and Approaches
There are different views and de®nitions of data quality (see, for example, English, 1996; Wand, 1996; Moriaritty, 1996). We will use the following operational de®nition of data quality (Redman, 1992): A product, service, or datum X is of higher quality than product, service, or datum Y if X meets customer needs better than Y. This de®nition appears to be generally accepted (i.e., the quality must be de®ned in terms of customer satisfaction) and generally conforms to other de®nitions. For example, Tayi and Ballou's (1998) statement
Accuracy: re¯ects correctness with respect to real life. Consistency: two or more data items do not con¯ict with each other. Currency: how recent is the information. Completeness: degree to which values are present in a data collection.
De®nition: re¯ects that data must be clearly and unambiguously de®ned. Access: shows the ease with which the users can access the data. Presentation: re¯ects the style with which the data is presented.
Data quality needs to be improved for different reasons: *
*
* *
Economic reasons (i.e., how can I reduce the cost of service). Service quality reasons (i.e., how happy are my users). Process ef®ciency (i.e., maximizing throughput). Platform ef®ciency (i.e., maximizing investment in hardware/software).
To improve the quality of data, several assessment and improvement activities need to take place throughout its life cycle. The approaches to obtain data quality fall into the following broad categories:
Enterprise Data Quality * *
Data cleanup. Process cleanup.
Data cleanup involves use of a tool to identify ``bad data'' (i.e., not accurate/consistent/current/complete) and then elimination of bad data through automated and/or manual processes. A wide variety of ``data scrubbers'' are commercially available to perform this task. The main limitation of this approach is that some data cannot be easily veri®ed to be correct. For example, no edit can verify someone's home address. In addition, data cleanup needs to be a periodic effort that must be repeated several times over the life cycle of data. Process cleanup goes beyond data cleanup and concentrates on the activities that tend to pollute clean data. The main activities involved in process cleanup are: * * *
Establish quality metrics. Monitor the data life cycle for quality pollution. Use statistical quality control and process management to maintain desired data quality.
For total quality management, both approaches (data cleanup and process cleanup) are needed. Large and complex systems, as shown in Fig. 1, involve many components (e.g., the data, the application software that operates on the data, the underlying platforms on which the data resides and that are used to access the data, and the overall process for using and managing the system). Different players (the users of the system, the managers, the business customers, the developers, etc.) view these components at different levels of detail. Conceptually, different data quality views can be expressed in terms of each component to indicate its behavior and to improve its quality. Examples of the views, from a data quality perspective, are: *
*
*
*
Quality of the data itself (e.g., accuracy, currency, consistency, completeness). Quality of the application software that operates on the data (e.g., software bugs). Quality of the platform (e.g., the performance of the access mechanisms). Quality of the management and operational processes (e.g., manual interventions, errors, delays, ¯owthroughs, user satisfaction level).
Fig. 1 shows the main components and the key data
281
quality metrics associated with each component. The model categorizes these components in terms of processes, people and technologies because each of these categories need to be measured and improved. We also show conceptual inter-relationships between different levels of metrics. We use this model for our view of data quality. This view does include several existing views such as English (1996), Moriarity (1996), Orr (1998), Redman (1992), Redman (1998), Wand (1996) and Wang (1998).
3. Case Study: Data Quality in Telecommunications This section describes a case study of how a family of data quality problems in the telecommunications industry were addressed through a systematic approach. The case study is based on a project that was funded by the Regional Bell Operating Companies (RBOCs) to address the issues of data reconciliation speci®cally and data quality at large. Fig. 2 presents the project activities, discussed below, as a work¯ow diagram. 3.1. Problem identi®cation workshops As a starting point, we conducted a series of one-day workshops with project collaborators from participating companies to identify/prioritize key problems. The workshops were attended by dozen to three dozen staff members that included data architects, database administrators, managers of various customer service units, and technology consultants. Each workshop started with an overview of the data quality issues to introduce key concepts and to provide a common framework for discussion. The overview was followed by an open brainstorming session on the data quality problems encountered by the sponsors, a prioritization/classi®cation session to focus on the key problems, and a detailed discussion session to quickly analyze the high priority items. In all, more than 80 issues were raised by the sponsors at these workshops. In order to obtain a meaningful view of the issues, they were categorized into 15 broad areas that were ranked by the customers in terms of business need (see Table 1) The overall priority ranking of a category proportional to the number of issues in that category, summed over all sessions. For example, data quality metrics were of
282
Umar, Karabatis and Ness
Fig. 1. Conceptual model for metrics.
Enterprise Data Quality
283
Fig. 2. Data quality (DQ) project plan.
Table 1. Analysis of data quality problems Broad issues
Priority
Category of data quality problem
*Inconsistency among systems *Process improvement and mechanization *Need for metrics *Flow through between systems *Root causes and data redundancy *System architecture and system evolution Standardization Inconsistency with reality Data input/validation Data access and security Single primary source of data Communications/administrative complexity Ownership and accountability Methodology Recovery from data inconsistencies
H H H H H M M M M M M L L L L
Data Process Process Software Data Software Process Data Data Software Data Process Process Process Software
284
Umar, Karabatis and Ness
major concern to multiple participants in all workshops we conducted and were thus ranked very high. After the workshops, we also attempted to cast the 15 broad issues into various data quality problem categories such as data itself, the application software that operates on the data, the underlying platforms on which the data resides and that are used to access the data, and the overall process for using and managing the system (see Table 1, column 3). During this categorization, we noted that many high priority problems were not related to the quality of the data itself but were instead concerned with the process and platform issues. This was a surprise to us. In addition, many data quality problems discussed in these workshop were directly or indirectly related to location data (i.e., location of customers, telecom equipment, and work centers). This is because the location information plays a central role in the telecom systems. Let us now brie¯y review the broad category of issues that were raised in these workshops.
Inconsistency among systems. Inconsistent information between different systems causes serious problems. The same information does not exist in all required systems, and even when it does, it is represented differently in different systems. A better way is needed to keep replicated data synchronized. The corporate data solution of some clients was examined and it was found that different ferent kinds of addresses from different systems represent the same entity. This causes business problems (25% of the customers in some cases do not receive their bill due to bad address). There is also a low degree of automation in the process. This, coupled with a high degree of repetitive data entry to multiple systems, exposes the process to ubiquitous data reconciliation problems. In particular, it is likely that data entry errors will occur, causing data discrepancies between the systems. It is also likely that some or all updates will not be performed to a particular system, resulting in a system containing incomplete data. The following example illustrates this situation. When a central of®ce (CO) is sold, if that information is not entered promptly, then an automated inventory system will send equipment to the new owner but the old client pays. In addition, old client pays tax on the equipment, as it is still carried in their database. A
single interface to all systems is needed to access location information. Process improvement and mechanization. The processes need to be improved and automated for lasting effects. The common scenario is: a manager comes in, ®xes the databases, gets promoted, leaves, and the problems resurface. The issue is really one of integrating disparate processes. The customer data is critical because it is important to quantify return on any customer projects. The main issues discussed in this area were: *
*
*
Our clients need to manage issues between regulated and unregulated sides of the market. There is a need to mine information in customer databases that will allow our client to offer new products to new customer segments (and to identify those segments). Some clients need a uni®ed customer number. There is a need to know if someone is a customer and how to identify them. Name and address present several problems. Some databases use telephone numbers, while others use circuit IDs. There is a need to be able to ®nd all telephone numbers and circuit IDs that are associated with a customer.
Need for metrics. Metrics are needed to drive behavior to justify investment and detect self-funding opportunities. Monitoring of metrics is an important issue in data reconciliation. For example, one should be able to estimate the ®nancial impact due to data reconciliation. In particular, the following issues were raised: *
*
How to measure the impact of data quality problems? How to measure the improvement of data quality problems?
Metrics must include data reconciliation cost savings (e.g., labor costs, labor/error resolution, overtime labor/retry error resolution), performance (e.g., volume work orders, errors, backlog, average time/error), and operations (volume of orders per center, volume of errors per center, volume of backlog per center, average time of handling an error vs. ¯owthrough).
Enterprise Data Quality
Flowthrough between systems. Current ¯ow-design is a cause of data inconsistencies. One suggested solution was reversion to batch processing. Some clients noted that batch processing is inconsistent with their need to quickly propagate customer data. Root causes and data redundancy. Software evolution can cause data quality to decrease, since there are no procedures for making old data to be consistent with new integrity rules. One root cause of data inconsistencies among provisioning and operations databases is the fact that the processes ®rst developed as isolated islands and were gradually pieced together. Some systems have many duplicated databases. This is a cause of signi®cant data inconsistency. System architecture and system evolution. Redundant data architecture is a root cause of data inconsistencies. A well designed architecture is needed to eliminate data silos (i.e., ``smokestack systems''). Many applications and projects do local optimization, subverting global optimization. Some of the reasons are: * *
*
Unreliable synchronization technology. Time pressures (get things up and running quickly). Lack of cross-application responsibility, data and applications knowledge, and overall data architecture.
There is an economic justi®cation for a better architecture. Databases proliferate because people can't obtain access to needed data; they create a database and the project manager controls it. Architectures are very closely related to ¯owthroughs and ¯ow-design. Standardization. Uniform valid point of entry is needed. For example, some systems accept nonstandard format. In addition, a common format is essential. In some cases, address information is stored in more than 20 different data ®lesÐthe addresses are of various types (e.g., postal, client speci®c, etc.) and obey various rules. A uniform valid point of entry with a single owner is needed. Inconsistency with reality. Many information items are inconsistent with reality. For example, it is
285
dif®cult to ®gure out which equipment belongs to which user. Data input/validation. There is a need to clean incoming data. This includes single enforced input interface and management of downstream changes. Data input validation implies enforcement of standards. Thus, most of standardization issues apply here. Data access and security. A standard is needed for data extracts. There is the need to access legacy systems and a number of other databases to obtain information. People don't know enough details about these systems, and when they ask, they are faced with constant problems. Single primary source of data items. A single repository for data items is needed. In particular, this represents the need for reconciliation of metadata, by having a single system act as a steward. A single repository would provide the de®nitive source of metadata information for all the corporate data in the various systems. Communications/administrative complexity. Software releases and upgrades can impact other systems. There is a need for a repository for proposed changes that must be adequately documented. Ownership and accountability. Reference data should not reside in individual applications. Owner(s) should be identi®ed and propagated as needed from a single source. Data ownership is needed for data elements. Methodology. A methodology is needed for continuous quality improvement. Some of the items mentioned were: how to integrate data quality metrics and process, a forum is needed to discuss data quality issues, need coordination of existing efforts, need to integrate data quality into applications, and improve processes. During the implementation of process among various systems, one client needs to do a mapping between the databases and the data models. They require support for technical and implementation issues and performance, and they need to quantify
286
Umar, Karabatis and Ness
the economics of the applications. A methodology is needed for continuous quality improvement.
* * *
Recovery from data inconsistencies caused by software bugs. Software bugs cause data problems that sometimes never get cleaned up. No procedures for data cleanup (i.e., recovery) are included with software patches.
*
d. Application Software Metrics * *
3.2. Data quality metrics After the problem identi®cation, we decided to quickly focus on data quality metrics since it was rated the highest in the workshop. As shown in Fig. 1, and discussed previously, software systems involve many components (e.g., the data, the application software that operates on the data, the underlying platforms on which the data resides, and the overall process for using and managing the system). Conceptually, a large number of metrics can be kept for each component to measure its behavior and to monitor its improvement. Examples of the metrics, from a data quality point of view, are: *
*
*
Metrics of the data itself (e.g., correctness, currency, consistency, completeness). Metrics of the application software that operates on the data (e.g., software bugs). Metrics of the platform (e.g., the performance of the access mechanisms).
*
*
*
* *
*
* *
*
*
*
*
a. Customer Related Metrics *
Customer satisfaction (user view) *
b. Data Metrics *
* * * * *
Data accuracy (e.g., wrong V&H coordinates, wrong pairs available) Data currency Data completeness Data consistency (same ®eld in different systems) Which data items are the dirtiest? Data usage (which data items get used most often by whom)
*
*
* * * * * * *
c. Platform Metrics
Changes in data structures (e.g., ®eld lengths) Number of changes made to software due to changes in real world (e.g., changes in copper facilities) Application software bugs
e. Process Metrics
Candidate metrics for data quality
*
Storage occupied Data access: ease of use Data access: response time Platform reliability
*
Frequency of manual input (normal process, not request for manual access (RMA).) Errors between systems (i.e., OK for a system but an error when passed along to another one Money (business) lost due to wrong quotes Percentage of held orders that can be attributed to bad inventory data Field technician time wasted due to bad data (addresses, inventory) Accuracy of property records for tax purposes Accuracy of location data for tax/tariff purposes Number of telephone calls (RMAs) produced due to incomplete data supplied by ®eld engineers Number of calls received from customers due to missing information in the forms Number of sources of information looked at before a customer question can be answered Number of times a question is generated due to bad data Number of times ( percentage) the RMAs were not handled correctly (i.e., second order RMAs) Retries due to staff errors Successful ¯owthroughs and things done in less time than expected Fallouts in end-to-end ¯owthroughs for provisioning Critical dates met versus missed Labor/error resolution Overtime labor/retry error resolution Volume of service orders by circuit id Errors in dealing with service orders Backlogs Average time/error Volume of orders per center
Enterprise Data Quality * * *
Volume of errors per center Volume of backlog per center Metrics of the management and operational processes (e.g., RMA, errors, delays, ¯owthroughs, user satisfaction level)
Based on the feedback from the Requirement Solicitation Workshops, response to the survey questionnaire, and general literature surveys, we developed a set of key metrics for these components. The resulting metrics are listed in the sidebar ``Candidate Metrics for Data Quality''. Naturally, not all of these 40 metrics are neededÐyou need to reduce this list based on metric priority, method of measurement, frequency of measurement, cost of measurement, risk of ignoring, and other appropriate factors. The following table (Table 2) shows how you can select data quality metrics. The main idea is to focus on those metrics that have the highest business impact. The best candidates for selection are those metrics that have high priority, low cost of measurement and high risk of ignoring. We are illustrating the use of this table through a few metrics. We used a similar table to evaluate and select the most appropriate metrics out of the 40 we identi®ed. After a metric process has been established and implemented, a quality control process must be instituted to monitor, evaluate and control the process based on the metrics. The focus is on detection and elimination of special causes that appear in the process. In particular, the process performance is predicted/measured against: *
*
Process goals (e.g., manual interventions must not exceed 100 per week). General trends (e.g., gradual improvement of data quality).
*
287
Variation control (e.g., learn root causes for pronounced variations).
Numerous examples of quality control in manufacturing can be found. For example, many manufacturing organizations target 100 defects per million parts (Motorola has targeted 3 defects per million). The main challenge is to monitor and manage variations against targeted goals for improvement. 3.3. Intersystem data ¯ow Intersystem data ¯ow in large scale telecom systems was ranked as one of the most important problems in the workshops. In a data ¯ow process, data is generally received by a system (either from one or more other systems, or directly, as input data), processed (and perhaps altered) by that system, and then passed to one or more other systems. The processing system may even create new data. Data may be committed to permanent storage ( perhaps after transformation) within the processing system and/or may be sent to other systems. Although discussed frequently in the literature Umar (1998), it is generally not the case that data in multiple systems are accessed or modi®ed concurrently. In particular, read and/or write operations under the same transaction umbrella that involve multiple systems are seldom required in telecommunications systems. However, sometimes a system may need to request some reference data from one or more external systems. Typically, the same data is passed through and processed by many systems, thus the data must be viewed in a context larger than that of any individual system. Such a view should include both representation (syntax) and interpretation (semantics) of data.
Table 2. Quality metrics selection example Quality metric (examples)
Metric priority (L, M, H) in terms net cost
Method (tool) for measurement
Frequency of measurement suggested
Cost of measurement (L, M, H)
Risk of ignoring (L, M, H)
Selection decision (Yes//No)
Customer dissatisfaction
H
Customer surveys
H
H
Yes
System utility PAWS report
Once a month (or at the end of each assignment) Daily Daily
Disk storage overruns RMAs (requests for manual assistance) Retries
L H
L M
L H
No Yes
H
PAWS report
Daily
M
H
Yes
288
Umar, Karabatis and Ness
Data ¯ow is a complex problem, involving many issues that must be considered in order to design a ¯ow that will be successful and that will minimize data reconciliation problems. It is not suf®cient to consider data ¯ow in just a local context (i.e., with respect to the systems that a given system either directly receives data from or directly sends data to). Rather, it is critical that data ¯ow be considered in the context of the enterprise and its business goals and processes. In particular, it is important that such a global context be one in which unanticipated data ¯ows can be accommodated, as systems evolve and new systems are created. It is also important that we be able to accommodate legacy systems in the data ¯ow, even though they might not be architected in a way that would be desirable if we were designing them ``from scratch''. We developed a set of recommendations for effective data ¯ow. A few of these recommendations are listed in the sidebar ``Sample Data Flow Recommendations''.
interest to us are data scrubbers, data warehouse tools, replication servers, work¯ow systems and multidatabase systems. Table 3 summarizes the results of analyzing the potential impact of the various COTS technologies to address the issues that were raised during the requirements workshops. The potential impact is quanti®ed as high, medium and low to indicate the possible effectiveness of COTS technologies to address the issues raised during the workshops. As can be seen, there is no single solution to the entire problem, however a well balanced combination of the aforementioned technologies can alleviate the problem drastically.
Sample data ¯ow recommendations Major Global Organization Recommendations *
3.4. Assessment of commercial technologies for data quality A diverse array of technologies, available from a variety of vendors, can be used to address many of the data reconciliation problems identi®ed during the Workshops. These technologies come with varying price/performance ratios, offer a wide range of features, and are available on different computing platforms. Examples of the technologies of particular
*
*
Distinct systems should have distinct purposes, should not cause the same kinds of changes to the same data items, and should not cause the same kinds of physical phenomena to occur to the same real world objects. A system that requests modi®cations to particular data should have a high proximity to the system that stewards that data A system that has a need for highly accurate data should have high proximity to the system that stewards that data
Table 3. Relative impact of solution technologies (L, M, H) Issues *Location problems *Inconsistency among systems *Process improvement and mechanization *Metrics *Flow through between systems *Root causes and data redundancy *System architecture and system evolution Standardization Inconsistency with reality Data input validation Data access and security Single primary source of data Communications/administrative complexity Ownership and accountability Methodology Recovery from data inconsistencies
Data warehouses Replication servers Work¯ow systems Data scrubbers Multidatabase systems M L L
M H M
H H H
L H
H
M
M
H
M
H
M L
L
M H M L L
M M L L L H
H
Enterprise Data Quality *
*
*
*
*
*
Data should ¯ow from a system that requires higher precision to one that requires lower precision. Data should ¯ow from a more constrained system to a less constrained system for that data. Data should originate with the system that is the steward of the data, and should ¯ow to-systems that need to use the data. A tree structure is the preferred organization for data ¯ow (no loops). Race conditions should be designed out of a data ¯ow process. This includes scenarios where the same data is sent to a system from multiple sources, as well as the scenario where different, but dependent data are sent to a system from multiple sources. A mechanism (such as sequence numbers), should be provided with data transmissions, and/or a uniform format should be provided so that systems can determine an appropriate order in which to process the data.
*
*
*
*
All systems that contain the same data items should follow identical standards for the characteristics of those data items, as well as for integrity constraints on those data items. The same data should be converted by multiple systems in the same manner. Consistency of integrity constraints, business rules, and data models need to be considered in a global context.
Major Data Validation Recommendations *
*
*
*
*
*
Any particular data item should be manually entered into a single system. Data should be entered into the system that is the steward of that data If data is entered into a target system that is not the steward, that target system should essentially act as an interface system to the steward, and forward the data to the steward for entry into the stewarding system. Data should be validated within each system that the data resides in a ¯owthrough process. Data may need to be validated, relative to a business rule, as a set, and not just as individual facts. Data needs to be either rejected or marked as being invalid as soon as it is detected.
A feedback loop to the steward would help correct data that a downstream system considers invalid. The steward should also be sent an indication of why the data is invalid. Data that is determined to be invalid should not be passed on to other systems downstream.
Major Data Flow Enablement Recommendations *
*
Data that is highly dynamic should not be updated by multiple systems. Data that needs to be validated together should be sent as a unit.
Major Intersystem Access Recommendations *
*
Major Metadata Recommendations *
289
*
Distributed Transaction Processing is not currently required as an access paradigm for data ¯ow, but it may be appropriate to consider it at some point in the future. Data should never be simply sent to another system under the assumption that it will be received, some form of guaranteed delivery or acknowledgment should be used. Appropriate synchronization mechanisms should exist to ensure either that data is received in a correct order for processing or that the receiving system can determine the correct order to process incoming data.
Major Data Replication Recommendations *
*
*
*
Data replication should be minimized, and should only be done for very good reasons. If data is replicated in more than one place, then the most accurate source of the data is the system that stewards the primary enterprise copy of that data. Data that is highly dynamic is a poor candidate for replication. Data replication may involve more than simply producing a new version of a database.
Major Procedures and Design Process Recommendations *
*
Automation of data ¯ow should be implemented where it is economically feasible to do so Evolution should never occur without the knowledge and participation of representatives from other systems that could potentially be affected.
290 *
Umar, Karabatis and Ness
There needs to be a carefully managed schedule for phasing in new systems or new versions of systems.
Data Warehouse (DW) tools can be utilized in a variety of ways within the framework of data quality. Although DWs are voluminous and cannot replace day-to-day operational systems due to performance overhead, they can be used as a point of reference for other systems. DWs help in ``inconsistency among systems''. DWs can also provide some standardization that other systems can refer to, since they usually contain information from several other operation systems. In addition, DWs can help to a limited extent in process improvement and mechanization because they contain a lot of informational context. Replication servers can alleviate the inconsistency among systems in cases where there is replicated information stored in various systems. They can help in process improvement due to their automatic way of propagating information. They are also designed to support a scheme with a primary source of data and secondary replicas. Flowthroughs are improved in the presence of a replication server, and they provide an excellent tool to maintain consistent copies of replicated data. Some replication servers support conditional replication which is extremely useful in a telecommunication company. Speci®cally, one can decide how often the replicated data will be updated and under what speci®c conditions. Work¯ow systems mainly help in designing ¯ows among various systems. They contribute to Process Improvement and Automation because they provide tools to change ¯ows and re-direct them among various systems easily, with no signi®cant programming effort. Updating ¯owthroughs can become more effective with the presence of a work¯ow system. Work¯ow systems can support system evolution speci®cally due to their ¯exibility of ¯ow re-design. Data scrubbers provide useful tools for data entry validation and intrasystem cleanup, one at a time. This capability of validation provide a means of standardizing interfaces and help minimize inconsistency among systems. Multidatabase technology can be used in a variety of facets in data reconciliation. Particularly they are useful as ``middleware'' that provide connectivity between systems. They can help in access and authorization of systems. Although the extant COTS technologies exhibit
a great deal of potential to address many of the issues raised during the workshops, the full bene®ts may be somewhat limited due to the following limitations: 1. In general, data reconciliation requires a mixture of COTS. Many COTS do not work with each other (i.e., work¯ow systems do not work with replication servers). Integration of COTS is an area of future work. 2. Many existing tools do not work on legacy data sources such as IMS, VSAM, etc. However, many RBOC data reconciliation problems require legacy data reconciliation. 3. Work¯ows with added capabilities for distributed transaction processing are needed to deal with the issues of long (Saga) transactions. 4. Web interfaces to existing tools are just beginning to emerge (e.g., Web access to work¯ows, Web access to data warehouses). This area needs to mature somewhat to be effective. 5. A potential area of future work is development of an intelligent, Web-based interface that invokes different technologies (e.g., data scrubbers, replication servers, work¯ows, multidatabase middleware, data warehouse tools) based on a problem type. 3.5. Implementation: the pilot project To demonstrate how some of the technologies can work together to solve large scale data quality problems, we created a pilot project. Its purpose was to design and implement a prototype to improve the quality of data being used in operational systems of telecommunication applications. The pilot project evolved around two existing Telcordia operational systems: CLONES and LOC. CLONES is a system responsible for creating unique codes to name and identify various entities such as equipment, building locations, etc. The entire system is built on top of a relational database (INFORMIX in Unix environments) which manages these unique codes on behalf of Telcordia clients. LOC is another Bellcore system comprising of applications built on top of a relational database (DB2 on MVS) that stores and maintains location information. In particular, LOC contains geographical location information about planning and forecasting areas, telecommunication equipment inventory, ven-
Enterprise Data Quality
dors, customers, etc. LOC also stores relationships among these locations. When a new entity is given a location code in CLONES, it must also be inserted into LOC, so there is an information ¯ow from CLONES to LOC regarding locations. However, CLONES is a separate application that does not have detailed information about locations, regarding a new entity. This information is added in LOC but often does not exist in CLONES. Since CLONES and LOC are independently updated, the databases can become inconsistent. The consequence of such type of low data quality becomes evident when some application gets invalid information from either CLONES or LOC and continues based on erroneous or invalid information. In this Section we will outline the replication and data scrubbing technologies we used to improve the data quality in this environment.
Setting up the environment. Since we did not want to interfere with the daily operations of the actual CLONES and LOC while they were on operation, we decided to get copies of these databases and set up our pilot system on these copies. This allowed us more ¯exibility with experiments such as schema changes, workarounds, etc. on copies of the actual data stored in our local PC databases. Since CLONES is implemented on an INFORMIX database running on UNIX, and LOC is on a DB2 running on MVS, we had to get the CLONES and LOC schemas as well as data extracts out of the original databases. We also chose to use a commercial replicator between CLONES and LOC. For the purposes of the pilot project Microsoft SQL Server 6.5 proved to be adequate to carry out the experiment since it comes with a data replicator that can send data to another MS SQL Server, or to another database that has 32 bit ODBC drivers compatible with those of MS SQL Server (for example ORACLE). Therefore we created extracts from CLONES and LOC which we inserted into two separate SQL Server databases running on different PCs. *
CLONES: Data for several tables were extracted as ASCII delimited ®les and were sent to our environment via FTP. Then they were entered into the SQL Server in a schema that was identical to the INFORMIX one, using a ``bulk copy'' utility of SQL Server. However, dates were creating
*
291
errors and we had to enter the date columns as strings. Translation back to dates could be done later on. LOC: Data was extracted from the LOC database, which is a DB2 database on MVS. However, several tables in LOC contained columns that represented data that was not present in CLONES.
Fig. 3 depicts the overall architecture of the pilot project. We addressed data quality in three major categories: Input validation, data scrubbing and data replication. We added an input validator on top of CLONES and LOC in order to clean new data entries. We also addressed ``dirty'' data already existing in the databases through a data scrubber. Once the data in CLONES had better quality it was replicated into LOC automatically. However, LOC was adding more columns to the data and has more detailed information about locations. To avoid the case of overwriting such additional information through automatic replication from CLONES we also sent such data back to CLONES to update its records with more recent and correct information. In the following we will describe in more detail each tool that we used to improve data quality of those systems. Input validation, standardization and scrubbing. One of the major sources to ``dirty'' the data in the databases is due to human error, especially when data is being entered into the system. Our philosophy was to ``trap'' such occurrences as early as possible. For this reason, we decided to ®lter and standardize name and address information after being entered by the user, and before being stored into the database, thus providing input validation. Therefore, we prototyped one of the CLONES data entry screensÐthe screen used to insert a building code into CLONES. The user enters the building name and address in free format. The name and address are then passed to a software scrubbing tool for decomposition into their constituent components, and standardization into a predetermined format. The use of standardization tools ensures that address information in the database is entered in a prede®ned way and minimizes human input errors. For example, ``Main St.'' and ``Main Street'' represent the same entity, however, from the database viewpoint they could create two separate rows. In cases that keys were automatically created (such as surrogate keys) the situation could easily lead into
292
Umar, Karabatis and Ness
Fig. 3. Final pilot project architecture.
unwanted cases of having two distinct keys representing the same entity. We used a commercial tool for data standardization from Innovative Systems, Inc. The ISI Edit tool standardizes and decomposes names and address into components that in turn are used in SQL statements to query or update the database. The ISI Edit routine
classi®es an input line into one of 12 line types. For our prototype, the line types of interest were: ``O''ÐOrganizational (used for building name, e.g., ``Premier Widget Corp''). ``S''ÐStreet (the address line containing a street name, e.g., ``445 South Street'').
Enterprise Data Quality
``C''ÐCity (the address line containing the city, e.g., ``Morristown, New Jersey 07960''). For each input line, ISI Edit returns a standardized line. ISI Edit standardizes organizational suf®xes (which may appear in the building name), street types, directional pre®xes, directional suf®xes, subaddress identi®ers, rural identi®ers, box identi®ers, building identi®ers, states and provinces. For certain line types (including ``S'', and ``C''), the system also returns ®xed-®eld data elements and codes. For example, the street line (i.e., type ``S'') is decomposed into street address number, street name and street type, street pre®x, and street suf®x. Since the street name and street type are separate ®elds in the CLONES table, we had to split the output returned by ISI Edit ourselves. Table 4 shows a few sample records that were standardized using ISI Edit. Although our attempt to provide input validation was successful we still had to clean the data already residing in the database. We utilized the standardization capabilities of the ISI Edit tool and we created an application that scans the entire table in the database and standardizes each record. If the original and the standardized record were not identical then we needed to take action. At this point there were two venues to take: Either proceed in batch mode and replace the original record with the standardized one, or present the result to a user who chooses whether to update the record or not. We chose the latter approach to visually examine the differences between the original and the standardized record. Schema manipulations for quality metrics. To be able to assess the quality of the data in a table, we performed some simple additions to the schema, to ease the creation of metrics for quality: *
Addition of expiration date column: Some addresses in CLONES may have a temporary
Table 4. Illustration of data cleanup Input data
Output data
Fleet National Association 1033 North Tower Hill Road Morris, New York 10011
FLEET NATIONAL ASSOC 1033 N TOWER HILL RD MORRIS, NY 10011
Hasbro Corporation 1033 Winding Lane North Sparta, NJ 07999
HASBRO CORP 1033 WINDING LN N SPARTA, NJ 07999
*
293
nature. For example, when a new subdivision is being built and no street names have been assigned yet, the data that is entered into CLONES is temporary (e.g., BLOCK 5, LOT 3) since no better description of the location exists when a location code is created. In such cases, the person entering data into CLONES can specify an expiration date for this particular CLLI code. Later on, when a real street name and number have been given to that location, it is very easy to query for expired addresses, and to correct them. In addition, one can ®nd out very easily how many addresses have expired, thus getting some metrics on the quality of the data. Addition of quality column: When new location information is entered into CLONES, the person entering it, knows how precise the information is compared with other addresses already in the database. This way, a new column is created into the schema of the address table in CLONES which identi®es the quality of the information entered. Then it is easy to assess the data quality of the table by issuing a straightforward SQL statement.
Data replication. We used Microsoft SQL Server 6.5 Replicator to experiment with data that are entered in CLONES and then are sent to LOC. In our prototype implementation we concentrated on data that populate the SITE table in LOC. This table contains information originating from four tables in CLONES. In general, the convention for replicating from one database to another is via a one-to-one mapping. SQL Server provides an easy way to replicate on a one-tomany scenario where the primary table is replicated to many secondary sites. However, in our particular example we faced a completely different situation: Several tables in CLONES are mapped into a single table in LOC. That is, we have a many-to-one scenario to deal with. MS SQL Server provides for many options regarding what to replicate, (selected columns, rows, or combination of both) and how often to replicate (instantly, periodically, every x number of transactions on the primary copy, etc.) However, it is not straightforward to combine a set of tables for replication into a single one. One way would be to create a view on CLONES and replicate the view; however, only base tables can be replicated, not views. The approach we took was to replicate each table in CLONES to LOC and reconstruct the SITE
294
Umar, Karabatis and Ness
table in LOC. This approach worked ®ne and replication proved to be working in a very satisfactory way. During the initial synchronization phase, all data, (about 30 Mb) were sent to LOC without any problems. Subsequent updates on CLONES are immediately propagated into LOC with no time delay. In short, the replication technology proved to be mature enough for our purposes. The following are the main ®ndings for the pilot project effort:
*
Front-end data validation using ISI Edit worked well. Data was successfully entered into the ``CLONES'' database. Data was successfully replicated to the ``LOC'' database via MS SQL Server Replicator.
*
*
*
*
*
*
4. Lessons Learned and Suggested Approach Our interactions with a large population of customers and working with numerous tools and methodologies have taught us the following lessons: *
*
*
*
Data quality should not be discussed in vaccuum. Many other ``non-data'' issues such as platforms, processes, and software architectures must be taken into account while considering data quality. This observation has been articulated in the literature [Orr 1998]. In discussions with end users and customers about data quality, the data quality issues get intermixed with other issues such as software quality and quality of service. Despite increased attention, enterprise data quality is not a well established and recognized corporate activity with concomitant corporate responsibility. Without such responsibility, it is dif®cult to improve the quality of enterprise data In terms of priority, we have found that more problems were related to process than data. For example, many high priority problems were not related to the quality of the data itself but were instead concerned with the process and platform issues.
*
*
Different organizations naturally have different data that is of primary concern for data quality. In the telecom industry, it happens to be location data that shows the location of customers, equipment, and work centers. Our observations contradict Ken Orr's assertion ``Data quality: use it (the data) or lose it'' (Orr, 1998). We have found that critical data is used frequently but is less apt to be of higher quality because people tend to create more copies of critical data and thus create data quality problems. For continuous improvement of data quality, it is essential that data quality procedures be meshed with existing corporate practices (i.e., data quality should not be a separate activity that may or may not be conducted). We suggest that integration of data quality with enterprise data architectures is an appropriate aproach. Issues are not all technical. We found several issues that are purely administrative. In particular, an owner of data quality must be appointed to handle the administrative issues. Different views on data quality exist and expectations vary widely among users. In addition, many hidden costs (e.g., personnel training) are encountered and should be taken into account. The COTS technologies address different parts of the problem. However, complete solutions require integration of several technologiesÐa non-trivial task.
Our approach, based on these observations and lessons, suggests a comprehensive data quality methodology. In addition, we have synthesized the main ideas from the extant literature to augment our thinking. For example, Wang (1998) has developed the concepts, principles, and procedures for de®ning, measuring, analyzing and improving information products. Based on these and cumulated research efforts, Wang presents a Total Data Quality Management (TDQM) methodology, and illustrates how this methodology can be applied in practice. TDQM consists of ®ve steps; de®ne, measure, analyze, and improve. However, it is not clear how this methodology can be integrated with the day-today corporate activities of, say, data architectures. On the other hand, Orr (1998) emphasizes that the primary focus of data quality projects is to increase the internal controls involved in entering and editing data. These efforts, according to Orr, are doomed to
Enterprise Data Quality
fail, as are one-shot attempts to clean up data, unless the use of of that data is increased. Orr contends ``data quality: use it or lose it'', i.e., if an organization wants to improve data quality, it needs to ensure that there is stringent use of each data element because use-based data quality provides a theoretically sound and practically achievable means to improve data quality. Although we have found, in our practical experience, that highly used data tends to become inconsistent more frequently ( people create multipe copies), we do agree that data quality should not be conducted in vaccuum and should be integrated with other activities to increase use. The basic premise of our approach is that data quality should be integrated with data architecture activities and data architecture should be planned with quality in mind. In particular, all activities in data architecture (data requirements, data modelling, partitioning, allocation, data ¯ows and access/interoperability) must keep data quality considerations at high priority. Our approach, depicted in Fig. 4, is presented as a sequence of steps that can be used as a checklist for a comprehensive data quality strategy. This approach is designed to address the large scale data quality problem discussed in Section 3. The following discussion brie¯y reviews these steps with particular attention to data quality.
Step 1: Information Requirements and Modeling for Data Quality Data quality must start with business requirements that drive the technology requirements. Examples of business requirements for prevention and amelioration of data quality include reduction of cost of doing business and minimization of bad exposure. The information requirements specify the data, the natural relationships between data, the transactions to access and manipulate the data, and software platform for database implementation. During this step, it is important to develop an understanding of the following requirements that drive data quality before proceeding: *
*
Data content requirements (i.e., accuraccy, consistency, currency). Size information (number of users who will access the database, database size).
* * * * *
*
*
*
* *
295
Response time (average, worst case) requirements. Scaling and growth requirements. Data security requirements. Data availability restrictions. Data synchronization requirements (i.e., how frequently duplicate data should be synchronized: immediately, hourly, daily, etc.). Connectivity needed from user sites (e.g., from PCs, Macs, UNIX). Interoperability and interfaces with other (existing/ proposed) systems. Portability between computing platforms (data portability, processing portability, user interface portability). Backup/recovery needs and disaster handling. Policy restrictions (e.g., standards and policies to be followed, application control and auditing restrictions).
After information requirements, a data model is developed to represent the view of data at several levels of detail. From a data quality point of view, data must be modeled, named and de®ned with a singular de®nition and uniform business rules to support all customers. In essence, the following steps in development of a data model should keep the data quality in mind: 1. De®ne entities (the principal data objects about which information is to be collected). 2. Identify relationships (the real-world associations) between entities. Relationships between entities can be one to one, one to many, and many to many. 3. Identify and develop attributes (characteristics of entities). 4. Represent this information by using an ERA diagram.
Step 2: Planning for Data Quality Effective planning and organizing is the key to improving data quality. This step involves the following activities: *
Choose the most appropriate metrics and measurement instruments. A few key metrics need to be selected for tracking because not all metrics can be or should be tracked. We, for
296
Umar, Karabatis and Ness
Fig. 4. Methodology to build quality in data architectures.
*
example, assembled a list of 40 metrics that could be used for measuring data quality. You can customize this list based on metric priority, method of measurement, frequency of measurement, cost of measurement, risk of ignoring, and other appropriate factors. The key issue of cost and bene®ts of gathering metrics should be considered. Well known sampling techniques can be used so that a small sample can be used to determine the key problems. Selection of proper instrumentation is an important aspect of metrics. Establish a solid replication strategy. Decide what data will be replicated, why and how? This
*
means deciding what data will be moving between sites and in one direction or bi-direction. Most organizations have some choices in what pieces of information should be placed where (i.e., how much information will be in the replicated in your home grown database and how frequently it will be synchronized). Determine a synchronization interval for different classes of data. A synchronization interval is the amount of time you can afford (from a business point of view) the data to be out of synch. Synchronization intervals can be tight (i.e., transaction level) or relaxed (i.e., daily, weekly).
Enterprise Data Quality
*
*
Establishing synchronization intervals can impact data quality. Adopt a cleanup strategy. This includes the initial cleanup and a ``cleanup interval'' for periodic scrubbing of data. For example, establish a strategy to make the location addresses compliant to the postal addresses and how frequently to repeat the cleanup process. Architect interfaces to other applications and ¯owthroughs. How the data will be exchanged between different applications (open interfaces) and how will it smoothly ¯ow. In particular, how will the information ¯ow between different databases and the back-end systems.
Establish end-user interfaces/access strategies. How will the users access the data (try to close the back doors). This issue was brought up in some of our workshops.
paper. Interested reader should refer to Umar (1997) for details. Intersystem data ¯ow in large scale distributed systems is of vital importance in establishing a vable enterprise data architectures. This issue also impacts data quality because ¯owthroughs can cause delays and data inconsistencies. Data ¯ow, as discussed in Section 3.3, is a complex problem, involving many issues that must be considered in order to design a ¯ow that will be successful and that will minimize data reconciliation problems. Brie¯y, data ¯ow must be considered in the context of the enterprise and its business goals and processes. In particular, it is important that such a global context be one in which unanticipated data ¯ows can be accommodated, as systems evolve, new systems are created, and legacy systems need to be interfaced with. We characterize the issues for data ¯ow design into several categories. The categories include *
Step 3: Data Allocation and Inter-System Flow Issues Data allocation can cause duplication and thus result in data quality problems (i.e., different copies showing different and con¯icting information). The following general guidelines should be used for data quality: *
*
*
*
Keep data replication as minimal as possible. Large number of replicates can cause serious problems in two phase commit (2PC) as well as Replication Servers. If data must be synchronized as a transaction, then keep the number of copies small (experience shows not more than 3) and use 2PC. If concurrency requirements outweigh ``subsecond'' data integrity requirements (i.e., data can be synchronized periodically) then use Replication Servers). If the network and nodes are unreliable, then use Replication Servers.
*
*
*
*
*
In addition to these guidelines, you should systematically study and evaluate various data allocation/ duplication strategies and eliminate unacceptable options as quickly as possible. Discussion of data allocation strategies is far beyond the scope of this
297
*
Global Organization of Systems and Data (i.e., how should systems be logically organized regarding data ¯ow, how should data be logically organized regarding data ¯ow, how should timing constraints and race conditions be handled?). Metadata (i.e., what standards should there be for data, what does a system need to know about other systems, where is this information stored and maintained, what is required of metadata in a global schema). Data Validation (i.e., where should data be entered into a set of systems, where should data be validated in a set of systems, what should happen if data fails a validation test). Data Flow Enablement (i.e., what must happen between systems to realize the data ¯ow, what is the role of immediate vs. batch data propagation, how should dynamic data be treated in relation to reference data). Access Paradigms (i.e., what should the intersystem access paradigms be, how can we know that the data ¯ow is successful, how do we deal with system or network failures). Data Replication (i.e., under what circumstances should we replicate data, how should we manage replicated data). Procedures and Design Process (i.e., what should be automated, what procedures are required for job scheduling, how can we gracefully evolve a set of
298
Umar, Karabatis and Ness
systems, what is required of data ¯ow documentation, what is the role of people and organizations in data stewardship). For each category, a set of guidelines should be developed. We discussed a few of these requirements in Section 3.3.
Step 4: Technologies Assessment and Selection Commercial off-the-shelf (COTS) technologies selection should be viewed in terms of data quality. In particular, software infrastructure should be chosen for: *
*
Quick and ef®cient access of data from a wide array of users residing at different sites. Interoperability of data between heterogeneous systems.
Data cleanup Access of data from a wide array of users residing at different sites is currently being provided through Web-technologies. The Web browsers access these databases through the Web server that may invoke SQL middleware. ODBC/JDBC drivers are commercially available to support ubiquitous access to heterogeneous data sources from Web browsers. Web-based tools with ODBC/JDBC should be considered for data access quality. Data interpretability is greatly in¯uenced by the exchange protocol (i.e., the format of the data and the rules of exchange) that is used between clients and servers. Let us ask the following question: can a database stored in vendor X database exchange information with vendor Y tools? In general, the answer to this question is no because the exchange protocols are largely proprietary at the time of this writing (i.e., clients from vendor X can only access vendor X database). This mismatch has led to ``database gateways'' that convert vendor X protocols into vendor Y protocol. DRDA (Distributed Relational Database Architecture) Gateways are an example of such gateways (DRDA Gateways from Informix, Oracle and other vendors convert the Informix and Oracle protocols to DB2 protocols). ODBC/JDBC
drivers can also be used for data interoperability. Selection of appropriate exchange protocol is also part of the data quality work. A diverse array of technologies, available from a variety of vendors, can be used to address many of the data quality problems. These technologies come with varying price/performance ratios, offer a wide range of features, and are available on different platforms. Examples of the technologies of particular interest to us are data scrubbers, data warehouse tools, replication servers, work¯ow managers and multidatabase systems. We discussed these technologies brie¯y in Section 3.4. As noted previously, there is no single solution to the entire problem, however a well balanced combination of the aforementioned technologies can alleviate the problem drastically.
Step 5: Implementation, Deployment, and Support An effective measurement and control process is the key to improving data quality during the implementation, deployment and support stage. This involves: *
*
*
*
Keep focus on the root causes. The root causes, may include enterprise architecture issues such as design of data ¯ow/synchronization, architecture of legacy products, number of manual interventions needed, and availability of well trained staff. Select and deploy most appropriate data quality improvement tools. Data quality improvement tools fall into the following categories: Ð Data scrubbers that correct incomplete and incorrect data through rules Ð Compare and stare tools that help in making the data consistent with reality Ð Data replicators to keep the data consistent We have discussed these tools in an earlier deliverable. Consider network issues. What type of loads and availability requirements will be placed on the network (several replication schemes require highly available networks). Re-evaluate platform issues. On what platforms replicated data should reside, what type of middle ware is available on these platform (ties into COTS). We have reviewed platform issues in the previous step.
Enterprise Data Quality *
*
*
Establish Quality Control. After a metric process has been established and implemented, a quality control process must be instituted to monitor, evaluate and control the process based on the metrics. The focus is on detection and elimination of special causes that appear in the process. In particular, the process performance is predicted/ measured against: Ð Process goals (e.g., RMAs must not exceed 100 per week) Ð General trends (e.g., gradual improvement of data quality) Ð Variation control (e.g., learn root causes for pronounced variations) Statistical quality control (SQC) is used to detect and eliminate unnatural causes. For example, SQC can be used to monitor the key metrics such as RMAs. If the RMAs for a particular system are ¯uctuating signi®cantly, then it is important to understand the reason and determine the root causes for this variation (a process owner should take responsibility for SQC). Improve Data Quality. The basic principle is to focus on critical success factors (key metrics) and systematically improve them by identifying root causes and eliminating them. From a business perspective, examples of these key metrics are customer satisfaction, RMAs, retries, average time to complete an order, etc. In addition to a long range systematic process improvement that is re¯ected in trends, unusual system behavior as displayed by SQC charts should be considered. Establish a change management procedure. This includes changes in data, the metadata as well as the underlying platforms.
*
*
*
*
Step 6: Management and Staf®ng Issues Appropriate management initiatives coupled with skilled and motivated staff are extremely important in improving data quality. The following management/staf®ng issues should be considered: *
Assign roles and responsibilities. Data owner versus quality ``inspectors'' roles. Basically, who is responsible for what during the data life cycle (data creation, design, implementation, mainte-
*
299
nance, retirement). Some companies have established ``Enterprise data architecture groups'' that watch out for the health of corporate data throughout its life cycle. Another idea is to treat data management as an asset managementÐthe managers are paid based on increase in the value of the asset (similar to other assets in organization). Establish a data quality owner. It is important to identify someone who is responsible for measuring and improving the quality of data. For example, as the value of customer and location information increases due to the New Telecom Bill, someone should own the data quality improvement task to minimize business impact. The owner may be a person or, in the case of large projects, a team that is responsible for data quality (this may be part of a total quality management program). The process owner should be viewed as manager of the data asset and should be rewarded and reprimanded as this asset appreciates or depreciates. Develop organizational procedures. In particular, develop feedback loop to continuously improve data quality (the notion of asset management). These organizational procedures may include a variety of items (based on business drivers) such as determine and isolate backdoors, do high level analysis of root causes, identi®cation of missing information, and routing of feedback by roles (i.e., next person to contact customer and stewards for data of type x, etc. are ®rst in the feedback loop). Organizational procedures should include a reward/reprimand system. Establish scheduling scenarios. Decide on events and conditions that will trigger data movement, time needed for data movement, considerations about time zone differences. This may require some organizational considerations. Include a reward/reprimand system to motivate people. Speci®cally, automated ¯owthroughs between systems (e.g., Telcordia's LOCATION Flowthrough product) can result in signi®cant changes, and in some cases, elimination of several tasks performed by the CLLI coordinators and other related folks. There has to be a reward system for these people to be motivated about implementing work¯ows ( perhaps responsibility shifts to monitor the process through metrics. Do not overlook administrative details. Take care of additional factors such as the following:
300
Umar, Karabatis and Ness
Ð Staff training; Ð Position descriptions (roles and responsibilities); Ð Responsibility shifts (impact on organizations); Ð Communication/administrative issues (i.e., documentation).
5.
Concluding Comments
We have described a case study of data quality in the telecommunication industry and showed how requirements for data quality were collected, how the data quality metrics were de®ned, what guide lines were established for intersystem data ¯ows, what COTS technologies were used, and what were the results of a prototype. As a result of experience gained and lessons learned, we have developed a comprehensive approach that synthesizes data quality issues into an architectural vision that is supported by a series of steps, procedures, checklists, and tools. Our approach takes into account the technology, process, and people issues and can be used to design data quality as an integral part of data architectures. In addition, we have synthesized the main ideas from the extant literature to augment our thinking. The results reported in this paper are a summarization of a series of technical reports, listed in the sidebar ``Data Quality Project Deliverable Reports''. In the future, we are planning to publish results from these reports as our experience grows in this area.
Data QualityÐACritical Information Systems Considerations. Data Management Review 1996. English L. Data Quality; Meeting Customer Needs. Data Management Review 1996;44±51. Kaplan D, Krishnan R, Padman R, Peters J. Assessing Data Quality in Accounting Information Systems. CACM 1998;41(2):72±78. Orr K. Data Quality and Systems Theory. CACM 1998;41(2):66±71. Redman T. Data Quality. Bantam Books, 1992. Redman T. Data Quality for the Information Age. Artech House, 1992. Redman T. The Impact of Poor Data Quality on the Typical Enterprise. CACM 1998;41(2):79±83. Moriarity T. Better Business Practices. Database Programming and Design 1996;59±61. Moriarity T. Barriers to Data Quality Part II; Business Practices. Database Programming and Design 1996;61±63. Tayi G, Ballou D. Examining Data Quality. CACM 1998;41(2):54± 57. Umar A. Application (R)Engineering: Building New Web-based Applications and Dealing with Legacies. Prentice Hall, 1997. Wang RY. A Product Perspective on Total Data Quality Management}. CACM 1998;41(2):58±96. Wand Y, Wang R. Anchoring Data Quality Dimensions in Ontological Foundations. Communications of ACM 1996;86±95.
References
Amjad Umar is Director of Distributed Objects Engineering and Research Group at Telcordia Technologies. He is also an Adjunct Professor at Rutgers University and Stevens Institute of Technology where he develops and teaches graduate level courses in distributed systems, object-oriented systems, networks, and databases. At Telcordia, he leads projects in distributed object technologies, advanced middleware, mobile computing, electronic commerce, legacy data access, data integration, and data warehouses. His previous experience includes consulting assignments in the manufacturing industry and faculty position at the University of Michigan. He has authored more than 20 technical papers and three Prentice Hall books Distributed Computing and Client-Server Systems, Application (Re)Engineering: Building Web-based Applications and Dealing with Legacies, and Object Oriented Client/Server Internet Environments. He has an M.S. in Computer, Information and Control Engineering and a Ph.D. in Information Systems (Industrial and Operations Engineering Department) from the University of Michigan. He is a member of ACM and IEEE.
Ballou D, Pazer H. Modeling Data and Process Quality in Multiinput, Multi-output Information Systems. Management Science 1985;31(2):150±162. Cellco J. Don't Warehouse Dirty Data. Datamation 1995;42±52.
George Karabatis is a Research Scientist at Telcordia Technologies since 1995, where he conducts research
Acknowledgment We would like to thank Josephine Micallef for her substantial help in this project, which would not have been complete without her tireless efforts in implementing our prototype.
Enterprise Data Quality
on heterogeneous information resources, work¯ow systems and distributed databases. He received his Ph.D. and M.S. in Computer Science from the University of Houston, and his B.S. in Applied Mathematics from the Aristotle University in Greece. His research interests include integration of heterogeneous systems, data warehouses, system architectures, work¯ow systems, quality of data, consistency maintenance between interdependent data, and transaction management. He has authored numerous publications in refereed journals, book chapters, and conference proceedings in database management, work¯ows and distributed
301
systems. He is a member of IEEE, IEEE/CS and ACM. Linda Ness is an Executive Director at Telcordia Technologies. In this capacity, she oversees tactical and strategic research projects at Telcordia. Her areas of interest span data quality, work¯ow systems, Internet Telephony, data warehouses, and data integration. In her previous assignments at Telcordia Technologies, she has managed research in data quality and data reconciliation. She also has held various faculty positions. She holds a Ph.D. in Mathematics from Harvard University.