A Framework to Assess Data Quality for Reliability Variables
Descrição do Produto
Paper presented at the World Congress of Engineering Asset Management 2006 A FRAMEWORK TO ASSESS DATA QUALITY FOR RELIABILITY VARIABLES Authors: M.Hodkiewicz1, P.Kelly2, J.Sikorska3, L.Gouws4 1. University of Western Australia, 2. CASWA Pty Ltd, 3. IMES Group Ltd, 4. Melikon Pty Ltd.
Lihat lebih banyak...
This paper presents a framework for assessing the impact of the data collection process on the validity of key measures in reliability. The quality of data is affected by many organisational and behavioural factors. The aims of developing this framework are to (1) identify inputs/steps that have the most significant impact on the quality of key performance indicators such as MTTF (mean time to failure) and MTTR (mean time to repair), (2) identify ‘weak’ links in the data collection process, and (3) identify potential remedial actions. Development of this framework will assist the understanding of assumptions used in reliability calculations and improve the quality of underlying data and the data collection process. Consequently, this is a vital step in the continued development and use of data based decision-making models for reliability assessment.
Engineering Asset Managers rely on data to support decisions. There has been little published work on how to assess the quality of maintenance data and its fit-for-purpose to support decisions. This problem is compounded by ambiguity over who is responsible for assuring data quality and appropriate use of the data. The quality of the data is dependent on various attributes including timeliness, accuracy, relevance, completeness and accessibility . There is currently no widely accepted methodology for (i) assessing the quality of maintenance-related data, (ii) confirming that data is fit-for-purpose or (iii) identifying and managing changes to the data collection, storage and use system. Many of the typical problems affecting data quality are discussed in the case-study based work of  and the experiences of  in the establishing of the OREDA project. Previous work has identified three key elements required for data quality: knowing-what, knowing-how, and knowingwhy. It also suggests that there are three roles associated with the data process: data collectors, custodians, and consumers . This research work also showed that data quality is highly dependent on data collectors knowing-why. For equipment maintenance and reliability data, the data collector is predominantly a maintenance or production technician. The custodians are the IT support staff responsible for managing various enterprise databases and systems, whilst the main data consumers are (i) reliability engineers who use the data during the determination of long-term maintenance strategies, and (ii) maintenance engineers and maintenance supervisors who use the data when addressing day-to-day maintenance issues. The framework described herein emphasises the importance of communication between data collectors and consumers, not only about the use of the data but also the implication of poor quality data on the business process.
THE 8 STEP PLAN A framework for evaluating data requirements is depicted in Figure 1.
Step 1: Identify the business need Quality data is simply defined as “data that is fit for purpose” . Therefore it must, by definition, be assessed in the context of a specific business need (its purpose). After all, multiple consumers of a particular set of data may use it for different purposes and therefore data that is of sufficient quality for one user may not be appropriate for another. Consequently, before any data quality improvement process can be initiated, there must be a clear understanding of the business decisions that must be supported by the data. We describe this as a context-oriented approach to data quality. Business needs will generally be defined using words such as “measure”, “quantify”, “improve”, “reduce” or “control”. To control and improve a business process, its status first needs to be determined and methods to measure the effect of changes then determined. The word “measure” and its various synonyms imply an inherent data requirement.
Figure 1: Framework assessing and improving maintenance data quality
Step 2: Identify the metrics that will meet the business need Once a business need has been identified it is possible to determine what metrics can be used to meet that need. Some of the most common Reliability-Availability-Maintainability (RAM) metrics include mean time to failure (MTTF), mean time between failure (MTBF), mean time to repair (MTTR), availability, reliability, logistical delay time and a variety of maintenance related costs. In addition to identifying what must be measured, it is also important to identify the level of accuracy required and the frequency at which the metrics are to be recalculated (eg. once a month, once a day, 1000 times per second etc). Again, these are context specific issues.
Step 3: Identify the data required for each metric Having selected the metrics, the data required to produce them to the required level of accuracy can be determined. In practice, this actually involves three steps, as illustrated in Figure 2. A: Identify metric variables. By implication, metrics involve some changing variables or dependencies. It is therefore important to identify what the variables are, and how often they need to be acquired. B: Identify the data fields that correspond to variables. Does the variable correspond to a single data field in a database or single gauge to be inspected, or must it be deduced from a number of sources? As the number of data sources increases, so does the risk of error, especially when data is collected manually. C: Determine where and how the data is stored. Data required for RAM analyses is stored in a variety of formats and locations, including equipment databases, computerized maintenance management systems (CMMS) and/or associated ERP systems, spreadsheets, paper records and/or the memories of maintenance and operations personnel. In this paper, we will collectively refer various software systems as Asset Management Databases (AMD).
Figure 2: Breakdown of Step 3 and Step 4
Step 4: Analyze Data Once the data has been sourced, its context-specific quality can be assessed. Again, this step can be divided into two smaller steps, shown in Figure 2. Data sources first need to be manually examined and the current contents qualitatively assessed. Are the fields full or empty? Is a field codified or free-text? Do the fields contain data relevant to the problem? This overview influences the development and/or selection of more specific questions for more comprehensive quantitative assessment. It is suggested that context-specific questions be formulated to collectively determine the following: •
Is the data accurate? (Intrinsic data quality)
Is the data appropriate to the business need? (Contextual data quality)
Is the data represented appropriately? (Representational data quality)
Is the data accessible but secure? (Accessibility data quality)
The second step is to develop measures to assess data attributes for the data fields. Required attributes for each of these categories are given in Table 1. The structure and nature of the questions should be dependent on the data variable under investigation and its context. Wherever possible, questions should be designed to elicit a quantitative response. Answers can then be viewed individually and/or as weighted sums, resulting in an overall score of data quality for that particular field.
Table 1: A conceptual framework for data quality (Adapted from )
Data quality category
Intrinsic (Does the data have any inherent quality in its own right?)
Real and credible Accurate, correct, reliable, errors can be easily identified Unbiased and objective Reputation of the data source and data Applicable to task at hand, usable Age of the data is appropriate to the task at hand Breadth, depth and scope of information contained in data Quantity and volume of available data is appropriate
Contextual (Is the data appropriate to the business need?)
Objectivity Reputation Relevancy Timeliness Completeness Appropriate amount of data Value-added
Representational (Is the data represented appropriately?)
Accessibility (Is the data accessible but secure?)
Interpretability Ease of understanding Representational consistency Concise representation Accessibility Access security
Data gives a competitive edge, adds value to the operation Data are in appropriate language and units and data definitions are clear Easily understood, clear, readable Consistency formatted and represented, data are compatible with previous format Concise, well organized, appropriate format of the data Accessible, retrievable, speed of access Access to data can be restricted, data is secure.
Step 5: Identify how to improve the quality of each data element It is likely that respondents will give very different answers to the data assessment questions posed in the previous step. Ultimately, it will be up to the data consumer to decide whether a piece of data is “fit for purpose”. Where the data is not fit for purpose, the first step is to meet with the collectors of the data to understand why this is so. One of the greatest reasons for erroneous data is a lack of understanding or appreciation by data collectors on the quality requirements for a piece of data. Step 6: Implement changes Unfortunately, a data quality audit will probably result in the identification of more problems than can reasonably be resolved within available timeframes and/or budgets. Therefore, it is suggested that improvements be prioritised based on the complexity of the change required, and the aggregated error caused by data in its present form on the metric and business need(s). A Pareto chart is a good tool to help visualise the changes and decide where time and money should be invested. Step 7: Assess improvements By revisiting Step 3 and using the same measures of data quality, it is possible to express the success or failure of Step 6 quantitatively. This (and the next) step require a long-term focus as there may be a significant delay required to allow a meaningful amount of data to accumulate after changes have been made. One of the benefits from Step 7 is the confidence of having hard data to support “the cause” of improving data quality in the organization. Step 8: Establish periodic review/order process to verify how well business need is being met. By periodically reviewing the metrics required to meet the current business needs the relevance of the data is maintained. Realistically, if Step 7 is to be considered difficult and rarely implemented, then this stage acquires a mythical status. The reality however, is that business requirements change for non-engineering reasons. For example, a decision is made to run the plant at minimum cost, or to ensure availability for a critical period. Or maybe a new CEO is appointed. As mentioned in Step 5, the most important people to keep abreast of current business needs and corresponding data requirements are the data collectors; improving their understanding has a profound follow on effect.
APPLYING THE 8-STEP PLAN TO ASSET MANAGEMENT DATA This process described above will be illustrated using one common metric, Mean Time to Failure.
Step 1: Define the business need For the purposes of illustrating the process we are going to assume that the following business need has been defined by a refinery’s management team: “To measure and improve reliability of rolling element bearings in refinery centrifugal pumps.” Step 2: Identify the metrics The selected metric is: Mean time to failure (MTTF) of bearings in pumps Step 3: Identify data required MTTF= Total number of bearing operating hours / Total number of bearing failures In a perfect system from which only MTTF would ever be calculated, each bearing failure event would have (a) a single unique designator classifying its failure completely (i.e. pump bearing failure), and (b) the age of the bearing at the time of failure. Unfortunately, data storage systems do not always record this information directly. Instead, failure classification or operating hours needs to be deduced from a number of fields, as shown in Figure 3. If age at failure (operating hours) is not available, the date of failure and date of installation are often used instead. Data requirements for this bearing example include the functional location (or equipment number), failure mode (i.e. bearing failure), failure type, event date/time and operating profile. Operating profile is helpful because it can be used to explain extraordinary events (i.e. data outliers) that may not reasonably represent the life of bearings at the refinery, but can significantly skew results.
Date (and time) and clock hours taken out of service
Functional location Event date/time
Date (and time) of return to service
Suspension (S) Functional failure (FF) Potential failure (PF)
Throughput/ load/ sales orders Human factors: e.g. sick days, holidays
ID of failed Equipment unit ID of failed equipment sub unit
Drive end bearing
MTTF - pump bearing Failure type
12/02/06 - v39
Functional failure code - by operator
Overheating Fail to start Vibration
Failure description/ observed cause of failure by maintainer
Figure 3: Example of the data required to calculate the MTTF metric for a pump bearing
In Figure 3, the font format (normal/italic) and box colour (grey, yellow, blue) is used to indicate the ‘data collector’. In the context of this example, 1.
The operator/ production technician (normal font, blue box) records the date the pump is removed from and returned to service, the functional location of the pump, and the functional failure code. This code identifies ‘why’ the unit was removed from service.
The maintainer (italic font, grey box) identifies the maintainable item that is the cause of the equipment being removed from service, in this case, the bearing. Further information is then required to describe the failure; example codes are provided in ISO 14224 . Another vital piece of data is to identify if the maintainable item is a functional
failure, potential failure or suspension. For this example, a functional failure occurs if the bearing has failed causing the pump to shut down. If the bearing is damaged but still functioning at the time of removal, this is described as a potential failure. A suspension is used to describe the situation when the bearing is replaced in the course of a pump overhaul, when the bearing is still functioning. This may occur if the pump is removed from service to replace a mechanical seal or rebuild the wet-end. The recording of suspensions is vital for the accurate estimation of MTTF. For example, if one pump operating from 1000 hours experiences 2 bearing failures and has a further 2 functioning bearings replaced during replacement of mechanical seals, the true MTTF (accounting for suspensions) is 500 hours. If the suspensions are incorrectly counted as failures, the calculated MTTF is 250 hours. Therefore, omitting the suspension/failure distinction can result in a dramatic under-estimation of the bearing MTTF. 3.
The operating profile (normal font, yellow box) comes from neither the operator nor maintainer. It may include, but is not limited to (1) indicators of the load on the equipment, for example, plant throughput, equipment load, sales orders, (2) the conditions of operation, and (3) who operated the unit. This information is likely to reside in databases outside the CMMS which can be accessed through data warehouses, reports and other IT solutions.
Figure 4: Entity relationship diagram for data shown in the Mind Map in Figure 3
Figure 4 shows the MTTF metric, and its contributing factors from Figure 3, expressed as an Entity Relationship diagram. Database administrators and software developers use ER diagrams to specify the design of data tables and queries. This translation from the data collector’s view to the data custodian’s without loss of information or form is important as it facilitates accurate communication between data consumers and custodians, this results in better IT systems which result in better quality data.
Step 4: Analyse the data Step 4a: Examine data in the data fields. Investigation of the data may reveal issues such as: • • • • • • •
Inaccurate failure date. Nonsense free-text data (for example “Broken” as a fault description). Incorrectly selected options. Prevalence of “other” or “none of the above” selections in restricted entry fields. Missing data. Referential integrity problems. Inappropriate selection of functional location level.
Step 4b: Develop measures to assess data attributes in those fields In order to measure the data quality of each of the main four fields (functional location, event date, failure mode and failure type) for this example, the authors propose the application of questions about each of the relevant data quality attributes from Figure 1. This is illustrated in Figure 3. For each attribute, one or more questions are required to assess the quality of that attribute in the context of the data field, the Functional location in this example, under investigation. Examples of questions that address each of the data attributes are shown. The questions can be framed to give either yes/no, true/false, scaled (1-5) or numerical (%) replies depending on the data variable under investigation. This illustration is not intended to be comprehensive but illustrative of a general approach.
Table 2: Application of selected data quality attributes to the Functional Location data field Data quality category Intrinsic
Attributes Accuracy Reputation
Concise representation Ease of understanding
Representational consistency Accessibility
Accessibility Access security
Question for the Functional location field in the MTTF calculation The data is correct The data is accurate I know who collected the data I know who entered the data Is the selection of FL affected by who collects the data? Data in this field is relevant for this analysis Data in this field is appropriate for this analysis Data is sufficiently current for our work Data is complete Are FL data definitions clear to all users (collector, custodian and user)? Is the information easily retrievable? Are drop-down menus well structured? If there been changes in the FL structure as the CMMS has been upgraded. Is there a translation between old and new systems? Is each FL entered in exactly the same format? Is the maximum number in any one drop-down menu less than 8? Is the information easily accessible? Is the information easily obtainable? Is there a system to prevent unauthorized changes to the system?
Marking Y/N Y/N Y/N Y/N Y/N T/F Y/N Y/N Y/N Y/N Y/N Y/N Y/N
Y/N Y/N Y/N Y/N Y/N
Step 5: Identify how to improve the data quality of each element Based on the results of step 4, we identify that there are some issues with missing data and that some fields are being incorrectly filled in. To address these issues we decide on the following measures: • • • • • • •
Auditing of data at the time of collection. Establishing a process for data vetting. Careful selection of classifications, Avoid presenting codes. Reduce manual data entry. Source date data from other, automated sources (e.g. purchasing system). Data entry close to the job in time and space Training Introduce workflow/process to derive change from exceptions such as selecting “other”
Step 6: Implement changes The proposed may be ranked in terms of their efficacy and impact. This allows for changes, which have high impact and low to medium cost or difficulty, to be tackled first.
Table 3: Example of table to rank options for improving data quality Change Efficacy Cost/Difficulty Alter IT systems to check for High Medium known garbage values Two week audit drive on work Medium Low order data Identify data owners and ‘goal’ High Very High them on data quality Install automated vibration Very High Medium sensors on worst performing pumps Step 7: Assess improvements to data quality Select a suitable period for review and engage an outside body to apply the metrics of data quality developed in Step 4 to information collected after the completion of Step 6. The resulting independent report will highlight changes that have been effective in improving data quality and assist in identifying where to focus more effort. Step 8: Establish periodic review/audit process The executive summaries of all the data quality projects are reviewed annually to assess the relevance of to the current business needs. Operations staff are present at this meeting and feed back their experiences and thoughts on the changes that have been implemented. Where it is clear that the business need has changed, the need to revisit the data quality requirements is discussed. In this case MTTF is a useful base statistic, relevant to a number of business needs. As such improving the quality of its underlying data is likely to deliver value for some time to come.
DISCUSSION: GLOBAL ISSUES
Data field links Figure 4 illustrates how a specific functional location at the unit level is linked to other fields in the CMMS and other databases. The selected equipment component, in this example, a pump bearing, will determine (1) the fields of the sub-unit and maintainable items (2) the equipment-specific corrective maintenance and preventative maintenance activity types, and (3) the equipment-specific failure descriptors and causes. In addition, the functional location may be linked to a serial number and perhaps also a separate equipment identification number (though equipment ID and equipment unit level functional location are often the same number). These numbers are related to installation and design data in the equipment database. There may also be additional links to databases housing reliability centred maintenance analysis or root cause failure analysis information. Careful thought needs to go into both the design and use of functional location structures and the design of links to related fields internal and external to the main database. This mind-map approach to identify all major users of specific data fields coupled with a knowledge of the data required for each of calculated reliability variables can be used to identify (1) key data fields and (2) identify the stakeholders in the data collection, storage and use sequence that need to be consulted if changes are required.
Data collection versus data use Work by Lee and Strong  has shown that data collectors with “why-knowledge” about the data production process contribute to producing better quality data. Overall, improving the knowledge of data collectors is more critical than that of data custodians. Knowing-why is gained from experience and understanding of the objectives and cause-effect relationships underlying activities (knowing-what) and procedures (knowing-how) involved in work processes in organisations . In other words, it is important for the collectors of data required for reliability analysis (the maintainers and operators) to understand why they need to collect the data and how it will be used. If, as is common, the use of the data has not been defined, either copious amounts of data are collected without any defined or proven purpose or, alternatively, insufficient data of any kind is collected. Either situation results in little data of use being available for decision making . The solution is to identify the variables on which key decisions are made, identify the data required for these variables, and communicate both the process and the results to the data collectors. Unless the data collectors appreciate the process and know how, why and by whom the data is being used, they will have no motivation to collect quality data. Furthermore, the more direct and timely the feedback to the data collector about its quality and the direct effect this has on the business need, the more influential this feedback process becomes. Data collectors should know what their data is used for, and given feedback on outcomes following the analysis based on the collected data.
Links to Equipment unit level failure data
Design data Installation data
Links to Equipment database
Functional failure categories (by operator) Method of failure detection/observation Failure severity class (critical/non-critical) Failure type (Functional, partial,suspension)
Replace Repair Modify Adjust
Corrective maintenance activity code
FL - Equipment unit level (pump) 12/02/06 - v23
Refit Restart/reset but no maintenance action Replace Periodic service Periodic function test
Preventative maintenance activity code
Periodic inspection Major overhaul
Links to Maintenance events
Determines sub unit classification Pump unit
Determines maintainable item classification
Power transmission Lubrication system Cooling system
Links to Corrective Maintenance records Links to Failure descriptors Links to Failure causes
Figure 5: Illustration of the links between a functional location at the unit level and the fields in other parts of the asset management system.
Codes versus text Many database systems rely on data collectors entering much of their data by making selections from predefined lists, for example, a list of failure mode codes or functional locations. The downside of this is that very little description is supplied to the data collector to help them make their selection, resulting in increased incidence of miscoding or excessive use of exemption codes (e.g. “other”). In extreme cases, normally when using older systems, only the raw codes are presented with explanations given in associated paper manuals or help files. Paradoxically, entering data in a pre-codified form does not make it easier for the computer system to deal with. Choosing a value from a list of options and having the computer generate the code is no harder and represents a very small cost when compared to the subsequent entry and application of inappropriate or incorrect data. Where possible, the authors recommend that data collectors be presented with a rich description of available options. Where there are several similar options available, clarifying text and images should be provided in a context sensitive manner using popup windows or “mouse over” actions. Skilled data entry operators can
achieve data entry error rates below 1%. If we charitably describe our operators as being skilled in data entry, that means our 100,000 row maintenance history table may contain 1000 garbage records. The reality is often much more frightening. To pick up on the example presented earlier, screens that allow the input/selection of functional location (equipment/tag/plant ID) are frequently poorly implemented. In the presentation of this data by computer systems, a hierarchical approach is ubiquitous, with much effort and meeting-time devoted to the design of the tree. A problem arises because different users of the system perceive the organization of equipment in different ways. As an example, the electrical system engineer may think of a faulty light globe as ‘100W globe in socket 7 of cable run 3 from distribution panel 8’.Meanwhile an operator sees it as ‘100W globe at east end of MCR3 plant west’, and the stores technician as ‘100W globe, Supplier 3, electrical consumables’. The important point is that the examples in the preceding paragraph are all valid ways of looking at the issue. Ironically, many CMMS only present one hierarchy, thereby disadvantaging users who view the data structure differently. To overcome this problem, a project is currently underway to trial a data-mining portal which allows multiple trees to be used to select items of interest. Results will be presented in a future paper. Data ownership In many ways it is unfortunate that the term “data ownership” was ever coined. The concept of data owner as a role implicitly assigns data maintenance tasks to the data owner. Therefore, by checking our own job descriptions to ensure that the words “data owner” are absent, we, as RAMS professionals, inevitably blame data quality problems on someone else. Yet, as the greatest users of RAMS data, our worth as analysts, engineers and professionals is directly linked to quality of the data on which we base our contribution. Ironically, the reticence of individuals to take on data ownership roles is often not due to an unwillingness to undertake the vetting incoming data, but rather a fear of opening a Pandora’s Box of data quality issues that may then become a significant burden for an individual. It is thus important to define responsibilities when assigning/accepting data ownership roles. Just because someone is prepared to maintain data does not necessarily mean they should be, individually, tasked with rectifying the sins of the past. We suggest that, where significant remediation or change to data management practice is identified, they are written up as separate business cases or project proposals. This helps establish the cost (and hopefully the value) of these changes in the context of the business needs.
There are a great many users of the same maintenance and failure data and considerable care should be taken before any changes are made to the data collection, data structure or data storage processes. This paper presents and illustrates a framework to assess the quality of data and the potential impact of data collection processes on the validity of key metrics used in assessing equipment reliability. We describe the new framework as a ‘context-oriented’ approach to data quality as the process assesses the data in individual fields in the context of its fitness for purpose; the purpose is the calculation of a specific reliability metric which must be necessary to support a specific business need. Advantages of this approach include: (1) a structured prescriptive methodology that can be applied to many situations, (2) a direct link between data quality and the business process, and (3) identification of all data inputs and their associated data collector, data storage format and process. In the authors’ opinion much more thought and attention needs to be given to the process of collecting, storing and using data for maintenance and reliability decisions. Using the framework described herein will assist in understanding assumptions used in reliability calculations and improve the quality of underlying data and data collection processes.
REFERENCES 1. Wang, R. Y. & Strong, D. M. (1996) Beyond accuracy: What data quality means to consumers. Journal of Management Information Systems, 12( 4), 5-34. 2. Koronios, A. & Lin, S. (2004) Key issues in achieving data quality in Asset Management . VETOMAC-3/ACSIM-2004 (Vibration Engineering & Technology of Machinery, Asia-Pacific Conference on System Integrity & Maintenance 2004, December 6-9) New Delhi. 3. Sandtorv, H. A., Hokstad, P. & Thompson, D. W. (1996) Practical experiences with a data collection project: the OREDA project. Reliability Engineering and System Safety, 51, 159-167. 4. Lee, Y.W. & Strong, D.M. (2004) Knowing-Why about Data processes and Data quality. Journal of Management Information Systems, 20(3), 13-39. 5. International Standard: Petroleum and natural gas industries – Collection and exchange of reliability and maintenance data for equipment, ISO 14224 (1999).
6. Hodkiewicz, M.R., Coetzee, J.L., Dwight, R.A. & Sharp, J.M. (2006) The importance of knowledge management to the Asset Management Process, Business Briefings: Oil and Gas Processing Review 2006, 43-45. 7. Gilb, T. and Weinberg, G. (1977) Humanized Input: Technique for Reliable Keyed Input, Winthrop Publishers, Inc.