Data quality aspects of a database for abdominal septic shock patients

Share Embed


Descrição do Produto

Computer Methods and Programs in Biomedicine (2004) 75, 23—30

Data quality aspects of a database for abdominal septic shock patients Jürgen Paetz a,b,*, Björn Arlt a,b , Kerstin Erz b,c , Katharina Holzer b , Rüdiger Brause a , Ernst Hanisch b,d a

Fachbereich Biologie und Informatik, Institut für Informatik, J.W. Goethe-Universität Frankfurt am Main, Robert-Mayer-Straße 11—15, D-60054 Frankfurt am Main, Germany b Zentrum der Chirurgie, Klinikum der J.W. Goethe-Universität Frankfurt am Main, Theodor-Stern-Kai 7, D-60590 Frankfurt am Main, Germany c Chirurgische Klinik, Städtische Kliniken Frankfurt-Höchst, Gotenstraße 6—8, D-65929 Frankfurt am Main, Germany d Asklepios Klinik, Röntgenstraße, D-63225 Langen, Germany Received 10 August 2003 ; received in revised form 15 September 2003; accepted 15 September 2003

KEYWORDS Retrospective data; Medical data mining; Database; Data quality; Preprocessing

Summary Since many years, medical researchers have investigated the mechanisms that may cause a septic shock. Despite many approaches that analyzed smaller parts of the relevant data or single variables, respectively, no larger database with all the possible relevant data existed. Our work was to bridge this gap. We built a large database for abdominal septic shock patients. While building it, we were confronted with many problems concerning the database realization and the data quality. Thus, we will demonstrate how we built our database and how we assured data quality. This is of interest for all medical or computer scientists who are concerned with building medical databases with retrospective data, e.g. for data mining purposes. © 2003 Elsevier Ireland Ltd. All rights reserved.

1. Introduction Septic shock, an outcome of an immune system reaction, for example, after an operation, has a mortality of about 50% in intensive care units (ICU). It was formally defined by a consensus conference [1]. Medical information about septic shock can be found in [2—4]. Its proportions are visualized in Fig. 1. Many large databases were built for ICU *Corresponding author. E-mail addresses: [email protected] (J. Paetz), [email protected] (B. Arlt), [email protected] (K. Erz), [email protected] (K. Holzer), [email protected] (R. Brause), [email protected] (E. Hanisch).

patients, e.g. for evaluating the quality of care by medical scores, only storing a small number of patient variables (80%) for 24 and 12 h resampling rate. Even for a resampling rate of 1 h, the number of available sample values becomes above 60%. But the loss of available samples for a resampling rate from 24 to 12 h is 4%, and from 24 to 1 h is about 20%. For all the other variables, a resampling rate of 12 h leads to a high loss of available samples, compared to 24 h resampling. Thus, a multi-dimensional analysis of the variables is only senseful using a 24 h resampling rate. This is the quantitative reason why medical scores are often only calculated one time within a day since for these scores all these variables are needed. Of course, variables as ferrum cannot be analyzed at all since available samples of 2.66% (24 h resampling) are much too low. We believe that a minimum of 80% available samples for one variable should be present, so that the missing values can be replaced in a senseful way (see Section 4.3). To analyze such variables as ferrum prospective measurements are needed. Although PCWP is measured on average every 25 h, its availability is only 6.41% for a 24 h resampling rate. The reason is that PCWP is not measured each day and one time at a day. It is measured whenever needed for a short time and within this short time very often. The rest of

J. Paetz et al. the time it is not measured at all, e.g. PCWP is not measured regularly which makes a retrospective analysis very difficult.

4.3. Handling of missing values After resampling the data with a suitable resampling rate, the missing values problem remains. Not for all time intervals there is a mean value. A lot of variables showed a high number of missing values (internally coded with − 9999) caused by faults or simply by seldom or irregular measurements (see Section 4.2). How can we handle missing values? One easy strategy is deleting all the data vectors that contain a missing value. This is only senseful if not much data has to be deleted in relation to the total amount of data [14]. Since we analyzed data vectors up to twenty dimensions, almost all data vectors contain at least one missing value so that the deletion strategy is not applicable in our case. Another possibility is the replacement of the data in the best possible manner using the present data of the other data vectors. This can be done by regression. The advantage is that all data vectors can be used but the disadvantage is the following unwanted effect: since we allow documentation rates down to 80%, a replacement with the best possible values would cause a biased, too good result. Of course, we do not want loose analysis performance by missing values handling, but we also do not want to achieve a higher analysis performance by artificially introduced data, e.g. a higher classification performance. Such a replacement strategy, like taking the mean value of existing values, can be dangerous, pretending better results than those that can be achieved with real data without missing values. Thus, we want to use a replacement strategy that do not weaken or advance analysis performance. We analyze the distribution of every variable. Usually three kinds of (one-peaked) distributions can be observed: almost normally distributed values (e.g. blood pressure, heart frequency) with peak in the middle, the cusp of the distribution clearly shifted to the right (e.g. O2 saturation) or to the left (e.g. liver values GGT, GOT). Then, we calculate the median of a variable, especially for shifted distributions. In case of an approximately normal distribution the mean value can be used. To test the normal distribution a Q — Q plot can be used. To obtain an easy insertion procedure that do not change the real distribution in a critical way, we insert randomly chosen data from a suitable normal distribution (random noise) N(m, s) with m as median and s standard deviation, so that our algorithm cannot learn from

Data quality aspects of a database for abdominal septic shock patients that added noise data. Since we have shifted distributions the normal distribution N(m, s) should have a clearly lower standard deviation than the original distribution. Visually, the filling procedure should only clearly change three of eleven equally sized histogram bars, since we do not want to fill in artificial outliers. Maximal five histogram bars should be changed totally. Otherwise the random value should be rejected and a new one should be generated. With this strategy, we obtained reliable classification results [5]. If we would insert a fixed normal value then a neural network would adapt to these artificially inserted normal values. Thus, it is better to insert randomized, different values. Some other well-known replacement strategies for missing value replacement (e.g. means, regression, nearest neighbor, random values) in one dimension are compared in [15]. Since our data samples have missing values in more than one dimension, the latter results cannot be adapted directly to our situation. Multiple imputation strategies [14,16] can be applied if more statistical information about the imputation error is needed.

5. Conclusion and outlook We have described a data mining approach in the medical area with abdominal septic shock patient data. We discussed the medical data mining cycle and its advantage over the statistical approach. Then, we show how to build a relational database for medical data with lots of different data types as symbolic or metric data, temporal or non-temporal data. In the future more standardized bedside hospital information systems would be helpful to avoid the electronic documentation of handwritten patient records. The fundamental relevance of preprocessing in computer science driven data analysis attracts more and more interest [17]. We discussed the preprocessing in greater detail in Section 4. We showed how important it is to do preprocessing with care to obtain valid results, e.g. resampling and missing values handling. In conclusion, it is almost impossible to get 100% clean data from an enormous amount of different patient records. The same result is presented in [18], where an overview of causes for errors is given. Nevertheless, we showed how to preprocess the data to obtain reliable results. It is of no wonder that preprocessing may take a major amount of time for the whole medical data mining process. If too many errors occur for some variables (too many missing values or too much values above or

29

below limit values), then the only possibility is to exclude these variables from analysis. To reduce the time for preprocessing more standardized preprocessing tools should be developed. Although it is desirable that such tools are understandable for non-staticians, we recommend that data analysis should not be performed without a fundamental statistical education. Usually, an user without or with little statistical background is interested in producing good results in short time (e.g. by deleting problematic data sets) while a statistically educated person is interested in obtaining valid results. Thus, new tools should mainly support the generation of valid results. With the database, we have already made a lot of different analyses [9,19,20]. The main work was building an alarm system prototype [5]. The more methodical aspects of this system are presented in [5]. With the database more questions about the dynamic of variables in abdominal septic shock will be analyzed. At the end, we emphasize that it would be no solution in our case to demand a prospective analysis. First, it would be too time consuming for a physician to document all variables regularly for a prospective analysis. Second, having no hypotheses at hand about what values of which variable combination cause the death in abdominal septic shock, all variables need to be considered for analysis. Our database is available for download [21].

Acknowledgements The work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project MEDAN (http://www.medan.de, Ref. no. HA 1456/7-2). The authors thank the involved medical students for supporting our work.

References [1] R.C. Bone, R.A. Balk, R.A. Cerra, R.P. Dellinger, A.M. Fein, W.A. Knaus, et al., The ACCP/SCCM consensus conference. Definition for sepsis and organ failure and guidelines for the use of the innovative therapies, Chest 101 (1992) 1644— 1655. [2] A.M. Fein, E.M. Abraham, R.A. Balk, D.R. Dantzker, R.C. Bone (Eds.), Sepsis and Multiorgan Failure, Williams & Wilkins, Baltimore, 1997. [3] R.M. Hardaway, A review of septic shock, Am. Surg. 66(1) (1) (2000) 22—29. [4] S. Wade, M. Büssow, E. Hanisch, Epidemiology of SIRS, sepsis and septic shock in surgical intensive care patients, Chirurg 69 (1998) 648—655. [5] J. Paetz, B. Arlt, A neuro-fuzzy based alarm system for septic shock patients with a comparison to medical scores,

30

[6]

[7]

[8]

[9]

[10]

[11]

J. Paetz et al. in: Proceedings of the Third International Symposium on Medical Data Analysis (ISMDA), LNCS vol. 2526, Rome, Italy, Springer, Berlin, 2002, pp. 42—52. T. Villmann, H. Wieland, M. Geyer, Data mining and knowledge discovery in medical applications using self-organizing maps, in: Proceedings of the International Symposium on Medical Data Analysis (ISMDA), LNCS vol. 1933, Frankfurt am Main, Germany, Springer, Berlin, 2000, pp. 138—151. N. Lavrac, Machine learning for data mining in medicine, in: Proceedings of the Seventh Joint European Conference on Artificial Intelligence in Medicine and Medical Decision Making (AIMDM), Aalborg, Denmark, 1999, pp. 57—60. C. Ordonez, E. Omiecinski, L. de Braal, C.A. Santana, N. Ezquerra, et al., Mining constrained association rules to predict heart disease, in: Proceedings of the First International Conference on Data Mining (ICDM), IEEE Computer Society Press, San Jose, USA, 2001, pp. 433—440. J. Paetz, F. Hamker, S. Thöne, About the analysis of septic shock patient data, in: Proceedings of the First International Symposium on Medical Data Analysis (ISMDA), LNCS vol. 1933, Frankfurt am Main, Germany, Springer, Berlin, 2000, pp. 130—137. S. Tsumoto, Clinical knowledge discovery in hospital information systems: two case studies, in: Proceedings of the Fourth European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Lyon, France, Springer, Berlin, 2000, pp. 652—656. J.-L. Vincent, R. Moreno, J. Takala, S. Willats, A. De Mendonca, H. Bruining, C.K. Reinhart, P.M. Suter, L.G. Thijs, The SOFA (sepsis-related organ failure assessment) score to describe organ dysfunction/failure, Intensive Care Med. 22 (1996) 707—710.

[12] The MEDAN Project Homepage, http://www.medan.de, accessed 4 September 2003. [13] M. Suistoma, A. Kari, E. Ruokonen, J. Takala, Sampling rate causes bias in APACHE II and SAPS II scores, Intensive Care Med. 26 (2000) 1773—1778. [14] J.L. Schafer, M.K. Olsen, Multiple imputation for multivariate missing-data problems: a data analyst’s perspective, Multivariate Behav. Res. 33 (1998) 545—571. [15] E. Pesonen, M. Eskelinen, M. Juhola, Treatment of missing data values in a neural network based decision support system for acute abdominal pain, Artif. Intell. Med. 13 (1998) 139—146. [16] D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, Wiley, New York, 1987. [17] First International Workshop on Data Cleaning and Preprocessing, held at the Second International Conference on Data Mining, Maebashi City, Japan, 2002. [18] D.G.T. Arts, N.F. de Keizer, G.-J. Scheffer, Defining and improving data quality in medical registries: a literature review, case study, and generic framework, J. Am. Med. Inform. Assoc. 9 (2002) 600—611. [19] J. Paetz, Metric rule generation with septic shock patient data, in: Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM), IEEE Computer Society Press, San Jose, USA, 2001, pp. 637—638. [20] J. Paetz, Intersection based generalization rules for the analysis of symbolic septic shock patient data, in: Proceedings of the Second IEEE International Conference on Data Mining (ICDM), Maebashi City, Japan, 2002, pp. 673— 676. [21] The MEDAN Database, http://www.medan.de/db/ medandb.zip, accessed 4 September 2003.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.