Real-time prognosis of ICU physiological data streams
Descrição do Produto
Real-Time Prognosis of ICU Physiological Data Streams Daby Sow, Jimeng Sun, Jianying Hu, Shahram Ebadollahi, Alain Biem IBM T.J. Watson Research Center, New York {sowdaby,jimeng,jyhu,ebad, biem}@us.ibm.com
Abstract— This paper presents a system capable of predicting in real-time the evolution of Intensive Care Unit (ICU) physiological patient data streams. It leverages a state of the art stream computing platform to host analytics capable of making such prognosis in real time. The heart of the prediction technique makes use of Fading Memory Polynomial filters [10] on the frequency domain to predict windows of ICU data streams. We report on the performance of this approach when applied to traces of more than 1500 ICU patients obtained from the MIMIC-II database [1].
I. INTRODUCTION Intensive care units are complex environments where patients are instrumented with many devices generating large volumes of physiological data. These data come from several streams of diverse types and rates, ranging from highly sampled (i.e., sampled at several kHz) streams like electrocardiograms, electroencephalograms, respiration signals, to moderately sampled (i.e., sampled at a few Hz) event streams like blood pressure, pulse oximetry, respiration rates and heart rates. The wide use of such monitoring devices in ICUs aims at allowing physicians to be more alert of the state of their patients at any given point in time, and help them to provide better care. Such monitoring devices also introduce a new set of challenges for physicians as they now have to cope with large amount of physiological streaming data. Prior research has shown that these streams have very valuable information buried in them (e.g., please see [9]), yet in most medical institutions, the vast majority of the data collected by these monitoring systems is dropped and lost forever, after being stored locally at the monitoring devices but only for several hours. Several research efforts have addressed this problem [6], [3], [4], [7]. In particular, at IBM, we have developed a framework [4] specifically tailored for ICU environments that leverages a state of the art stream computing platform to apply real-time analytics on physiological data streams. Typical analytical applications of this framework [7] make heavy use of classification algorithms where features are extracted from the raw signals before being presented to classification rules either specified by a domain expert or learned from historical data. The typical goal of such rules is to detect early onsets of complications. In [5], we turn our attention away from classification issues and address prognosis questions. More specifically, we propose simple techniques to predict the evolution of physiological data streams of a given patient based on models derived from data
obtained from similar patients. We report in [5] on the use a such similarity metrics to make prognosis using persisted patient records. This paper focuses on real-time prognosis of patient’s physiological data streams. We propose approaches rooted on stream computing concepts for real-time prognosis as data collected from monitoring devices are streamed towards our system. We apply time series forecasting algorithms to evolve predictors capable of estimating accurately physiological streams. We report on the efficacy of the approach with a series of experiments performed on more than 1500 patients obtained from the MIMIC-II databases [1]. II. SYSTEM OVERVIEW Figure 1 represents an overview of the real-time prognosis system that we have designed. At the heart of this prognosis system is a state of the art stream processing middleware called InfoSphere Streams [2], [8] (Streams). Streams is a programmable software platform facilitating the development of stream computing applications. It is a highly scalable platform capable of ingesting massive amounts of streaming data and analyze these data in real time. Streams is designed to support both unstructured and structured data streams. Our prognosis system is essentially a Streams application, equipped with mechanisms capable of reading in physiological data streams from the outside world, processing these data with a library of signal processing and time series forecasting operators, and exporting the results to the outside world. Programming Streams to perform these tasks is greatly facilitated with a declarative programming language called S PADE [8]. S PADE shields developers from the complexity of the internal Streams APIs. S PADE applications are modeled by directed graphs. In our system, we instantiate such graphs on a per patient basis, as illustrated in Figure 1. Nodes of these graphs are S PADE operators. They form the building blocks of S PADE applications and encapsulate atomic stream processing logic. Each operator has a set of input and output ports where streaming data is respectively received and produced. Directed arcs between these nodes represent streams flowing between operators. For example, in Figure 1 each patient graph contains a Fast Fourier Transform operator (FFT) capable of computing an FFT expansion of vectors of real numbers read on its input port and publish the results on its output port. Using a publish/subscribe paradigm, downstream operators can subscribe to the results of this FFT operator.
As shown in Figure 1, our prognosis system application can be decomposed in three analytical parts. The first part consists of base S PADE operators used to pre-process incoming streams. The second part is time domain prediction while the last part is a frequency domain prediction. More details in each of the three analytical parts are presented in the next section. The remaining operators shown outside of these three parts are not part of the propose analysis technique. They are used to interface with the external patient monitors (i.e., the source operators) and to externalize the predictions made by the system (i.e., the sink operators). III. ANALYTICS FOR ONLINE PROGNOSIS We represent physiological streams as, xkm [n] where 0 ≤ k < K indexes different patients, K being the maximum number of patients supported by the system; 0 ≤ m < M indexes the different types of physiological streams processed by the system. Typical examples are blood pressure, SpO2, respiration rate and heart rate. In this work, we deal exclusively with regularly sampled physiological streams and throughout this paper, n is an integer representing discrete time. Consecutive samples xkm [i], xkm [i + 1], xkm [i + 2] · · · xkm [ j] are represented with xkm [i → j]. For each stream xkm , at a given time n, our goal is to design techniques capable of producing an estimation of future samples of xkm [n] that we denote xˆkm [n, τ], where n denotes the time of the prediction and τ denotes a forecasting parameter. τ is the amount of time in the future that we are predicting. In other words, xˆkm [n, τ] is a prediction for xkm [n + τ] made at time n. Consecutive predictions made at time n, xˆkm [n, i], xˆkm [n, i + 1], xˆkm [n, i + 2] · · · xˆkm [n, j], are represented by xˆkm [n, i → j] The aim of this work is to design computationally efficient prediction schemes computing xˆkm . To enable the execution of the method in real-time, we attempted to keep the approach as simple as possible. We are also constrained to use single pass techniques, restricting the amount of state information that needs to be managed within the S PADE graphs. The rest of this section describe our approach in greater details. A. DATA PRE-PROCESSING As streams enter the system, they first undergo data preprocessing steps with goal to prepare the data for effective prediction schemes. More specifically, we start by removing obvious outliers from the data by applying simple thresholding techniques that filter out samples that are outside the normal range for a given signal type. For example, we know that the operational range of an SpO2 signal is between 0 and 100; hence any samples reported by the monitoring device that are outside of this range are excluded. Furthermore, simple smoothing operations are also applied to the data to detect and correct statistical outliers. We found by experience that the computation of the median on a short segment of data xkm [n − δ → n + δ ], δ > 0 , δ being small and typically set at 2, is a good and simple way to detect and correct such outliers. As a result, we leveraged the existence
of such an operator in the Streams time series toolkit to clean our signals. B. PREDICTION WITH FADING-MEMORY POLYNOMIAL FILTERS The basic technique used to make prediction in this work leverages Fading-Memory Polynomial Filters (FMP) as described in [10]. This technique is stateful and can be expressed by the following recursive equations. Let εn be an estimate of prediction error measured at time n. εn = xkm [n] − xˆkm [n − 1, 1] m(1)
Let xˆk [n, 1] demote the first order derivative of xˆkm [n, 1] and θ be a parameter that effectively defines the time constant of the FMP filter. xˆkm is estimated as follow: xˆkm [n, 1] xˆkm [n − 1, 1] 1−θ2 1 0 = + εn m(1) m(1) 1 −1 (1 − θ )2 xˆk [n, 1] xˆk [n − 1, 1] Essentially, this technique tracks the evolution of the estimation error, together with the rate of change of the raw signal to make estimations of future samples. We adopt it in this work because of its low computational overhead that makes it quite relevant for streaming applications where estimation must be computed on the fly, in real time. C. PREDICTING PHYSIOLOGICAL DATA STREAMS We leverage the FMP prediction scheme outlined above to estimate future values of physiological time series data. 1) The Time Domain Approach: As it name indicates, this approach is straightforward application of the FMP scheme in the time domain. Indeed, we directly applied FMP filters on signals xkm [n] to generate predictions for the next sample xˆkm [n, 1]. Let Ψ(·) denote the application of an FMP filter on a time series xkm , then xˆkm [n, 1] = Ψ(xkm [n]) and in order to predict further in the future, we compose applications of Ψ, resulting in the following estimates for xkm xˆkm [n, τ] = |Ψ ◦ Ψ {z ◦ · · · ◦ Ψ}(xkm [n]) τ
times
where Ψ is composed on itself τ times. Clearly, the maximal value of τ that provides satisfactory results in terms of predictive accuracy will depend on the error introduced at each application of Ψ on the input signals. We expect this approach to give us satisfactory results for short term predictions and become unstable as we look more and more in the future (i.e., as τ increases). 2) The Frequency Domain Approach: In order to produce longer term predictions, we transform the problem to a frequency domain prediction. Leveraging basic S PADE operators, we aggregate incoming stream elements xkm [n] into windows of elements xkm [n → n + W ] where W is the window size. Before performing a Fourier expansion of these windows of sample, we substract the average value of xkm in the previous window. In another words, we compute: m ym k [n → n +W ] = xk [n → n +W ] −
m ∑u=W u=1 xk [n − u] W
Fig. 1.
System Overview
The removal of the average value of the previous window xkm [n − W → n − 1] is performed to control the scale of the signal that we predict. ym k [n → n + W ] is then fetch to a Fourier expansion operator producing W coefficients correm sponding to the spectrum of ym k . Let Yk [w → w +W ] denote this Fourier expansion (w indexing spectral components), we then use W FMP filters in parallel to estimate each sample of Ykm [w → w +W ] independently . To compute xˆkm , we simply apply an inverse Fourier transformation on the predicted W Fourier coefficients to obtain the estimation yˆm k [n,W → 2W ]. Finally, we easily reconstruct xˆkm [n,W → 2W ] by adding m yˆm k [n,W → 2W ] to the average of xk over the current window being monitored: xˆkm [n, 1 → W ] = yˆm k [n, 1 → W ] +
−1 m xk [n + u] ∑u=W u=0 W
xˆkm [n,W → 2W ] is an estimation for xkm [n +W → n + 2W ] IV. EXPERIMENTAL RESULTS We have implemented the method outlined above on Streams. We have used physiological data streams form 1527 patients obtained from the MIMIC-II database [1]. Data for each patients consist of several streams that include heart rate, mean arterial blood pressure, pulse oximetry data, and respiration rate. The traces for each patients are variable in length and typically span several days, corresponding the length of stay of the patient in the ICU. The downloaded data has been fast forwarded into our prognosis system for evaluation. Due to space constraints, we report results on the prediction of mean arterial blood pressure streams. Similar results were obtained on the other stream types. The metrics that we have used to measure the effectiveness of the prediction are: • The absolute error rate: N −1 ∑ k |xm [u]−xˆm [u−1,τ]| where Nk is the total εabs,k (τ) = u=0 k N k k number of samples in the stream xkm . • The average absolute error rate across patients: ∑K ε ε¯abs (τ) = k=0K abs,k • The window absolute error rate: W −1 m m W (n) = ∑u=0 |xk [n+u]−xˆk [n+u,W ]| εabs,k W
Fig. 2. Average absolute error rate as a function of the forecasting parameter (¯εabs (τ) vs τ).
•
The window average absolute error rate across patients: W (n) = ε¯abs
W ∑K k=0 εabs,k (n) K
For these experiments the FFT window size W was set at 32 minutes and the prediction task consists of predicting the next 32 minutes of mean arterial blood pressure given the last 32 minutes of observed data. We slid these windows by 8 minutes as data was received by the S PADE applications. Figure 2 compares the predictive capabilities of the time domain approach and the frequency domain. On the X-axis of these plots are represented different values for forecasting parameter τ in minutes. τ represents the number of minutes that we are predicting. The Y-axis of these plots show ε¯abs (τ). In general, the frequency domain approach outperforms the time domain approach. As expected, the time domain curve grows exponentially with τ. Indeed, cascading FMP filters has an amplification effect on the prediction errors. Nevertheless, we can notice that the cascade of FMP filters performs quite well in the time domain for small values
Fig. 3.
Average error distribution
of τ (τ < 5 minutes). The time domain approach could be used to efficiently recover small segments of missing values on these physiological data streams. The frequency domain curve shows very good performances with a slight increase in the error rates as τ grows. This constance of the error rate is attributed to the FFT expansion and the independent use of W FMP filters on each of the W FFT coefficients. These FFT coefficients play each an equal role in the reconstruction of each of the time domain samples, thus spreading their prediction error on all the time domain samples. A global view of the performance of the frequency domain approach on our entire data set is presented in Figure 3. W (n) for all patients Figure 3 plots the distribution of εabs,k and all the time windows computed. Our data set contained a total of 717720 windows. Most of the mass of this W (n). distribution is concentrated around low values for εabs,k The few exceptions stretching the distribution towards higher error values are mainly due to discontinuities resulting from missing values in the raw signals obtained from the MIMICII database. We believe that adding more sophisticated interpolation schemes dealing with such missing values would further improve the performance of the system. Figure 4 illustrates how fast the prediction is converging to low error rates as more and more data are analyzed by the FMP filters, for the frequency domain approach. More specifically, the X-axis represents the time n while the Yaxis represents the window average absolute error rate across W (n). The rapid decline of ε¯ W (n) proves that our patients, ε¯abs abs frequency domain predictor does not require too much data to reach reasonable levels of performance. Indeed, after an hour or so of monitoring , the system is capable of making predictions with reasonable absolute error rates. V. CONCLUSIONS AND FUTURE WORKS We have presented a real-time prognosis system implemented on a state of the art stream computing platform to
Fig. 4. Evolution of the average prediction error across patients as a function of time
produce estimates for future values of physiological streams commonly monitored in modern ICUs. We have tested the approach on a large data set consisting of real ICU patient data obtained from physionet [1]. Initial results are proving that this system is capable of making accurate predictions for at least half hour minutes in the future. Future work will explore the use of predicting schemes able to leverage cross stream correlations and able to leverage data from similar patients for real-time prognosis. R EFERENCES [1] MIMIC II Database. http://physionet.org/physiobank/database/mimic2db/. [2] L. Amini, H. Andrade, R. Bhagwan, F. Eskesen, R. King, P. Selo, Y. Park, and C. Venkatramani. SPC: A distributed, scalable platform for data mining. In Workshop on Data Mining Standards, Services and Platforms, DM-SSP, Philadelphia, PA, 2006. [3] A. Bar-Or, D. Goddeau, J. Healey, L. Kontothanassis, B. Logan, A. Nelson, and J. Van Thong. Biostream: A system architecture for real-time processing of physiological signals. In IEEE Engineering in Medicine and Biology Conference, 2004. [4] M. Blount, M. Ebling, M. Eklund, A. James, C. McGregor, N. Percival, K. Smith, and D. Sow. Analysis of physiologocal data streams in intensive care units. In IEEE Engineering in Medicine and Biology, March-April 2010. [5] J. Sun, D. Sow, J. Hu, S. Ebadollahi A System for Mining Temporal Physiological Data Streams for Advanced Prognostic Decision Support Submitted to the16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2010 [6] C.-M. Chen, H. Agrawal, M. Cochinwala, and D. Rosenblut. Stream query processing for healthcare bio-sensor applications. In 20th International Conference on Data Engineering, pages 791–794, 2004. [7] D. Sow, B. Alain, M. Blount, M. Ebling, and O. Verscheure. Body sensor data processing using stream computing. In 11th ACM SIGMM International Conference on Multimedia Information Retrieva, 2010. [8] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo. SPADE: The System S declarative stream processing engine. In International Conference on Management of Data, ACM SIGMOD, Vancouver, Canada, 2008. [9] M. P. Griffin and J. R. Moorman. Toward the early diagnosis of neonatal sepsis and sepsis-like illness using novel heart rate analysis. Pediatrics, 107(1):97–104, 2001. [10] N. Morisson. Introduction to Sequential Smoothing and Prediction. McGraw-Hill, 1969.
Lihat lebih banyak...
Comentários