1
Multiscale Denoising of Biological Data: A Comparative Analysis M. N. Nounou*, Senior Member, IEEE, H. N. Nounou, Senior Member, IEEE, N. Meskin, Member, IEEE, A. Datta, Fellow, IEEE, and E. R. Dougherty, Fellow, IEEE
I. A BSTRACT Measured microarray genomic and metabolic data are a rich source of information about the biological systems they represent. For example, timeseries biological data can be used to construct dynamic genetic regulatory network models, which can be used to design intervention strategies to cure or manage major diseases. Also, copy number data can be used to determine the locations and extent of aberrations in chromosome sequences. Unfortunately, measured biological data are usually contaminated with errors that mask the important features in the data. Therefore, these noisy measurements need to be filtered to enhance their usefulness in practice. Waveletbased multiscale filtering has been shown to be a powerful denoising tool. In this work, different batch as well as online multiscale filtering techniques are used to denoise biological data contaminated with white or colored noise. The performances of these techniques are demonstrated and compared to those of some conventional low pass filters using two case studies. The first case study uses simulated dynamic metabolic data, while the second case study uses real copy number data. Simulation results show that significant improvement can be achieved using multiscale filtering over conventional filtering techniques. Index Terms—Wavelets, Multiscale filtering, Metabolic data, Copy number data.
II. I NTRODUCTION Recent advances in DNA technologies have enabled the generation of microarray genomic and metabolic data from various biological systems. These data are a rich source of information about the biological systems they are collected from. For example, the availability of timeseries biological data has made it possible to estimate dynamic models that can not only be used to characterize the behavior of such biological systems and the interactions among different genes [1]–[6], but also to design intervention strategies for curing/managing major disease phenotypes [7], [8]. Also, the availability of copy number data has made it possible to determine the *M. N. Nounou is with the Chemical Engineering Program, Texas A&M University at Qatar, Doha, Qatar (email:
[email protected]). H. N. Nounou is with the Electrical and Computer Engineering Program, Texas A&M University at Qatar, Doha, Qatar (email:
[email protected]). N. Meskin is with the Electrical Engineering Department, Qatar University, Doha, Qatar (email:
[email protected]). A. Datta is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA (email:
[email protected]). E. R. Dougherty is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 USA, and also with the Translational Genomics Research Institute, Phoenix, AZ 85004 USA (email:
[email protected]).
locations and extent of aberrations (amplifications or deletions) in DNA sequences. Unfortunately, biological data are usually contaminated with measurement noise that can greatly degrade the usefulness of the data. For example, using noisy timeseries biological data to construct a dynamic genetic regulatory network will not only affect the accuracy of the estimated model, but also the effectiveness of any intervention technique based on it [4], [5]. Also, the noise contribution in copy number data can be very high (the signal to noise ratio can reach as low as unity), which makes it challenging to make good inference about the presence and extent of aberrations in the DNA [9]. This means that measurement errors in biological data need to filtered to enhance the usefulness of these data in practice. In general, filtering techniques can be classified into three main categories: filtering with a model, filtering with an empirical model, and filtering without a model. Modelbased filtering techniques minimize the error between the measured and filtered data while requiring the filtered data to satisfy the available model. Methods in this category include Kalman filtering [10], Moving Horizon Estimation, and particle filtering [11]. Of course, the quality of the filtered data depends on the accuracy of the models used. Also, in biological systems, the challenge is that models are not usually available a priori. In the absence of a fundamental model and in the case of multivariate filtering, an empirical model that is extracted from the relationship between the measured variables can also be used in data filtering. Methods in this category include Principal Component Analysis (PCA) [12]. Since accurate process models usually are not easily obtained, the most widely used filtering methods do not rely on fundamental or empirical models; instead, they rely on information about the nature of the errors or the smoothness of the underlying signal. Examples of modelfree filters include the wellknown low pass filters, such as Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters [13]. Examples of FIR and IIR filters include the Mean Filter (MF) and the Exponentially Weighted Moving Average (EWMA) filter, respectively. These are very popular filters as they are computationally efficient and can be implemented online. However, they are not very effective when used to filter data containing features with varying contributions over time and frequency. This is because low pass filters define a frequency threshold above which all features are considered noise. Therefore, they may eliminate important features having higher frequencies than the threshold value and retain noise components that are changing at lower frequencies than the threshold value. An example of a high
2
frequency feature is a sharp change in the data, while an example of low frequency noise is correlated (or colored) measurement errors. Thus, filtering practical biological data that may contain features that span wide ranges in time and frequency requires multiscale filtering algorithms that can account for this multiscale nature of the data. Wavelets and multiscale representation have been widely used in the analysis and investigation of various biological systems. For example, wavelets have been used to analyze genomic or DNA sequences to detect specific patterns [14]– [19]. Also, wavelets have been applied to various aspects of protein structural investigations, including primary sequence evolution [20], [21], secondary and tertiary structure determination [22]–[25], refinement of Xray crystallography [26], [27], and drug design and visualization [24], [28]. Moreover, wavelets have been used in the analysis of microarray data [29]–[31] and functional magnetic resonance imaging (fMRI) data [32]. Other applications of wavelets in biological systems include the analysis of 3D biological shapes using conformal mapping and spherical wavelets [33]. Finally, the authors in [34] developed a new thresholding filter to be used in multiscale denoising of biological data. In this work, various filtering techniques are compared when applied to filter biological data. Low pass as well as different multiscale filtering techniques are used. Examples of low pass filtering methods include mean filtering (MF) and EWMA filtering, which can be implemented online. Examples of multiscale filtering methods, on the other hand, include Standard Multiscale (SMS) filtering, Online Multiscale (OLMS) filtering, Translation Invariant (TI) filtering, and Boundary Corrected TI (BCTI) filtering, which will be described in Section IV. The remainder of this paper is organized as follows. In Section III, a brief description of some of the low pass filtering techniques is presented, followed by a description of mutltiscale representation of data and some of the multiscale filtering methods in Section IV. Then, in Section V, the various low pass and multiscale filtering methods are compared through two case studies, one using simulated timeseries metabolic data and the other using real copy number data. Finally, concluding remarks are presented in Section VI. III. L INEAR DATA F ILTERING Linear filtering techniques filter the data by computing a weighted sum of previous measurements in a window of finite or infinite length. These techniques include finite impulse response (FIR) and infinite impulse response (IIR) filtering, which are popular because they are computationally efficient and can be implemented online [13]. A linear filter can be written as follows, x ˆk = where,
∑
N −1 ∑
bi xk−i
(1)
i=0
bi = 1, and N is the FIR filter length. A well know
be expressed as follows, x ˆk = αxk + (1 − α) x ˆk−1
where, α is a smoothing parameter between 0 and 1, where a value of one corresponds to no filtering and a value of zero corresponds to only keeping the first measured data point. A more detailed discussion of different types of filters is presented in [35]. In linear filtering, the basis functions representing raw measured data have a temporal localization equal to the sampling interval. This means that linear filters are singlescale in nature since all the basis functions have the same fixed timefrequency localization. Consequently, these methods face a tradeoff between accurate representation of temporally localized changes and efficient removal of temporally global noise. Therefore, simultaneous noise removal and accurate feature representation of measured signals cannot be effectively achieved by singlescale filtering methods [36]. Enhanced denoising can be achieved using multiscale filtering as will be described next. IV. M ULTISCALE DATA F ILTERING In this section, several multiscale filtering method are described. However, since multiscale filtering relies on multiscale representation of data using wavelets and scaling functions, a brief introduction to multiscale representation will be presented first. A. Multiscale Representation of Data A signal can be represented at multiple resolutions by decomposing the signal on a family of wavelets and scaling functions. Consider the time series measurements of the concentration of a particular metabolite in a biological system which are shown in Figure 1(a) (see the description of biological system in section VA). The signals in Figures 1(b, d, and f) are at increasingly coarser scales compared to the original signal in Figure 1(a). These scaled signals are determined by projecting the original signal on a set of orthonormal scaling functions of the form, √ ϕij (t) = 2−j ϕ(2−j t − k) (3) or equivalently by filtering the signal using a low pass filter of length r, hf = [h1 , h2 , .., hr ], derived from the scaling functions. On the other hand, the signals in Figures 1(c, e, and g), which are called the detail signals, capture the details between any scaled signal and the scaled signal at the finer scale. These detail signals are determined by projecting the original signal on a set of wavelet basis functions of the form, √ (4) ψij (t) = 2−j ψ(2−j t − k) or equivalently by filtering the scaled signal at the finer scale using a high pass filter of length r, gf = [g1 , g2 , .., gr ] , derived from the wavelet basis functions. Therefore, the original signal can be represented as the sum of all detail signals at all scales and the scaled signal at the coarsest scale as follows,
i
FIR filter is the mean filter, where bi = N1 . An example of IIR filter, on the other hand, is the EWMA filter, which can
(2)
x(t) =
−J n2 ∑
k=1
−j
aJk ϕJk (t) +
J n2 ∑ ∑
j=1 k=1
djk ψjk (t)
(5)
3
1 (a) Original signal
2
3
4
1
2
3
4
1
2
3
4
1
2
3
g
h
(c) First detail signal
(b) First scaled signal
4
g
h
(e) Second detail signal
(d) Second scaled signal
Fig. 2. A schematic diagram of the translation mechanism used in TI filtering.
g h
(g) Third detail signal
(f) Third scaled signal
Fig. 1.
A schematic diagram of data representation at multiple scales.
where j, k, J, and n are the dilation parameter, translation parameter, maximum number of scales (or decomposition depth), and the length of the original signal, respectively [37]– [39]. B. Standard Multiscale (SMS) Filtering Multiscale filtering using wavelets is based on the observation that random errors in a signal are present over all wavelet coefficients while deterministic changes get captured in a small number of relatively large coefficients [40]–[44]. Thus, stationary Gaussian noise may be removed by a three step method [42]: 1) Transform the noisy signal into the timefrequency domain by decomposing the signal on a selected set of orthonormal wavelet basis functions. 2) Threshold the wavelet coefficients by suppressing coefficients smaller than a selected value. 3) Transform the thresholded coefficients back into the original domain. Donoho and coworkers have studied the statistical properties of wavelet thresholding and have shown that for a noisy signal of length n, the filtered signal will have an error within O(log n) of the error between the noisefree signal and the signal filtered with a priori knowledge about the smoothness of the underlying signal [41]. Selecting the proper value of the threshold is a critical step in this filtering process, and several methods have been devised. For good visual quality of the filtered signal, the Visushrink method determines the threshold as [45], √ (6) tj = σj 2 log n where, n is the signal length and σj is the standard deviation of the errors at scale j, which can be estimated from the wavelet coefficients at that scale using the following relation, σj =
1 median {djk } . 0.6745
(7)
Other methods for determining the value of the threshold are described in [46].
C. Transaltion Invariant (TI) Filtering In standard multiscale filtering, when the wavelet basis function used to represent a certain feature in the signal is not aligned with the feature itself, an artifact, which is not present in the original signal, can be created in the filtered signal. Translation invariant (TI) filtering is one approach that was proposed in [47] to deal with this problem by shifting the signal several times, filtering it, and then averaging out all translations to suppress these artifacts and improve the smoothness of the filtered data (see Figure 2). This approach, however, suffers from the creation of end effects because TI filtering considers the signal to be a cyclic signal, which results in artificial discontinuities when the two end points in the signal are very different in magnitude. This disadvantage will be overcome by the boundary corrected TI (BCTI) filtering method, which will be discussed in a later section. D. Online Multiscale (OLMS) Filtering The standard and TI multiscale filtering techniques described earlier are batch, i.e., they require the entire data (which has to be of dyadic length) a priori, and thus can not be implemented online. When filtering timeseries metabolic data, batch filtering would be acceptable if the filtered data are needed to estimate the parameters of a genetic regulatory network. However, if the filtered data are used within an intervention framework, then the metabolic data need to be filtered online as they are measured. To deal with this problem, an online multiscale (OLMS) filtering algorithm, that is based on multiscale filtering of data in a moving window of dyadic length, has been developed [36]. The OLMS algorithm is summarized below: 1) Decompose the measured data within a window of dyadic length using a causal boundary corrected wavelet filter. 2) Threshold the wavelet coefficients and reconstruct the filtered signal. 3) Retain only the last data point of the reconstructed signal for online use. 4) When new measured data are available, move the window in time to include the most recent measurement while maintaining the maximum dyadic window length (as illustrated in Figure 3). Note from Figure 3 that the moving window always seeks to keep the largest number of dyadic samples available, which would provide a more accurate estimation of the threshold value. Also, note that if a wavelet filter other than Haar is
4
1
2
i
ith filtered sample
x2
x3 +
2
3
2
3
x5 1

x1
4
x4
Fig. 3. A schematic diagram of the moving window mechanism used in OLMS and BCTI filtering.
used in OLMS filtering, a boundary corrected version of the filter is needed to avoid inaccuracies at the boundaries. This is particularly important because OLMS only retains the last filtered data point from each window. E. Boundary Corrected TI (BCTI) Filtering If all the filtered signals from all moving windows in OLMS (shown in Figure 3) are averaged, a much smoother filtered signal is obtained. This is in fact similar to TI filtering, but since it does not assume the signal to be a cyclic list, it overcomes the problem of boundary effects encountered in TI, and thus is called Boundary Corrected TI (BCTI). Another advantage is that BCTI averages less number of data points when computing the final estimated signals (compare the mechanisms of TI and BCTI in Figures 2 and 3, respectively), and thus BCTI is computationally less expensive than TI. The performances of all filtering techniques are illustrated and compared through the following case studies. V. C ASE S TUDIES In this section, the performances of the various low pass and multiscale filtering techniques are compared through two case studies, one using simulated time series metabolic data and the other using real copy number data.
Fig. 4.
Generic branched pathway with four dependent variables.
i.e., the noisy data are computed simply by adding the noise to the noisefree data. Different types of noise (white as well as colored) and different levels (different variances) are used to provide a good comparison between the performances of the different filtering techniques. For the white noise case, two levels of zeromean Gaussian noise, which correspond to signaltonoise ratios (SNR) of 10 and 20, are used. The SNR is the ratio of the variance of the noisefree data to the variance of the added noise. On the other hand, for the colored noise case, the noise is generated using the following model, nk = 0.5nk−1 + bek , where ek is an independent sample from a zero mean Gaussian distribution with a variance of one. Two values of b are used, which result in SN R values of 10 and 20. Various filtering techniques are used to filter these noisy data, which include mean filtering (MF), exponentially weighted moving average (EWMA) filtering, standard multiscale (SMS) filtering, translation invariant (TI) filtering, and boundary corrected TI (BCTI) filtering. All filtering techniques are optimized using cross validation [46], where the data are split into two sets, odd and even. Then, each set (odd and even) is filtered separately, and then the following cross validation mean square error is minimized to determine the optimum filter parameters: 1 ∑{ 2 (¯ xodd,k − xodd,k ) N/2 N/2
CV M SE
=
k=1
+ (¯ xeven,k − xeven,k )
A. Case Study 1: Filtering of Dynamic Genomic Data In this case study, the various filtering algorithms discussed in this paper are compared through their application using simulated timeseries biological data, which represent the concentrations of four metabolites that are related by the branched pathway shown in Figure 4. These concentrations are the four dependent variables x1 , ..., x4 in the following Ssystem [48], which also involves one independent variable x5 : 0.75 x˙ 2 = 0.08x0.5 1 − 0.03x2 0.2 x˙ 3 = 0.03x0.75 − 0.05x0.5 2 3 x4
(8)
0.8 x˙ 4 = 0.02x0.5 1 − 0.06x4
The data are generated by first discretizing the the Ssystem (8) using a sampling time of 3.9s, and then simulating the discretized model using the following initial conditions: x1 (0) = 5.6, x1 (0) = 3.1, x1 (0) = 2.9, and x1 (0) = 3.1, and assuming x5 = 1. The simulated data, which are assumed to be noisefree, are then contaminated with additive noise,
} (9)
where x ¯odd,k = 12 (ˆ xeven,k+1 + x ˆeven,k ), x ¯even,k = 1 (ˆ x + x ˆ ), and x ˆ and x ˆ are the k th odd odd,k+1 odd,k odd,k even,k 2 and even filtered data samples, respectively, and N is the total number of data samples. The performances of the various filtering techniques, on the other hand, are compared by computing the mean square error with respect to the noisefree N ∑ 2 data, i.e., M SE = N1 (ˆ xk − x ˜k ) , where x ˆk and x ˜k are the k=1
x˙ 1 = 0.2x−0.8 x15 − 0.1x0.5 1 3
2
k th filtered and noisefree data samples, respectively. To make statistically valid conclusions about the performances of all techniques, a Monte Carlo simulation using 1000 realizations is performed, and the results are summarized in Tables I and II and Figures 58. Table I and Figures 5 and 6 compare the performances of the different techniques for the case of white noise, while Table II and Figures 7 and 8 compare their performances for the case of colored noise. In all Figures, x1 is in blue, x2 is in dark green, x3 is in red, and x4 is in light green. Tables I and II show that the OLMS filtering method out
5
TABLE I C OMPARISON BETWEEN THE FILTERING MEAN SQUARE ERRORS (MSE) FOR THE VARIOUS FILTERING TECHNIQUES IN THE CASE OF WHITE NOISE
x4 0.0117 0.0117 0.0060 0.0058 0.0086 0.0023 0.0082 0.0074 0.0039 0.0039 0.0049 0.0015
MF EWMA OLMS SMS TI BCTI
0.0382 0.0333 0.0215 0.0163 0.0226 0.0138
SN R x2 0.0666 0.0629 0.0574 0.0360 0.0322 0.0180 SN R 0.0363 0.0332 0.0246 0.0173 0.0143 0.0142
4
2
0
x4 0.0181 0.0160 0.0154 0.0078 0.0160 0.0043 0.0104 0.0092 0.0060 0.0046 0.0063 0.0037
0
100
200
300
400
4
2
0
500
6
4
4
OLMS
6
2
0
= 10 x3 0.0199 0.0187 0.0264 0.0151 0.0086 0.0057 = 20 0.0108 0.0102 0.0091 0.0053 0.0045 0.0045
6 Mean Filtering
6
0
100
200
300
400
500
0
100
200
300
400
500
2
0
100
200
300
400
0
500
Fig. 5. Comparison between the performances of various online filtering methods for the case of white noise with SN R = 10.
6
6
4
2
0
0
100
200
300
400
TI Filtering
2
0
100
200
300
400
500
0
100
200
300
400
500
6
4
2
0
4
0
500
6
B. Case Study 2: Filtering of DNA Copy Number Data In this case study, the various filtering techniques are used to filter noisy DNA copy number data. Copy number aberrations (amplifications or deletions in the DNA) are characteristics of many genomic diseases such as cancer [9], [49]. Microarray Comparative Genomic Hybridization (CGH) is an experimental approach for scanning DNA copy number, which allows simultaneous measurements of DNA relative changes at thousands of DNA loci. In these experiments, a target (to be tested) DNA sequence is hybridized with another normal DNA sequence, and then digital images are captured to calculate
x1 0.0660 0.0601 0.0518 0.0250 0.0616 0.0157
SMS Filtering
performs the conventional low pass filters (MF and EWMA), which can all be implemented online. This advantage can be helpful if the filtered data are utilized within an intervention framework, where the measured concentrations need to be filtered online. This advantage of OLMS can be seen in Figures 5 and 7, for the cases of white noise and colored noise, respectively. If more time is available to filter the data, other batch multiscale filtering methods can perform even better. For example, the SMS technique provides a clear improvement over all online methods, including OLMS. This is demonstrated in Figures 6 and 8, for the cases of white noise and colored noise, respectively. Other batch multiscale filtering techniques can provide improved smoothness, such as in TI. However, due to the cyclespinning of the data, TI results in boundary effects, especially if the two ends of the data are significantly different in magnitude, such as in the case of x1 (in blue) and x4 (in light green) in Figures 6 and 8, for white and colored noise. This drawback of TI is accounted for in BCTI, which provides similar smoothness without the boundary effects, and thus results in smaller mean square errors as shown in Tables I and II. The advantages of these batch multiscale filtering techniques can be helpful if the filtered biological data are used to enhance the accuracy of biological models which are estimated from these data. It is also clear that the advantages of all multiscale techniques is clearer for colored noise than for white noise. This is because multiscale representation decorrelates correlated measurements at multiple scales, which provides an advantage for multiscale filtering methods over single scale ones [44].
Technique MF EWMA OLMS SMS TI BCTI
BCTI Filtering
0.0258 0.0251 0.0131 0.0121 0.0172 0.0043
= 10 x3 0.0119 0.0113 0.0098 0.0069 0.0050 0.0027 = 20 0.0069 0.0067 0.0062 0.0038 0.0029 0.0016
Noisy Data
MF EWMA OLMS SMS TI BCTI
SN R x2 0.0403 0.0381 0.0270 0.0223 0.0167 0.0087 SN R 0.0231 0.0228 0.0177 0.0130 0.0091 0.0053
EWMA
x1 0.0415 0.0416 0.0210 0.0186 0.0320 0.0072
FOR THE VARIOUS FILTERING TECHNIQUES IN THE CASE OF COLORED NOISE
Noisy Data
Technique MF EWMA OLMS SMS TI BCTI
TABLE II C OMPARISON BETWEEN THE FILTERING MEAN SQUARE ERRORS (MSE)
0
100
200
300
400
500
4
2
0
Fig. 6. Comparison between the performances of various batch multiscale filtering methods for the case of white noise with SN R = 10.
0
0
100
200
300
400
6
−0.5 −1
2
0
5
10
0 −0.5 −1
0
5
4
0
100
200
300
400
500
6
0 −0.5 −1
0
5
x 10 0 −0.5 −1
10
0
5
4
2
0 −0.5 −1
0
0
100
200
300
400
0
500
0
100
200
300
400
500
Fig. 7. Comparison between the performances of various online filtering methods for the case of colored noise with SN R = 10.
6 SMS Filtering
Noisy Data
6
4
2
0
0
100
200
300
400
0
100
200
300
400
500
0
100
200
300
400
500
6 BCTI Filtering
TI Filtering
2
0
500
6
4
2
0
4
0
100
200
300
400
500
4
2
0
Fig. 8. Comparison between the performances of various batch multiscale filtering methods for the case of colored noise with SN R = 10.
the ratios of fluorescence intensities at different locations in the DNA sequence. This ratio reflects the relative DNA copy number between the target and the normal sequences. Therefore, when the log2 of the ratio equals zero, the target sequence is the same as the normal sequence; however, when the log2 is greater or smaller than zero at particular locations, we expect amplification or deletion of the DNA, respectively, at these locations. Copy number data are usually very noisy, where the signal to noise ratio can reach as low as unity [9]. Therefore, they need to be denoised for effective interpretation of such data. In this case study, the various filtering techniques are used to filter real copy number data. These copy number data are physically Bacterial Artificial Chromosome (BAC) array data on 15 fibroblast cell lines [49]. Specifically, two chromosome data sets are used in this case study: chromosome 9 of MPE600 and chromosome 14 of GM01750. These data sets, which are of
x 10 BCTI Filtering
OLMS
OLMS
EWMA
2
4
0
5 position
10
10 4
x 10 4
10 4
x 10 0
500
0
TI Filtering
2
4
EWMA
4
SMS Filtering
6 Mean Filtering
Noisy Data
6
Mean Filtering
6
0 −0.5 −1
0
4
x 10
5 position
10 4
x 10
Fig. 9. Filtering Chromosome 9 log2 copy number data using various filtering techniques.
lengths 106 and 64, respectively, can be freely downloaded at www.nature.com/ng/journal/v29/n3/suppinfo/ng754 S1.html. The log2 ratio data for chromosomes 9 and 14 are filtered using the various filtering techniques and the results are shown in Figures 9 and 10. All filters are optimized using cross validation as described in the first case study (equation 9). Figures 9 and 10, which show the filtered signals versus the position along the chromosomes, clearly show the advantages of multiscale filtering over low pass filtering, which help interpret these log2 data for both chromosomes. For example, all batch multiscale filtering techniques (SMS, TI and BCTI) provide greater smoothness than low pass filters, and thus allow more accurate determination of the extent of amplification (as in the case of chromosome 14 at positions smaller than 1.5 × 104 ) or deletion (as in the case on chromosome 9 at positions smaller than 5 × 104 ). Figures 9 and 10 also show that BCTI and TI provide smoother filtering than SMS, which is due to the averaging used in these techniques, and that BCTI results in more accurate filtering than TI at the edges. The results also show that, in order to capture the changes in these data (from nonzero values to zero values or vise versa), which happen at high frequency, the low pass filtering techniques had to retain a lot more noise than multiscale filtering. This is due to the fact that low pass filters define a frequency threshold above which all features are removed, regardless of whether they are important features or just noise. So, in order to preserve the high frequency features (such as the abrupt changes in the data), low pass filters tend to keep more noise, which can be clearly seen in Figures 9 and 10. Finally, the results also show that multiscale filtering can outperform conventional filtering methods using relatively small data sets, such as the chromosome 14 data which consist of 64 data samples. VI. C ONCLUSION This paper presented a comparison between various filtering techniques when used to denoise biological data contaminated
0.4 0.2 0 −0.2
SMS Filtering
Mean Filtering
7
0
2
4
6
8
R EFERENCES 0.4 0.2 0 −0.2
0
2
4
6
4
x 10 TI Filtering
EWMA
0.4 0.2 0 −0.2
0
2
4
6
8
0.4 0.2 0 −0.2
0
2
4
6
4
OLMS
x 10 BCTI Filtering
0
2
4 6 position
8 4
x 10
8 4
x 10 0.4 0.2 0 −0.2
8 4
x 10
0.4 0.2 0 −0.2
0
2
4 6 position
8 4
x 10
Fig. 10. Filtering Chromosome 14 log2 copy number data using various filtering techniques.
with white or correlated noise. Various low pass as well as multiscale filtering techniques are compared. Examples of low pass filters include the Mean Filter (MF) and the Exponentially Weighted Moving Average (EWMA) filter, while examples of multiscale filters include the Online Multiscale (OLMS), the Standard Multiscale (SMS), the Translation Invariant (TI), and the Boundary Corrected TI (BCTI) filtering techniques. This comparison is conducted using two case studies, one using simulated timeseries biological data and the other using real copy number data. The results clearly show the advantages of the multiscale techniques over the conventional low pass filters using both simulated and real biological data. For the simulated dynamic data, the results show that if filtering has to be implemented online (such as within an intervention framework), OLMS outperforms the online MF and EWMA filters. However, if the filtered data are to be used to model the biological system, where more time is available, the batch multiscale filtering techniques (SMS, TI, and BCTI) can provide even better performances. TI and BCTI provide smoother filtered data than SMS because of the averaging they utilize, while BCTI gives more accurate filtering at the edges. The results of the copy number case study, on the other hand, show that the multiscale filtering techniques, especially TI and BCTI, provide greater smoothness of the data than low pass filter, which helps determine the extent of aberrations in the chromosome sequences more accurately. Also, multiscale techniques show better noiseremoval abilities than low pass filters. This is because, in order to preserve the abrupt changes in the data (which happen at high frequency), low pass filters tend also to keep more of the high frequency noise. VII. ACKNOWLEDGEMENT This work was made possible by NPRP grant NPRP081483051 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.
[1] H. Jong, “Modeling and simulation of genetic regulatory systems: A literature review,” J. Computational Biology, vol. 9, no. 1, pp. 67–103, 2002. [2] I.C. Chou, H. Martens, and E. O. Voit, “Parameter estimation in biochemical systems models with alternating regression,” Theoretical Biology and Medical Modelling, vol. 3, no. 25, 2006. [3] O. R. G. et al, “Parameter estimation using simulated annealing for ssystem models of biochemical networks,” Bioinformatics, vol. 23, no. 4, pp. 480–486, 2007. [4] Z. Kutalik, W. Tucker, and V. Moulton, “Ssystem parameter estimation for noisy metabolic profiles using newtonflow analysis,” IET System Biology, vol. 1, no. 3, pp. 174–180, 2007. [5] H. Wang, L. Qian, and E. Dougherty, “Inference of gene regulatory networks using Ssystems: a unified approach,” IET System Biology, vol. 4, no. 2, pp. 145–156, 2010. [6] N. Meskin, H. Nounou, M. Nounou, A. Dattta, and E. R. Dougherty, “Parameter estimation of biological phenomena modeled by ssystems: An extended kalman filter approach,” IEEE Conference on Decision and Control and European Control Conference, Orlando, FL, pp. 4424– 4429, 2011. [7] A. ErvadiRadhakrishnan and E. O. Voit, “Controllabilty of nonlinear biochemical systems,” Mathematical Biosciences, vol. 196, pp. 99–123, 2005. [8] N. Meskin, H. Nounou, M. Nounou, A. Dattta, and E. R. Dougherty, “Intervention in biological phenomena modeled by ssystems,” IEEE Transactions on Biomedical Engineering, vol. 58, no. 1, pp. 1260–1267, 2011. [9] A. Alqallaf and A. Tewfik, “DNA copy number detection and sigma filter,” IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS, pp. 1–4, 2007. [10] H. W. Sorenson, Kalman Filtering: Theory and Applications. IEEE Press, New York, 1985. [11] J. B. Rawlings and B. R. Bakshi, “Particle filtering and moving horizon estimation,” Computers and Chemical Engineering Journal, vol. 30, no. 1012, p. 15291541, 2006. [12] M. A. Kramer and R. S. H. Mah, Model Based Monitoring. Proc. Int. Conf. on Foundations of Comput. Aided Process Oper., D. Rip pen, J. Hale, and J. Davis, eds., CACHE, Austin, TX, 1994. [13] M. T. Tham and A. Parr, “Succeed at online validation and reconstruction of data,” Chem. Eng. Prog., vol. 90, no. 5, p. 46, 1994. [14] A. Arneodo, Y. d’Aubenton Carafa, B. Audit, E. Bacry, J. Muzy, and C. Thermes, “Wavelet based fractal analysis of dna sequences,” Physica D, vol. 1328, pp. 1–30, 1996. [15] A. Arneodo, Y. d’Aubenton Carafa, E. Bacry, P. Graves, J. Muzy, and C. Thermes, “What can we learn with wavelet about dna sequences,” Physica A, vol. 249, pp. 439–448, 1998. [16] G. Dodin, P. Vandergheynst, P. Levoir, C. Cordier, and L. Marcourt, “Fourier and wavelet transform analysis, a tool for visualizing regular patterns in dna sequences,” Journal of Theoretical Biology, vol. 206, pp. 323–326, 2000. [17] P. Li, “Wavelets in bioinformatics and computational biology: state of art and perspectives,” Bioinformatics, vol. 19, pp. 2–9, 2003. [18] N. N. O. S. Huang, H. and A. Vo, “Array cgh data modeling and smoothing in stationary wavelet packet transform domain,” BMC Genomics, vol. 9, S2S17, 2008. [19] N. Nguyen, “Stationary wavelet packet transform and dependent laplacian bivariate shrinkage estimator for arraycgh data smoothing,” Journal of Compuational Biology, vol. 17, no. 2, pp. 139–152, 2010. [20] P. Morozov, T. Sitnikova, G. Churchill, F. Ayala, and A. Rzhetsky, “A new method for characterizing replacement rate variation in molecular sequences,” Genetics, vol. 154, pp. 381–395, 2000. [21] A. R. amd P. Morozov, “Markov chain monte carlo computation of confidence intervals for substitutionrate variation in proteins,” Proceedings of Pacific Symposium of Biocomputing, pp. 203–214, 2001. [22] P. Lio and M. Vannucci, “Wavelet changepoint prediction of transmembrane proteins,” Bioinformatics, vol. 16, pp. 376–382, 2000. [23] K. Murray, D. Gorse, and J. Thornton, “Wavelet transforms for the characterization and detection of repeating motifs,” Journal of Molecular Biology, vol. 316, pp. 341–363, 2002. [24] A. Mandell, M. Owens, K. Selz, W. Morgan, M. Shesinger, and C. Nemeroff, “Mode matches in hydrophobic free energy eigenfunctions predicts peptideprotein interactions,” Biopolymers, vol. 46, pp. 89–101, 1998.
8
[25] H. Hirakawa, S. Muta, and S. Kuhara, “The hydrophobic cores of proteins predicted by wavelet analysis,” Bioinformatics, vol. 15, pp. 141– 148, 1999. [26] P. Main and J. Wilson, “Lowresolution phase extension using wavelet analysis,” Acta Crystallographica, Section D, Crystallography Journals, vol. 56, pp. 1324–1331, 2000. [27] J. Ferrer, M. Roth, and A. Antoniadis, “Data compression for diffraction patterns,” Acta Crystallographica, Section D, Crystallography Journals, vol. 54, pp. 184–199, 1998. [28] M. Carson, “Wavelets and molecular structure,” Journal of ComputerAided Molecular Design, vol. 10, pp. 273–283, 1996. [29] R. Klevecz, “Dynamic architecture of the yeast cell cycle uncovered by wavelet decomposition,” Functional and Integrative Genomics, vol. 1, pp. 186–192, 2000. [30] R. Jornsten and B. Yu, “Comprestimation: microarray images in abundance,” Proceedings of the Conference on Information Sciences and Systems, Princeton University, pp. 186–192, March 1517, 2000. [31] E. Myasnikova, A. Samsonova, K. Kozlov, M. Samsonova, and J. Reinitz, “Registration of the expression patterns of drosophila segmentation genes by two independent methods,” Bioinformatics, vol. 17, pp. 3–12, 2001. [32] I. Dinov, J. Boscardin, M. Mega, E. Sowell, and A. Toga., “A waveletbased statistical analysis of fmri data: I. motivation and data distribution modeling,” NeuroInformatics, vol. 3, no. 4, pp. 319–343, 2005. [33] D. Nain, S. Haker, A. Bobick, and A. Tannenbaum, “Multiscale 3d shape representation and segmentation using spherical wavelets,” IEEE Transactions on Medical Imaging, vol. 26, pp. 598–618, 2007. [34] V. Prasad, P. Siddaiah, and B. P. Rao, “A new wavelet based method for denoising of biological signals,” International Journal of Computer Science and Network Security, vol. 8, no. 1, pp. 238–244, 2008. [35] R. D. Strum and D. E. Kirk, First Principles of Discrete Systems and Digital Signal Processing. AddisonWesley, Reading, MA, 1989. [36] M. Nounou and B. Bakshi, “Online multiscale filtering of random and gross errors without process models,” AIChE Journal, vol. 45, no. 5, pp. 1041–1058, 1999. [37] G. Strang, “Wavelets and dilation equations,” SIAM Rev., vol. 31, pp. 614–627, 1989. [38] I. Daubechies, “Orthonormal bases for compactly supported wavelets,” Commun. on Pure and Applied Math., vol. 41, pp. 909–996, 1988. [39] S. Mallat, “A theory of multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 7, p. 764, 1989. [40] A. Cohen, I. Daubechies, and V. Pierre, “Wavelets on the interval and fast wavelet transforms,” Appl. Comput. Harmonic Analysis, vol. 1, no. 7, pp. 64–81, 1993. [41] D. Donoho and I. Johnstone, “Ideal denoising in an orthonormal basis chosen from a library of bases,” Technical Report, Department of Statistics, Stanford University, 1994. [42] D. Donoho, I. Johnstone, G. Kerkyacharian, and D. Picard, “Wavelet shrinkage: Asymptotia?” J. Royal Stat. Soc. B, vol. 57, no. 2, pp. 301– 369, 1995. [43] B. Bakshi, “Multiscale analysis and modeling using wavelets,” Chemometrics, vol. 13, no. 34, pp. 415–434, 1999. [44] M. Nounou and B. Bakshi, “Multiscale methods for denoising and compression,” Wavelets in Analytical Chemistry, ed. B. Walczak, Elsevier, Amsterdam, pp. 119–150, 2000. [45] D. Donoho and I. Johnstone, “Ideal spatial adaptation via wavelet shrinkage,” Biometrika, vol. 81, pp. 425–455, 1994. [46] G. Nason, “Wavelet shrinkage using crossvalidation,” J. Royal Stat. Soc. B, vol. 58, p. 463, 1996. [47] R. Coifman and D. Donoho, “Translationinvariant denoising,” Lecture Notes in Statistics, SpringerVeralag, vol. 103, p. 125, 1995. [48] E. O. Voit and J. Almeida, “Decoupling dynamical systems for pathway identification from metabolic profiles,” Bioinformatics, vol. 20, no. 11, pp. 1670–1681, 2004. [49] A. Snijders, N. Nowak, R. Segraves, S. Blackwood, N. Brown, J. Conroy, G. Hamilton, A. Hindle, B. Huey, K. Kimura, S. Law, K. Myambo, J. Palmer, B. Ylstra, J. Yue, J. Gray, A. Jain, D. Pinkel, and D. Albertson, “Assembly of microarrays for genomewide measurement of DNA copy number,” Nature Gentics, vol. 29, pp. 263–264, 2001.