A temporal domain audio watermarking technique

Share Embed


Descrição do Produto

1088

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 4, APRIL 2003

A Temporal Domain Audio Watermarking Technique Aweke Negash Lemma, Javier Aprea, Werner Oomen, and Leon van de Kerkhof

Abstract—Audio watermarking techniques can be used to embed extra information into audio signals. The goal is to hide prespecified data carrying some information into the audio stream such that it is not audible to the human ear (i.e., transparent) and is, at the same time, resistant to removal attacks (i.e., robust). In the currently known watermarking systems, the above challenges are not always adequately resolved. In this paper, we present an alternative audio watermarking technique that mitigates these and other related shortcomings. The system is referred to as modified audio signal keying (MASK). In MASK, the short-time envelope of the audio signal is modified in such a way that the change is imperceptible to the human listener. The MASK system can easily be tailored for a wide range of applications. Moreover, informal experimental results show that it has a good robustness and audibility behavior. Index Terms—Human auditory system, psycho-acoustic model, robustness, transparency, transparency-robustness plane.

I. INTRODUCTION

T

ODAY, multitudes of digital audio contents are available to the public. Although digital signals can offer much better quality and more flexibility than analog ones, record companies are reluctant to use new digital media because it poses the danger of unrestricted (illegal) duplication and redistribution of “original quality” materials. In this context, audio watermarking techniques can be used to embed copyright and copy control information into the so-called clear-text audio signals. Apart from this, watermarking can also be used for various other applications such as data authentication, broadcast monitoring, data indexing, and so forth. In audio watermarking, the goal is to hide a prespecified data stream into the audio signal such that it meets some prespecified audibility requirements (e.g., transparency) and is, at the same time, robust enough to survive various audio processing (e.g. removal attacks). In addition to these, there are other considerations including complexity, payload, and effects on audio compression systems. In principle, all audio watermarking systems exploit the irrelevant properties present in audio signals and their representations to systematically hide extra information. To this end, several approaches are known. As a first category, we find a set of schemes that embed watermarks by adding predefined very weak noise signals to audio such that the changes are inaudible. The more advanced ones employ noise-shaping techniques [10] to inaudibly embed stronger noise signals. The shortcoming of these sorts of watermarks is that they are fragile to most signal Manuscript received February 4, 2002; revised January 7, 2003. The associate editor coordinating the review of this paper and approving it for publication was Dr. Ahmed Tewfik. The authors are with Philips Digital Systems Lab. (PDSL), 5600 JB Eindhoven, The Netherlands (e-mail: [email protected]; Javier.Aprea@ Philips.com; [email protected]; [email protected]). Digital Object Identifier 10.1109/TSP.2003.809372

processing attacks including audio compression tools. Second, one might identify a group of watermarking techniques that exploit the insensitivity of the human ear to very short delay echoes. Although there are a few variations within this group [5], we collectively refer to them as echo hiding techniques. One outstanding shortcoming of these approaches is that the watermark embedding success is signal and/or delay dependent. Moreover, resolving this problem tends to make the system unacceptably complex. As a third set, one may identify a group of watermarks that are embedded in the transform domains [1], [13]. In these approaches, the phase and/or amplitude of the transform domain coefficients are modified in a certain way to carry the desired watermark information. The known transforms are FFT, DCT, and wavelet transforms. The main problem of these approaches is that they show unsatisfactory robustness in signals with very few transform domain components. Finally, one may identify watermarking techniques that employ the spread spectrum concept of communication systems [9]. The main issue here is the computational complexity needed to lock on to the watermark signal, i.e., synchronization overhead my be unacceptably high. Although most of these approaches work well for a relatively wide range of signals, they do not always resolve the audibility-transparency challenge adequately [8], [7]. This and other practical considerations such as complexity have motivated us to look into other alternatives to audio watermarking. This lead to the development of an audio watermarking technique referred to as modified audio signal keying (MASK). In MASK, a watermark is embedded by modifying the envelope of the audio with an appropriately conditioned and scaled version of a predefined random sequence carrying some information (a payload). On the detector side, the watermark symbols are extracted by estimating the short-time envelope energy. To this end, first, the incoming audio is subdivided into frames, and then, the energy of the envelope is estimated. The watermark is extracted from this energy function. The MASK watermarking system can easily be tailored for a wide range of applications. Moreover, informal experimental results show that it has a good robustness and audibility behavior. Other strong features of the MASK watermark are that it has simple embedder and detector, it can be embedded in any audio, re-embedding of watermark is simple, and its embedder can easily be controlled with a psychoacoustic model of the human auditory system. A. Outline In the following two sections, we give detailed descriptions of the MASK watermark embedding and detecting systems. We also derive mathematical models describing the MASK system and some functions quantifying its audibility quality and ro-

1053-587X/03$17.00 © 2003 IEEE

LEMMA et al.: TEMPORAL DOMAIN AUDIO WATERMARKING TECHNIQUE

1089

bustness. In Section IV, we present some experimental results showing the performance and the characteristics of MASK. Finally, in Section V, we give some final remarks and conclusions. II. WATERMARK EMBEDDER In MASK, the temporal envelope of the audio signal is modified according to a certain prespecified information signal referred to as the watermark. The block diagram in Fig. 1 shows the digital signal processing needed for embedding a multibit payload watermark into a host signal . First, the filter extracts the part of the audio signal that is suitable to carry the watermark information. We denote the output of with . The watermarked audio signal is then obtained by adding an appropriately scaled version of the product of and to the host signal

Fig. 1. Watermark embedder.

(1) We choose the watermark in such a way that multiplying it predominantly modifies the short time envelope of . with be defined such that , and let be defined Let ; then, (1) can be written as such that , and the envelope-modulated portion of the watermarked signal is given as (2) is a bandpass filter,1 and are the in-band and When the out-of-band components of the host signal, respectively. For must the MASK system to work properly, the signals and be in phase. This is achieved by appropriately compensating for the phase distortion introduced by the filter . The gain in the modulation factor controls the audibility–robustness tradeoff. It may be a constant or, as shown in Fig. 1, can be automatically adapted according to a properly chosen audibility cost-function–e.g., a psycho-acoustic model of the human auditory system (HAS) [14].

Fig. 2.

Watermark conditioning circuit.

function (typically, a biphase window) of length samples is convolved with the up-sampled watermark sequence to generate . Let be the th a slowly varying watermark signal ; then, according to Fig. 2, we may frame of the signal write (3) is the window shaping function, and where is the input to the watermark conditioning circuit. The so-genand are then added up erated watermark sequences to give the multibit payload wawith a certain relative delay , i.e., termark signal (4)

A. Watermark Signal The multibit payload watermark signal with payload is produced as follows. First, we generate a finite length, zero mean, uniformly distributed random sequence for by means of a random number generator with initial seed known to both the embedder and the detector. We then apply circular shifts and to the sequence to obtain the watermark and , respectively. Each sequence sequences is subsequently converted into a periodic, slowly varying by the so-called waternarrowband signal of length mark conditioning circuit. The multirate system of Fig. 2 shows the implementation of the watermark conditioning circuit. First, the up-sampler raises the sampling frequency of the input watermark sequence by the factor . The factor is referred to as the watermark symbol period and represents the span of one watermark symbol in the audio signal. Finally, a window shaping 1Although there is no restriction on the shape of to be a bandpass filter.

H , it is commonly chosen

is such that the zero-crossings of where the relative delay are aligned with the maximum amplitude points of and vice-versa. This results in a composite signal with minimum cross-interference between its two underlying signals. For a raised cosine and a biphase window functions, this is achieved and , respectively (see Section II-B with for the definition of these windows). The finite length wateris embedded throughout the audio signal by mark signal repeating it end-to-end. genDuring detection, the composite watermark signal erates two correlation peaks corresponding to its two underlying and . These two signals are separated by signals (see Section III-A5). considering two framings displaced by As shown in Fig. 9, the two correlation peaks are separated by . is part of the payload and is defined as (5) In addition to , extra information is encoded by changing the relative signs of the embedded watermarks. In the detector, this

1090

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 4, APRIL 2003

Fig. 3. Shaping window function. (Left) Raised cosine. (Right) Biphase.

Fig. 5. Characteristic curves for robustness R and audio quality Q as function of gain and symbol period T .

Fig. 4. Frequency spectra for the watermark sequence

w [k ]

=

f1; 1; 01; 1; 01; 01g conditioned with (a) a raised cosine and (b) biphase shaping windows.

is seen as a relative sign between the correlation peaks. can take four possible values and may be defined as sign

sign

where and are the values of the correlation peaks corand , respectively. The overall watermark responding to is then given as a combination of and : payload (6) The maximum information that can be carried by a watermark is thus bits. sequence of length B. Parameters Controlling the Watermark Performance In MASK, one may identify four main parameters that control the robustness and audibility behavior. These are used in the water1) the watermark shaping function mark conditioning circuit; 2) the embedding strength ; 3) the watermark symbol period ; 4) the passband of the bandpass filter . Although it is difficult to make a quantitative analysis on the effects of each parameter on audibility and robustness for all sorts of audio signals, it is possible to give a qualitative analysis in the form of characteristic curves. Let us first consider the two alternative window shaping functions, namely, the raised cosine and the biphase windows, given in Fig. 3. Unlike the raised cosine window, the biphase window shaping function always results in a quasi DC-free watermark signal. This is clearly seen in Fig. 4, where the frequency spectra corresponding to the watermark sequence conditioned, respectively, with a raised cosine and a biphase window shaping functions are shown. To understand the effect on the watermark performance, it is important to note that the useful information is contained

Fig. 6. Characteristic curves for robustness R and audio quality Q as function of bandpass filter cut-off frequencies f and f .

only in the non-DC component of the watermark. This means that for the same watermark energy, the biphase window carries more useful information than the raised-cosine window. As a result of this, the biphase window allows a superior audibility performance for the same robustness, or conversely, it allows a better robustness for the same audibility quality. In Fig. 5, the effects of the embedding strength and the watermark symbol period on the robustness (denoted by ) and audio quality (denoted by ) for given window shape function and filter cut-off frequencies and are shown. It is possible to see from these curves that there exists a point beyond which the audio quality degrades quickly with increasing , whereas robustness does not increase significantly. For and as functions of , there are inflection points around and , respectively. In Fig. 6, the characteristics of and as functions of and for given , , and are shown. From these, it is possible to derive the behavior of the watermark for a given passband. One interesting behavior is the existence of an inflection point , below which both robustness and audibility quality around show positive slopes (normally, and have opposite slopes). The above curves characterize the watermark system behavior in terms of robustness and audio quality. With these curves, it is possible to find a set of parameters optimally fulfilling specific robustness and audio quality requirements. For some of these parameters, however, one has to also consider other practical constraints such as detection time and complexity.

LEMMA et al.: TEMPORAL DOMAIN AUDIO WATERMARKING TECHNIQUE

1091

C. Automatic Control of the Embedding Strength Given that the main requirement of a watermarking system is to be inaudible while being robust to attacks, it is necessary to find an objective cost function that can be used to control the embedding strength. Otherwise, to ensure inaudibility, we may have to limit the maximum watermark energy to that acceptable to critical fragments, or conversely, we may have to limit the range of audio tracks that may be watermarked to those where the watermark is not audible. is automatically To overcome this problem in MASK, controlled by comparing the power spectrum of the watermark signal against the so-called masking curve of the host audio generated using a psycho-acoustic model of the HAS (see Fig. 1). In this way, embedding is done optimally, i.e., the watermark is embedded with the maximum possible energy that is still inaudible. When re-embedding, i.e., removing the existing payload and embedding a different one, is necessary, the watermark strength can be derived from detection. In this way, re-embedding can be made to closely resemble the embedded watermark strength without having to use the psycho-acoustic model of the human auditory system for calculating the gain . D. Determining the Masking Threshold Tuning the psycho-acoustic model for achieving a CD-quality watermarked audio has been done with the help of a masking threshold experiment based on an adaptive 2-alternative forcedchoice (2AFC) measurement paradigm [4]. In this test, a listener is presented with a number of sequential trials for an audio fragment. At each trial, there are two stimuli (a reference and a target) presented in a random order. The reference stimulus is a nonwatermarked version of the audio fragment. The target is a watermarked version of the same audio fragment for which the level of the masking curve is adjusted adaptively throughout the test. The listener’s task is to identify the correct presentation order at each trial. If the listener does not identify the order correctly, the artifact level is increased (e.g., by adjusting the masking threshold upwards). Conversely, if the listener identifies the presentation order correctly, the artifact level is decreased. Performance is measured as the number of correct responses. After each trial, the listener is given a limited feedback on his performance. The procedure is repeated until enough number of reversals in the changes of the artifact level are attained within a maximum number of trials. The threshold is determined by, e.g., taking the mean level over the last reversals. To enable a better visualization of the procedure, a typical example of the progression of the 2AFC test is graphically shown in Fig. 7. The test starts with an artifact level well above the expected threshold. The level is gradually varied through the test in steps of variable size according to a set of convergence rules. The chosen rule dictates, among other things, the convergence point on the so-called psychometric function curve. This curve is a sigmoid function defined between two limits, referred to as the chance-performance level and the perfect-performance level (100% point). For a 2AFC procedure, the chance-performance level is at 50%. The central part of the psychometric function

+

Fig. 7. Progression of the 2AFC test. “ ” represents a correct identification, and “ ” represents an incorrect identification.

0

is shown on the left side of Fig. 7. Commonly, the convergence threshold is set midway between the 50% and the 100% performance levels. The example shown in Fig. 7 uses the so-called two-down–one-up rule that converges to the 71% performance level. In Section II-D, we present experimental results for this test. III. WATERMARK DETECTOR In this section, we discuss the MASK watermark detector. To facilitate the discussion, we first consider in Section III-A detection in the unlikely case of synchronous operation between the embedder and the detector. In Section III-B, we consider a more realistic condition where there is a possible time-offset and time-scale modification between the embedder and the detector. A. Temporally Synchronized Detection With temporal synchronism, we mean that there is no timeoffset or time-scale modification between the embedder and the detector. Under this assumption, the MASK watermark detector looks like the one shown in Fig. 8. It consists of two stages: a) the symbol extraction stage and b) the correlation and decision stage. In the symbol extraction stage, the incoming watermarked is processed to generate an estimate of the watersignal mark sequence. In the correlation and decision stage, the generated watermark sequence is correlated with the reference watermark, and the correlation peak is compared against a threshold to determine the detection truth value. In the following, we give a detailed description of each of the processing stages and derive mathematical models characterizing the principal components. 1) Watermark Symbol Extraction Stage: The first process in the symbol extraction stage is the filter , which is typically a bandpass filter, and has the same behavior as the corresponding be the output of the filter in the watermark embedder. Let filter . Assuming linearity, it follows that [viz. (1) and (2)] (7) After filtering, the audio signal is segmented into frames of be the th sample of the th frame of length . Let be the th sample of the waterthe filtered signal, and let . Then, noting that (viz. 3), mark signal from (7), it follows that (8) is as defined in Section II-A, and is an estiwhere mate of the th symbol of the embedded watermark sequence

1092

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 4, APRIL 2003

is predominantly contribution of the envelope of the original audio. Thus, we may approximate (11) by Fig. 8.

(12)

Simple MASK watermark detector.

or .2 The next processing step depends on the . In the following, we show how used window function can be estimated for the two proposed window shaping functions: the raised cosine window shaping function and the biphase window shaping function. 2) Raised Cosine Window Shaping Function: As shown in Fig. 8, the next step after framing is the energy computation be the energy corresponding to the th frame stage. Let signal, i.e.,

Substituting (8) into the above equation, we obtain

where Lowpass is a lowpass filter function. The above relation, in fact, represents the functionality of the whitening stage for the case of a raised cosine window shaping function. 3) Biphase Window Shaping Function: When a biphase window is employed, a different approach is used to estimate the envelope of the original audio. To be more specific, consider the biphase window function given in Fig. 3. It is seen that when the audio envelope is shaped with this window function, the first and the second halves of the frame are scaled in opposite directions. In the detector, we utilize this property to estimate the envelope energy of the host signal. To this end, first, the and audio frame is subdivided into two halves. Let be the energy functions corresponding to the first and second half frames, respectively, i.e., (13)

(9) and Note that for the raised cosine window function, most of the watermark energy is concentrated near the central part of the frame referred to as the region of significance. To maximize detection, it is thus advantageous to compute the energy function only over this region of significance. Usually, the central one-third portion of the frame is considered as the region of significance. In this region, we can approximate the raised cosine . Thus, ignoring higher order terms of window with in (9) and after some elaboration, we obtain

(14) respectively. For the biphase window shaping function, we identify two significance regions corresponding to the two lobes. Let the first and the second halves of the window shaping funcbe denoted by and , respectively. Then, over tion and the significance regions, we may approximate . Thus, ignoring higher order terms of in (13) and (14) and after some elaboration, we obtain

(10) in the above equation, we obtain the following Solving for approximation:

(11)

Note that the denominator of (11) contains a term that requires the knowledge of the host (unwatermarked) signal . Since is not available to the detector, it means that we need to first estimate the denominator of (11). To this end, note that the embedded watermark sequence is a white noise. This means that any correlation between neighboring samples of the energy is not contributed by the watermark. In fact, here, function it is assumed that the watermark contributes only to the noisy , and the slowly varying part part of the energy function 2Note that depending on the alignment, only w [k ] or w [k ] is estimated. However, by considering two frames with relative displacement of T , both w [k] and w [k] can be estimated (see Section III-A5).

and

Thus, assuming it can be shown that

, may be approximated by (15)

Unless stated otherwise, we will assume in all subsequent discussions that a biphase window shaping function is used. It should be noted, however, that all the discussions also apply to the raised cosine window shaping function. 4) Correlation and Decision Stage: As shown in Fig. 8, the final decision on the existence of a watermark is made by comparing the peak of the correlation function against a certain threshold. A typical behavior of the correlation function is shown in Fig. 9. The horizontal line in the figure represents the detection threshold. Its value controls the false alarm rate.

LEMMA et al.: TEMPORAL DOMAIN AUDIO WATERMARKING TECHNIQUE

1093

Fig. 10.

Fig. 9. Typical shape of the correlation function.

Basically, there are two kinds of false alarms: the false positive rate defined as the probability of detecting a watermark in nonwatermarked items and the false negative rate defined as the probability of not detecting a watermark in watermarked items. Generally, the requirement on the false positive rate is more stringent than that on the false negative rate. Let be the normalized correlation peak. Then, assuming normally distributed correlation function, the false positive probability is given by erfc

(16)

where erfc is the complementary error function [6]. 5) Estimating the Payload: Note that the watermark is embedded by modulating the audio signal with the composite given in (4). It has been discussed above that the signal energy computation is generally conducted over the so-called significance region of about one third of the central portion of each lobe of the biphase window shaping function. This means and displaced by that the significance regions of have very few cross interferences. This behavior is and used to separate the envelope portions corresponding to , respectively. To be specific, consider two framings: one starting at and the other at , respectively. The sequence estimated and that estimated using the first framing corresponds to . Once the using the second framing corresponds to two sequences are separated, the correlation functions are computed. The resulting correlation functions typically look like that shown in Fig. 9. The payload is then computed using the relations given in (5) and (6). B. Temporally Nonsynchronized Detection Generally, the watermark signal received at the detector is a delayed and time-scale modified version of the watermark signal transmitted at the embedder side. Thus, in designing

MASK watermark detector.

robust audio watermarks, it is important to make sure that the watermark detector can synchronize to the watermark sequence inserted in the host signal and is also able to resolve time-scale modifications. In this section, we discuss how this can be achieved in the MASK detector. The generalized watermark detector arranged to resolve possible time-offset and time-scale modifications is shown in and , Fig. 10. In the figure, for and are defined as (17) and (18) and are computed This means that the energy functions . Subseover 50% overlapping subframes of lengths quently, each pair of the energy function produces an estimate of the embedded watermark sequence. 1) Resolving Time-Offset Problem: Let the energies and be defined as in (17) and (18), respectively. Then, estimates of the watermark sequence as we generate

(19) Since, for each lobe of the biphase window shaping function, the energy of the embedded watermark is concentrated near the center of the lobe, the subframe best aligned with the center of the lobe results in a distinctively better watermark symbol estimate compared with all the other subframes. Thus, of the sequences, the one that best fits the reference so-generated watermark is chosen. The corresponding correlation peak value is then used to determine the truth-value of the detection. For the MASK watermark, the length of each of the buffers is typically is typically 2 to 8. between 2048 and 8192, and to be a multiple of 4, the two embedded By choosing and are estimated automatically sequences during the time-offset alignment process. Thus, the value of is preferably chosen to be even.

1094

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 4, APRIL 2003

2) Improving the Accuracy of the Watermark Detection: Usually, the length of each buffer is three to four times that of the watermark sequence, and each watermark symbol is constructed by taking the averages of several estimates of a given watermark symbol. This averaging process is referred to as smoothing, and the number of times the averaging is done is referred to as the smoothing factor . Thus, given the and the watermark sequence length , the buffer length is such that smoothing factor (20) Taking a large smoothing factor improves the accuracy of the detection. However, this factor cannot be increased arbitrarily, as the detector tends to be prohibitively complex. Moreover, as will soon be seen, large smoothing factor makes the watermark more sensitive to scale modification. Note that the smoothing process preassumes that the watermark is embedded by repeating it end-to-end, as discussed in Section II-A. If this is not the case, then smoothing can be ignored altogether at the expense of robustness. 3) Efficient Time-Scale Search: In digital devices, there can exist up to 1% drift in sampling (clock) frequency. For audio equipment, this drift is normally manifested as a stretch or shrink in the time domain signal (i.e., a linear time-scale change). A watermark embedded in the audio signal will be affected by this time stretch or shrink as well, which may make watermark detection very difficult or even impossible. In MASK, an efficient time-scale search that is based on the manipulation of the buffered sequences is implemented. More specifically, we use a linear interpolation technique to realize the time-scale search. buffers are multiplexed into a single buffer to First, the generate the sequence

Second, an interpolated sequence

is generated as

where is the expected time-scale modification, and a linear interpolation coefficient defined as

is

The interpolated watermark sequences are subsequently generated by demultiplexing the interpolated into the buffers. This means that the th sequence entry of the th interpolated watermark sequence is obtained as

Finally, the content of each buffer is correlated with a reference watermark sequence and the maximum of the correlation peaks is compared against a threshold to determine the detection truth value. The scale search is realized by repeating the above procedure (i.e., interpolation–de-multiplexing–correlation) until a positive detection truth value is attained or until all the time scales under consideration are exhausted. 4) Finding the Appropriate Scale Search Step Size: It has been discussed above that for a proper watermark detection, one be the has to perform an appropriate time scale search. Let

scale search step size, and let us assume that we want the watermark to survive all the time scale modifications in the interval . Then, in the worst case, a total of (21) time scales need to be visited before the detection truth value , we would like to find the maxis determined. To minimize that can still allow an exhaustive scale search. imum value of To this end, experimental results show that the detection performance is not significantly affected if the time scaling does not . This exceed the inverse of half the buffer length should be such that means that for exhaustive scale search,

Putting this into (21), it follows that we need to visit at least (22) scales in order to conduct an exhaustive scale search. Note that the scale search can be time consuming. Thus, one has to take into account the complexity issue while choosing the watermark and . embedding/detection parameters 5) Detection Threshold and False Alarm Rate: Note that to esresolve the time offset, one has to check which of the timated sequences (buffer values) best fits the reference waterexperimark. Moreover, for each buffer, one has to conduct ments to check which of the scales best fits the reference watermark. This means that in the worst case, one conducts a total of experiments for each detection. Thus, the false positive rate under time-offset alignment and time-scale search is given : by multiplying (16) by erfc

(23)

is a correction factor that takes into acwhere count the fact that the random variables stored in the different buffers are not entirely independent of each other. This means that for the same normalized correlation peak, the false positive rate associated with time-offset and time-scale search is times larger than that associated with no time-offset and no time-scale search. In other words, to attain the same false positive rate, the threshold associated with the former has to be set at a higher value than that associated with the latter. For in, , , and a false positive stance, for , the detection threshold normalized with rerate of spect to the standard deviation of the correlation function has to for the case where there is no time-scale be set at for the case where there are and offset search and at time-offset and time-scale search. IV. EXPERIMENTAL RESULTS In this section, the performance of the MASK watermark is evaluated in terms of robustness and audibility. In order to enable applicability of the watermark in a wide range of systems, it is desired that the watermark is both inaudible and robust. Audibility and robustness are parameters tightly connected and have been therefore jointly optimized during the design of the watermarking scheme. First, a 2AFC subjective

LEMMA et al.: TEMPORAL DOMAIN AUDIO WATERMARKING TECHNIQUE

TABLE I SELECTED AUDIO CLIPS. ALL ITEMS ARE STEREO, ABOUT 10 s LONG, AND SAMPLED AT 44.1 kHz—16 BIT RESOLUTION

1095

TABLE II DESCRIPTION OF THE WATERMARKING ATTACKS

test was conducted to determine the hearing threshold level for the psycho-acoustic model. Then, an embedding operating point with a certain masking margin from this threshold was chosen, and the robustness of the watermark for this setting was assessed. At the end of this section, a joint representation for audibility and robustness is given. The representation visualizes the tradeoffs between these two opposite interests. The results are based on only a limited set of audio clips. This was primarily because of a limited time resource. In order to make these results as representative as possible, we have selected four different types of audio clips that are critical for the current watermarking system. In Table I, the name and the property of each of the considered clips are listed. A. 2AFC Test Here, we give results of the 2AFC test conducted to determine the masking threshold. The masking threshold is an invariant reference energy level corresponding to the hearing threshold of the watermark artifacts. It is determined only once and all watermark embeddings are done with reference to this value. The 2AFC test was done using the three-down one-up decision rule. This matching pattern converges to the 79% performance point on the psychometric function. This threshold is somewhat higher than the commonly assumed midpoint performance (i.e., the 75% performance point). During the test, an immediate feedback was given, notifying the subject if he had identified the item correctly or not. However, no feedback is given about the progression of the experiment. Based on a preliminary analysis, we first chose a certain expected masking threshold. We then set the initial threshold level at 9 dB above this expected value. Moreover, the step size of the change in the masking threshold was set to an initial value dB and decreased by half until a minimum of of dB was reached. The final convergence threshold is reversals, and determined as the average over the last the maximum number of trials was set to 100. A total of eight listeners, ages ranging from 25-35 (one female and seven male) and with experience in subjective listening tests, participated in the 2AFC test. From these, only four came close to or below the expected threshold for only two of the selected four items. After all the data was analyzed, the masking threshold was determined. To allow for possible measurement as well as nonpredictable errors, a masking margin of 1.5 dB was allowed between the determined masking threshold and the watermark embedder operating point. B. Robustness The watermark robustness was tested against the different signal processing and audio coding attacks listed in Table II.

Fig. 11.

Robustness results.

The corresponding results are shown in Fig. 11. For each attack, the vertical bar represents the distribution of the normalized correlation peaks of the four items (lower end of the bar indicates the minimum, the upper end shows the maximum, and the dashed point represent the median). The results show that, for the considered clips, the MASK watermark survives every single attack. C. Subjective Audibility Test In order to evaluate the audibility of the MASK watermark, we have selected the “Double blind, triple stimulus, with hidden reference” test methodology [11]. In this test, the subject is presented with a trial of three clips: R/A/B. The first is the reference R, which is the original signal. The watermarked and the hidden reference are assigned to the second signal (A) and third signal (B) in random order. In order to detect positional effects, each trial is presented twice in the test as R/A/B and R/B/A. All clips and trials are randomized. For each trial, the subject has to identify the watermarked clip and to grade it according to the ITU-R BS.1284 standardized 5-point grading scale: • 5.0 imperceptible; • 4.0 perceptible but not annoying; • 3.0 slightly annoying; • 2.0 annoying; • 1.0 very annoying. By default, the item identified as the reference is given the score 5.0.

1096

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 4, APRIL 2003

Fig. 12.

Audibility results. Vertical axis shows the diffgrade.

In total, seven listeners with previous experience in subjective listening testing participated. All these listeners are employees of Philips PDSL, Eindhoven, the Netherlands. For each trial, the so-called diffgrade is calculated. The diffgrade is defined as the difference between the identified original and the watermarked signal. Note that in the case the subject identifies the watermarked clip as the original, a positive diffgrade is assigned to that trial. For each item, the diffgrades of all subjects are grouped, and the average and the 95% confidence interval (cf. [2]) are calculated. The test results are graphically presented in Fig. 12. Note that the vertical axis is zoomed to the “Perceptual but not annoying” range. According to the EBU criteria [3] (designed for high-quality audio coders), the overall results show that the watermarked items are statistically indistinguishable from the originals (cf. [12]). However, it is seen that for item-4 (Stravinsky), the watermarked audio is statistically distinguishable from the original. As stated earlier, the best settings of the embedding parameters are attained by jointly optimizing the audibility and robustness. To this end, one can clearly see from Fig. 11 that we still have some robustness headroom that can be traded for audibility. This can be achieved by choosing a lower masking margin (less than the selected 1.5 dB) or by conducting a more conservative 2AFC test. A more conservative 2AFC test is attained by tightening the convergence procedure. For example, we may choose a two-down–one-up rule instead of the one used in this paper (the three-down–one-up rule). Finally, it should be noted that the 2AFC and the subjective listening tests were conducted only with a limited numbers of items and listeners. In such a small experiment, an individual error in the measurements can significantly affect the end result. This can be resolved by conducting the test using sufficient numbers of items and listeners such that the effects of singular errors are minimized. D. Objective Audibility Test In this section, we give objective audibility test results. Note that objective test is by no means a substitute for subjective test. However, since the former is fully reproduceable, it is quite useful for comparison and benchmarking purposes. To this end, in Fig. 12, we have shown the results of an objective test conducted according to the ITU-PEAQ standardized tool [15]. From the plot, we see that all the items but harpsichord attained objec-

Fig. 13.

Robustness versus audibility plane.

tive scores higher than the zero–diffgrade line (i.e., the nonaudibility line). Apart for Stravinsky, the objective results of all items overlap the confidence intervals of the corresponding subjective scores. There is a slight mismatch between the two measurements for the Stravinsky item. E. Robustness Versus Audibility To visualize the interaction between audibility and robustness, in Fig. 13, the so-called robustness–audibility plane is shown. The horizontal and the vertical axes of the plane represent the audibility quality grade and the normalized correlation peak value, respectively. For each item, the elliptical-shaped region characterizes the audibility-robustness behavior of the watermark. The horizontal axis of the ellipse is equal to the 95% confidence interval of the subjective listening test result, and its vertical axis is equal to the range of the robustness scores. In the case the ellipse crosses the zero–diffgrade line, the corresponding watermarked signal is said to be statistically indistinguishable from the original. If, at the same time, the ellipse is entirely above the robustness threshold line, the watermark is also said to be robust for the considered clip and attacks. In this representation, the robustness threshold is set to 8.95. For the , , , , and considered settings of , this corresponds to a false positive rate of . From the figure, we see that all the ellipses but the one corresponding to the Stravinsky clip intersect the transparency line and are at the same time above the robustness threshold. Thus, we can conclude for all the clips except Stravinsky that the watermark is robust to the considered attacks, and the watermarked signals are statistically indistinguishable from the originals. V. CONCLUSION In this paper, we have presented a new watermarking scheme that is based on envelope modulation. The system is shown to be simple, with good audibility and robustness behavior. We have given a qualitative analysis of the behavior of the watermarking system with respect to different control parameters. On the basis

LEMMA et al.: TEMPORAL DOMAIN AUDIO WATERMARKING TECHNIQUE

of this analysis, one can choose a set of parameters that is optimal for a given application. We have also presented test results for determining the threshold of the masking curve generated with the psychoacoustic model of the human auditory system. The subsequent listening and robustness tests have revealed that there is still room for improvement of the watermarking system. Although the results of the informal subjective listening test have been limited in terms of the number of clips and the number of subjects, these results give us the confidence to organize an independent formal test. Results on such a formal test will be published when available.

REFERENCES [1] W. Bender, D. Gruhl, and N. Morimoto, “Techniques for data hiding,” 1994. [2] M. H. Degroot, Probability and Statistics. Reading, MA: AddisonWesley, 1986. [3] , CCIR document number TG 10-2/3. Basic audio quality requirements for digital audio bit-rate reduction systems for broadcast emission and primary distribution, Oct. 28, 1991. [4] S. A. Gelfand, Hearing: An Introduction to Psychological and Physiological Acoustics, 3rd ed. Basel, Switzerland: Marcel Dekker, 1998. [5] D. Gruhl, W. Bender, and A. Lu, “Echo hiding, information hiding 1-st international workshop,” in Lecture Notes in Computer Science, R. J. Anderson, Ed. Cambridge, U.K.: Issac Newton Inst., Cambridge Univ., 1996, vol. 1117, pp. 295–315. [6] S. Haykin, An Introduction to Analog and Digital Communication. New York: Wiley, 1989. [7] . [Online]. Available: http://www.sdmi.org/cfp.htm [8] . [Online]. Available: http://www.sdmi.org/pr/Amsterdam_May_18_ 2001_PR.htm [9] J. Cox, J. Kilian, T. Leighton, and T. Shamoon, “Secure spread spectrum watermarking for multimedia,” IEEE Trans. Image Processing, vol. 6, pp. 1673–1687, Dec. 1997. [10] W. Oomen, M. E. Groenewegen, R. G. van der Waal, and R. N. J. Veldhuis, “A variable-bit-rate buried-data channels for compcact disc.,” in Proc. 96th AES Conv., Amsterdam, the Netherlands, Feb. 26-Mar. 1, 1994, p. 9.4. [11] ITU-R Rec. BS.1116 (rev1), “Method for the subjective assessment of small impairments in audio systems including multi-channel sound systems,” Int. Telecommun. Union, Geneva, Switzerland, 1997. [12] G. A. Soulodre, T. Grusec, M. Lavoie, and L. Thibault, “Subjective evaluation of state-of-the-art 2-channel audio codecs,” in Proc. 104th AES Conv., Amsterdam, the Netherlands, May 1998, p. 4740 (P11–5). [13] M. van der Veen, F. Bruekers, J. Haitsma, T. Kalker, A. N. Lemma, and W. Oomen, “Robust, multi-functional and high-quality audio watermarking technology,” in Proc. 110th AES Conv., Amsterdam, The Netherlands, May 2001. [14] E. Zwicker and U. T. Zwicker, “Audio engineering and psychoacoustics: Matching signals to the final receiver, the human auditory system,” J. Audio Eng. Soc., vol. 39, pp. 115–126, Mar. 1991. [15] T. Thielde, W. C. Treurniet, R. Bitto, C. Scmidmer, T. Sporer, J. G. Beerends, C. Colomes, M. Keyhl, G. Stoll, K. Brandenburg, and B. Feiten, “PEAQ-The ITU standard for objective measurement of perceived audio quality,” J. Audio Eng. Soc., vol. 48, no. 1/2/3, pp. 3–29, Jan. 2000.

1097

Aweke Negash Lemma was born in Arba Minch, Ethiopia, on September 7, 1965. He received the B.Sc. degree in 1988 (with great distinction) from the Department of Electrical Engineering, Addis Ababa University, Addis Ababa, Ethiopia, the M.Sc. degree in 1994 (with great distinction) from the Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands, and the Chartered Designer degree in 1996 and the Ph.D. degree in 2000 from the Department of Electrical Engineering, Delft University of Technology, Delft, The Netherlands. Between 1988 and 1992, he was a Lecturer at Addis Ababa University. From 1994 to 2000, he worked as a Researcher and Assistant Lecturer with the Signal Processing Group, Delft University of Technology, in the fields of multirate signal processing, speech coding, and statistical and array signal processing. Currently, he is with the Philips Digital Systems Laboratory, Eindhoven. His current research focuses on watermarking techniques for multimedia signals.

Javier Aprea was born in Sao Paulo, Brazil, on August 13, 1965. He received the B.E.E. degree in 1987 from the Escola de Engenharia, Universidade Federal do RGS, Porto Alegre, Brazil, the M.S. degree in computer science in 1991 from the Instituto de Informatica at the same university, and the M.E.E. degree in 1994 from the Eindhoven International Institute, Eindhoven University of Technology, Eindhoven, The Netherlands. From 1986 to 1992, he worked at several institutions and companies in Brazil, developing hardware and software for measurement and control, industrial automation, biomedical engineering, and image processing. In 1994, he joined Philips Medical Systems, Best, The Netherlands, working on the development of sensors for MRI scanners. In 1998, he joined Philips Consumer Electronics, Eindhoven, working as system designer in the Sound Coding Group of the Philips Digital Systems Laboratory. His current interests are audio watermarking and electronic music distribution.

Werner Oomen was born in Oosterhout, The Netherlands, on February 20, 1967. He received the B.S. degree in electronics from the Polytechnical School, Breda, The Netherlands, in 1989 and the Ing. degree in electronics from the University of Eindhoven, Eindhoven, The Netherlands, in 1992. In 1992, he joined the Philips Research Laboratories, Eindhoven, in the Digital Signal Processing Group, where he worked on audio source coding algorithms. Since 1999, he has been working at the Philips Digital Systems Laboratory, Eindhoven, on different topics related to digital signal processing of audio signals.

Leon van de Kerkhof was born in 1958 in Eindhoven, The Netherlands. He received the B.S. degree in electrical engineering in 1981 from the Eindhoven Institute of Technology and the M.S. degree from the Eindhoven University of Technology in 1987. In 1978, he joined Philips Research Laboratories, where he worked on acoustic noise control and the use of adaptive filters in acoustics. In 1987, he moved to Philips Consumer Electronics, working on the development and implementation of the MPEG-1 and MPEG-2 audio standards, digital audio broadcasting (DAB), and super audio CD. Currently, he is manager of the Sound Coding Group of the Philips Digital Systems Laboratory, Eindhoven.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.