Automatic transcription of piano music by sparse representation of magnitude spectra

Share Embed


Descrição do Produto

AUTOMATIC TRANSCRIPTION OF PIANO MUSIC BY SPARSE REPRESENTATION OF MAGNITUDE SPECTRA Cheng-Te Lee, Yi-Hsuan Yang, and Homer Chen National Taiwan University [email protected], [email protected], [email protected] ABSTRACT Assuming that the waveforms of piano notes are pre-stored and that the magnitude spectrum of a piano signal segment can be represented as a linear combination of the magnitude spectra of the pre-stored piano waveforms, we formulate the automatic transcription of polyphonic piano music as a sparse representation problem. First, the note candidates of the piano signal segment are found by using heuristic rules. Then, the sparse representation problem is solved by l1regularized minimization, followed by temporal smoothing the frame-level results based on hidden Markov models. Evaluation against three state-of-the-art systems using ten classical music recordings of a real piano is performed to show the performance improvement of the proposed system. Index Terms— F0 estimation, multiple pitch estimation, sparse representation, l1-regularized minimization. 1. INTRODUCTION Music transcription is the process of converting a musical recording to a musical score (i.e., symbolic representation). This process in the strict music sense entails a number of complicated tasks to extract musical attributes, including pitch, starting time, duration, instrument type, and dynamics of each note and tempo, meter signature, and tonality (key signature) of the music. In its simplest form, music transcription can be reduced to the estimation of the pitch and timing of each individual note. Such information, although only a small subset of the whole musical attributes, is sufficiently useful for content-based music retrieval, such as query by humming, query by example, and cover song identification. This paper mainly deals with this simplified version of music transcription. Like the work described in [1]-[3], we limit the audio source to piano (either real or synthesized). Although only one single type of audio source is considered, the problem is by no means trivial. In fact, Peeters [4] showed that the ___________________________ This work was supported by the National Science Council of Taiwan under contract NSC 97-2221-E-002-111-MY3.

of piano music is more challenging than that of other instruments because pianos have wide pitch range and high polyphony. We are motivated to tackle the transcription of piano music in this work also because piano is the most popular musical instrument worldwide. The fact that there are usually no more than six concurrent notes played from the 88 notes of a piano inspires us to use sparse representation to solve the transcription problem. 1.1 Previous work Music signals can be monophonic or polyphonic. For polyphonic music, one has to deal with issues such as the source number ambiguity and the octave ambiguity that do not exist in monophonic music; therefore, polyphonic music transcription is more complicated than monophonic music transcription. For this reason, monophonic music transcription is relatively easy to deal with and mature [5], but polyphonic music transcription still has much room for improvement. Moorer [6] made the first attempt to transcribe polyphonic music. Although the system is limited to duets, it has inspired many interesting ideas. For example, Marolt [1] applied neural networks and oscillator networks to track partials, Klapuri [7] adopted an iterative estimation and cancellation scheme to estimate concurrent fundamental frequencies (F0s), Bello et al. [2] applied a rule-based framework to find meaningful groups of spectral peaks, Poliner et al. [3] adopted a machine learning approach to determine whether a particular note is present, and Duan et al. [8] presented a probabilistic approach to fundamental frequency estimation. These systems used the batch learning technique to learn the relationships between the spectra of the music signals and the underlying notes. Therefore, the models cannot adapt to different types of music without retraining. Moreover, generating the ground truth data for learning is time-consuming [9]. Abdallah et al. [10] and Cont [11] introduced a sparsity constraint into their transcription system, and they only used one base element for each note. As shown by Wright et al. [12], the performance of sparse representation highly depends on the overcompleteness of the dictionary of base elements. Here,

Fig. 1. The proposed automatic piano transcription system.

Table 1. Tuning algorithm Input: magnitude spectrum M of a sample, tuning factor τ Output: tuned magnitude spectrum Mt 1. For i = 1 to the dimension of M 2. Mt[i] = 0 3. For i = 1 to the dimension of M 4. n = round((i-1) × τ) + 1 5. /* round(x) rounds x to the nearest integer */ 6. Mt[n] = Mt[n] + Mt[i]

1.2 Organization of the paper The organization of this paper is as follows. In Section 2, we propose a piano transcription system that is based on sparse representation of magnitude spectra without the need for model training. In Section 3, the proposed system is applied to transcribe ten recordings of solo piano music [3] and compared to previous approaches. Section 4 concludes this paper. 2. PROPOSED SYSTEM

Fig. 2. The original and the tuned spectra of a sample of note C4 (MIDI number 60). The estimated tuning factor is 0.982.

Fig. 1 shows the schematic diagram of our transcription system. The system takes a piano WAVE file as input and generates the transcription in MIDI file format [14]. The system first normalizes the root-mean-squared (RMS) amplitude of the input WAVE file and estimates the tuning factor of the piano. Then the samples in the piano sound database are tuned accordingly. During this preprocessing stage, the system also decomposes the audio content of the normalized WAVE file into short-time frames and applies fast Fourier transform (FFT) to each frame. The note candidates are selected according to the resulting spectrum and the tuning factor. Then, the sparse representation of the frame spectrum using the spectra of the note candidates is created. Because we formulate the sparse representation as an l1-regularized minimization problem, there are some small non-zero coefficients equivalent to noise. The system eliminates such noise by setting a threshold. Finally, the system applies temporal smoothing to the resulting framelevel sparse representation to generate the transcription. The detail of each component is described below. 2.1 Data preprocessing

Fig. 3. Illustration of the peak detection algorithm.

sparse representation refers to an expression of the input signal as a linear combination of base elements, and the resulting representation only has few non-zero terms. To address these issues, we propose a system that adapts to different pianos without retraining. It only requires the WAVE [13] files of a dozen individual notes of a new piano to be added to the piano sound database. In addition, the system uses more than one base element for a note to improve the transcription performance.

Before the transcription process starts, a data pre-processing step is performed. First, a database of piano note waveforms with standard tuning is built. There should be at least one waveform for each piano note in the database. Also, there can be several waveforms generated by different pianos for the same note. Each waveform is then divided into short-time samples, and the energy of each sample is normalized. Then, the RMS amplitude of the input piano WAVE file is normalized, and the tuning of the piano is estimated using

(a)

(b)

Fig. 4. Spectra of note D4 (MIDI number 62) of two different pianos. The magnitude of the fundamental frequency is marked by circle. (a) Strong-fundamental type. (b) Weak-fundamental type.

the method proposed in [1]. It uses adaptive oscillators to find partials in the input WAVE file and calculates the tuning factor by comparing the frequencies of the partials to the frequencies of the notes of a piano with standard tuning (i.e., A4=440 Hz). After the tuning factor is obtained, the samples in the database are tuned using the algorithm shown in Table 1. Note that the tuning factor is a real number and that its value is set to 1.0 for standard tuning. Fig. 2 shows the difference between the original and the tuned spectra of a certain note sample. The spectra of the tuned samples serve as the base elements of the sparse representation. 2.2 Note candidates selection For each frame, the system selects a number of note candidates to reduce the search space of the sparse representation coefficients. In this way, we can reduce the time complexity of sparse representation. First, the system detects significant spectral peaks in a manner similar to the one proposed in [15], which is illustrated in Fig. 3. The smoothed spectrum is obtained by convolving the original spectrum with a moving Gaussian filter. If the magnitude of a frequency satisfies the two requirements described below, it is considered a significant spectral peak. The first requirement is that the original magnitude of a significant spectral peak should be higher than a predefined global threshold. The second requirement, which is a local requirement, is that the original magnitude of a significant spectral peak should be higher than the magnitude of its smoothed version by a predefined margin. In addition to peak detection, the system determines the frequency of each note. The frequency f of the nth key of a piano is defined by f = (440 × τ ) × 12 2

n − 49

,

(1)

where τ is the tuning factor. The system at this point can select note candidates by the following two heuristic rules: • If the magnitude of a significant spectral peak at frequency fp is higher than the magnitude of the spectrum at integer multiples (harmonics) of fp, the note whose frequency is nearest to the spectral peak is selected as a candidate. • Following the last rule, if the magnitude of the spectrum at fp/2 is higher than a pre-defined threshold, the note whose frequency is nearest to fp/2 is selected as a note candidate. The same procedure is applied for fp/3. The two rules are designed for two types, strongfundamental and weak-fundamental, of harmonic series of piano notes. As shown in Fig. 4, the spectra of the same note of different pianos may look different. Fig. 4(a) shows an example note of strong-fundamental type. We can see that the spectral envelop has maximum magnitude at the fundamental frequency marked by circle. The first rule is used to detect notes of the strong-fundamental type, while the second rule is used to detect notes whose spectral envelope has maximum magnitude at one of the harmonics, see Fig. 4(b). Notes with such spectral envelope belong to the weak-fundamental type. Yeh et al. have a similar classification of the harmonic series [16]. 2.3 Computation of sparse representation We assume that the magnitude spectrum of a frame can be represented as a linear combination of the base elements, which are the spectra of samples of the note candidates in the tuned database. However, the coefficients of such linear combination may not be unique because of the mathematical dependencies between the base elements. To address this issue, we impose a sparsity constraint to require that the number of non-zero coefficients should be minimal. The sparsity constraint is reasonable [12] because “the sparsest representation is naturally discriminative.” For the transcription problem dealt with in this paper, the constraint implies that the spectrum of a frame is most compactly expressed using the base elements of notes present in the frame. Fig. 5 illustrates the idea of sparse representation. Denote the dictionary of base elements for the ith key of a piano by Ai = [ai,1|ai,2|…|ai,Ni], where ai,k is a column vector containing the magnitudes of the Fourier coefficients of the kth short-time sample of that key and Ni is the number of such samples. Let the column vector y be the magnitudes of the Fourier coefficients of a frame, the problem now is to find a sparse coefficient vector x* such that x* = arg min || x ||0 subject to y = Ax,

(2)

x

where A contains the dictionaries of the note candidates. Because finding the solution of (2) is NP-hard, it is not

(a)

(b)

(c)

(d)

Fig. 5. Illustration of the sparse representation of frame spectrum for music transcription. (a) The magnitude spectrum of a frame. (b) The coefficients of sparse representation. (c) The base elements. (d) The residues.

practical to solve it directly. Fortunately, Donoho [17] demonstrated that if the solution of (2) is sparse enough, it is equal to the solution of the following l1-minimization problem: x* = arg min || x ||1 subject to y = Ax.

Table 2. Parameter values used in the proposed system

(3)

x

It is shown in [18] that the solution of (3) is close to the following l1-regularized minimization problem: x = arg min || y - Ax || + λ || x ||1 , *

2

(4)

2.4 Temporal smoothing The FFT described above treats the short-time frames independently, leaving the temporal structure of music unexploited. To address this issue, we use two-state (on and off) hidden Markov models (HMMs) to model the attack and decay of each note [3]. For each note, we want to maximize

∏ p( x | q ) p(q t

t

t

t

| qt −1 ),

(5)

where qt is the state at time t, xt is the frame beginning at time t, p(xt|qt) is the probability of xt given qt, and p(qt|qt-1) is the transition probability between states. Although we do not know p(xt|qt), from the conditional probability we have

p ( qt | xt ) ∝ p ( xt | qt ) p (qt ) .

(6)

Therefore, we can maximize

∏ t

p ( qt | xt ) p( qt | qt −1 ) p ( qt )

(7)

instead of (5). p(qt|xt) is obtained by dividing the activation index of the note by the maximum activation index of xt. Both the prior p(qt) and the state transition probability p(qt|qt-1) can be learnt from the training data. Note that in

Value

The RMS amplitude for the normalization of the input WAVE file

0.25

The length of each base element (|ai,k|)

1.0

The radius of the moving Gaussian filter

10

The global threshold for the selection of the significant spectral peaks The minimal margin for the selection of the significant spectral peaks

x

where the regularization parameter λ is a real non-negative number. We use the method described in [19] to solve (4). After obtaining the sparse coefficient vector x* of (4), we consider that a note is present in the frame if the summation of the coefficients corresponding to that note, its activation index, is larger than a threshold.

Parameter

Regularization parameter (λ)

1.0 0.75 100.0

this problem formulation the learning process is about the music structure rather than the timbre characteristic of a particular piano. We apply the Viterbi algorithm to find the solution of (7). After the temporal smoothing by HMMs, the system generates the transcription of the input WAVE file in the MIDI format. 3. EVALUATION We evaluate the performance of the proposed system against three state-of-the-art systems [1], [7], [20] using ten oneminute long classical music recordings (in the form of WAVE files) of a Yamaha Disklavier playback grand piano [3]. The executable files of [1], [7] are provided by the authors, and the evaluation result of [20] is provided by its author. Each music recording comes with a MIDI file to serve as the ground truth. There are 4952 notes in the recordings, and the average polyphony of them is 3.5. Klapuri’s system [7] achieved the best performance for Multiple Fundamental Frequency Tracking task of Music Information Retrieval Evaluation eXchange (MIREX) in 2008, whereas Yeh’s system [20] achieved the best performance for the same contest in 20101. 3.1 Experiment set-up

1

http://www.music-ir.org/mirex/

Table 3. Result of the frame-based evaluation

Table 4. Result of the note-based evaluation

F-measure

Precision

Recall

F-measure

Precision

Recall

Proposed system

70.2%

74.4%

66.5%

Proposed system

73.0%

74.6%

71.6%

Marolt’s system

66.1%

78.6%

57.1%

Yeh’s system

67.1%

57.2%

81.1%

Klapuri’s system

62.2%

72.4%

54.6% F-measure =

(a)

(b)

(c)

Fig. 6. Evaluation results in terms of the polyphony of the frames. The numbers in the parentheses are relative frequencies of the polyphony. (a) F-measure. (b) Precision. (c) Recall.

The waveforms in the database are synthesized by two commercial software tools, and three different piano timbres are used2. There are 646 base elements in the database, and each note has six to eight corresponding base elements. The testing WAVE files are monaural, and the sampling rate is 8 kHz. The frame size for FFT is 100 milliseconds, and the hop size between successive frames is 10 milliseconds. The training data used for learning the prior and the transition probability is also provided in [3]. Note that the training data is disjoint with the testing data. Values of the parameters of the proposed system are listed in Table 2. 3.2 Frame-based evaluation We use three typical metrics to evaluate the transcription performance, namely precision, recall, and F-measure. The metrics are defined as follows: precision = recall =

N tp N tp +Nfp N tp N tp +N fn

,

,

(8) (9)

2 Piano 1 and Piano 2 of Native Instrument’s Bandstand and Acoustic Piano 1 of Arobas‘s Guitar Pro.

2 × Precision × Recall , Precision + Recall

(10)

where Ntp is the number of correctly transcribed notes (true positives), Nfp is the number of unvoiced notes transcribed as voiced (false positives), and Nfn is the number of voiced notes transcribed as unvoiced (false negative) of all frames. Table 3 shows the average performance results for the frame-based evaluation. We can see that the proposed system achieves a better F-measure, which is a combined metric for recall and precision. It is also observed that the high recall of our system is obtained at the cost of a little bit lower precision than Marolt’s system [1]. Fig. 6 shows the details of the transcription performance with respect to polyphony. Both F-measure and recall of the proposed system are consistently better than that of Klapuri’s system [7] and Marolt’s system. Compared with Klapuri’s system, the proposed system has better precision for polyphony 2, 4, and 6 or more. Compared with Marolt’s system, the proposed system has better precision for polyphony 2. As mentioned, there is a tradeoff between precision and recall, and the best tradeoff is achieved when the F-measure is maximized. To improve the F-measure performance, a more sophisticated method that learns the music structure, say, the relationship between concurrent notes, should be investigated in future work. 3.3 Note-based evaluation In addition to the frame-based evaluation, we also evaluate the performance of the proposed system in a note-based manner and compare it with Yeh’s system [20]. A note output by the system is considered a correctly transcribed note if its MIDI number matches with the ground truth and if the system switches it on within 100 milliseconds from the onset specified in the ground truth MIDI file. The evaluation is performed for each of the ten music recordings described earlier. At the end of the evaluation, the number of false negatives is obtained by subtracting the number of correctly transcribed notes from the total number of ground truth notes, and the number of false positives is obtained by subtracting the number of correctly transcribed notes from the total number of notes output by the system. We do not take the offset of the notes into consideration because they have little perceptual importance [3]. Table 4 shows the result of the note-based evaluation. We can see that the Fmeasure and precision of the proposed system are better than Yeh’s system. In summary, the performance of the proposed system is consistent and unbiased, and the improvement over the three

state-of-the-art systems is significant under the one-tailed ttest (p-value
Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.