Transcoded speech contemporary objective quality measurements reliability

July 15, 2017 | Autor: Jan Holub | Categoria: Quality Measures
Share Embed


Descrição do Produto

Transcoded Speech Contemporary Objective Quality Measurements Reliability Jan Holub, Ľubica Blašková Department of Measurement 13138, FEE CTU, Prague, Czech Republic [email protected]

Abstract Speech transcoding (coder tandeming) is unavoidable source of transmission quality degradation for ad-hoc or permanently interoperating networks. Test methodology and results of voice transmission quality measurement for transcoded voice in coder tandem arrangement is described. The objective tests have been performed using ITU-T P.862 (PESQ) and P.563 (3SQM) algorithms. The subjective tests results based on ITU-T P.800 are also presented. The objective results confirm additional degradations caused by abovementioned transcodings, highlighting coder combinations causing extreme impairments. The subjective tests quantify objective measurement accuracy.

1. Introduction Voice transmission of any call in the telecommunication network is affected by many impairments; including delay, echo, various kinds of noise, speech (de)coding distortions and artifacts, temporal and amplitude clipping etc. Each transmission impairment has a certain perceptual impact on the speech transmission quality. The overall quality can be evaluated and expressed in terms of a Mean Opinion Score (MOS) covering the range from 1 (bad) to 5 (excellent). To provide the ability to measure voice transmission quality, objective methods like ITU-T P.862 [2] are widely deployed. Such methods are widely used to compare different coding and transmission technologies, or to monitor the network performance. The traditionally proven but expensive subjective methods [1], involving human listeners assessing many speech samples, have been partially replaced by objective digital signal processing algorithm based measurements that either compare the original undistorted signal to the transmitted one [2] (so called intrusive or double-sided algorithms) or process only the transmitted version [3]. All these methods have been designed and tested on past and contemporary 978-1-4244-1870-1/08/$25.00 (c)2008 IEEE

telecommunication transmission standards that are widely used in common mobile and fixed telecommunication networks, e.g. those using ‘toll quality’ voice encoding. The application of objective digital signal processing based methods to any other area, such as special radio communication networks that deploy low bit-rate speech coding and transcoding must be carefully verified by proper testing and result comparison with subjective assessment.

1.1. Listening and Conversational Tests A trivial method of measuring the quality of transmitted voice would be to ask callers for their opinion after a call has been made. Due to obvious practical problems related to this approach, listening and conversational tests have been standardized instead as the methods for subjective determination of transmission quality. These tests relate real world distortions created in a laboratory environment to the subjectively perceived quality. E.g. recommendation [1] describes approved methods which are considered to be suitable for determining how satisfactory given connections may be expected to perform. They contain recommended subjective evaluation procedures for conversational and listening-only tests.

1.2. Intrusive Objective Measurements Intrusive measurements of speech transmission quality usually require special test calls generated by the measurement system and require that the original (non distorted) speech sample is available to the measurement algorithm. The algorithm itself then compares original and transmitted speech samples and identifies and integrates the perceptual differences between them. Known psycho-acoustical aspects of human hearing (human ear loudness and frequency resolution and sensitivity, temporal and frequency masking, etc.) are/should be modeled by the algorithm to estimate the subjectively perceived quality in terms of the MOS value as would have been obtained in a listening test. A typical example of an intrusive algorithm is PESQ [2],[4]. The correlation coefficient

between the PESQ MOS estimate and the related MOS from formal listening tests is in most cases above 0.9. PESQ was validated for various transmission and coding technologies including commercial mobile networks and Voice over Internet Protocol (VoIP) transmissions, generally using coders with a higher bitrate than 4 kb/s.

1.3. Non-Intrusive Objective Measurements Passive monitoring of on-going calls in the network is a basic principle of 3SQM – ITU-T P.563 [3]. The 3SQM (Single-Sided Speech Quality Measurement) combines three non-intrusive algorithms and achieves a correlation coefficient with listening tests of around 0.8.

2. Tasks Performed and Results To verify applicability of existing objective speech transmission quality measurement algorithms for transcoded speech, the following tasks had to be performed: • Selection of coders and their combinations • Transcoded database recording • Objective testing • Subjective testing

2.1. Selection Recording

of

Coder

Tandems

and

Two background noise conditions (no noise / Hoth noise +10dB SNR) speech sentences database has been recorded on coder tandems (and triples) as per Table 1. The combinations have been selected according to possible scenarios that may occur in reality on permanent or ad-hoc interconnections of networks. A number of mobile networks were used for this work. A deployable Tetra network using ACELP coders, dedicated PR transceivers deploying MELPe coders and VoIP test bed enabling real-time simulations of GSM FR and G.729 coders) have been used. The original voice sample contained 4 short sentences in Czech language, spoken by four (two male and two female) talkers, and recorded in a studio environment. The voice samples were preceded by an initial training period to enable Automatic Gain Control circuits to set correctly. Each sentence was recorded 5 times per one transmission/impairment setting (See Table 1).

2.2. Objective Testing Two up-to-date measurement methods have been applied to the recorded samples: PESQ-LQ (ITU-T P.862 [2] and re-processed additionally in accordance to P.862.1) as a widely accepted [4] example of

intrusive method and 3SQM (ITU-T P.563 [3) as an advanced non-intrusive measurement algorithm. Each of the 5 recordings (of each sentence for any given transmission combination) have been evaluated separately and the average score has been calculated. Confidence intervals (CI95%) have been calculated and are shown in the graphs, too. Table 1. Selected coder combinations Single Coders Tested: ACELP MELPe GSM Full rate G.729 Tested Coder Tandems : ACELP-MELPe MELPe-ACELP ACELP-GSM GSM-ACELP ACELP-VoIP G.729 VoIP G.729 -ACELP MELPe - VoIP G.729 VoIP G.729 - MELPe MELPe - GSM GSM - MELPe Tested Three-Coders Combinations: MELPe –VoIP G.729 - ACELP ACELP - VoIP G.729 – MELPe

In case of PESQ test, the final results have been recalculated by the 2nd-order polynomial regression as recommended in P.862.3. For 3SQM tests, no such recalculation was meaningful due to very low correlation with subjective tests (see Table 6.). Therefore, the 3SQM results are presented as the original (raw) algorithm output values. Table 2. Objective test results (no background noise) Technology ACELP MELPe GSM G.729 ACELP-MELPe MELPe-ACELP ACELP- GSM GSM-ACELP ACELP- G.729 G.729-ACELP MELPe-G.729 G.729-MELPe MELPe-GSM GSM-MELPe MELPe-G.729ACELP ACELP-G.729MELPe

MOSLQOn1: PESQ-LQ regressed 3,69 2,42 4,07 4,28 2,06 2,61 3,84 3,63 3,94 3,40 3,01 2,77 2,94 2,69 2,79 2,42

0,069 0,035 0,008 0,004 0,045 0,031 0,023 0,044 0,023 0,042 0,028 0,038 0,043 0,032 0,031

MOSLQOn2: 3SQM (non reg.) 2,17 4,53 3,49 4,02 3,86 3,96 3,30 3,63 3,66 3,65 4,48 4,26 4,09 4,14 3,69

0,224 0,289 0,060 0,108 0,302 0,154 0,102 0,135 0,153 0,110 0,111 0,103 0,227 0,097 0,136

0,054

3,83

0,173

STD1

STD2

Table 3. Objective test results (Hoth +10dB SNR) Technology ACELP MELPe GSM G.729 ACELP-MELPe MELPe-ACELP ACELP- GSM GSM-ACELP ACELP- G.729 G.729-ACELP MELPe-G.729 G.729-MELPe MELPe-GSM GSM-MELPe MELPe-G.729ACELP ACELP-G.729MELPe

MOSLQOn1: PESQ-LQ regressed 3,17 1,74 2,46 2,76 1,85 1,82 3,12 2,89 3,20 2,90 2,04 2,01 1,96 1,88 2,19 2,11

0,072 0,030 0,004 0,024 0,043 0,056 0,010 0,014 0,009 0,007 0,022 0,016 0,025 0,022 0,031

MOSLQOn2: 3SQM (non reg.) 2,73 1,24 2,93 3,27 2,70 2,32 2,82 2,98 3,01 3,16 2,27 3,05 1,85 2,23 2,56

0,259 0,159 0,313 0,260 0,232 0,208 0,079 0,129 0,153 0,114 0,173 0,217 0,223 0,279 0,233

0,025

3,08

0,254

STD1

STD2

Correlation of subjective results with results from objective tests shows accuracy of the objective methods described above. Table 5. Subjective test results (Hoth +10dB SNR) Technology ACELP MELPe GSM G.729 ACELP-MELPe MELPe-ACELP ACELP- GSM GSM-ACELP ACELP- G.729 G.729-ACELP MELPe-G.729 G.729-MELPe MELPe-GSM GSM-MELPe MELPe–G.729-ACELP ACELP-G.729-MELPe

MOS-LQSn 3,09 1,12 2,49 2,82 1,26 1,28 3,04 2,85 3,04 2,79 1,51 1,41 1,53 1,46 1,66 1,55

CI95% 0,162 0,079 0,221 0,189 0,149 0,149 0,163 0,184 0,189 0,170 0,146 0,171 0,170 0,149 0,157 0,141

Table 6. Subjective versus objective tests results

2.3. Subjective Testing Subjective tests as per ITU-T P.800 [1] have been performed always on the first recording of the given condition, as it was proved by expert listening that all 5 recordings of each condition are subjectively of equal quality. The test purpose was to validate the applicability of objective algorithms for assessing the quality of the given coders combinations.

Correlation Maximum pos. difference Maximum neg. difference RMSE

PESQ: P.862 + P.862.1, regressed) 0,836 1,060 -1,550 0,560

3SQM: P.563 0,370 3,288 -2,722 1,072

PESQ-LQ after 2-nd order regression

Table 4. Subjective test results (no background noise) MOS-LQSn 4,25 2,21 3,64 4,00 1,30 2,89 3,68 3,85 4,22 3,63 3,44 2,46 3,08 2,33 3,13 1,75

CI95% 0,166 0,184 0,198 0,188 0,121 0,199 0,199 0,191 0,204 0,203 0,200 0,168 0,197 0,189 0,203 0,182

5

4 MOS-LQon (PESQ-LQ, regr.)

Technology ACELP MELPe GSM G.729 ACELP-MELPe MELPe-ACELP ACELP- GSM GSM-ACELP ACELP- G.729 G.729-ACELP MELPe-G.729 G.729-MELPe MELPe-GSM GSM-MELPe MELPe–G.729-ACELP ACELP-G.729-MELPe

3

2

1

The subjective listening-only tests have been performed in critical listening room where up to 8 listeners can be seated. The reverberation time of the room is 185 ms and natural background noise less than 10dB SPL (A). Multiple sessions have been run always with different listeners. In total, 38 votes per each sample have been obtained.

1

2

3

4

MOS-LQsn

Figure 1. Results of PESQ-LQ after 3-rd order regression versus subjective test results

5

3SQM (no regression)

Acknowledgment

5

This work has been supported by the Czech ministry of Education: MSM 6840770014 ”Research in the Area of the Prospective Information and Navigation Technologies” and by WINTSEC (Wireless Interoperability for Security), a project under the EU's PASR 2006. Authors would like to thank to C3A NATO for invaluable help and support during the project.

MOS-LQon (3SQM, no regr.)

4

3

References

2

[1]

ITU-T Rec. P. 800 “Methods for subjective determination of transmission quality”, International Telecommunication Union, Geneva, 1996.

[2]

ITU-T Rec. P. 862 “Perceptual Evaluation of Speech Quality”, International Telecommunication Union, Geneva, 2001.

[3]

ITU-T Rec. P. 563, “Single-ended method for objective speech quality assessment in narrowband telephony applications”, International Telecommunication Union, Geneva, 2004.

[4]

Pennock, S.: “Accuracy of the Perceptual Evaluation of Speech Quality (PESQ) Algorithm”, MESAQIN 2002, Praha, CTU.

[5]

Street, M and Collura, J.: “Interoperable voice communications: test and selection of STANAG 4591”, RTO-IST conf. on ‘Military communications’, Warsaw, Poland, 2001

[6]

Holub, J., Street, M., Šmíd, R. : Intrusive Speech Transmission Quality Measurements for Low Bit Rate Coded Audio Signals, AES115 Convention, New York, October 2003

1 1

2

3

4

5

MOS-LQsn

Figure 2. Results of 3SQM (without regression) versus subjective test results

3. Conclusions It is evident that all tandem setups perform with decreased speech transmission quality in comparison with cases when only single coders are used. It is also worth noting always both direction (coders “A-to-B” and “B-to-A”) must be tested as final transmission quality may differ significantly (see e.g. ACELPMELPe or ACELP-G.729-MELPe). The comparison between subjective and objective results shows that neither PESQ-LQ neither 3SQM can be used reliably for objective voice QoS monitoring in case of multiple coder tandeming where at least one low bit-rate coder is used. However, PESQ-LQ after proper regression shows at least reasonable correlation with subjective data (0.84).

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.