Advertisement

Joint Optimization of Neural Network-based WPE Dereverberation and Acoustic Model for Robust Online ASR

Publisher: IEEE

Abstract: Signal dereverberation using the Weighted Prediction Error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. Fi... View more
Abstract:
Signal dereverberation using the Weighted Prediction Error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. First proposed as an iterative algorithm, follow-up works have reformulated it as a recursive least squares algorithm and therefore enabled its use in online applications. For this algorithm, the estimation of the power spectral density (PSD) of the anechoic signal plays an important role and strongly influences its performance. Recently, we showed that using a neural network PSD estimator leads to improved performance for online automatic speech recognition. This, however, comes at a price. To train the network, we require parallel data, i.e., utterances simultaneously available in clean and reverberated form. Here we propose to overcome this limitation by training the network jointly with the acoustic model of the speech recognizer. To be specific, the gradients computed from the cross-entropy loss between the target senone sequence and the acoustic model network output is backpropagated through the complex-valued dereverberation filter estimation to the neural network for PSD estimation. Evaluation on two databases demonstrates improved performance for on-line processing scenarios while imposing fewer requirements on the available training data and thus widening the range of applications.
Date of Conference: 12-17 May 2019
Date Added to IEEE Xplore: 16 April 2019
ISBN Information:
ISSN Information:
Publisher: IEEE
Conference Location: Brighton, United Kingdom, United Kingdom
Advertisement

SECTION 1.

INTRODUCTION

Reverberation has a severe impact on the intelligibility of a speech signal and, despite all recent advances in acoustic modeling, still deteriorates the performance of automatic speech recognition (ASR) systems significantly, even when trained on large scale data [1], [2]. In these challenging far-field scenarios, signal processing to dereverberate and thus enhance the signal can help to mitigate the performance losses. Many techniques have been proposed for signal dereverberation, which can be broadly categorized in linear filter ing approaches and spectral subtraction like approaches for magnitude or power spectrum manipulation [3].

Weighted prediction error (WPE) dereverberates the signal by estimating an inverse filter which is used to subtract the reverberation tail from the observation and thus falls into the first category. It can operate either on a single channel or in a multiple-input multiple-output fashion on multi-channel data. The quality of the estimated filter coefficients mainly depends on the estimation of the PSD of the "anechoic speech", i.e., the direct speech signal and its early reflections. Since this signal is unknown, the vanilla WPE works iteratively by alternating between two steps: (Step 1) Dereverberating the signal using the current estimate of the anechoic speech PSD, and, (Step 2) estimating the anechoic speech PSD using the current estimate of the dereverberated signal. Alternating these two steps gradually improves the estimate of both, the (dereverberated) target signal and the anechoic speech PSD. This, however, inherently makes the vanilla WPE an offline method and computationally expensive.

To overcome this dependency issue – and enable an on-line usage of WPE – we recently proposed to utilize a neural network to directly estimate the PSD from the observation [4] [5]. We could show that this leads to improved performance for low-latency solutions compared to a more simple PSD estimation [6]. However, we now need parallel reverberated and non-reverberated data in order to train the PSD estimation network, limiting the applicability of the approach.

In this work, we lift this restriction by combining the WPE front-end with the acoustic model already during training. This allows us to train the estimator directly with a suitable ASR loss. Apart from an expected performance improvement like we saw with e.g. beamforming [7], the motivation for this is threefold:

  1. Calculating a training target for the PSD estimator requires a corpus of parallel data. Building such a corpus is almost only possible by simulating the observed data, inevitably leading to a mismatch between training and test data.

  2. There is no clear notion which part of the signal should be considered as anechoic or the target signal respectively. Thus, the training target for the PSD estimator is not well-defined. Where as the network can adjust to the acoustic model needs when trained jointly.

  3. If other directed noise sources are present, their contribution to the covariance statistics could be optimized by an appropriate weighting factor for the tf-bins in question.

We investigate the performance of such a joint system and compare it with one separately trained on oracle PSD information.

SECTION 2.

SCENARIO AND SIGNAL MODEL

Using D microphones, we observe a signal which is represented as the D-dimensional vector yt,f at time frame index t and frequency bin index f in the short time Fourier transformation (STFT) domain. In a far-field scenario, this signal is impaired by (convolutive) reverberation. We assume, that for ASR the early part of the room impulse response (RIR) is beneficial whereas the reverberation tail deteriorates the recognition and should therefore be suppressed. Specifically, we consider the first 50 ms after the main peak of the RIR (h(early)) to contribute to the anechoic signal whereas the remaining part (h(tail)) is assumed to cause the distortions. In the STFT domain we denote this model as follows:

yt,f=x(early)t,f+x(tail)t,f,(1)
View SourceRight-click on figure for MathML and additional features. where x(early)t,f and x(tail)t,f are the STFTs of the source signal convolved with the early part of the RIR and with the late reflections, respectively. Note that we explicitly allow RIRs longer than the length of a DFT window.

SECTION 3.

WEIGHTED PREDICTION ERROR

WPE estimates the reverberation tail of the signal from previous samples and subtracts it from the observation to obtain an optimal estimate of the anechoic speech in a maximum likelihood sense:

x^(early)t,f,d=yt,f,dgHf,dy~tΔ,f,(2)
View SourceRight-click on figure for MathML and additional features. where gf,d and y~tΔ,f are stacked representations of the filter taps and the observation respectively and d is the filter index.

Using ∆ 1 avoids whitening of the speech source. WPE maximizes the likelihood of the model under the assumption that the anechoic signal is a realization of a zero-mean circularly-symmetric complex Gaussian with an unknown time-varying variance λt,f.

3.1. Iterative WPE

There is no closed form solution for the likelihood optimization, but an iterative procedure which alternates between estimating the filter coefficients gfd and the time-varying variance λtf exists:

Step 1)

Rf=ty~tΔ,fy~HtΔ,fλt,f,pf,d=ty~tΔ,fy*t,f,dλt,f,gf,d=R1fpf,d(3)(4)(5)
View SourceRight-click on figure for MathML and additional features.

Step 2)

λt,f=1(δ+1+δ)Dτ=tδt+δdx^(early)τ,f,d2.(6)
View SourceRight-click on figure for MathML and additional features.

The heuristic context of (δ + 1 + δ) frames helps to improve the variance estimate in this iterative scheme [8].

3.2. Recursive WPE

To derive a recursive formulation, the correlation matrix is estimated with a decaying window:

Rt,f=τ=0tαtτy~τΔ,fy~HτΔ,fλτ,f.(7)
View SourceRight-click on figure for MathML and additional features.

This leads to a recursive solution with the following rank-one updates [9]:

Kt,f=R1t1,fy~tΔ,fαλt,f+y~HtΔ,fR1t1,fy~tΔ,fR1t,f=1α(R1t1,fKt,fy~HtΔ,fR1t1,f)Gt,f=Gt1,f+Kt,fy~HtΔ,f.(8)(9)(10)
View SourceRight-click on figure for MathML and additional features.

Here, Gt,f are the stacked filter taps gf,d for each microphone. Note that these are now time variant. This is in essence a Recursive Least Squares (RLS) adaptive filter for the reverberation estimation. The authors of [6] approximate the PSD of the target signal using a smoothed PSD of the observation averaged over the microphones using a left and right context δL and δR:

λt,f=1D1δL+1+δRτ=tδLt+δRd|yτ.f,d|2.(11)
View SourceRight-click on figure for MathML and additional features.

SECTION 4.

PROPOSED FRAMEWORK

4.1. PSD estimation

Given the statistics λt,f of the underlying anechoic signal, the optimal filter coefficients for WPE can be calculated in closed form with Eq. 5 or adaptively with Eq. 10. But since we can only observe the reverberant signal, these statistics have to be estimated. Consistently with [4] and [5], we focus on a deep neural network (DNN) for PSD estimation.

In particular, we use the same network architecture as in the works above. The network consists of a long short-term memory (LSTM) layer with 512 units, two linear layers with 2048 units and ReLU activation functions and a final linear layer with 513 units. It operates on a single channel and the final estimate is obtained by averaging over all channels making it independent of the number of channels and the microphone configuration.

As a baseline, we consider estimating λt,f by a comparably simple smoothing of the spectrum as specified by Eq. 11 which has also shown good performance in [6] and [5] (with δL = 1 and δR = 0).

4.2. Acoustic model

Our acoustic model is a wide bi-directional residual network (WBRN) as proposed in [5]. It consists of several convolutional layers with residual connections, followed by two BLSTM layers and two linear layers. The hyper-parameters as well as the initial training procedure were adapted from [10]. The model is trained on frame-wise senone targets and shows very competitive performance on the task at hand. Note that the acoustic model itself operates offline since we focus on the effects of the front-end but can be replaced by an online version to achieve a fully online operating system.

4.3. Training

The acoustic model is first trained using multi-condition data of the respective corpus until convergence. For the DNN based PSD estimator, we train different variants.

The first one (A1) is our baseline and we train the PSD estimator separately as described in [5] and [4]. The anechoic speech PSD is used as the target for a mean-squared-error (MSE) training.

For the second one (B1), we utilize the acoustic model loss to finetune the estimator trained separately. This model still needs the parallel data (for the initialization) but might result in improved overall system performance as we directly optimize for the ASR target.

Third (C1), we train a PSD estimator with random initialization using the state level cross-entropy (CE) loss but keep all the parameters of the acoustic model fixed, i.e. use it as a loss function w.r.t. the PSD estimator.

Note that we backpropagate either through the offline equations (Eq. 3Eq. 5) or the recursive formulation (Eq. 8Eq. 10) respectively. Because the backpropagation through the online variant needs a lot of memory, this calculation always runs on the CPU where we can utilize the system memory. Since it is also computationally very expensive, we do not train it directly form scratch but rather first train the offline system from scratch and then switch to the online variant after an initial training phase.

Finally, we finetune the acoustic model for all variants with a learning rate of 105 using the respective WPE front-end. These systems are referred to as (A2) – (C2), each of which corresponds to the variant (A1) – (C1). If applicable, we jointly optimize both models in this step. Otherwise, just the acoustic model is finetuned. To increase the diversity of the training data, we sample the delay ∆ to be in a range between 1 and 4 and the number of taps K to be between 5 and 10 during this step.

4.4. Implementation

All models were implemented in Tensorflow r1.10. We use publicly available1 WPE implementation [11]. For joint training we found it crucial to use 128 bit (i.e. 64 bit for the real and imaginary part) for the complex values involved in the calculation of WPE.

SECTION 5.

EVALUATION

To demonstrate the versatility of the described approach, we evaluate the proposed systems in terms of WERs on the data of the REVERB challenge as well as on WSJ+VoiceHome data.

The REVERB challenge dataset [12] contains simulated and real utterances. For simulated data WSJCAM0 utterances [13] are convolved with measured RIRs. Noise is added with 20 dB signal to noise ratio (SNR). Reverberation times (T60) are in the range of 0.25 – 0.7 s. The real data consists of utterances from the MC-WSJ-AV corpus [14] which are recorded in a noisy reverberant room with a reverberation time of 0.7 s. The corpus is known for its mismatch between the simulated data used during training and the real recordings for evaluation. To reduce this discrepancy, we randomly sample the SNR to be in the range of 5 dB – 30 dB and scale the signal with 0.2 for the training of the PSD estimators and all finetuning experiments [4]. The initial acoustic model is trained on unscaled data without SNR perturbation.

For WSJ+VoiceHome we convolve WSJ utterances (5 k vocabulary) with VoiceHome RIRs and VoiceHome background noise [15] with reverberation times (T60) in the range of 395 – 585 ms. This is similar to the simulation setup proposed by Bertin et al. [16]. The VoiceHome background noise is very dynamic and typically found in households e.g. vacuum cleaner, dish washing or interviews on television and the SNR ranges from 0 dB – 10 dB.

We evaluate the performance for two and eight microphone channels. Since WPE preserves the number of channels, we always take the first one and use it for decoding with the acoustic model. For decoding, we use the 3-gram language model from the WSJ0 corpus without any rescoring afterwards.

Table 1. WERs/% for all systems evaluated on REVERB (real data eval set, averaged over near and far) and WSJ+VoiceHome. For the MSE loss parallel data is needed to calculate the anechoic PSD target, while the CE loss uses senone targets.
Table 1.

For WPE we use a DFT window size of 512 (32 ms) and a shift of 128 (8 ms). For the recursive WPE variant, we set α = 0.9999 and for vanilla WPE the number of iteration to 3. For all variants, we vary the delay parameter in a range between 1 and 4 and the number of filter taps is set to either 5 or 10. These values are determined on the development set and can be different for each configuration.

Our baselines for evaluation are Unprocessed, Iteration and Smooth. These use the first channel without any enhancement, vanilla WPE iterations and the smoothing PSD estimator (see. Eq. 11) respectively. These are compared with the neural network variants trained as described in Subsec. 4.3.

These systems are evaluated for two different latency constraints (where applicable): offline and online. Offline means, that the whole utterance is available for processing and this is our baseline scenario. For the online setting, which is our target scenario, we use the recursive formulation of WPE (Eq. 8Eq. 10) and the system operates on a frame-by-frame basis.

All results are shown in Tbl. 1. First note, that the unprocessed baseline itself already achieves very good performance. For comparison, the recently updated Kaldi system achieves a WER of around 19.7 %2 on the Reverb dataset. All fine-tuned systems improve upon the unprocessed baseline irrespective of the PSD estimator or the latency constraint showing the effectiveness of a WPE front-end. As another general tendency we can see that the DNN supported systems outperform the baselines, especially in the online case we were focusing on. For the offline use-case, using the vanilla WPE formulation with iterations seems to be most suitable though, especially when considering the system complexity.

For the online use-case however, using a DNN-based PSD estimator and training it with the ASR criterion is more effective. One reason for this might be that the PSD can be implicitly tuned for faster convergence of the filter estimation in this case. The best results are achieved if the PSD estimator is pre-trained using parallel data and then finetuned jointly with the acoustic model. If no parallel data is available, the PSD estimator can also be trained from scratch with little to no loss in performance. All this applies to both tested corpora and therefore for highly reverberant as well as reverberant and noisy household-like scenarios.

SECTION 6.

CONCLUSIONS

In this paper we demonstrate that jointly optimizing a DNN based PSD estimator and the acoustic model improves the performance for online dereverberation with WPE by 8 % – 18 % in highly reverberant as well as noisy reverberant scenarios compared to a baseline smoothing PSD estimator. Being able to backpropagate through the filter estimation further lifts the requirement for parallel training data and further allows to potentially train the PSD estimator on real data to reduce the mismatch between training and inference with minimal impact on the overall performance.

ACKNOWLEDGEMENTS

This work was in part supported by a Google Faculty Research Award. Computational resources were provided by the Paderborn Center for Parallel Computing.

See the top organizations patenting in technologies mentioned in this article