Saturday, May 31, 2014

[paper] Joint noise adaptive training for robust automatic speech recognition

Link to the paper:

This paper studied
1) the alternative way of using the output of speech separation to improve ASR performance;
2) training strategies that unify separation and the backend acoustic modeling.

Microsoft's noise-aware training (NAT) was proposed to improve the noise robustness of DNNs with estimations of noise. However, they used a rather crude estimation, which is obtained by averaging the first and the last few frames of each utterance. And the noise statistics are simply appended to the original input features as the input to the new DNN.

In this paper, the authors utilized their speech separation module which generates ideal ratio masks (IRM) to compute a better noise statistics. Given an estimate of the IRM, $\boldsymbol{m}(t)$, the following speech and noise estimations can be derived:

Noise estimation: $\boldsymbol{n}(t) = ( 1 - \boldsymbol{m}(t) ) \odot \boldsymbol{y}(t)$

Noise removed speech estimation: $\boldsymbol{x}(t) = \boldsymbol{m}(t)^{\alpha} \odot \boldsymbol{y}(t)$

Clean speech estimation: $\bar{\boldsymbol{x}}(t) = f(\boldsymbol{x}(t), \boldsymbol{y}(t))$

where $\boldsymbol{y}(t)$ is the original noisy speech feature vector and $\odot$ represents the element-wise multiplication. The $\alpha$ parameter is a tunable parameter (<1) that exponentially scales up IRM estimates, thereby reducing the distortion introduced by masking. However, in their work, $\alpha$ was set to 1. $f(.)$ is the reconstruction function that undoes the distortion introduced by channel or microphone mismatch between training and testing.

The Aurora4 baseline system reported in this paper is 11.7% with a 7H-1024D DNN (ReLU hidden layers, no RBM pre-training, Dropout). The authors claimed that the gains are mainly coming from their DNN training frame labels which are obtained by aligning the corresponding clean training set instead of the noisy data themselves.

The authors also showed that the use of their noise estimates is slightly better than the crude noise estimation adopted by the Microsoft paper, mainly in noisy+channel mismatched conditions.

The final best Aurora4 performance of 11.1% was obtained by averaging two systems.

The joint training is formulated by treating the processing steps of masking, applying log, sentence level mean normalization, adding deltas, splicing and global MVN as DNN layers. Then the wholes system is treated as a single DNN and back-propagte the classification error all the way back to the input of the speech separation input.