Paper Link.

This paper addressed the problem of reverberant speech recognition by extending a noise-robust feature enhancement method based on NMF.

While the topic of speech recognition in noisy environments has been widely studies, many proposed systems are limited by an underlying assumption that the observed signal is an additive mixture of speech and noise, often with the latter having spectral characteristics unlike those of speech. The distortions introduced by the multiple reflected signals inherent in reverberation do not fit this model well.

The bounded conditional mean imputation is used to reconstruct the unreliable regions by assuming the that the observed value is an upper bound for the signal of interest.

Two types of masks are experimented:

1) AGC feature based masks

Denote the $b$-th Mel channel component of frame $t$ of the reverberant observation as $y(t, b)$. Then $y(t,b)$ is first compressed by raising to the power of 0.3, then processed with a band-pass modulation filter with 3 dB cutoff frequencies of 1.5Hz and 8.2Hz. An automatic gain control is further applied and a normalization by subtracting a channel specific constant selected so that the minimum value for each channel over a single utterance is 0. The resulting feature is referred to as the AGC feature, $y_{bp}^{agc}(t,b)$ and the mask is hence defined as:

\[

m_R (t, b) = \left \{

\begin{array}{l l}

1 & \text{if} \quad y_{bp}^{agc}(t,b) > \theta(b), \\

0 & \text{otherwise}.

\end{array} \right .

\]

where the threshold $\theta(b)$ for Mel channel $b$ is selected for each utterance based on the "blurredness" metric $B$ as

\[

\theta(b) = \gamma \frac{\frac{1}{N} \sum_{t=1}^N y_{bp}^{agc} (t, b)}{1 + \exp (-\alpha (B - \beta))}.

\]

2) A computational auditory model motivated mask

Reverberation tails are located in the signal $y(t,b)$ by first estimating the smoothed temporal envelope in each channel, $y_{lp} (t - \tau_d, b)$, using a 2nd order low-pass Butterworth filter with cutoff frequency at 10Hz, and identifying regions for which the derivative $y_{lp}' (t - \tau_d, b) < 0$. The parameter $\tau_d$ corrects for the filter delay. The amount of energy in each decaying region of one frequency channel is quantified by

\[

L(t,d) = \left \{

\begin{array}{l l}

\frac{1}{|n(t, b)|} \sum_{k\in n(t,b)} y(k, b) & \text{if} \quad y_{lp}' (t - \tau_d, b) < 0, \\

0 & \text{otherwise},

\end{array} \right .

\]

where $n(t,b)$ is the set of contiguous time indices around $t$ where the derivative for channel $b$ remains negative. Under the assumption that reverberant signals result in greater $L(t,b)$ values than dry speech, the $m_{LP}$ mask is defined as:

\[

m_{LP}(t,b) = \left \{

\begin{array}{l l}

1 & \text{if} \quad L(t,d) < \theta_{LP}, \\

0 & \text{otherwise}.

\end{array} \right.

\]

GMM and SVM mask estimators were used in the paper for the estimation of these two types of masks.

The mask estimators are trained on a subset of the multi-condition training set, along with the corresponding clean speech signals.