Paper.

This is the paper proposed the error back-propagation algorithm for neural network. And the paper was initially published to Nature.

## Thursday, June 12, 2014

## Sunday, June 1, 2014

### [paper] Recognition of reverberant speech by missing data imputation and NMF feature enhancement

Paper Link.

This paper addressed the problem of reverberant speech recognition by extending a noise-robust feature enhancement method based on NMF.

While the topic of speech recognition in noisy environments has been widely studies, many proposed systems are limited by an underlying assumption that the observed signal is an additive mixture of speech and noise, often with the latter having spectral characteristics unlike those of speech. The distortions introduced by the multiple reflected signals inherent in reverberation do not fit this model well.

The bounded conditional mean imputation is used to reconstruct the unreliable regions by assuming the that the observed value is an upper bound for the signal of interest.

Two types of masks are experimented:

1) AGC feature based masks

Denote the $b$-th Mel channel component of frame $t$ of the reverberant observation as $y(t, b)$. Then $y(t,b)$ is first compressed by raising to the power of 0.3, then processed with a band-pass modulation filter with 3 dB cutoff frequencies of 1.5Hz and 8.2Hz. An automatic gain control is further applied and a normalization by subtracting a channel specific constant selected so that the minimum value for each channel over a single utterance is 0. The resulting feature is referred to as the AGC feature, $y_{bp}^{agc}(t,b)$ and the mask is hence defined as:

\[

m_R (t, b) = \left \{

\begin{array}{l l}

1 & \text{if} \quad y_{bp}^{agc}(t,b) > \theta(b), \\

0 & \text{otherwise}.

\end{array} \right .

\]

where the threshold $\theta(b)$ for Mel channel $b$ is selected for each utterance based on the "blurredness" metric $B$ as

\[

\theta(b) = \gamma \frac{\frac{1}{N} \sum_{t=1}^N y_{bp}^{agc} (t, b)}{1 + \exp (-\alpha (B - \beta))}.

\]

2) A computational auditory model motivated mask

Reverberation tails are located in the signal $y(t,b)$ by first estimating the smoothed temporal envelope in each channel, $y_{lp} (t - \tau_d, b)$, using a 2nd order low-pass Butterworth filter with cutoff frequency at 10Hz, and identifying regions for which the derivative $y_{lp}' (t - \tau_d, b) < 0$. The parameter $\tau_d$ corrects for the filter delay. The amount of energy in each decaying region of one frequency channel is quantified by

\[

L(t,d) = \left \{

\begin{array}{l l}

\frac{1}{|n(t, b)|} \sum_{k\in n(t,b)} y(k, b) & \text{if} \quad y_{lp}' (t - \tau_d, b) < 0, \\

0 & \text{otherwise},

\end{array} \right .

\]

where $n(t,b)$ is the set of contiguous time indices around $t$ where the derivative for channel $b$ remains negative. Under the assumption that reverberant signals result in greater $L(t,b)$ values than dry speech, the $m_{LP}$ mask is defined as:

\[

m_{LP}(t,b) = \left \{

\begin{array}{l l}

1 & \text{if} \quad L(t,d) < \theta_{LP}, \\

0 & \text{otherwise}.

\end{array} \right.

\]

GMM and SVM mask estimators were used in the paper for the estimation of these two types of masks.

The mask estimators are trained on a subset of the multi-condition training set, along with the corresponding clean speech signals.

This paper addressed the problem of reverberant speech recognition by extending a noise-robust feature enhancement method based on NMF.

While the topic of speech recognition in noisy environments has been widely studies, many proposed systems are limited by an underlying assumption that the observed signal is an additive mixture of speech and noise, often with the latter having spectral characteristics unlike those of speech. The distortions introduced by the multiple reflected signals inherent in reverberation do not fit this model well.

The bounded conditional mean imputation is used to reconstruct the unreliable regions by assuming the that the observed value is an upper bound for the signal of interest.

Two types of masks are experimented:

1) AGC feature based masks

Denote the $b$-th Mel channel component of frame $t$ of the reverberant observation as $y(t, b)$. Then $y(t,b)$ is first compressed by raising to the power of 0.3, then processed with a band-pass modulation filter with 3 dB cutoff frequencies of 1.5Hz and 8.2Hz. An automatic gain control is further applied and a normalization by subtracting a channel specific constant selected so that the minimum value for each channel over a single utterance is 0. The resulting feature is referred to as the AGC feature, $y_{bp}^{agc}(t,b)$ and the mask is hence defined as:

\[

m_R (t, b) = \left \{

\begin{array}{l l}

1 & \text{if} \quad y_{bp}^{agc}(t,b) > \theta(b), \\

0 & \text{otherwise}.

\end{array} \right .

\]

where the threshold $\theta(b)$ for Mel channel $b$ is selected for each utterance based on the "blurredness" metric $B$ as

\[

\theta(b) = \gamma \frac{\frac{1}{N} \sum_{t=1}^N y_{bp}^{agc} (t, b)}{1 + \exp (-\alpha (B - \beta))}.

\]

2) A computational auditory model motivated mask

Reverberation tails are located in the signal $y(t,b)$ by first estimating the smoothed temporal envelope in each channel, $y_{lp} (t - \tau_d, b)$, using a 2nd order low-pass Butterworth filter with cutoff frequency at 10Hz, and identifying regions for which the derivative $y_{lp}' (t - \tau_d, b) < 0$. The parameter $\tau_d$ corrects for the filter delay. The amount of energy in each decaying region of one frequency channel is quantified by

\[

L(t,d) = \left \{

\begin{array}{l l}

\frac{1}{|n(t, b)|} \sum_{k\in n(t,b)} y(k, b) & \text{if} \quad y_{lp}' (t - \tau_d, b) < 0, \\

0 & \text{otherwise},

\end{array} \right .

\]

where $n(t,b)$ is the set of contiguous time indices around $t$ where the derivative for channel $b$ remains negative. Under the assumption that reverberant signals result in greater $L(t,b)$ values than dry speech, the $m_{LP}$ mask is defined as:

\[

m_{LP}(t,b) = \left \{

\begin{array}{l l}

1 & \text{if} \quad L(t,d) < \theta_{LP}, \\

0 & \text{otherwise}.

\end{array} \right.

\]

GMM and SVM mask estimators were used in the paper for the estimation of these two types of masks.

The mask estimators are trained on a subset of the multi-condition training set, along with the corresponding clean speech signals.

### [paper] Factorized adaptation for deep neural network

Paper Link.

A novel method was proposed in this paper to adapt context dependent DNN-HMM with only limited number of parameters by taking into account the underlying factors that contributes to the distorted speech signal.

The paper generally classified the existing work on adapting neural networks into five groups:

1) LIN, LON, fDLR

2) LHN, oDLR

3) Activation function with different shapes: Hermitian based hidden activation functions

4) Regularization based approaches, such as L2 regularization [

5) speaker code.

The three major components contributing to the excellent performance of CD-DNN-HMM are:

1) modeling senones directly even though there might be thousands or even tens of thousands of senones;

2) using DNNs instead of shallow MLPs;

3) using a long context window of frames as the input.

The HMM's state emission probability density function $p(\boldsymbol{x}|s)$ is computed by converting the state posterior probability $p(s|\boldsymbol{x})$ to

$p(\boldsymbol{x}|s) = \frac{p(s|\boldsymbol{x})}{p(s)} p(\boldsymbol{x})$

where $p(s)$ is the prior probability of state $s$, and $p(\boldsymbol{x})$ is independent of state and can be dropped during evaluation. [This paragraph is simply for my reference, as one of my paper reviewers do not like the term "scaled likelihood" when I discussed this process. I should follow this description in future whenever it is needed.]

The method proposed in this paper is termed as

Denote the input feature vector as $\boldsymbol{y}$ and the output vector right before the softmax activation as $\boldsymbol{r}$. The complex nonlinearity realized by the DNN model to convert $\boldsymbol{y}$ to $\boldsymbol{r}$ is represented by the function $R(.)$, i.e.

$\boldsymbol{r} = R( \boldsymbol{y})$

and the posterior probability vector is hence computed by $\text{softmax}(\boldsymbol{r})$.

To adapt an existing DNN to a new environment, the vector $\boldsymbol{r}$ is compensated by removing those unwanted parts caused by acoustic factors. Specifically, the modified vector $\boldsymbol{r}'$ is obtained by

$\boldsymbol{r}' = R(\boldsymbol{y}) + \sum_n Q_n f_n$

where $f_n$ is the underlying $n$-th acoustic factor and $Q_n$ is the corresponding loading matrix. Then $\boldsymbol{r}'$ instead of the original $\boldsymbol{r}$ is used to compute the final posterior probabilities.

The factors $[ f_1, \cdots , f_N ]$ are extracted from adaptation utterances and the loading matrices $[ Q_1, \cdots, Q_N ]$ are obtained from training data using EBP.

From the view of VTS, the above model could be derived as follows. Suppose the corresponding clean speech vector is $\boldsymbol{x}$ and noise is $\boldsymbol{n}$. All these features are in the log filter-bank domain. They have the following relationship:

$\boldsymbol{x} = \boldsymbol{y} + \log( 1 - \exp( \boldsymbol{n} - \boldsymbol{y}) ) $

[note the difference between the commonly used VTS equation, where the noisy speech is represented by the clean one.]and can be expanded with 1st order VTS at $(\boldsymbol{y}_0, \boldsymbol{n}_0)$ as

$\boldsymbol{x} \approx \boldsymbol{y} + \log (1 - \exp (\boldsymbol{n}_0 - \boldsymbol{y}_0) ) + \boldsymbol{A} (\boldsymbol{y} - \boldsymbol{y}_0) + \boldsymbol{B} (\boldsymbol{n} - \boldsymbol{n}_0)$,

where

\[

\boldsymbol{A} = \frac{\partial \log(1-\exp(\boldsymbol{n}-\boldsymbol{y}))}{\partial \boldsymbol{y}} |_{(\boldsymbol{y}_0, \boldsymbol{n}_0)} \\

$\boldsymbol{B} = \frac{\partial \log(1-\exp(\boldsymbol{n}-\boldsymbol{y}))}{\partial \boldsymbol{n}} |_{(\boldsymbol{y}_0, \boldsymbol{n}_0)}

\]

Then $R(\boldsymbol{x})$ can be expanded with 1st order VTS as

\[

R(\boldsymbol{x}) \approx R(\boldsymbol{x}_0) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{x}_0} (\boldsymbol{x} - \boldsymbol{x}_0)

\]

Use the noisy speech $\boldsymbol{y}$ as the $\boldsymbol{x}_0$ and the 1st order VTS approaximation, we have

\[

R(\boldsymbol{x}) \approx R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} (\boldsymbol{x} - \boldsymbol{y}) \\

\approx R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} (\log (1 - \exp (\boldsymbol{n}_0 - \boldsymbol{y}_0) ) + \boldsymbol{A} (\boldsymbol{y} - \boldsymbol{y}_0) + \boldsymbol{B} (\boldsymbol{n} - \boldsymbol{n}_0)) \\

= R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} ( \boldsymbol{A} \boldsymbol{y} + \boldsymbol{B} \boldsymbol{n} + const.)

\]

Assuming that $\frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}}$ is constant, the above equation could be simplified as:

\[

R(\boldsymbol{x}) \approx R(\boldsymbol{y}) + \boldsymbol{C} \boldsymbol{y} + \boldsymbol{D} \boldsymbol{n} + const.

\]

Hence in addition to the noise factor $\boldsymbol{n}$, the distorted input feature $\boldsymbol{y}$ should also be used as a factor to adjust the noisy output vector $R(\boldsymbol{y})$ to obtain the corresponding clean one $R(\boldsymbol{x})$.

In the experiments conducted, 24D log Mel filter-bank features with their 1st and 2nd order derivatives are used. The noise $\boldsymbol{n}$ is a 72D vector obtained by averaging the first and last 20 frames of each utterance. For each frame, we have a frame-invariant noise factor $\boldsymbol{n}$ and a frame variant factor $\boldsymbol{y}$ within an utterance.

In this paper, only the simple additive noise factor is used. The authors claim that further improvements are possible if some estimated channel factors are also used.

A novel method was proposed in this paper to adapt context dependent DNN-HMM with only limited number of parameters by taking into account the underlying factors that contributes to the distorted speech signal.

The paper generally classified the existing work on adapting neural networks into five groups:

1) LIN, LON, fDLR

2) LHN, oDLR

3) Activation function with different shapes: Hermitian based hidden activation functions

4) Regularization based approaches, such as L2 regularization [

*Regularized adaptation of discriminative classifiers*], KL-divergence regularization5) speaker code.

The three major components contributing to the excellent performance of CD-DNN-HMM are:

1) modeling senones directly even though there might be thousands or even tens of thousands of senones;

2) using DNNs instead of shallow MLPs;

3) using a long context window of frames as the input.

The HMM's state emission probability density function $p(\boldsymbol{x}|s)$ is computed by converting the state posterior probability $p(s|\boldsymbol{x})$ to

$p(\boldsymbol{x}|s) = \frac{p(s|\boldsymbol{x})}{p(s)} p(\boldsymbol{x})$

where $p(s)$ is the prior probability of state $s$, and $p(\boldsymbol{x})$ is independent of state and can be dropped during evaluation. [This paragraph is simply for my reference, as one of my paper reviewers do not like the term "scaled likelihood" when I discussed this process. I should follow this description in future whenever it is needed.]

The method proposed in this paper is termed as

**Acoustic Factorization VTS (AFVTS)**.Denote the input feature vector as $\boldsymbol{y}$ and the output vector right before the softmax activation as $\boldsymbol{r}$. The complex nonlinearity realized by the DNN model to convert $\boldsymbol{y}$ to $\boldsymbol{r}$ is represented by the function $R(.)$, i.e.

$\boldsymbol{r} = R( \boldsymbol{y})$

and the posterior probability vector is hence computed by $\text{softmax}(\boldsymbol{r})$.

To adapt an existing DNN to a new environment, the vector $\boldsymbol{r}$ is compensated by removing those unwanted parts caused by acoustic factors. Specifically, the modified vector $\boldsymbol{r}'$ is obtained by

$\boldsymbol{r}' = R(\boldsymbol{y}) + \sum_n Q_n f_n$

where $f_n$ is the underlying $n$-th acoustic factor and $Q_n$ is the corresponding loading matrix. Then $\boldsymbol{r}'$ instead of the original $\boldsymbol{r}$ is used to compute the final posterior probabilities.

The factors $[ f_1, \cdots , f_N ]$ are extracted from adaptation utterances and the loading matrices $[ Q_1, \cdots, Q_N ]$ are obtained from training data using EBP.

From the view of VTS, the above model could be derived as follows. Suppose the corresponding clean speech vector is $\boldsymbol{x}$ and noise is $\boldsymbol{n}$. All these features are in the log filter-bank domain. They have the following relationship:

$\boldsymbol{x} = \boldsymbol{y} + \log( 1 - \exp( \boldsymbol{n} - \boldsymbol{y}) ) $

[note the difference between the commonly used VTS equation, where the noisy speech is represented by the clean one.]and can be expanded with 1st order VTS at $(\boldsymbol{y}_0, \boldsymbol{n}_0)$ as

$\boldsymbol{x} \approx \boldsymbol{y} + \log (1 - \exp (\boldsymbol{n}_0 - \boldsymbol{y}_0) ) + \boldsymbol{A} (\boldsymbol{y} - \boldsymbol{y}_0) + \boldsymbol{B} (\boldsymbol{n} - \boldsymbol{n}_0)$,

where

\[

\boldsymbol{A} = \frac{\partial \log(1-\exp(\boldsymbol{n}-\boldsymbol{y}))}{\partial \boldsymbol{y}} |_{(\boldsymbol{y}_0, \boldsymbol{n}_0)} \\

$\boldsymbol{B} = \frac{\partial \log(1-\exp(\boldsymbol{n}-\boldsymbol{y}))}{\partial \boldsymbol{n}} |_{(\boldsymbol{y}_0, \boldsymbol{n}_0)}

\]

Then $R(\boldsymbol{x})$ can be expanded with 1st order VTS as

\[

R(\boldsymbol{x}) \approx R(\boldsymbol{x}_0) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{x}_0} (\boldsymbol{x} - \boldsymbol{x}_0)

\]

Use the noisy speech $\boldsymbol{y}$ as the $\boldsymbol{x}_0$ and the 1st order VTS approaximation, we have

\[

R(\boldsymbol{x}) \approx R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} (\boldsymbol{x} - \boldsymbol{y}) \\

\approx R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} (\log (1 - \exp (\boldsymbol{n}_0 - \boldsymbol{y}_0) ) + \boldsymbol{A} (\boldsymbol{y} - \boldsymbol{y}_0) + \boldsymbol{B} (\boldsymbol{n} - \boldsymbol{n}_0)) \\

= R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} ( \boldsymbol{A} \boldsymbol{y} + \boldsymbol{B} \boldsymbol{n} + const.)

\]

Assuming that $\frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}}$ is constant, the above equation could be simplified as:

\[

R(\boldsymbol{x}) \approx R(\boldsymbol{y}) + \boldsymbol{C} \boldsymbol{y} + \boldsymbol{D} \boldsymbol{n} + const.

\]

Hence in addition to the noise factor $\boldsymbol{n}$, the distorted input feature $\boldsymbol{y}$ should also be used as a factor to adjust the noisy output vector $R(\boldsymbol{y})$ to obtain the corresponding clean one $R(\boldsymbol{x})$.

In the experiments conducted, 24D log Mel filter-bank features with their 1st and 2nd order derivatives are used. The noise $\boldsymbol{n}$ is a 72D vector obtained by averaging the first and last 20 frames of each utterance. For each frame, we have a frame-invariant noise factor $\boldsymbol{n}$ and a frame variant factor $\boldsymbol{y}$ within an utterance.

In this paper, only the simple additive noise factor is used. The authors claim that further improvements are possible if some estimated channel factors are also used.

Subscribe to:
Posts (Atom)