A novel method was proposed in this paper to adapt context dependent DNN-HMM with only limited number of parameters by taking into account the underlying factors that contributes to the distorted speech signal.

The paper generally classified the existing work on adapting neural networks into five groups:

1) LIN, LON, fDLR

2) LHN, oDLR

3) Activation function with different shapes: Hermitian based hidden activation functions

4) Regularization based approaches, such as L2 regularization [

*Regularized adaptation of discriminative classifiers*], KL-divergence regularization

5) speaker code.

The three major components contributing to the excellent performance of CD-DNN-HMM are:

1) modeling senones directly even though there might be thousands or even tens of thousands of senones;

2) using DNNs instead of shallow MLPs;

3) using a long context window of frames as the input.

The HMM's state emission probability density function $p(\boldsymbol{x}|s)$ is computed by converting the state posterior probability $p(s|\boldsymbol{x})$ to

$p(\boldsymbol{x}|s) = \frac{p(s|\boldsymbol{x})}{p(s)} p(\boldsymbol{x})$

where $p(s)$ is the prior probability of state $s$, and $p(\boldsymbol{x})$ is independent of state and can be dropped during evaluation. [This paragraph is simply for my reference, as one of my paper reviewers do not like the term "scaled likelihood" when I discussed this process. I should follow this description in future whenever it is needed.]

The method proposed in this paper is termed as

**Acoustic Factorization VTS (AFVTS)**.

Denote the input feature vector as $\boldsymbol{y}$ and the output vector right before the softmax activation as $\boldsymbol{r}$. The complex nonlinearity realized by the DNN model to convert $\boldsymbol{y}$ to $\boldsymbol{r}$ is represented by the function $R(.)$, i.e.

$\boldsymbol{r} = R( \boldsymbol{y})$

and the posterior probability vector is hence computed by $\text{softmax}(\boldsymbol{r})$.

To adapt an existing DNN to a new environment, the vector $\boldsymbol{r}$ is compensated by removing those unwanted parts caused by acoustic factors. Specifically, the modified vector $\boldsymbol{r}'$ is obtained by

$\boldsymbol{r}' = R(\boldsymbol{y}) + \sum_n Q_n f_n$

where $f_n$ is the underlying $n$-th acoustic factor and $Q_n$ is the corresponding loading matrix. Then $\boldsymbol{r}'$ instead of the original $\boldsymbol{r}$ is used to compute the final posterior probabilities.

The factors $[ f_1, \cdots , f_N ]$ are extracted from adaptation utterances and the loading matrices $[ Q_1, \cdots, Q_N ]$ are obtained from training data using EBP.

From the view of VTS, the above model could be derived as follows. Suppose the corresponding clean speech vector is $\boldsymbol{x}$ and noise is $\boldsymbol{n}$. All these features are in the log filter-bank domain. They have the following relationship:

$\boldsymbol{x} = \boldsymbol{y} + \log( 1 - \exp( \boldsymbol{n} - \boldsymbol{y}) ) $

[note the difference between the commonly used VTS equation, where the noisy speech is represented by the clean one.]and can be expanded with 1st order VTS at $(\boldsymbol{y}_0, \boldsymbol{n}_0)$ as

$\boldsymbol{x} \approx \boldsymbol{y} + \log (1 - \exp (\boldsymbol{n}_0 - \boldsymbol{y}_0) ) + \boldsymbol{A} (\boldsymbol{y} - \boldsymbol{y}_0) + \boldsymbol{B} (\boldsymbol{n} - \boldsymbol{n}_0)$,

where

\[

\boldsymbol{A} = \frac{\partial \log(1-\exp(\boldsymbol{n}-\boldsymbol{y}))}{\partial \boldsymbol{y}} |_{(\boldsymbol{y}_0, \boldsymbol{n}_0)} \\

$\boldsymbol{B} = \frac{\partial \log(1-\exp(\boldsymbol{n}-\boldsymbol{y}))}{\partial \boldsymbol{n}} |_{(\boldsymbol{y}_0, \boldsymbol{n}_0)}

\]

Then $R(\boldsymbol{x})$ can be expanded with 1st order VTS as

\[

R(\boldsymbol{x}) \approx R(\boldsymbol{x}_0) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{x}_0} (\boldsymbol{x} - \boldsymbol{x}_0)

\]

Use the noisy speech $\boldsymbol{y}$ as the $\boldsymbol{x}_0$ and the 1st order VTS approaximation, we have

\[

R(\boldsymbol{x}) \approx R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} (\boldsymbol{x} - \boldsymbol{y}) \\

\approx R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} (\log (1 - \exp (\boldsymbol{n}_0 - \boldsymbol{y}_0) ) + \boldsymbol{A} (\boldsymbol{y} - \boldsymbol{y}_0) + \boldsymbol{B} (\boldsymbol{n} - \boldsymbol{n}_0)) \\

= R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} ( \boldsymbol{A} \boldsymbol{y} + \boldsymbol{B} \boldsymbol{n} + const.)

\]

Assuming that $\frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}}$ is constant, the above equation could be simplified as:

\[

R(\boldsymbol{x}) \approx R(\boldsymbol{y}) + \boldsymbol{C} \boldsymbol{y} + \boldsymbol{D} \boldsymbol{n} + const.

\]

Hence in addition to the noise factor $\boldsymbol{n}$, the distorted input feature $\boldsymbol{y}$ should also be used as a factor to adjust the noisy output vector $R(\boldsymbol{y})$ to obtain the corresponding clean one $R(\boldsymbol{x})$.

In the experiments conducted, 24D log Mel filter-bank features with their 1st and 2nd order derivatives are used. The noise $\boldsymbol{n}$ is a 72D vector obtained by averaging the first and last 20 frames of each utterance. For each frame, we have a frame-invariant noise factor $\boldsymbol{n}$ and a frame variant factor $\boldsymbol{y}$ within an utterance.

In this paper, only the simple additive noise factor is used. The authors claim that further improvements are possible if some estimated channel factors are also used.