Saturday, May 31, 2014

[paper] Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network

Link to paper: http://research.microsoft.com/apps/pubs/?id=215422

The main focus of this paper is to limit the number of parameters in both the adaptation transforms and the speaker adapted models. The outstanding performance of CD-DNN-HMM requires huge number of parameters, which makes adaptation very challenging, especially with limited adaptation data.

This paper is based on the previous work of restructuring the DNN weights using SVD.

The following review of speaker adaptation for DNNs is useful to me:

[Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems] applies affine transformations to the inputs and outputs of a neural network.

[Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training] applies a linear transformation to the activations of the internal hidden layers.

[Hermitian polynomial for speaker adaptation of connectionist speech recognition systems] changes the shape of the activation function to better fit the speaker specific features.

[KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition] uses regularized adaptation to conservatively adapt the model by forcing the senone distributions estimated by the adapted model to be close to that estimated from the speaker independent model through KL-divergence.

[Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code] uses a separate small size of speaker code that is learned from each particular speaker and a large adaptation network obtained from the training data.

[Factorized adaptation for deep neural network] uses factorized adaptation to limit the number of parameters by taking into consideration of the underlying factors.

KL-Divergence regularized DNN:


The standard cross entropy objective function of DNNs is:

$\mathcal{E}=\frac{1}{N} \sum_{t=1}^N \sum_s p(l_t = s | x_t) \text{log} p(y_t = s | x_t)$

where $l_t$ is the reference label and $y_t$ is the DNN prediction.

By adding the KL-divergence between the posterior vector of the adapted model and the SI model, the new objective is:

$\mathcal{E}=\frac{1}{N} \sum_{t=1}^N \sum_s \big( (1-\rho) p(l_t = s | x_t) + \rho p^{\tt SI}(y_t=s | x_t) \big) \text{log} p(y_t = s | x_t)$

Comparing these two equations, applying the KL divergence regularization is equivalent to changing the target probability distribution to be a linear interpolation of the distribution estimated from the SI model and the ground truth alignment of the adaptation data.

SVD bottleneck adaptation:

The DNN's $m*n$ ($m \geq n$)weight matrix $W_{(m*n)}$ is decomposed using SVD:

$W_{(m*n)} = U_{(m*n)} \Sigma_{(n*n)} V_{(n*n)}^T$

where $\Sigma_{(n*n)}$ is a diagonal matrix with $W_{(m*n)}$'s singular values on the diagonal. Assuming $W_{(m*n)}$ is sparse matrix, the number of $W_{(m*n)}$'s non-zero singular values will be $k$, where $k \ll n$. Then we can rewrite

$W_{(m*n)} = U_{(m*k)} \Sigma_{(k*k)} V_{(n*k)}^T = U_{(m*k)} N_{(k*n)}$

It acts as if a linear bottleneck layer with much fewer units has been added between the original layers.
To do the SVD bottleneck adaptation, another linear layer is added with $k$ units in-between. That is

$W_{(m*n)} = U_{(m*k)} S_{(k*k)} N_{(k*n)}$

where $S_{(k*k)}$ is set to the identity matrix for the SI model and updated for each speaker.

SVD delta compression:


This technique is mainly used to reduce the number of parameters required to be stored for the adapted model. It uses the same SVD trick to decompose the difference of the weight matrices between the adapted model and the SI model:
\[
\Delta W_{(m*n)} = W_{(m*n)}^{\tt SA} - W_{(m*n)}^{\tt SI} \\
=U_{(m*n)} \Sigma_{(n*n)} V_{(n*n)}^T \\
\approx U_{(m*k)} \Sigma_{(k*k)} V_{(n*k)}^T \\
= U_{(m*k)} N_{(k*n)}
\]

The results suggest the SVD bottleneck adaptation is more effective and the combination of these two techniques only work for adaptation with small amount of data.