Link to the paper: http://www.crim.ca/perso/patrick.kenny/Gupta_ICASSP2014.pdf

This paper show that the i-vector representation of speech segments can be used to perform blind speaker adaptation of hybrid DNN-HMM systems. Acoustic feature are augmented by the corresponding i-vectors before being presented to the DNN. The same i-vector is used for all acoustic feature vectors aligned with a given speaker.

The paper also shows that i-vector based speaker adaptation is effective irrespective of whether cross-entropy or sequence training is used.

i-vectors are a fixed dimensional representation of speech segments (the dimensionality is independent of the segment duration). During training and recognition, one i-vector per speaker is computed as an additional input to the DNN. All the frames corresponding to this speaker have the same i-vector appended to them. Speaker adaptation during decoding is completely unsupervised but a diarization step is needed in order to extract an i-vector for each speaker in the audio file.

The TRAP features are used in their work. The computation process is as follows:

1) normalize the 23D filterbank features to zero mean per speaker;

2) 31 frames of these features are spliced together to form a 713D feature vector;

3) a hamming window is applied to the 713D feature vector;

4) a discrete cosine transform is applied and the dimensional is reduced to 368D;

5) a global mean and variance normalization is further carried out;

After these processing, the final 368D feature vector is used as the input to the DNN.

Using $\boldsymbol{i}$ to denote i-vector and $\boldsymbol{s}$ to denote utterance supervectors, probability model for supervector is

$\boldsymbol{s} = \boldsymbol{m} + \boldsymbol{T} \boldsymbol{i}, \quad \boldsymbol{i} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$

where $\boldsymbol{m}$ is the supervector defined by a universal background model (UBM) and the columns of the matrix $\boldsymbol{T}$ are the eigenvectors. The estimation of an i-vector extractor is actually the estimation of $\boldsymbol{T}$.

From the paper, the i-vector model is clear, but the paper doesn't give detailed explanation of the estimation. I may need to read up more about i-vectors to really understand it.

Some useful findings from the experiments in the paper are:

1) The length normalized i-vectors gave better performance than the unnormalized ones. The normalization adopted in their work is simply dividing the i-vector by the square root of the sum of the squares of its elements.

2) The i-vector based adaptation is effective for both seen and unseen speakers.

3) The i-vectors with higher dimensionality give better performance. As in their experiments with 100D, 200D and 400D, the 400D i-vector performs the best.

## Saturday, May 31, 2014

### [paper] Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network

Link to paper: http://research.microsoft.com/apps/pubs/?id=215422

The main focus of this paper is to limit the number of parameters in both the adaptation transforms and the speaker adapted models. The outstanding performance of CD-DNN-HMM requires huge number of parameters, which makes adaptation very challenging, especially with limited adaptation data.

This paper is based on the previous work of restructuring the DNN weights using SVD.

The following review of speaker adaptation for DNNs is useful to me:

[

[

[

[

[

[

The standard cross entropy objective function of DNNs is:

$\mathcal{E}=\frac{1}{N} \sum_{t=1}^N \sum_s p(l_t = s | x_t) \text{log} p(y_t = s | x_t)$

where $l_t$ is the reference label and $y_t$ is the DNN prediction.

By adding the KL-divergence between the posterior vector of the adapted model and the SI model, the new objective is:

$\mathcal{E}=\frac{1}{N} \sum_{t=1}^N \sum_s \big( (1-\rho) p(l_t = s | x_t) + \rho p^{\tt SI}(y_t=s | x_t) \big) \text{log} p(y_t = s | x_t)$

Comparing these two equations, applying the KL divergence regularization is equivalent to changing the target probability distribution to be a linear interpolation of the distribution estimated from the SI model and the ground truth alignment of the adaptation data.

The DNN's $m*n$ ($m \geq n$)weight matrix $W_{(m*n)}$ is decomposed using SVD:

$W_{(m*n)} = U_{(m*n)} \Sigma_{(n*n)} V_{(n*n)}^T$

where $\Sigma_{(n*n)}$ is a diagonal matrix with $W_{(m*n)}$'s singular values on the diagonal. Assuming $W_{(m*n)}$ is sparse matrix, the number of $W_{(m*n)}$'s non-zero singular values will be $k$, where $k \ll n$. Then we can rewrite

$W_{(m*n)} = U_{(m*k)} \Sigma_{(k*k)} V_{(n*k)}^T = U_{(m*k)} N_{(k*n)}$

It acts as if a linear bottleneck layer with much fewer units has been added between the original layers.

To do the SVD bottleneck adaptation, another linear layer is added with $k$ units in-between. That is

$W_{(m*n)} = U_{(m*k)} S_{(k*k)} N_{(k*n)}$

where $S_{(k*k)}$ is set to the identity matrix for the SI model and updated for each speaker.

This technique is mainly used to reduce the number of parameters required to be stored for the adapted model. It uses the same SVD trick to decompose the difference of the weight matrices between the adapted model and the SI model:

\[

\Delta W_{(m*n)} = W_{(m*n)}^{\tt SA} - W_{(m*n)}^{\tt SI} \\

=U_{(m*n)} \Sigma_{(n*n)} V_{(n*n)}^T \\

\approx U_{(m*k)} \Sigma_{(k*k)} V_{(n*k)}^T \\

= U_{(m*k)} N_{(k*n)}

\]

The results suggest the SVD bottleneck adaptation is more effective and the combination of these two techniques only work for adaptation with small amount of data.

The main focus of this paper is to limit the number of parameters in both the adaptation transforms and the speaker adapted models. The outstanding performance of CD-DNN-HMM requires huge number of parameters, which makes adaptation very challenging, especially with limited adaptation data.

This paper is based on the previous work of restructuring the DNN weights using SVD.

The following review of speaker adaptation for DNNs is useful to me:

[

*Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems*] applies affine transformations to the inputs and outputs of a neural network.[

*Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training*] applies a linear transformation to the activations of the internal hidden layers.[

*Hermitian polynomial for speaker adaptation of connectionist speech recognition systems*] changes the shape of the activation function to better fit the speaker specific features.[

*KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition*] uses regularized adaptation to conservatively adapt the model by forcing the senone distributions estimated by the adapted model to be close to that estimated from the speaker independent model through KL-divergence.[

*Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code*] uses a separate small size of speaker code that is learned from each particular speaker and a large adaptation network obtained from the training data.[

*Factorized adaptation for deep neural network*] uses factorized adaptation to limit the number of parameters by taking into consideration of the underlying factors.#### KL-Divergence regularized DNN:

The standard cross entropy objective function of DNNs is:

$\mathcal{E}=\frac{1}{N} \sum_{t=1}^N \sum_s p(l_t = s | x_t) \text{log} p(y_t = s | x_t)$

where $l_t$ is the reference label and $y_t$ is the DNN prediction.

By adding the KL-divergence between the posterior vector of the adapted model and the SI model, the new objective is:

$\mathcal{E}=\frac{1}{N} \sum_{t=1}^N \sum_s \big( (1-\rho) p(l_t = s | x_t) + \rho p^{\tt SI}(y_t=s | x_t) \big) \text{log} p(y_t = s | x_t)$

Comparing these two equations, applying the KL divergence regularization is equivalent to changing the target probability distribution to be a linear interpolation of the distribution estimated from the SI model and the ground truth alignment of the adaptation data.

**SVD bottleneck adaptation:**The DNN's $m*n$ ($m \geq n$)weight matrix $W_{(m*n)}$ is decomposed using SVD:

$W_{(m*n)} = U_{(m*n)} \Sigma_{(n*n)} V_{(n*n)}^T$

where $\Sigma_{(n*n)}$ is a diagonal matrix with $W_{(m*n)}$'s singular values on the diagonal. Assuming $W_{(m*n)}$ is sparse matrix, the number of $W_{(m*n)}$'s non-zero singular values will be $k$, where $k \ll n$. Then we can rewrite

$W_{(m*n)} = U_{(m*k)} \Sigma_{(k*k)} V_{(n*k)}^T = U_{(m*k)} N_{(k*n)}$

It acts as if a linear bottleneck layer with much fewer units has been added between the original layers.

To do the SVD bottleneck adaptation, another linear layer is added with $k$ units in-between. That is

$W_{(m*n)} = U_{(m*k)} S_{(k*k)} N_{(k*n)}$

where $S_{(k*k)}$ is set to the identity matrix for the SI model and updated for each speaker.

#### SVD delta compression:

This technique is mainly used to reduce the number of parameters required to be stored for the adapted model. It uses the same SVD trick to decompose the difference of the weight matrices between the adapted model and the SI model:

\[

\Delta W_{(m*n)} = W_{(m*n)}^{\tt SA} - W_{(m*n)}^{\tt SI} \\

=U_{(m*n)} \Sigma_{(n*n)} V_{(n*n)}^T \\

\approx U_{(m*k)} \Sigma_{(k*k)} V_{(n*k)}^T \\

= U_{(m*k)} N_{(k*n)}

\]

The results suggest the SVD bottleneck adaptation is more effective and the combination of these two techniques only work for adaptation with small amount of data.

### [paper] Joint noise adaptive training for robust automatic speech recognition

Link to the paper: http://www.cse.ohio-state.edu/~dwang/papers/Narayanan-Wang.icassp14.pdf

This paper studied

1) the alternative way of using the output of speech separation to improve ASR performance;

2) training strategies that unify separation and the backend acoustic modeling.

Microsoft's noise-aware training (NAT) was proposed to improve the noise robustness of DNNs with estimations of noise. However, they used a rather crude estimation, which is obtained by averaging the first and the last few frames of each utterance. And the noise statistics are simply appended to the original input features as the input to the new DNN.

In this paper,

Noise estimation: $\boldsymbol{n}(t) = ( 1 - \boldsymbol{m}(t) ) \odot \boldsymbol{y}(t)$

Noise removed speech estimation: $\boldsymbol{x}(t) = \boldsymbol{m}(t)^{\alpha} \odot \boldsymbol{y}(t)$

Clean speech estimation: $\bar{\boldsymbol{x}}(t) = f(\boldsymbol{x}(t), \boldsymbol{y}(t))$

where $\boldsymbol{y}(t)$ is the original noisy speech feature vector and $\odot$ represents the element-wise multiplication. The $\alpha$ parameter is a tunable parameter (<1) that exponentially scales up IRM estimates, thereby reducing the distortion introduced by masking. However, in their work, $\alpha$ was set to 1. $f(.)$ is the reconstruction function that undoes the distortion introduced by channel or microphone mismatch between training and testing.

The Aurora4 baseline system reported in this paper is 11.7% with a 7H-1024D DNN (ReLU hidden layers, no RBM pre-training, Dropout). The authors claimed that the gains are mainly coming from

The authors also showed that the use of their noise estimates is slightly better than the crude noise estimation adopted by the Microsoft paper, mainly in noisy+channel mismatched conditions.

The final best Aurora4 performance of 11.1% was obtained by averaging two systems.

The joint training is formulated by treating the processing steps of masking, applying log, sentence level mean normalization, adding deltas, splicing and global MVN as DNN layers. Then the wholes system is treated as a single DNN and back-propagte the classification error all the way back to the input of the speech separation input.

This paper studied

1) the alternative way of using the output of speech separation to improve ASR performance;

2) training strategies that unify separation and the backend acoustic modeling.

Microsoft's noise-aware training (NAT) was proposed to improve the noise robustness of DNNs with estimations of noise. However, they used a rather crude estimation, which is obtained by averaging the first and the last few frames of each utterance. And the noise statistics are simply appended to the original input features as the input to the new DNN.

In this paper,

**the authors utilized their speech separation module which generates ideal ratio masks (IRM) to compute a better noise statistics**. Given an estimate of the IRM, $\boldsymbol{m}(t)$, the following speech and noise estimations can be derived:Noise estimation: $\boldsymbol{n}(t) = ( 1 - \boldsymbol{m}(t) ) \odot \boldsymbol{y}(t)$

Noise removed speech estimation: $\boldsymbol{x}(t) = \boldsymbol{m}(t)^{\alpha} \odot \boldsymbol{y}(t)$

Clean speech estimation: $\bar{\boldsymbol{x}}(t) = f(\boldsymbol{x}(t), \boldsymbol{y}(t))$

where $\boldsymbol{y}(t)$ is the original noisy speech feature vector and $\odot$ represents the element-wise multiplication. The $\alpha$ parameter is a tunable parameter (<1) that exponentially scales up IRM estimates, thereby reducing the distortion introduced by masking. However, in their work, $\alpha$ was set to 1. $f(.)$ is the reconstruction function that undoes the distortion introduced by channel or microphone mismatch between training and testing.

The Aurora4 baseline system reported in this paper is 11.7% with a 7H-1024D DNN (ReLU hidden layers, no RBM pre-training, Dropout). The authors claimed that the gains are mainly coming from

**their DNN training frame labels which are obtained by aligning the corresponding clean training set instead of the noisy data themselves**.The authors also showed that the use of their noise estimates is slightly better than the crude noise estimation adopted by the Microsoft paper, mainly in noisy+channel mismatched conditions.

The final best Aurora4 performance of 11.1% was obtained by averaging two systems.

The joint training is formulated by treating the processing steps of masking, applying log, sentence level mean normalization, adding deltas, splicing and global MVN as DNN layers. Then the wholes system is treated as a single DNN and back-propagte the classification error all the way back to the input of the speech separation input.

## Saturday, May 24, 2014

### [paper] Two Microphone Binary Mask speech enhancement: application to diffuse and directional noise fields

Link to the paper: http://etrij.etri.re.kr/etrij/common/GetFile.do?method=filedownload&fileid=ERY-1398669567602

This paper utilize the Binary Masking technique to address two kinds of noises: namely the diffuse noise and the directional noise. The diffuse noise is certainly a noise signal, but a directional noise could correspond to a true noise or a disturbing speech source.

Binary masking methods emulate the human's ear capability to mask a weaker signal by a stronger one. [Moore, Brian CJ, and Brian C. Moore.

Spatial cues such as interaural-time-different (ITD) and interaural-level-difference (ILD) are highly useful in source separation.

Many two-micriphone systems rely on localization cues for speech segregation. But, there cues are only useful when each sound source is located at a single point and so, each signal arrives from a specific direction. Although this condition holds for speech and directional noise sources (such as car and street noise, and a competing speaker), in various environments the noise is diffuse and does not arrive from a specific direction (e.g., consider restaurants, and large malls).

The main contributions of this paper is the two features proposed to estimate the masks. Two features are the Coherence Feature and the Phase Error Feature.

1) The Coherence Feature of two spectra $X_1(\lambda, k)$ and $X_2(\lambda, k)$ is defined as

$COH(\lambda, k)=\frac{|P_{X_1,X_2}(\lambda,k)|}{\sqrt{|P_{X_1}(\lambda, k)| |P_{X_2}(\lambda, k)|}}$

where $P_{X_i}(\lambda, k)$ is the smoothed spectrum of signal $X_i, i\in{1,2}$, and is calculated as:

$P_{X_i}(\lambda, k) = \alpha P_{X_i}(\lambda - 1, k) + (1 - \alpha) |X_i (\lambda, k)|^2$.

$P_{X_1, X_2}(\lambda, k)$ is the smoothed cross power spectral density of the two spectra, which is computed as:

$P_{X_1, X_2}(\lambda, k) = \alpha P_{X_1, X_2}(\lambda - 1, k) + (1 - \alpha) X_1(\lambda, k) X_2(\lambda, k)$.

The Coherence of two signals shows the level of correlation or similarity of two signals. In case of a directional source, the signals received at the two microphones are highly similar to each other (they only differ in their time of arrival and amplitude attenuation). So their Coherence is near 1.

But in case of a diffuse noise source, the received signals have lower similarity, and so, their Coherence is noticeably smaller than 1.

2) The Phase Error Feature is defined as

$PE(\lambda, k) = \Delta \phi(\lambda, k) - 2 \pi * ITD$

where $\Delta \phi(\lambda, k) = \angle X_1(\lambda, k) - \angle X_2(\lambda, k)$.

In this paper, the authors utilize these two sets of features for the estimation of binary masks. They have experimented with neural networks (2 hidden layers), decision tree, Gaussian mixture model and support vector machines, which didn't give large differences.

The evaluation criterion adopted in this paper is the mask estimation hit and false alarm rate. However, personally, I don't think it is a good one. As the binary masks depend on the thresholds selected. In different applications, we may want to use different thresholds, which will lead to different masks. The hid and false alarm rates will also vary. Moreover, for recognition purpose, either human or machines, it is hard to relate the hid and false alarm rates to the recognition performance. Especially, whether the improvements in the hit and false alarm rates cause different recognition performance.

This paper utilize the Binary Masking technique to address two kinds of noises: namely the diffuse noise and the directional noise. The diffuse noise is certainly a noise signal, but a directional noise could correspond to a true noise or a disturbing speech source.

Binary masking methods emulate the human's ear capability to mask a weaker signal by a stronger one. [Moore, Brian CJ, and Brian C. Moore.

*An introduction to the psychology of hearing*. Vol. 5. San Diego: Academic press, 2003.]Spatial cues such as interaural-time-different (ITD) and interaural-level-difference (ILD) are highly useful in source separation.

Many two-micriphone systems rely on localization cues for speech segregation. But, there cues are only useful when each sound source is located at a single point and so, each signal arrives from a specific direction. Although this condition holds for speech and directional noise sources (such as car and street noise, and a competing speaker), in various environments the noise is diffuse and does not arrive from a specific direction (e.g., consider restaurants, and large malls).

The main contributions of this paper is the two features proposed to estimate the masks. Two features are the Coherence Feature and the Phase Error Feature.

1) The Coherence Feature of two spectra $X_1(\lambda, k)$ and $X_2(\lambda, k)$ is defined as

$COH(\lambda, k)=\frac{|P_{X_1,X_2}(\lambda,k)|}{\sqrt{|P_{X_1}(\lambda, k)| |P_{X_2}(\lambda, k)|}}$

where $P_{X_i}(\lambda, k)$ is the smoothed spectrum of signal $X_i, i\in{1,2}$, and is calculated as:

$P_{X_i}(\lambda, k) = \alpha P_{X_i}(\lambda - 1, k) + (1 - \alpha) |X_i (\lambda, k)|^2$.

$P_{X_1, X_2}(\lambda, k)$ is the smoothed cross power spectral density of the two spectra, which is computed as:

$P_{X_1, X_2}(\lambda, k) = \alpha P_{X_1, X_2}(\lambda - 1, k) + (1 - \alpha) X_1(\lambda, k) X_2(\lambda, k)$.

The Coherence of two signals shows the level of correlation or similarity of two signals. In case of a directional source, the signals received at the two microphones are highly similar to each other (they only differ in their time of arrival and amplitude attenuation). So their Coherence is near 1.

But in case of a diffuse noise source, the received signals have lower similarity, and so, their Coherence is noticeably smaller than 1.

2) The Phase Error Feature is defined as

$PE(\lambda, k) = \Delta \phi(\lambda, k) - 2 \pi * ITD$

where $\Delta \phi(\lambda, k) = \angle X_1(\lambda, k) - \angle X_2(\lambda, k)$.

In this paper, the authors utilize these two sets of features for the estimation of binary masks. They have experimented with neural networks (2 hidden layers), decision tree, Gaussian mixture model and support vector machines, which didn't give large differences.

The evaluation criterion adopted in this paper is the mask estimation hit and false alarm rate. However, personally, I don't think it is a good one. As the binary masks depend on the thresholds selected. In different applications, we may want to use different thresholds, which will lead to different masks. The hid and false alarm rates will also vary. Moreover, for recognition purpose, either human or machines, it is hard to relate the hid and false alarm rates to the recognition performance. Especially, whether the improvements in the hit and false alarm rates cause different recognition performance.

## Thursday, May 22, 2014

### [paper] Probabilistic linear discriminant Analysis for acoustic model

Link to the paper: http://homepages.inf.ed.ac.uk/srenals/plda-spl2014.pdf

PLDA is formulated by a generative model, where an acoustic feature vector $\boldsymbol{y}_t$ from the $j$-th HMM state at time index $t$ can be expressed as

$\boldsymbol{y}_t | j, m = \boldsymbol{U}_m \boldsymbol{x}_{jmt} + \boldsymbol{G}_m \boldsymbol{z}_{jm} + \boldsymbol{b}_m + \epsilon_{mt}$,

where $m$ is the Gaussian component index of the GMM for state $j$.

$\boldsymbol{z}_{jm}$ is the component dependent variable, shared by the whole set of acoustic feature frames generated by the $j$-th state's $m$-th Gaussian.

$\boldsymbol{x}_{jmt}$ is the channel variable which explains the per-frame variations.

In their work, the prior distributions of $\boldsymbol{z}_{jm}$ and $\boldsymbol{x}_{jmt}$ are assumed to be $\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$.

$\boldsymbol{b}$ denotes the bias.

$\epsilon_t$ is the residual noise which is Gaussian with a zero mean and diagonal covariance, i.e. $\epsilon_t \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\lambda})$

PLDA is formulated by a generative model, where an acoustic feature vector $\boldsymbol{y}_t$ from the $j$-th HMM state at time index $t$ can be expressed as

$\boldsymbol{y}_t | j, m = \boldsymbol{U}_m \boldsymbol{x}_{jmt} + \boldsymbol{G}_m \boldsymbol{z}_{jm} + \boldsymbol{b}_m + \epsilon_{mt}$,

where $m$ is the Gaussian component index of the GMM for state $j$.

$\boldsymbol{z}_{jm}$ is the component dependent variable, shared by the whole set of acoustic feature frames generated by the $j$-th state's $m$-th Gaussian.

$\boldsymbol{x}_{jmt}$ is the channel variable which explains the per-frame variations.

In their work, the prior distributions of $\boldsymbol{z}_{jm}$ and $\boldsymbol{x}_{jmt}$ are assumed to be $\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$.

$\boldsymbol{b}$ denotes the bias.

$\epsilon_t$ is the residual noise which is Gaussian with a zero mean and diagonal covariance, i.e. $\epsilon_t \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\lambda})$

### [paper] Deep learning in neural networks: an overview

Link to the paper: http://arxiv.org/abs/1404.7828

It's a good and information rich overview of the deep learning research area. It has plenty of contents well organized in the chronological order, which is reflected by the 40 out of totally 60 pages' references.

It also has a line or two mentioning about the HTM, which I am interested in exploring.

It's a good and information rich overview of the deep learning research area. It has plenty of contents well organized in the chronological order, which is reflected by the 40 out of totally 60 pages' references.

It also has a line or two mentioning about the HTM, which I am interested in exploring.

## Wednesday, May 7, 2014

### Principles of Analytic Graphics

1. Show Comparisons

2. Show causality, mechanism, explanation

3. Show multivariate data

4. Integrate multiple models of evidence

5. Describe and document the evidence

6. Content is king

From the first lecture of https://www.coursera.org/course/exdata.

Book: Edward Tufte (2006). Beautiful Evidence, Graphics Press LLC. http://www.edwardtufte.com/tufte/

2. Show causality, mechanism, explanation

3. Show multivariate data

4. Integrate multiple models of evidence

5. Describe and document the evidence

6. Content is king

From the first lecture of https://www.coursera.org/course/exdata.

Book: Edward Tufte (2006). Beautiful Evidence, Graphics Press LLC. http://www.edwardtufte.com/tufte/

### Install the latest R on Ubuntu 12.04

1) add deb http://<my.favorite.cran.mirror>/bin/linux/ubuntu precise/ to /etc/apt/sources.list

2) run:

sudo apt-get update

sudo apt-get install r-baseTo install the latest 3.0 version of R, we need to do the following after modifying the sources.list file :

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

sudo add-apt-repository ppa:marutter/rdev

sudo apt-get update

sudo apt-get upgrade

sudo apt-get install r-base

This is from: http://askubuntu.com/questions/218708/installing-latest-version-of-r-base

Subscribe to:
Posts (Atom)