Thursday, June 12, 2014

[paper] Learning representations by back-propagating errors


This is the paper proposed the error back-propagation algorithm for neural network. And the paper was initially published to Nature.

Sunday, June 1, 2014

[paper] Recognition of reverberant speech by missing data imputation and NMF feature enhancement

Paper Link.

This paper addressed the problem of reverberant speech recognition by extending a noise-robust feature enhancement method based on NMF.

While the topic of speech recognition in noisy environments has been widely studies, many proposed systems are limited by an underlying assumption that the observed signal is an additive mixture of speech and noise, often with the latter having spectral characteristics unlike those of speech. The distortions introduced by the multiple reflected signals inherent in reverberation do not fit this model well.

The bounded conditional mean imputation is used to reconstruct the unreliable regions by assuming the that the observed value is an upper bound for the signal of interest.

Two types of masks are experimented:

1) AGC feature based masks

Denote the $b$-th Mel channel component of frame $t$ of the reverberant observation as $y(t, b)$. Then $y(t,b)$ is first compressed by raising to the power of 0.3, then processed with a band-pass modulation filter with 3 dB cutoff frequencies of 1.5Hz and 8.2Hz. An automatic gain control is further applied and a normalization by subtracting a channel specific constant selected so that the minimum value for each channel over a single utterance is 0. The resulting feature is referred to as the AGC feature, $y_{bp}^{agc}(t,b)$ and the mask is hence defined as:

m_R (t, b) = \left \{
\begin{array}{l l}
1 & \text{if} \quad y_{bp}^{agc}(t,b) > \theta(b), \\
0 & \text{otherwise}.
\end{array} \right .

where the threshold $\theta(b)$ for Mel channel $b$ is selected for each utterance based on the "blurredness" metric $B$ as

\theta(b) = \gamma \frac{\frac{1}{N} \sum_{t=1}^N y_{bp}^{agc} (t, b)}{1 + \exp (-\alpha (B - \beta))}.

2) A computational auditory model motivated mask

Reverberation tails are located in the signal $y(t,b)$ by first estimating the smoothed temporal envelope in each channel, $y_{lp} (t - \tau_d, b)$, using a 2nd order low-pass Butterworth filter with cutoff frequency at 10Hz, and identifying regions for which the derivative $y_{lp}' (t - \tau_d, b) < 0$. The parameter $\tau_d$ corrects for the filter delay. The amount of energy in each decaying region of one frequency channel is quantified by

L(t,d) = \left \{
\begin{array}{l l}
\frac{1}{|n(t, b)|} \sum_{k\in n(t,b)} y(k, b) & \text{if} \quad y_{lp}' (t - \tau_d, b) < 0, \\
0 & \text{otherwise},
\end{array} \right .

where $n(t,b)$ is the set of contiguous time indices around $t$ where the derivative for channel $b$ remains negative. Under the assumption that reverberant signals result in greater $L(t,b)$ values than dry speech, the $m_{LP}$ mask is defined as:

m_{LP}(t,b) = \left \{
\begin{array}{l l}
1 & \text{if} \quad L(t,d) < \theta_{LP}, \\
0 & \text{otherwise}.
\end{array} \right.

GMM and SVM mask estimators were used in the paper for the estimation of these two types of masks.

The mask estimators are trained on a subset of the multi-condition training set, along with the corresponding clean speech signals.

[paper] Factorized adaptation for deep neural network

Paper Link.

A novel method was proposed in this paper to adapt context dependent DNN-HMM with only limited number of parameters by taking into account the underlying factors that contributes to the distorted speech signal.

The paper generally classified the existing work on adapting neural networks into five groups:
2) LHN, oDLR
3) Activation function with different shapes: Hermitian based hidden activation functions
4) Regularization based approaches, such as L2 regularization [Regularized adaptation of discriminative classifiers], KL-divergence regularization
5) speaker code.

The three major components contributing to the excellent performance of CD-DNN-HMM are:
1) modeling senones directly even though there might be thousands or even tens of thousands of senones;
2) using DNNs instead of shallow MLPs;
3) using a long context window of frames as the input.

The HMM's state emission probability density function $p(\boldsymbol{x}|s)$ is computed by converting the state posterior probability $p(s|\boldsymbol{x})$ to

$p(\boldsymbol{x}|s) = \frac{p(s|\boldsymbol{x})}{p(s)} p(\boldsymbol{x})$

where $p(s)$ is the prior probability of state $s$, and $p(\boldsymbol{x})$ is independent of state and can be dropped during evaluation. [This paragraph is simply for my reference, as one of my paper reviewers do not like the term "scaled likelihood" when I discussed this process. I should follow this description in future whenever it is needed.]

The method proposed in this paper is termed as Acoustic Factorization VTS (AFVTS).

Denote the input feature vector as $\boldsymbol{y}$ and the output vector right before the softmax activation as $\boldsymbol{r}$. The complex nonlinearity realized by the DNN model to convert $\boldsymbol{y}$ to $\boldsymbol{r}$ is represented by the function $R(.)$, i.e.

$\boldsymbol{r} = R( \boldsymbol{y})$

and the posterior probability vector is hence computed by $\text{softmax}(\boldsymbol{r})$.

To adapt an existing DNN to a new environment, the vector $\boldsymbol{r}$ is compensated by removing those unwanted parts caused by acoustic factors. Specifically, the modified vector $\boldsymbol{r}'$ is obtained by

$\boldsymbol{r}' = R(\boldsymbol{y}) + \sum_n Q_n f_n$

where $f_n$ is the underlying $n$-th acoustic factor and $Q_n$ is the corresponding loading matrix. Then $\boldsymbol{r}'$ instead of the original $\boldsymbol{r}$ is used to compute the final posterior probabilities.

The factors $[ f_1, \cdots , f_N ]$ are extracted from adaptation utterances and the loading matrices $[ Q_1, \cdots, Q_N ]$ are obtained from training data using EBP.

From the view of VTS, the above model could be derived as follows. Suppose the corresponding clean speech vector is $\boldsymbol{x}$ and noise is $\boldsymbol{n}$. All these features are in the log filter-bank domain. They have the following relationship:

$\boldsymbol{x} = \boldsymbol{y} + \log( 1 - \exp( \boldsymbol{n} - \boldsymbol{y}) ) $

[note the difference between the commonly used VTS equation, where the noisy speech is represented by the clean one.]and can be expanded with 1st order VTS at $(\boldsymbol{y}_0, \boldsymbol{n}_0)$ as

$\boldsymbol{x} \approx \boldsymbol{y} + \log (1 - \exp (\boldsymbol{n}_0 - \boldsymbol{y}_0) ) + \boldsymbol{A} (\boldsymbol{y} - \boldsymbol{y}_0) + \boldsymbol{B} (\boldsymbol{n} - \boldsymbol{n}_0)$,


\boldsymbol{A} = \frac{\partial \log(1-\exp(\boldsymbol{n}-\boldsymbol{y}))}{\partial \boldsymbol{y}} |_{(\boldsymbol{y}_0, \boldsymbol{n}_0)} \\
$\boldsymbol{B} = \frac{\partial \log(1-\exp(\boldsymbol{n}-\boldsymbol{y}))}{\partial \boldsymbol{n}} |_{(\boldsymbol{y}_0, \boldsymbol{n}_0)}

Then $R(\boldsymbol{x})$ can be expanded with 1st order VTS as

R(\boldsymbol{x}) \approx R(\boldsymbol{x}_0) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{x}_0} (\boldsymbol{x} - \boldsymbol{x}_0)

Use the noisy speech $\boldsymbol{y}$ as the $\boldsymbol{x}_0$ and the 1st order VTS approaximation, we have

R(\boldsymbol{x}) \approx R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} (\boldsymbol{x} - \boldsymbol{y}) \\
\approx R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} (\log (1 - \exp (\boldsymbol{n}_0 - \boldsymbol{y}_0) ) + \boldsymbol{A} (\boldsymbol{y} - \boldsymbol{y}_0) + \boldsymbol{B} (\boldsymbol{n} - \boldsymbol{n}_0)) \\
= R(\boldsymbol{y}) + \frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}} ( \boldsymbol{A} \boldsymbol{y} + \boldsymbol{B} \boldsymbol{n} + const.)

Assuming that $\frac{\partial R}{\partial \boldsymbol{x}}|_{\boldsymbol{y}}$ is constant, the above equation could be simplified as:

R(\boldsymbol{x}) \approx R(\boldsymbol{y}) + \boldsymbol{C} \boldsymbol{y} + \boldsymbol{D} \boldsymbol{n} + const.

Hence in addition to the noise factor $\boldsymbol{n}$, the distorted input feature $\boldsymbol{y}$ should also be used as a factor to adjust the noisy output vector $R(\boldsymbol{y})$ to obtain the corresponding clean one $R(\boldsymbol{x})$.

In the experiments conducted, 24D log Mel filter-bank features with their 1st and 2nd order derivatives are used. The noise $\boldsymbol{n}$ is a 72D vector obtained by averaging the first and last 20 frames of each utterance. For each frame, we have a frame-invariant noise factor $\boldsymbol{n}$ and a frame variant factor $\boldsymbol{y}$ within an utterance.

In this paper, only the simple additive noise factor is used. The authors claim that further improvements are possible if some estimated channel factors are also used.

Saturday, May 31, 2014

[paper] i-vector based speaker adaptation of deep neural networks for French broadcast audio transcription

Link to the paper:

This paper show that the i-vector representation of speech segments can be used to perform blind speaker adaptation of hybrid DNN-HMM systems. Acoustic feature are augmented by the corresponding i-vectors before being presented to the DNN. The same i-vector is used for all acoustic feature vectors aligned with a given speaker.

The paper also shows that i-vector based speaker adaptation is effective irrespective of whether cross-entropy or sequence training is used.

i-vectors are a fixed dimensional representation of speech segments (the dimensionality is independent of the segment duration). During training and recognition, one i-vector per speaker is computed as an additional input to the DNN. All the frames corresponding to this speaker have the same i-vector appended to them. Speaker adaptation during decoding is completely unsupervised but a diarization step is needed in order to extract an i-vector for each speaker in the audio file.

The TRAP features are used in their work. The computation process is as follows:
1) normalize the 23D filterbank features to zero mean per speaker;
2) 31 frames of these features are spliced together to form a 713D feature vector;
3) a hamming window is applied to the 713D feature vector;
4) a discrete cosine transform is applied and the dimensional is reduced to 368D;
5) a global mean and variance normalization is further carried out;
After these processing, the final 368D feature vector is used as the input to the DNN.

Using $\boldsymbol{i}$ to denote i-vector and $\boldsymbol{s}$ to denote utterance supervectors, probability model for supervector is

$\boldsymbol{s} = \boldsymbol{m} + \boldsymbol{T} \boldsymbol{i}, \quad \boldsymbol{i} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$

where $\boldsymbol{m}$ is the supervector defined by a universal background model (UBM) and the columns of the matrix $\boldsymbol{T}$ are the eigenvectors. The estimation of an i-vector extractor is actually the estimation of $\boldsymbol{T}$.

From the paper, the i-vector model is clear, but the paper doesn't give detailed explanation of the estimation. I may need to read up more about i-vectors to really understand it.

Some useful findings from the experiments in the paper are:
1) The length normalized i-vectors gave better performance than the unnormalized ones. The normalization adopted in their work is simply dividing the i-vector by the square root of the sum of the squares of its elements.

2) The i-vector based adaptation is effective for both seen and unseen speakers.

3) The i-vectors with higher dimensionality  give better performance. As in their experiments with 100D, 200D and 400D, the 400D i-vector performs the best.

[paper] Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network

Link to paper:

The main focus of this paper is to limit the number of parameters in both the adaptation transforms and the speaker adapted models. The outstanding performance of CD-DNN-HMM requires huge number of parameters, which makes adaptation very challenging, especially with limited adaptation data.

This paper is based on the previous work of restructuring the DNN weights using SVD.

The following review of speaker adaptation for DNNs is useful to me:

[Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems] applies affine transformations to the inputs and outputs of a neural network.

[Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training] applies a linear transformation to the activations of the internal hidden layers.

[Hermitian polynomial for speaker adaptation of connectionist speech recognition systems] changes the shape of the activation function to better fit the speaker specific features.

[KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition] uses regularized adaptation to conservatively adapt the model by forcing the senone distributions estimated by the adapted model to be close to that estimated from the speaker independent model through KL-divergence.

[Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code] uses a separate small size of speaker code that is learned from each particular speaker and a large adaptation network obtained from the training data.

[Factorized adaptation for deep neural network] uses factorized adaptation to limit the number of parameters by taking into consideration of the underlying factors.

KL-Divergence regularized DNN:

The standard cross entropy objective function of DNNs is:

$\mathcal{E}=\frac{1}{N} \sum_{t=1}^N \sum_s p(l_t = s | x_t) \text{log} p(y_t = s | x_t)$

where $l_t$ is the reference label and $y_t$ is the DNN prediction.

By adding the KL-divergence between the posterior vector of the adapted model and the SI model, the new objective is:

$\mathcal{E}=\frac{1}{N} \sum_{t=1}^N \sum_s \big( (1-\rho) p(l_t = s | x_t) + \rho p^{\tt SI}(y_t=s | x_t) \big) \text{log} p(y_t = s | x_t)$

Comparing these two equations, applying the KL divergence regularization is equivalent to changing the target probability distribution to be a linear interpolation of the distribution estimated from the SI model and the ground truth alignment of the adaptation data.

SVD bottleneck adaptation:

The DNN's $m*n$ ($m \geq n$)weight matrix $W_{(m*n)}$ is decomposed using SVD:

$W_{(m*n)} = U_{(m*n)} \Sigma_{(n*n)} V_{(n*n)}^T$

where $\Sigma_{(n*n)}$ is a diagonal matrix with $W_{(m*n)}$'s singular values on the diagonal. Assuming $W_{(m*n)}$ is sparse matrix, the number of $W_{(m*n)}$'s non-zero singular values will be $k$, where $k \ll n$. Then we can rewrite

$W_{(m*n)} = U_{(m*k)} \Sigma_{(k*k)} V_{(n*k)}^T = U_{(m*k)} N_{(k*n)}$

It acts as if a linear bottleneck layer with much fewer units has been added between the original layers.
To do the SVD bottleneck adaptation, another linear layer is added with $k$ units in-between. That is

$W_{(m*n)} = U_{(m*k)} S_{(k*k)} N_{(k*n)}$

where $S_{(k*k)}$ is set to the identity matrix for the SI model and updated for each speaker.

SVD delta compression:

This technique is mainly used to reduce the number of parameters required to be stored for the adapted model. It uses the same SVD trick to decompose the difference of the weight matrices between the adapted model and the SI model:
\Delta W_{(m*n)} = W_{(m*n)}^{\tt SA} - W_{(m*n)}^{\tt SI} \\
=U_{(m*n)} \Sigma_{(n*n)} V_{(n*n)}^T \\
\approx U_{(m*k)} \Sigma_{(k*k)} V_{(n*k)}^T \\
= U_{(m*k)} N_{(k*n)}

The results suggest the SVD bottleneck adaptation is more effective and the combination of these two techniques only work for adaptation with small amount of data.

[paper] Joint noise adaptive training for robust automatic speech recognition

Link to the paper:

This paper studied
1) the alternative way of using the output of speech separation to improve ASR performance;
2) training strategies that unify separation and the backend acoustic modeling.

Microsoft's noise-aware training (NAT) was proposed to improve the noise robustness of DNNs with estimations of noise. However, they used a rather crude estimation, which is obtained by averaging the first and the last few frames of each utterance. And the noise statistics are simply appended to the original input features as the input to the new DNN.

In this paper, the authors utilized their speech separation module which generates ideal ratio masks (IRM) to compute a better noise statistics. Given an estimate of the IRM, $\boldsymbol{m}(t)$, the following speech and noise estimations can be derived:

Noise estimation: $\boldsymbol{n}(t) = ( 1 - \boldsymbol{m}(t) ) \odot \boldsymbol{y}(t)$

Noise removed speech estimation: $\boldsymbol{x}(t) = \boldsymbol{m}(t)^{\alpha} \odot \boldsymbol{y}(t)$

Clean speech estimation: $\bar{\boldsymbol{x}}(t) = f(\boldsymbol{x}(t), \boldsymbol{y}(t))$

where $\boldsymbol{y}(t)$ is the original noisy speech feature vector and $\odot$ represents the element-wise multiplication. The $\alpha$ parameter is a tunable parameter (<1) that exponentially scales up IRM estimates, thereby reducing the distortion introduced by masking. However, in their work, $\alpha$ was set to 1. $f(.)$ is the reconstruction function that undoes the distortion introduced by channel or microphone mismatch between training and testing.

The Aurora4 baseline system reported in this paper is 11.7% with a 7H-1024D DNN (ReLU hidden layers, no RBM pre-training, Dropout). The authors claimed that the gains are mainly coming from their DNN training frame labels which are obtained by aligning the corresponding clean training set instead of the noisy data themselves.

The authors also showed that the use of their noise estimates is slightly better than the crude noise estimation adopted by the Microsoft paper, mainly in noisy+channel mismatched conditions.

The final best Aurora4 performance of 11.1% was obtained by averaging two systems.

The joint training is formulated by treating the processing steps of masking, applying log, sentence level mean normalization, adding deltas, splicing and global MVN as DNN layers. Then the wholes system is treated as a single DNN and back-propagte the classification error all the way back to the input of the speech separation input.

Saturday, May 24, 2014

[paper] Two Microphone Binary Mask speech enhancement: application to diffuse and directional noise fields

Link to the paper:

This paper utilize the Binary Masking technique to address two kinds of noises: namely the diffuse noise and the directional noise. The diffuse noise is certainly a noise signal, but a directional noise could correspond to a true noise or a disturbing speech source.

Binary masking methods emulate the human's ear capability to mask a weaker signal by a stronger one. [Moore, Brian CJ, and Brian C. Moore. An introduction to the psychology of hearing. Vol. 5. San Diego: Academic press, 2003.]

Spatial cues such as interaural-time-different (ITD) and interaural-level-difference (ILD) are highly useful in source separation.

Many two-micriphone systems rely on localization cues for speech segregation. But, there cues are only useful when each sound source is located at a single point and so, each signal arrives from a specific direction. Although this condition holds for speech and directional noise sources (such as car and street noise, and a competing speaker), in various environments the noise is diffuse and does not arrive from a specific direction (e.g., consider restaurants, and large malls).

The main contributions of this paper is the two features proposed to estimate the masks. Two features are the Coherence Feature and the Phase Error Feature.

1) The Coherence Feature of two spectra $X_1(\lambda, k)$ and $X_2(\lambda, k)$ is defined as

$COH(\lambda, k)=\frac{|P_{X_1,X_2}(\lambda,k)|}{\sqrt{|P_{X_1}(\lambda, k)| |P_{X_2}(\lambda, k)|}}$

where $P_{X_i}(\lambda, k)$ is the smoothed spectrum of signal $X_i, i\in{1,2}$, and is calculated as:

$P_{X_i}(\lambda, k) = \alpha P_{X_i}(\lambda - 1, k) + (1 - \alpha) |X_i (\lambda, k)|^2$.

$P_{X_1, X_2}(\lambda, k)$ is the smoothed cross power spectral density of the two spectra, which is computed as:

$P_{X_1, X_2}(\lambda, k) = \alpha P_{X_1, X_2}(\lambda - 1, k) + (1 - \alpha) X_1(\lambda, k) X_2(\lambda, k)$.

The Coherence of two signals shows the level of correlation or similarity of two signals. In case of a directional source, the signals received at the two microphones are highly similar to each other (they only differ in their time of arrival and amplitude attenuation). So their Coherence is near 1.

But in case of a diffuse noise source, the received signals have lower similarity, and so, their Coherence is noticeably smaller than 1.

2) The Phase Error Feature is defined as

$PE(\lambda, k) = \Delta \phi(\lambda, k) - 2 \pi * ITD$

where $\Delta \phi(\lambda, k) = \angle X_1(\lambda, k) - \angle X_2(\lambda, k)$.

In this paper, the authors utilize these two sets of features for the estimation of binary masks. They have experimented with neural networks (2 hidden layers), decision tree, Gaussian mixture model and support vector machines, which didn't give large differences.

The evaluation criterion adopted in this paper is the mask estimation hit and false alarm rate. However, personally, I don't think it is a good one. As the binary masks depend on the thresholds selected. In different applications, we may want to use different thresholds, which will lead to different masks. The hid and false alarm rates will also vary. Moreover, for recognition purpose, either human or machines, it is hard to relate the hid and false alarm rates to the recognition performance. Especially, whether the improvements in the hit and false alarm rates cause different recognition performance.

Thursday, May 22, 2014

[paper] Probabilistic linear discriminant Analysis for acoustic model

Link to the paper:

PLDA is formulated by a generative model, where an acoustic feature vector $\boldsymbol{y}_t$ from the $j$-th HMM state at time index $t$ can be expressed as

$\boldsymbol{y}_t | j, m = \boldsymbol{U}_m \boldsymbol{x}_{jmt} + \boldsymbol{G}_m \boldsymbol{z}_{jm} + \boldsymbol{b}_m + \epsilon_{mt}$,

where $m$ is the Gaussian component index of the GMM for state $j$.

$\boldsymbol{z}_{jm}$ is the component dependent variable, shared by the whole set of acoustic feature frames generated by the $j$-th state's $m$-th Gaussian.

$\boldsymbol{x}_{jmt}$ is the channel variable which explains the per-frame variations.

In their work, the prior distributions of $\boldsymbol{z}_{jm}$ and $\boldsymbol{x}_{jmt}$ are assumed to be $\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$.

$\boldsymbol{b}$ denotes the bias.

$\epsilon_t$ is the residual noise which is Gaussian with a zero mean and diagonal covariance, i.e. $\epsilon_t \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\lambda})$

[paper] Deep learning in neural networks: an overview

Link to the paper:

It's a good and information rich overview of the deep learning research area. It has plenty of contents well organized in the chronological order, which is reflected by the 40 out of totally 60 pages' references.

It also has a line or two mentioning about the HTM, which I am interested in exploring.

Wednesday, May 7, 2014

Principles of Analytic Graphics

1. Show Comparisons
2. Show causality, mechanism, explanation
3. Show multivariate data
4. Integrate multiple models of evidence
5. Describe and document the evidence
6. Content is king

From the first lecture of
Book: Edward Tufte (2006). Beautiful Evidence, Graphics Press LLC.

Install the latest R on Ubuntu 12.04

Following the default instructions on to install the R with steps below will install an out-dated version (for may case is 2.14):

1) add deb http://<my.favorite.cran.mirror>/bin/linux/ubuntu precise/ to /etc/apt/sources.list
2) run:
sudo apt-get update
sudo apt-get install r-base

To install the latest 3.0 version of R, we need to do the following after modifying the sources.list file :

sudo apt-key adv --keyserver --recv-keys E084DAB9
sudo add-apt-repository ppa:marutter/rdev
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install r-base

This is from:

Saturday, March 22, 2014

Long, Deep and Wide

While deep neural nets are getting popular, researchers in speech recognition communities start revisiting neural nets and searching for new directions. Length and width besides the depth start appearing.

In Long, Deep and Wide Artificial Neural Nets for Dealing with Unexpected Noise in Machine Recognition of Speech, Hermansky argues that benefits can be also seen in expanding the nets longer in temporal direction, and wider into multiple parallel processing streams.

While the DNN generated speech sound likelihood estimates are demonstrated to be better that the earlier used likelihoods derived by generative Gaussian Mixture Models, unexpected signal distortions that were not seen in the training data can still make the acoustic likelihoods unacceptably low. A step towards addressing the unreliable acoustic evidence might be in expanding the net architectures not only into deeper but also into longer and wider structures, where substantial temporal context attempts to cover whole co-articulation patterns of speech sounds, and multiple processing paths, attending to multiple parts of information-carrying space, attempt to capitalize on redundancies of coding of information in speech, possibly allowing for adaptive alleviation of corrupted processing streams.

This paper suggests that MLP-based estimation of posterior probabilities of speech sounds should be done from relatively long segments of speech signal, and in many parallel interacting streams, resulting on MLP architectures that are not only deep but also long and wide. The streams should describe the speech signal in different ways, capitalizing on the redundant way the message is coded in the signal. Given the constantly changing acoustic environment, the choice of the best streams for the final decision about the message should be done adaptively.

In the book Speech and Hearing in Communication, Fletcher suggests that human speech recognition is carried out in individual frequency bands and the final error in recognition is given by a product of probabilities of errors in the individual frequency streams. Based on similar studies, researchers in ASR community proposed multi-stream ASR. The fundamental motivation is that when message cues are conflicting or corrupted in some processing streams, such a situation can be identified and a corrective action can focus on the more reliable streams that still provide enough cues to facilitate the recognition. (This actually reminds me about our previous study on spectral masking technique for noisy speech recognition. It assumes every input feature is noisy and tries to first identify the "components" that are more speech-dominated, then keeps only those information and throws away noise components. The following recognition is purely based on those partial information. The main bottleneck of that approach is the mask estimation.)

Morgan also reviewed various ASR systems developed prior to the development of DNNs in Deep and Wide: Multiple Layers in Automatic Speech Recognition, with the emphasis on the use of multiple streams of highly dimensioned layers. That paper ultimately concludes that while the  deep processing structures can provide improvements for ASR systems, choice of features and the structure with which they are incorporated, including layer width, can also be significant factors. The have typically found that using an insufficient number of units per layer can have a very effect on the word error rate although this saturates or can even slightly decline with too large  a layer.

In this paper, Morgan also pointed out that the ability to use many more parameters for a given amount of data without overfitting was one of the major design aims for deep learning networks.

Furthermore, they investigated the effect of using different depth and width in DNNs with a fixed total number of model parameters on the Aurora2 task in Deep vs. Wide: Depth on a Budget for Robust Speech Recognition. Adding layers generally resulted in better accuracy, but the number of parameters was increased with every layer added, so that it was not clear what was the main contributing factor to the good results - the depth, or the large number of parameters. However, a shallow model with the same number of parameters usually performs worse than a deeper one.

One interesting paper they referred to is the HNN/ACID approach of Fritsch's paper ACID/HNN: A Framework for Hierarchical Connectionist Acoustic Modeling. He used a tree of networks in order to estimate a large number of context-dependent classes, using the simple factoring trick expounded in Morgan's paper Factoring networks by a statistical method.

Saturday, March 15, 2014


A very detailed explanation of the WFST-based lattice indexing and searing for STD can be found in Lattice indexing for spoken term detection. It is one of the best papers I have read, both informative and well written. The Kaldi KWS tools are based on this paper and some implementation details could be found at

First, just to clear the confusion between the decoding beam and lattice beam for my self:
1) decoding beam is the beam width for search the best path through the complete search graph;
2) lattice beam is the beam width for pruning the whole lattice based on the obtained best path.

Intuitively, we need to find the best path first and then do another round of lattice pruning using the lattice beam. Practically, the pruning is done periodically rather than waiting for the end of the file ( Kaldi does it every 25 frames). The approach used in Kaldi is equivalent to link all currently active states to a "dummy" final state and then pruning in the normal way. [Generating exact lattices in the WFST framework].

Hence for the STD task, we can use a larger decoding beam to ensure the search for the best hypotheses and a smaller lattice beam to limit the lattice used for STD indexing and searching.

For STD, we need to construct an exact inverted index for ASR lattices with time information. All substrings seen in the lattices along with their time information must be indexed.

A factor transducer is an inverted index of the set of substrings (which is referred to as factors) of the set of strings comprising a document. A timed factor transducer stores the time information on the arc weights.

The whole procedure is as follows: we start with preprocessing each input automaton (ASR lattice) to obtain a posterior lattice in which non-overlapping arc clusters are separately labelled. Then from each processed input automaton we construct an intermediate factor transducer which recognizes exactly the set of factor of the input. We convert these intermediate structures into deterministic transducers by augmenting each factor with a disambiguation symbol and then applying weighted automaton optimization. Finally, we take the union of these deterministic transducers and further optimize the result to obtain a deterministic inverted index of the entire dataset.

The path weights of the ASR lattices correspond to joint probabilities assigned by the language and acoustic models. A general weight-pushing algorithm is applied to convert these weights to desired negated log posterior probabilities given the lattice.

For STD, we would like to keep separate index entries for non-overlapping occurrences in an utterance since we no longer search for the utterances but the exact time intervals containing the query term. A clustering algorithm is applied to achieve this separation, which assigns the arcs with the same input label and overlapping time-spans a unique class id. The clustering is done as follows:
1) sort the collected (start time, end time) pairs with respect to end times;
2) identify the largest set of non-overlapping (start time, end time) pairs and assign them as cluster heads;
3) classify the rest of the arcs according to maximal overlap.
After clustering, each arc carries a cluster identifier on the output label.

Next, all the factors are indexed in the following way:
1) map each arc to (negated log probability, start time, end time)
2) create a unique new initial state
3) create a unique new final state
4) for each state, $q$ in the automaton, create two new arcs: a) an initial arc from the new initial state to to $q$; b) a final arc from $q$ to the new final state.

Then the paths carrying the same factor-pair are merged and weighted $\epsilon$-removal, determinization and minimization are applied.  Followed by the removal of cluster identifiers on the non-final arcs and insertion of disambiguation symbols into the final arcs. The resulting factor transducer is further optimized by applying weighted determinization and minimization.

The timed factor transducer of the entire dataset is constructed by:
1) taking the union of each TFT;
2) encoding the input-output labels as a single label and applying weighted $\epsilon$-removal, determinization,  and minimization;
3) remove the disambiguation symbols on the final transitions.

With this generated TFT index, we can utilize factor selection filters in WFST form to restrict, transform or re-weight the index entries, for example:
A) a pronunciation dictionary which maps words to phone sequences can be applied to word lattices to obtain phonetic lattices.
B) a simple grammar which restricts the factors can be used. We can use such a grammar to reject the silence symbol, i.e., factors including the silence symbol are removed from the index.

Finally the search over the TFT index is the construction of the response automaton given a query automaton X:
1) composing X with the index TFT on the input side and projecting the resulting transducer onto its output labels;
2) removing the $\epsilon$ transitions and finally sorting with the shortest path algorithm.
Each successful path in the resulting response transducer is a single arc which carries an automaton identifier on its label and a (-log posterior probability, start time, end time) triplet on its weight. A simple traverse in the arc order gives sorted results.

In the paper, the authors find a lattice beam width of 4 is ideal for their STD experiments.

Wednesday, March 12, 2014


Some nice explanations on RBM and DBN from the paper Application of Deep Belief Network for Natural Language Understanding:

RBMs can be trained using unlabeled data and they can learn stochastic binary features which are good for modeling the higher-order statistical structure of a dataset. Even thought these features are discovered without considering the discriminative task for which they will be used, some of them are typically very useful for classification as well as for generation.

After training the network consisting of the visible layer and the first hidden layer, which we will refer to as $ \text{RBM}_1 $, its learned parameters, $ \theta_1 $, define $ p(\boldsymbol{v}, \boldsymbol{h}|\theta_1), p(\boldsymbol{v}|\theta_1), p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ and $ p(\boldsymbol{h}|\boldsymbol{v}, \theta_1) $ via equations
\[ p(h_j = 1 | \boldsymbol{v}) = \sigma(a_j + \sum_i v_i w_{ij}) \]
\[ p(v_i =1 | \boldsymbol{h}) = \sigma(b_i + \sum_j h_j w_{ij}) \].
The parameters of $ \text{RBM}_1$ also define a prior distribution over hidden vectors, $ p(\boldsymbol{h}|\theta_1)$, which is obtained by marginalizing over the space of visible vectors. This allows $p(\boldsymbol{v}|\theta_1)$ to be written as:
\[ p(\boldsymbol{v}|\theta_1) = \sum_{\boldsymbol{h}} p(\boldsymbol{h}|\theta_1) p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)\].
The idea behind training a DBN by training a stack of RBMs is to keep the $p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ defined by $\text{RBM}_1$, but to improve $p(\boldsymbol{v})$ by replacing $p(\boldsymbol{h}|\theta_1)$ by a better prior over the hidden vectors. To improve $p(\boldsymbol{v})$, this better prior must have a smaller KL divergence than $p(\boldsymbol{h}|\theta_1)$ from the "aggregated posterior", which is the equally weighted mixture of the posterior distributions over the hidden vectors of $\text{RBM}_1$ on all $N$ of the training cases:
\[ \frac{1}{N} \sum_{\boldsymbol{v}\in \textbf{train}} p(\boldsymbol{h}|\boldsymbol{v}, \theta_1) \]. The analogous state for Gaussian Mixture models is that the updated mixing proportion of a component should be closer to the average posterior probability of that component over all training cases.

Now consider training $\text{RBM}_2$, which is the network formed by using the samples from the averaged posterior of $\text{RBM}_1$ as training data. It is easy to ensure that the distribution which $\text{RBM}_2$ defines over its visible units is identical to $p(\boldsymbol{h}|\theta_1)$: we simply initialize $\text{RBM}_2$ to be an upside-down version of $\text{RBM}_1$ in which the roles of visible and hidden units have been swapped. So $\text{RBM}_2$ has $\boldsymbol{h}$ as a visible vector and $\boldsymbol{h}_2$ as a hidden vector. Then we train $\text{RBM}_2$ which makes $p(\boldsymbol{h}|\theta_2)$ be a better model of the aggregated posterior than $p(\boldsymbol{h}|\theta_1)$.

After training $\text{RBM}_2$, we can combine the two RBMs to create a hybrid of a directed and an undirected model. $p(\boldsymbol{h}|\theta_2)$ is defined by the undirected $\text{RBM}_2$, buy $p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ is defined by directed connections from the first hidden layer to the visible units. In this hybrid model, which we call a deep belief net, extract inference of $p(\boldsymbol{h}|\boldsymbol{v}, \theta_1, \theta_2)$ is no longer easy because the prior over the hidden vectors is no longer defined by $\theta_1$. However, if we perform approximate inference for the first hidden layer by using equation \[p(h_j=1|\boldsymbol{v})=\sigma(a_j + \sum_i v_i w_{ij}) \], there is a variational lower bound on the log probability of the training data that is improved every time we add another layer to the DBN, provided we add it in the appropriate way.

After training a stack of RBMs, the bottom up recognition weights of the resulting DBN can be used to initialize the weights of a multi-layer feed-forward neural network, which can then be discriminatively fine-tuned by backpropagating error derivatives. The feed-forward network is given a final "softmax" layer that computes a probability distribution over class labels and the derivative of the log probability  of the correct class is backpropagated to train the incoming weights of the final layer and to discriminatively fine-tune the weights in all lower layers.

In principle, adding more layers improves modeling power, unless the DBN already perfectly models the data. In practice, however, little is gained by using more than about 3 hidden layers.

Friday, January 24, 2014

Kaldi lattices

Kaldi is based on WFST for decoding so is the lattices. Firstly, a WFST has a set of states with one distinguished start state; each state has a final cost; and there is a set of arcs between the states, where each arc has an input label, an output label, and a weight, which is also referred to as a cost and is usually the negated log-probability.

To decode an utterance with T frames, the WFST interpretation is as follows. We construct an accepter, or WFSA (WFST with the same input and output symbol for all the arcs). It has T+1 states, with an arc for each cobmination of (time, context-dependent HMM state). The costs on there arcs correspond to negated and scaled acoustic log-likelihood. Call this acceptor U, then the complete search space is:
S= U * HCLG, where * represents the composition operation and HCLG is the decoding graph generated for Kaldi. HCLG integrates the HMM transition, context expansion, lexicon and most importantly the language model. In composition, the weights are added together. As they are negated log-likelihoods, it is effectively multiplying the original probabilities.

There are two representations of lattices in Kaldi:

1) the Lattice type

Each lattice is stored as an FST, with the following format:

[start state id] [end state id] [input symbol] [output symbol] [weight]

Usually the input symbols are the transition-ids and output symbols are words. The weight is usually:

[the graph cost],[the acoustic cost]

[the graph cost] : a sum of the LM cost, the (weighted) transition probabilities, and any pronunciation cost. Since they are mixed together, in order to do things like LM rescoring Kaldi typically first subtract the "old" LM cost and then add in the "new" LM cost to avoid touching the other parts of the cost.
[the acoustic cost] : the acoustic cost is the unscaled raw scores. The scaling is applied before any algorithm and unscale them again before writing them out.

0       616     273     2402    10.9106,2539.22
0       667     345     3       4.09183,365.022
0       672     273     1268    13.7098,2659.86
0       726     273     758     12.6504,2686.83
0       780     273     3157    14.3474,2700.35
0       834     273     1992    16.2729,2677.13
1       888     281     3       10.8443,4811.55
... ...

2) the CompactLattice type

This is also in FSA, acceptor, format, with an arc format as:

[start state id] [end state id] [input & output symbol] [weight]

The input/output symbol are usually the word ids and the weight is:

[the graph cost],[the acoustic cost],[a string sequence of integers]

The integers represent the transition ids, i.e. the frame-alignment for this word symbol.

0       20      2402    10.9106,2539.22,273_282_280_280_280_280_280_280_280_280_280_288_63704_63703_63712_63732_63731_63731_58806_58805_58805_58805_58805_58864_58863_58888_15286_15285_15285_15285_15285_15285_15285_15285_15285_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375
0       16      3       4.09183,365.022,345_354_352_352_352_352
0       9       1268    13.7098,2659.86,273_282_280_280_280_280_280_280_280_280_280_280_288_42724_42723_42736_42776_42775_3698_3697_3712_3711_3711_3711_3711_3711_3711_3762_15234_15233_15233_15233_15233_15233_15233_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375_15375_15375_273
0       5       758     12.6504,2686.83,273_282_280_280_280_280_280_280_280_280_280_280_288_64680_64702_64708_64707_64707_3680_3679_3679_3724_3723_3723_3762_3761_3761_3761_15234_15233_15233_15233_15233_15233_15233_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375_15375_15375_273
0       3       3157    14.3474,2700.35,273_282_280_280_280_280_280_280_280_280_280_280_288_3186_3185_3185_3185_3185_3185_3226_3225_3225_3225_3225_3246_3245_3245_3245_15234_15233_15233_15233_15233_15233_15233_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375_15375_15375_273

... ...

The conversion can be done using the lattice-copy with the option --write-compact. Besides, in Kaldi, these two types of lattices are not distinguished for I/O purpose, that's to say the tools dealing with lattices can take in and output any of these two formats. However, typically, the CompactLattice is used as the default output format.

Many algorithms on lattices (for instance, taking the best path, or pruning) are most efficient to do using the Lattice type rather than the CompactLattice type.

In general, the words in Kaldi lattices are not synchronized with to the transition-ids, meaning that the transition-ids on an arc won't necessarily all belong to the word whose label is on that arc. This means the times you get from the lattice will be inexact. It is also true of the weights. To obtain the exact times, you should run the program lattice-align-words. It only works if the system is based on word-position-dependent phones and it requires certain command line options to tell which phones are in which position in the word. An alternative program, lattice-align-words-lexicon can be used if the system does not have word-position-dependent phones.

An HTK lattice has the exact time information. An example HTK lattice is:

lmscale=20.00 wdpenalty=-30.00
N=31 L=56
I=0 t=0.00
I=1 t=0.36
I=2 t=0.75
I=3 t=0.81
... etc
I=30 t=2.48
J=0 S=0 E=1 W=SILENCE v=0 a=-3239.01 l=0.00
J=1 S=1 E=2 W=FOUR v=0 a=-3820.77 l=0.00
... etc
J=55 S=29 E=30 W=SILENCE v=0 a=-246.99 l=-1.20

The first 5 lines comprise a header which records names of the files used to generate the lattice along with the settings of the language model scale and penalty factors. Each node in the lattice represents a point in time measured in seconds and each arc represents a word spanning the segment of the input starting at the time of its start node and ending at the time of its end node. For each such span, v gives the number of the pronunciation used, a gives the acoustic score and l gives the language model score.

The language model scores in output lattices do not include the scale factors and penalties. These are removed so that the lattice can be used as a constraint network for subsequent recognizer testing.

Thursday, January 16, 2014


I always believe the best way to start a new exploration is to follow what has been done and get familiar and understand their choices. I hence started with reproducing the system included in Kaldi for the limitedLP Vietnamese task.

1). Lexicon preparation.

In the language specification, there are totally 25 consonants and 45 vowels (12 monophthongs,  25 diphthongs and 8 triphthongs). The total number of phonemes is 54 which may be too much for a "limited" language. In Kaldi's setup, both the diphthongs and triphthongs are mapped back to the monophthongs with the following option:

phoneme_mapping="i@U=i @ U;oaI=o a I;oaI:=o a I:;u@I=u @ I;uI@= u I @;1@I=1 @ I;1@U=1 @ U;
  a:I=a: I; a:U=a: U; aU=a U; @U=@ U; aI=a I; @I=@ I; EU=E U; eU=e U; i@=i @; iU=i U; Oa:=O a: ; Oa=O a;
  OE=O E; OI=O I; oI=o I; @:I=@: I; u@=u @; 1@=1 @; ue=u e; uI=u I; 1I=1 I; u@:=u @:; 1U=1 U; ui:=u i:"

Through this processing, the number of individual phonemes are reduced to the number of consonants and monophthongs.

2). Tone information.

Although tones are only applied to vowels in theory, in the Kaldi setup, the tone is applied to all the phonemes of the corresponding syllable. One possible explanation is that the use of tones to vowels may also affect the realization of consonants due to the co-articulation effects.

One original lexicon item: Amway   a: m _1 . w aI _1
The corresponding Kaldi item: Amway   a:_1 m_1         w_1 a_1 I_1

The period in the pronunciation indicates the syllable boundary. With this tonal information, the number of phonemes increased to around 6 times as there are totally 6 different phonemes.

3). Position dependent phonemes.

The phonemes used in Kaldi are further distinguished using their positions in words. Four positions marker are used: (B)egin, (E)nd, (I)nternal and (S)ingleton . For this setup, even SIL is marked to have following variations: SIL SIL_B SIL_E SIL_I SIL_S.

4). Features.

In the Kaldi's setup, PLP features are used together with Pitch features and/or FFV (fundamental frequency variations).

Wednesday, January 15, 2014

Extract Chinese texts from Wiki

To build a local Mandarin speech corpus, the wiki texts are decided to use as prompts for speakers to read. Following is the major steps involved in extracting Mandarin sentences from the Chinese wiki dump, which are learned from

1. Wiki Dump downloading:

The one we used is (although at the time of writing it is no longer the latest one).

2. Main body text extraction

The extraction uses the developed initially for Italian Wiki text extraction. A simple command will do:

bzcat zhwiki-20131221-pages-articles-multistream.xml.bz2 | python -b 1000M -o texts > output.log

The extracted file will be automatically split into files with size less than 1000M (option "-b 1000M") and will be saved in the folder "texts" (option "-o texts").

3. Traditional Chinese to Simplified Chinese

The conversion is done using the open source tool opencc :

opencc -i wiki_00 -o wiki_00_chs -c zht2zhs.ini

wiki_00 is the input file obtained from the previous step and wiki_00_chs is the converted output file and the zht2zhs.ini is simply specifying which configuration file to use. No need to create it on your own.

Till now, we have obtained the wiki texts in a single file with each article in a "<doc..> ... </doc>" entry.

Following are steps specific to the generation of sentences used for speech recording.

Assuming an average reading speed of 240 characters per minutes ( and an average sentences length of 20 characters (15~25), one sentence will lead to a speech recording of 5 seconds. For a 150 hours speech corpus, we hence would need 108,000 sentences. Let's prepare 150,000 or 200,000 sentences.

The text normalization is relatively easier as we can simple discard those with unwanted variations for the purpose of sentence selection. The steps we have done are as follows:

a) Remove the in-sentence comments of the format "(...)"
b) Replace the Chinese coded alphanumeric strings to traditional ascii ones, such as "a" to "a"
c) Convert numerical year representation to Chinese representation, such as "1976年" to "一九七六年"
d) Convert percentage numbers to Chinese character representation
e) Convert numbers to Chinese characters
f) Split paragraph into sentences based on the boundary symbols: "。!;?!;?"
g) For sentences with more than 50 characters (twice the number of the maximum length we want to use), they will be further split based on commas: ",,"
h) Remove all the left punctuation
i) Remove spaces in the sentences
j) Final check of whether the sentence is made up of Chinese characters only (zhon.unicode.HAN_IDEOGRAPHS from
k) save the sentence if it is not empty string

Detailed Python implementation could be found at

The last step is to use an existing LM to compute the perplexity for each sentences and the final selection is based on that score. The per utterance perplexity computation could be done with the ngram tool of the srilm package.

Tuesday, January 14, 2014


To start the participation of this year's openKWS, setup the dry-run of the submission is carried out first before having any system yet.

Detailed instructions could be found at

Following are the steps done:

1. Vietnamese data download, provided by Prof.
2. IndusDB - not available
3. SCTK installed
4. JobRunner extracted
5. F4DE: so much staff, maybe only care about the KWSEval is enough, but just in case, install all!
         sudo apt-get install gnu-plot libxml2 sqlite3
         make perl_install
6. Account application - needs PI

Following are some notes from the doc ( to be kept in mind:

1. the KWS task is to final all of the occurrences of a keyword, a sequence of one or more words in a corpus of un-segmented speech data.

2. the lexicon provided in the "build pack" for training contains entries for both the training and development test data. The lexical items that exist only in the development test data must be excluded during model training. 

3. keywords, a sequence of contiguous lexical items, will be specified in the language's UTF-8 encoded, native orthographic representation.

4. Homographs, words with the same written form but different meanings, will not be differentiated. Morphological variations of a keyword will not be considered positive variations.

5. transcript comparisons will be case insensitive

6. the silence gap between adjacent words in a keyword must be <= 0.5 second