While deep neural nets are getting popular, researchers in speech recognition communities start revisiting neural nets and searching for new directions. Length and width besides the depth start appearing.

In Long, Deep and Wide Artificial Neural Nets for Dealing with Unexpected Noise in Machine Recognition of Speech, Hermansky argues that benefits can be also seen in expanding the nets longer in temporal direction, and wider into multiple parallel processing streams.

While the DNN generated speech sound likelihood estimates are demonstrated to be better that the earlier used likelihoods derived by generative Gaussian Mixture Models, unexpected signal distortions that were not seen in the training data can still make the acoustic likelihoods unacceptably low. A step towards addressing the unreliable acoustic evidence might be in expanding the net architectures not only into deeper but also into longer and wider structures, where substantial temporal context attempts to cover whole co-articulation patterns of speech sounds, and multiple processing paths, attending to multiple parts of information-carrying space, attempt to capitalize on redundancies of coding of information in speech, possibly allowing for adaptive alleviation of corrupted processing streams.

This paper suggests that MLP-based estimation of posterior probabilities of speech sounds should be done from relatively long segments of speech signal, and in many parallel interacting streams, resulting on MLP architectures that are not only deep but also long and wide. The streams should describe the speech signal in different ways, capitalizing on the redundant way the message is coded in the signal. Given the constantly changing acoustic environment, the choice of the best streams for the final decision about the message should be done adaptively.

In the book Speech and Hearing in Communication, Fletcher suggests that human speech recognition is carried out in individual frequency bands and the final error in recognition is given by a product of probabilities of errors in the individual frequency streams. Based on similar studies, researchers in ASR community proposed multi-stream ASR. The fundamental motivation is that when message cues are conflicting or corrupted in some processing streams, such a situation can be identified and a corrective action can focus on the more reliable streams that still provide enough cues to facilitate the recognition. (This actually reminds me about our previous study on spectral masking technique for noisy speech recognition. It assumes every input feature is noisy and tries to first identify the "components" that are more speech-dominated, then keeps only those information and throws away noise components. The following recognition is purely based on those partial information. The main bottleneck of that approach is the mask estimation.)

Morgan also reviewed various ASR systems developed prior to the development of DNNs in Deep and Wide: Multiple Layers in Automatic Speech Recognition, with the emphasis on the use of multiple streams of highly dimensioned layers. That paper ultimately concludes that while the deep processing structures can provide improvements for ASR systems, choice of features and the structure with which they are incorporated, including layer width, can also be significant factors. The have typically found that using an insufficient number of units per layer can have a very effect on the word error rate although this saturates or can even slightly decline with too large a layer.

In this paper, Morgan also pointed out that the ability to use many more parameters for a given amount of data without overfitting was one of the major design aims for deep learning networks.

Furthermore, they investigated the effect of using different depth and width in DNNs with a fixed total number of model parameters on the Aurora2 task in Deep vs. Wide: Depth on a Budget for Robust Speech Recognition. Adding layers generally resulted in better accuracy, but the number of parameters was increased with every layer added, so that it was not clear what was the main contributing factor to the good results - the depth, or the large number of parameters. However, a shallow model with the same number of parameters usually performs worse than a deeper one.

One interesting paper they referred to is the HNN/ACID approach of Fritsch's paper ACID/HNN: A Framework for Hierarchical Connectionist Acoustic Modeling. He used a tree of networks in order to estimate a large number of context-dependent classes, using the simple factoring trick expounded in Morgan's paper Factoring networks by a statistical method.

## Saturday, March 22, 2014

## Saturday, March 15, 2014

### WFST for STD

A very detailed explanation of the WFST-based lattice indexing and searing for STD can be found in Lattice indexing for spoken term detection. It is one of the best papers I have read, both informative and well written. The Kaldi KWS tools are based on this paper and some implementation details could be found at http://kaldi.sourceforge.net/kws.html.

First, just to clear the confusion between the decoding beam and lattice beam for my self:

1) decoding beam is the beam width for search the best path through the complete search graph;

2) lattice beam is the beam width for pruning the whole lattice based on the obtained best path.

Intuitively, we need to find the best path first and then do another round of lattice pruning using the lattice beam. Practically, the pruning is done periodically rather than waiting for the end of the file ( Kaldi does it every 25 frames). The approach used in Kaldi is equivalent to link all currently active states to a "dummy" final state and then pruning in the normal way. [Generating exact lattices in the WFST framework].

Hence for the STD task, we can use a larger decoding beam to ensure the search for the best hypotheses and a smaller lattice beam to limit the lattice used for STD indexing and searching.

For STD, we need to construct an exact inverted index for ASR lattices with time information. All substrings seen in the lattices along with their time information must be indexed.

A factor transducer is an inverted index of the set of substrings (which is referred to as factors) of the set of strings comprising a document. A timed factor transducer stores the time information on the arc weights.

The whole procedure is as follows: we start with preprocessing each input automaton (ASR lattice) to obtain a posterior lattice in which non-overlapping arc clusters are separately labelled. Then from each processed input automaton we construct an intermediate factor transducer which recognizes exactly the set of factor of the input. We convert these intermediate structures into deterministic transducers by augmenting each factor with a disambiguation symbol and then applying weighted automaton optimization. Finally, we take the union of these deterministic transducers and further optimize the result to obtain a deterministic inverted index of the entire dataset.

The path weights of the ASR lattices correspond to joint probabilities assigned by the language and acoustic models. A general weight-pushing algorithm is applied to convert these weights to desired negated log posterior probabilities given the lattice.

For STD, we would like to keep separate index entries for non-overlapping occurrences in an utterance since we no longer search for the utterances but the exact time intervals containing the query term. A clustering algorithm is applied to achieve this separation, which assigns the arcs with the same input label and overlapping time-spans a unique class id. The clustering is done as follows:

1) sort the collected (start time, end time) pairs with respect to end times;

2) identify the largest set of non-overlapping (start time, end time) pairs and assign them as cluster heads;

3) classify the rest of the arcs according to maximal overlap.

After clustering, each arc carries a cluster identifier on the output label.

Next, all the factors are indexed in the following way:

1) map each arc to (negated log probability, start time, end time)

2) create a unique new initial state

3) create a unique new final state

4) for each state, $q$ in the automaton, create two new arcs: a) an initial arc from the new initial state to to $q$; b) a final arc from $q$ to the new final state.

Then the paths carrying the same factor-pair are merged and weighted $\epsilon$-removal, determinization and minimization are applied. Followed by the removal of cluster identifiers on the non-final arcs and insertion of disambiguation symbols into the final arcs. The resulting factor transducer is further optimized by applying weighted determinization and minimization.

The timed factor transducer of the entire dataset is constructed by:

1) taking the union of each TFT;

2) encoding the input-output labels as a single label and applying weighted $\epsilon$-removal, determinization, and minimization;

3) remove the disambiguation symbols on the final transitions.

With this generated TFT index, we can utilize factor selection filters in WFST form to restrict, transform or re-weight the index entries, for example:

A) a pronunciation dictionary which maps words to phone sequences can be applied to word lattices to obtain phonetic lattices.

B) a simple grammar which restricts the factors can be used. We can use such a grammar to reject the silence symbol, i.e., factors including the silence symbol are removed from the index.

Finally the search over the TFT index is the construction of the response automaton given a query automaton X:

1) composing X with the index TFT on the input side and projecting the resulting transducer onto its output labels;

2) removing the $\epsilon$ transitions and finally sorting with the shortest path algorithm.

Each successful path in the resulting response transducer is a single arc which carries an automaton identifier on its label and a (-log posterior probability, start time, end time) triplet on its weight. A simple traverse in the arc order gives sorted results.

In the paper, the authors find a lattice beam width of 4 is ideal for their STD experiments.

First, just to clear the confusion between the decoding beam and lattice beam for my self:

1) decoding beam is the beam width for search the best path through the complete search graph;

2) lattice beam is the beam width for pruning the whole lattice based on the obtained best path.

Intuitively, we need to find the best path first and then do another round of lattice pruning using the lattice beam. Practically, the pruning is done periodically rather than waiting for the end of the file ( Kaldi does it every 25 frames). The approach used in Kaldi is equivalent to link all currently active states to a "dummy" final state and then pruning in the normal way. [Generating exact lattices in the WFST framework].

Hence for the STD task, we can use a larger decoding beam to ensure the search for the best hypotheses and a smaller lattice beam to limit the lattice used for STD indexing and searching.

For STD, we need to construct an exact inverted index for ASR lattices with time information. All substrings seen in the lattices along with their time information must be indexed.

A factor transducer is an inverted index of the set of substrings (which is referred to as factors) of the set of strings comprising a document. A timed factor transducer stores the time information on the arc weights.

The whole procedure is as follows: we start with preprocessing each input automaton (ASR lattice) to obtain a posterior lattice in which non-overlapping arc clusters are separately labelled. Then from each processed input automaton we construct an intermediate factor transducer which recognizes exactly the set of factor of the input. We convert these intermediate structures into deterministic transducers by augmenting each factor with a disambiguation symbol and then applying weighted automaton optimization. Finally, we take the union of these deterministic transducers and further optimize the result to obtain a deterministic inverted index of the entire dataset.

The path weights of the ASR lattices correspond to joint probabilities assigned by the language and acoustic models. A general weight-pushing algorithm is applied to convert these weights to desired negated log posterior probabilities given the lattice.

For STD, we would like to keep separate index entries for non-overlapping occurrences in an utterance since we no longer search for the utterances but the exact time intervals containing the query term. A clustering algorithm is applied to achieve this separation, which assigns the arcs with the same input label and overlapping time-spans a unique class id. The clustering is done as follows:

1) sort the collected (start time, end time) pairs with respect to end times;

2) identify the largest set of non-overlapping (start time, end time) pairs and assign them as cluster heads;

3) classify the rest of the arcs according to maximal overlap.

After clustering, each arc carries a cluster identifier on the output label.

Next, all the factors are indexed in the following way:

1) map each arc to (negated log probability, start time, end time)

2) create a unique new initial state

3) create a unique new final state

4) for each state, $q$ in the automaton, create two new arcs: a) an initial arc from the new initial state to to $q$; b) a final arc from $q$ to the new final state.

Then the paths carrying the same factor-pair are merged and weighted $\epsilon$-removal, determinization and minimization are applied. Followed by the removal of cluster identifiers on the non-final arcs and insertion of disambiguation symbols into the final arcs. The resulting factor transducer is further optimized by applying weighted determinization and minimization.

The timed factor transducer of the entire dataset is constructed by:

1) taking the union of each TFT;

2) encoding the input-output labels as a single label and applying weighted $\epsilon$-removal, determinization, and minimization;

3) remove the disambiguation symbols on the final transitions.

With this generated TFT index, we can utilize factor selection filters in WFST form to restrict, transform or re-weight the index entries, for example:

A) a pronunciation dictionary which maps words to phone sequences can be applied to word lattices to obtain phonetic lattices.

B) a simple grammar which restricts the factors can be used. We can use such a grammar to reject the silence symbol, i.e., factors including the silence symbol are removed from the index.

Finally the search over the TFT index is the construction of the response automaton given a query automaton X:

1) composing X with the index TFT on the input side and projecting the resulting transducer onto its output labels;

2) removing the $\epsilon$ transitions and finally sorting with the shortest path algorithm.

Each successful path in the resulting response transducer is a single arc which carries an automaton identifier on its label and a (-log posterior probability, start time, end time) triplet on its weight. A simple traverse in the arc order gives sorted results.

In the paper, the authors find a lattice beam width of 4 is ideal for their STD experiments.

## Wednesday, March 12, 2014

### RBM and DBN

Some nice explanations on RBM and DBN from the paper Application of Deep Belief Network for Natural Language Understanding:

RBMs can be trained using unlabeled data and they can learn stochastic binary features which are good for modeling the higher-order statistical structure of a dataset. Even thought these features are discovered without considering the discriminative task for which they will be used, some of them are typically very useful for classification as well as for generation.

After training the network consisting of the visible layer and the first hidden layer, which we will refer to as $ \text{RBM}_1 $, its learned parameters, $ \theta_1 $, define $ p(\boldsymbol{v}, \boldsymbol{h}|\theta_1), p(\boldsymbol{v}|\theta_1), p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ and $ p(\boldsymbol{h}|\boldsymbol{v}, \theta_1) $ via equations

\[ p(h_j = 1 | \boldsymbol{v}) = \sigma(a_j + \sum_i v_i w_{ij}) \]

and

\[ p(v_i =1 | \boldsymbol{h}) = \sigma(b_i + \sum_j h_j w_{ij}) \].

The parameters of $ \text{RBM}_1$ also define a prior distribution over hidden vectors, $ p(\boldsymbol{h}|\theta_1)$, which is obtained by marginalizing over the space of visible vectors. This allows $p(\boldsymbol{v}|\theta_1)$ to be written as:

\[ p(\boldsymbol{v}|\theta_1) = \sum_{\boldsymbol{h}} p(\boldsymbol{h}|\theta_1) p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)\].

The idea behind training a DBN by training a stack of RBMs is to keep the $p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ defined by $\text{RBM}_1$, but to improve $p(\boldsymbol{v})$ by replacing $p(\boldsymbol{h}|\theta_1)$ by a better prior over the hidden vectors. To improve $p(\boldsymbol{v})$, this better prior must have a smaller KL divergence than $p(\boldsymbol{h}|\theta_1)$ from the "aggregated posterior", which is the equally weighted mixture of the posterior distributions over the hidden vectors of $\text{RBM}_1$ on all $N$ of the training cases:

\[ \frac{1}{N} \sum_{\boldsymbol{v}\in \textbf{train}} p(\boldsymbol{h}|\boldsymbol{v}, \theta_1) \]. The analogous state for Gaussian Mixture models is that the updated mixing proportion of a component should be closer to the average posterior probability of that component over all training cases.

Now consider training $\text{RBM}_2$, which is the network formed by using the samples from the averaged posterior of $\text{RBM}_1$ as training data. It is easy to ensure that the distribution which $\text{RBM}_2$ defines over its visible units is identical to $p(\boldsymbol{h}|\theta_1)$: we simply initialize $\text{RBM}_2$ to be an upside-down version of $\text{RBM}_1$ in which the roles of visible and hidden units have been swapped. So $\text{RBM}_2$ has $\boldsymbol{h}$ as a visible vector and $\boldsymbol{h}_2$ as a hidden vector. Then we train $\text{RBM}_2$ which makes $p(\boldsymbol{h}|\theta_2)$ be a better model of the aggregated posterior than $p(\boldsymbol{h}|\theta_1)$.

After training $\text{RBM}_2$, we can combine the two RBMs to create a hybrid of a directed and an undirected model. $p(\boldsymbol{h}|\theta_2)$ is defined by the undirected $\text{RBM}_2$, buy $p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ is defined by directed connections from the first hidden layer to the visible units. In this hybrid model, which we call a deep belief net, extract inference of $p(\boldsymbol{h}|\boldsymbol{v}, \theta_1, \theta_2)$ is no longer easy because the prior over the hidden vectors is no longer defined by $\theta_1$. However, if we perform approximate inference for the first hidden layer by using equation \[p(h_j=1|\boldsymbol{v})=\sigma(a_j + \sum_i v_i w_{ij}) \], there is a variational lower bound on the log probability of the training data that is improved every time we add another layer to the DBN, provided we add it in the appropriate way.

After training a stack of RBMs, the bottom up recognition weights of the resulting DBN can be used to initialize the weights of a multi-layer feed-forward neural network, which can then be discriminatively fine-tuned by backpropagating error derivatives. The feed-forward network is given a final "softmax" layer that computes a probability distribution over class labels and the derivative of the log probability of the correct class is backpropagated to train the incoming weights of the final layer and to discriminatively fine-tune the weights in all lower layers.

In principle, adding more layers improves modeling power, unless the DBN already perfectly models the data. In practice, however, little is gained by using more than about 3 hidden layers.

RBMs can be trained using unlabeled data and they can learn stochastic binary features which are good for modeling the higher-order statistical structure of a dataset. Even thought these features are discovered without considering the discriminative task for which they will be used, some of them are typically very useful for classification as well as for generation.

After training the network consisting of the visible layer and the first hidden layer, which we will refer to as $ \text{RBM}_1 $, its learned parameters, $ \theta_1 $, define $ p(\boldsymbol{v}, \boldsymbol{h}|\theta_1), p(\boldsymbol{v}|\theta_1), p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ and $ p(\boldsymbol{h}|\boldsymbol{v}, \theta_1) $ via equations

\[ p(h_j = 1 | \boldsymbol{v}) = \sigma(a_j + \sum_i v_i w_{ij}) \]

and

\[ p(v_i =1 | \boldsymbol{h}) = \sigma(b_i + \sum_j h_j w_{ij}) \].

The parameters of $ \text{RBM}_1$ also define a prior distribution over hidden vectors, $ p(\boldsymbol{h}|\theta_1)$, which is obtained by marginalizing over the space of visible vectors. This allows $p(\boldsymbol{v}|\theta_1)$ to be written as:

\[ p(\boldsymbol{v}|\theta_1) = \sum_{\boldsymbol{h}} p(\boldsymbol{h}|\theta_1) p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)\].

The idea behind training a DBN by training a stack of RBMs is to keep the $p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ defined by $\text{RBM}_1$, but to improve $p(\boldsymbol{v})$ by replacing $p(\boldsymbol{h}|\theta_1)$ by a better prior over the hidden vectors. To improve $p(\boldsymbol{v})$, this better prior must have a smaller KL divergence than $p(\boldsymbol{h}|\theta_1)$ from the "aggregated posterior", which is the equally weighted mixture of the posterior distributions over the hidden vectors of $\text{RBM}_1$ on all $N$ of the training cases:

\[ \frac{1}{N} \sum_{\boldsymbol{v}\in \textbf{train}} p(\boldsymbol{h}|\boldsymbol{v}, \theta_1) \]. The analogous state for Gaussian Mixture models is that the updated mixing proportion of a component should be closer to the average posterior probability of that component over all training cases.

Now consider training $\text{RBM}_2$, which is the network formed by using the samples from the averaged posterior of $\text{RBM}_1$ as training data. It is easy to ensure that the distribution which $\text{RBM}_2$ defines over its visible units is identical to $p(\boldsymbol{h}|\theta_1)$: we simply initialize $\text{RBM}_2$ to be an upside-down version of $\text{RBM}_1$ in which the roles of visible and hidden units have been swapped. So $\text{RBM}_2$ has $\boldsymbol{h}$ as a visible vector and $\boldsymbol{h}_2$ as a hidden vector. Then we train $\text{RBM}_2$ which makes $p(\boldsymbol{h}|\theta_2)$ be a better model of the aggregated posterior than $p(\boldsymbol{h}|\theta_1)$.

After training $\text{RBM}_2$, we can combine the two RBMs to create a hybrid of a directed and an undirected model. $p(\boldsymbol{h}|\theta_2)$ is defined by the undirected $\text{RBM}_2$, buy $p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ is defined by directed connections from the first hidden layer to the visible units. In this hybrid model, which we call a deep belief net, extract inference of $p(\boldsymbol{h}|\boldsymbol{v}, \theta_1, \theta_2)$ is no longer easy because the prior over the hidden vectors is no longer defined by $\theta_1$. However, if we perform approximate inference for the first hidden layer by using equation \[p(h_j=1|\boldsymbol{v})=\sigma(a_j + \sum_i v_i w_{ij}) \], there is a variational lower bound on the log probability of the training data that is improved every time we add another layer to the DBN, provided we add it in the appropriate way.

After training a stack of RBMs, the bottom up recognition weights of the resulting DBN can be used to initialize the weights of a multi-layer feed-forward neural network, which can then be discriminatively fine-tuned by backpropagating error derivatives. The feed-forward network is given a final "softmax" layer that computes a probability distribution over class labels and the derivative of the log probability of the correct class is backpropagated to train the incoming weights of the final layer and to discriminatively fine-tune the weights in all lower layers.

In principle, adding more layers improves modeling power, unless the DBN already perfectly models the data. In practice, however, little is gained by using more than about 3 hidden layers.

Subscribe to:
Posts (Atom)