While deep neural nets are getting popular, researchers in speech recognition communities start revisiting neural nets and searching for new directions. Length and width besides the depth start appearing.
In Long, Deep and Wide Artificial Neural Nets for Dealing with Unexpected Noise in Machine Recognition of Speech, Hermansky argues that benefits can be also seen in expanding the nets longer in temporal direction, and wider into multiple parallel processing streams.
While the DNN generated speech sound likelihood estimates are demonstrated to be better that the earlier used likelihoods derived by generative Gaussian Mixture Models, unexpected signal distortions that were not seen in the training data can still make the acoustic likelihoods unacceptably low. A step towards addressing the unreliable acoustic evidence might be in expanding the net architectures not only into deeper but also into longer and wider structures, where substantial temporal context attempts to cover whole co-articulation patterns of speech sounds, and multiple processing paths, attending to multiple parts of information-carrying space, attempt to capitalize on redundancies of coding of information in speech, possibly allowing for adaptive alleviation of corrupted processing streams.
This paper suggests that MLP-based estimation of posterior probabilities of speech sounds should be done from relatively long segments of speech signal, and in many parallel interacting streams, resulting on MLP architectures that are not only deep but also long and wide. The streams should describe the speech signal in different ways, capitalizing on the redundant way the message is coded in the signal. Given the constantly changing acoustic environment, the choice of the best streams for the final decision about the message should be done adaptively.
In the book Speech and Hearing in Communication, Fletcher suggests that human speech recognition is carried out in individual frequency bands and the final error in recognition is given by a product of probabilities of errors in the individual frequency streams. Based on similar studies, researchers in ASR community proposed multi-stream ASR. The fundamental motivation is that when message cues are conflicting or corrupted in some processing streams, such a situation can be identified and a corrective action can focus on the more reliable streams that still provide enough cues to facilitate the recognition. (This actually reminds me about our previous study on spectral masking technique for noisy speech recognition. It assumes every input feature is noisy and tries to first identify the "components" that are more speech-dominated, then keeps only those information and throws away noise components. The following recognition is purely based on those partial information. The main bottleneck of that approach is the mask estimation.)
Morgan also reviewed various ASR systems developed prior to the development of DNNs in Deep and Wide: Multiple Layers in Automatic Speech Recognition, with the emphasis on the use of multiple streams of highly dimensioned layers. That paper ultimately concludes that while the deep processing structures can provide improvements for ASR systems, choice of features and the structure with which they are incorporated, including layer width, can also be significant factors. The have typically found that using an insufficient number of units per layer can have a very effect on the word error rate although this saturates or can even slightly decline with too large a layer.
In this paper, Morgan also pointed out that the ability to use many more parameters for a given amount of data without overfitting was one of the major design aims for deep learning networks.
Furthermore, they investigated the effect of using different depth and width in DNNs with a fixed total number of model parameters on the Aurora2 task in Deep vs. Wide: Depth on a Budget for Robust Speech Recognition. Adding layers generally resulted in better accuracy, but the number of parameters was increased with every layer added, so that it was not clear what was the main contributing factor to the good results - the depth, or the large number of parameters. However, a shallow model with the same number of parameters usually performs worse than a deeper one.
One interesting paper they referred to is the HNN/ACID approach of Fritsch's paper ACID/HNN: A Framework for Hierarchical Connectionist Acoustic Modeling. He used a tree of networks in order to estimate a large number of context-dependent classes, using the simple factoring trick expounded in Morgan's paper Factoring networks by a statistical method.