Saturday, May 31, 2014

[paper] i-vector based speaker adaptation of deep neural networks for French broadcast audio transcription

Link to the paper: http://www.crim.ca/perso/patrick.kenny/Gupta_ICASSP2014.pdf

This paper show that the i-vector representation of speech segments can be used to perform blind speaker adaptation of hybrid DNN-HMM systems. Acoustic feature are augmented by the corresponding i-vectors before being presented to the DNN. The same i-vector is used for all acoustic feature vectors aligned with a given speaker.

The paper also shows that i-vector based speaker adaptation is effective irrespective of whether cross-entropy or sequence training is used.

i-vectors are a fixed dimensional representation of speech segments (the dimensionality is independent of the segment duration). During training and recognition, one i-vector per speaker is computed as an additional input to the DNN. All the frames corresponding to this speaker have the same i-vector appended to them. Speaker adaptation during decoding is completely unsupervised but a diarization step is needed in order to extract an i-vector for each speaker in the audio file.

The TRAP features are used in their work. The computation process is as follows:
1) normalize the 23D filterbank features to zero mean per speaker;
2) 31 frames of these features are spliced together to form a 713D feature vector;
3) a hamming window is applied to the 713D feature vector;
4) a discrete cosine transform is applied and the dimensional is reduced to 368D;
5) a global mean and variance normalization is further carried out;
After these processing, the final 368D feature vector is used as the input to the DNN.

Using $\boldsymbol{i}$ to denote i-vector and $\boldsymbol{s}$ to denote utterance supervectors, probability model for supervector is

$\boldsymbol{s} = \boldsymbol{m} + \boldsymbol{T} \boldsymbol{i}, \quad \boldsymbol{i} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$

where $\boldsymbol{m}$ is the supervector defined by a universal background model (UBM) and the columns of the matrix $\boldsymbol{T}$ are the eigenvectors. The estimation of an i-vector extractor is actually the estimation of $\boldsymbol{T}$.

From the paper, the i-vector model is clear, but the paper doesn't give detailed explanation of the estimation. I may need to read up more about i-vectors to really understand it.

Some useful findings from the experiments in the paper are:
1) The length normalized i-vectors gave better performance than the unnormalized ones. The normalization adopted in their work is simply dividing the i-vector by the square root of the sum of the squares of its elements.

2) The i-vector based adaptation is effective for both seen and unseen speakers.

3) The i-vectors with higher dimensionality  give better performance. As in their experiments with 100D, 200D and 400D, the 400D i-vector performs the best.