Friday, January 24, 2014

Kaldi lattices

Kaldi is based on WFST for decoding so is the lattices. Firstly, a WFST has a set of states with one distinguished start state; each state has a final cost; and there is a set of arcs between the states, where each arc has an input label, an output label, and a weight, which is also referred to as a cost and is usually the negated log-probability.

To decode an utterance with T frames, the WFST interpretation is as follows. We construct an accepter, or WFSA (WFST with the same input and output symbol for all the arcs). It has T+1 states, with an arc for each cobmination of (time, context-dependent HMM state). The costs on there arcs correspond to negated and scaled acoustic log-likelihood. Call this acceptor U, then the complete search space is:
S= U * HCLG, where * represents the composition operation and HCLG is the decoding graph generated for Kaldi. HCLG integrates the HMM transition, context expansion, lexicon and most importantly the language model. In composition, the weights are added together. As they are negated log-likelihoods, it is effectively multiplying the original probabilities.

There are two representations of lattices in Kaldi:

1) the Lattice type

Each lattice is stored as an FST, with the following format:

[start state id] [end state id] [input symbol] [output symbol] [weight]

Usually the input symbols are the transition-ids and output symbols are words. The weight is usually:

[the graph cost],[the acoustic cost]

[the graph cost] : a sum of the LM cost, the (weighted) transition probabilities, and any pronunciation cost. Since they are mixed together, in order to do things like LM rescoring Kaldi typically first subtract the "old" LM cost and then add in the "new" LM cost to avoid touching the other parts of the cost.
[the acoustic cost] : the acoustic cost is the unscaled raw scores. The scaling is applied before any algorithm and unscale them again before writing them out.

0       616     273     2402    10.9106,2539.22
0       667     345     3       4.09183,365.022
0       672     273     1268    13.7098,2659.86
0       726     273     758     12.6504,2686.83
0       780     273     3157    14.3474,2700.35
0       834     273     1992    16.2729,2677.13
1       888     281     3       10.8443,4811.55
... ...

2) the CompactLattice type

This is also in FSA, acceptor, format, with an arc format as:

[start state id] [end state id] [input & output symbol] [weight]

The input/output symbol are usually the word ids and the weight is:

[the graph cost],[the acoustic cost],[a string sequence of integers]

The integers represent the transition ids, i.e. the frame-alignment for this word symbol.

0       20      2402    10.9106,2539.22,273_282_280_280_280_280_280_280_280_280_280_288_63704_63703_63712_63732_63731_63731_58806_58805_58805_58805_58805_58864_58863_58888_15286_15285_15285_15285_15285_15285_15285_15285_15285_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375
0       16      3       4.09183,365.022,345_354_352_352_352_352
0       9       1268    13.7098,2659.86,273_282_280_280_280_280_280_280_280_280_280_280_288_42724_42723_42736_42776_42775_3698_3697_3712_3711_3711_3711_3711_3711_3711_3762_15234_15233_15233_15233_15233_15233_15233_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375_15375_15375_273
0       5       758     12.6504,2686.83,273_282_280_280_280_280_280_280_280_280_280_280_288_64680_64702_64708_64707_64707_3680_3679_3679_3724_3723_3723_3762_3761_3761_3761_15234_15233_15233_15233_15233_15233_15233_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375_15375_15375_273
0       3       3157    14.3474,2700.35,273_282_280_280_280_280_280_280_280_280_280_280_288_3186_3185_3185_3185_3185_3185_3226_3225_3225_3225_3225_3246_3245_3245_3245_15234_15233_15233_15233_15233_15233_15233_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375_15375_15375_273

... ...

The conversion can be done using the lattice-copy with the option --write-compact. Besides, in Kaldi, these two types of lattices are not distinguished for I/O purpose, that's to say the tools dealing with lattices can take in and output any of these two formats. However, typically, the CompactLattice is used as the default output format.

Many algorithms on lattices (for instance, taking the best path, or pruning) are most efficient to do using the Lattice type rather than the CompactLattice type.

In general, the words in Kaldi lattices are not synchronized with to the transition-ids, meaning that the transition-ids on an arc won't necessarily all belong to the word whose label is on that arc. This means the times you get from the lattice will be inexact. It is also true of the weights. To obtain the exact times, you should run the program lattice-align-words. It only works if the system is based on word-position-dependent phones and it requires certain command line options to tell which phones are in which position in the word. An alternative program, lattice-align-words-lexicon can be used if the system does not have word-position-dependent phones.

An HTK lattice has the exact time information. An example HTK lattice is:

lmscale=20.00 wdpenalty=-30.00
N=31 L=56
I=0 t=0.00
I=1 t=0.36
I=2 t=0.75
I=3 t=0.81
... etc
I=30 t=2.48
J=0 S=0 E=1 W=SILENCE v=0 a=-3239.01 l=0.00
J=1 S=1 E=2 W=FOUR v=0 a=-3820.77 l=0.00
... etc
J=55 S=29 E=30 W=SILENCE v=0 a=-246.99 l=-1.20

The first 5 lines comprise a header which records names of the files used to generate the lattice along with the settings of the language model scale and penalty factors. Each node in the lattice represents a point in time measured in seconds and each arc represents a word spanning the segment of the input starting at the time of its start node and ending at the time of its end node. For each such span, v gives the number of the pronunciation used, a gives the acoustic score and l gives the language model score.

The language model scores in output lattices do not include the scale factors and penalties. These are removed so that the lattice can be used as a constraint network for subsequent recognizer testing.

Thursday, January 16, 2014


I always believe the best way to start a new exploration is to follow what has been done and get familiar and understand their choices. I hence started with reproducing the system included in Kaldi for the limitedLP Vietnamese task.

1). Lexicon preparation.

In the language specification, there are totally 25 consonants and 45 vowels (12 monophthongs,  25 diphthongs and 8 triphthongs). The total number of phonemes is 54 which may be too much for a "limited" language. In Kaldi's setup, both the diphthongs and triphthongs are mapped back to the monophthongs with the following option:

phoneme_mapping="i@U=i @ U;oaI=o a I;oaI:=o a I:;u@I=u @ I;uI@= u I @;1@I=1 @ I;1@U=1 @ U;
  a:I=a: I; a:U=a: U; aU=a U; @U=@ U; aI=a I; @I=@ I; EU=E U; eU=e U; i@=i @; iU=i U; Oa:=O a: ; Oa=O a;
  OE=O E; OI=O I; oI=o I; @:I=@: I; u@=u @; 1@=1 @; ue=u e; uI=u I; 1I=1 I; u@:=u @:; 1U=1 U; ui:=u i:"

Through this processing, the number of individual phonemes are reduced to the number of consonants and monophthongs.

2). Tone information.

Although tones are only applied to vowels in theory, in the Kaldi setup, the tone is applied to all the phonemes of the corresponding syllable. One possible explanation is that the use of tones to vowels may also affect the realization of consonants due to the co-articulation effects.

One original lexicon item: Amway   a: m _1 . w aI _1
The corresponding Kaldi item: Amway   a:_1 m_1         w_1 a_1 I_1

The period in the pronunciation indicates the syllable boundary. With this tonal information, the number of phonemes increased to around 6 times as there are totally 6 different phonemes.

3). Position dependent phonemes.

The phonemes used in Kaldi are further distinguished using their positions in words. Four positions marker are used: (B)egin, (E)nd, (I)nternal and (S)ingleton . For this setup, even SIL is marked to have following variations: SIL SIL_B SIL_E SIL_I SIL_S.

4). Features.

In the Kaldi's setup, PLP features are used together with Pitch features and/or FFV (fundamental frequency variations).

Wednesday, January 15, 2014

Extract Chinese texts from Wiki

To build a local Mandarin speech corpus, the wiki texts are decided to use as prompts for speakers to read. Following is the major steps involved in extracting Mandarin sentences from the Chinese wiki dump, which are learned from

1. Wiki Dump downloading:

The one we used is (although at the time of writing it is no longer the latest one).

2. Main body text extraction

The extraction uses the developed initially for Italian Wiki text extraction. A simple command will do:

bzcat zhwiki-20131221-pages-articles-multistream.xml.bz2 | python -b 1000M -o texts > output.log

The extracted file will be automatically split into files with size less than 1000M (option "-b 1000M") and will be saved in the folder "texts" (option "-o texts").

3. Traditional Chinese to Simplified Chinese

The conversion is done using the open source tool opencc :

opencc -i wiki_00 -o wiki_00_chs -c zht2zhs.ini

wiki_00 is the input file obtained from the previous step and wiki_00_chs is the converted output file and the zht2zhs.ini is simply specifying which configuration file to use. No need to create it on your own.

Till now, we have obtained the wiki texts in a single file with each article in a "<doc..> ... </doc>" entry.

Following are steps specific to the generation of sentences used for speech recording.

Assuming an average reading speed of 240 characters per minutes ( and an average sentences length of 20 characters (15~25), one sentence will lead to a speech recording of 5 seconds. For a 150 hours speech corpus, we hence would need 108,000 sentences. Let's prepare 150,000 or 200,000 sentences.

The text normalization is relatively easier as we can simple discard those with unwanted variations for the purpose of sentence selection. The steps we have done are as follows:

a) Remove the in-sentence comments of the format "(...)"
b) Replace the Chinese coded alphanumeric strings to traditional ascii ones, such as "a" to "a"
c) Convert numerical year representation to Chinese representation, such as "1976年" to "一九七六年"
d) Convert percentage numbers to Chinese character representation
e) Convert numbers to Chinese characters
f) Split paragraph into sentences based on the boundary symbols: "。!;?!;?"
g) For sentences with more than 50 characters (twice the number of the maximum length we want to use), they will be further split based on commas: ",,"
h) Remove all the left punctuation
i) Remove spaces in the sentences
j) Final check of whether the sentence is made up of Chinese characters only (zhon.unicode.HAN_IDEOGRAPHS from
k) save the sentence if it is not empty string

Detailed Python implementation could be found at

The last step is to use an existing LM to compute the perplexity for each sentences and the final selection is based on that score. The per utterance perplexity computation could be done with the ngram tool of the srilm package.

Tuesday, January 14, 2014


To start the participation of this year's openKWS, setup the dry-run of the submission is carried out first before having any system yet.

Detailed instructions could be found at

Following are the steps done:

1. Vietnamese data download, provided by Prof.
2. IndusDB - not available
3. SCTK installed
4. JobRunner extracted
5. F4DE: so much staff, maybe only care about the KWSEval is enough, but just in case, install all!
         sudo apt-get install gnu-plot libxml2 sqlite3
         make perl_install
6. Account application - needs PI

Following are some notes from the doc ( to be kept in mind:

1. the KWS task is to final all of the occurrences of a keyword, a sequence of one or more words in a corpus of un-segmented speech data.

2. the lexicon provided in the "build pack" for training contains entries for both the training and development test data. The lexical items that exist only in the development test data must be excluded during model training. 

3. keywords, a sequence of contiguous lexical items, will be specified in the language's UTF-8 encoded, native orthographic representation.

4. Homographs, words with the same written form but different meanings, will not be differentiated. Morphological variations of a keyword will not be considered positive variations.

5. transcript comparisons will be case insensitive

6. the silence gap between adjacent words in a keyword must be <= 0.5 second