Thursday, January 16, 2014

Learning

I always believe the best way to start a new exploration is to follow what has been done and get familiar and understand their choices. I hence started with reproducing the system included in Kaldi for the limitedLP Vietnamese task.

1). Lexicon preparation.

In the language specification, there are totally 25 consonants and 45 vowels (12 monophthongs,  25 diphthongs and 8 triphthongs). The total number of phonemes is 54 which may be too much for a "limited" language. In Kaldi's setup, both the diphthongs and triphthongs are mapped back to the monophthongs with the following option:

phoneme_mapping="i@U=i @ U;oaI=o a I;oaI:=o a I:;u@I=u @ I;uI@= u I @;1@I=1 @ I;1@U=1 @ U;
  a:I=a: I; a:U=a: U; aU=a U; @U=@ U; aI=a I; @I=@ I; EU=E U; eU=e U; i@=i @; iU=i U; Oa:=O a: ; Oa=O a;
  OE=O E; OI=O I; oI=o I; @:I=@: I; u@=u @; 1@=1 @; ue=u e; uI=u I; 1I=1 I; u@:=u @:; 1U=1 U; ui:=u i:"

Through this processing, the number of individual phonemes are reduced to the number of consonants and monophthongs.

2). Tone information.

Although tones are only applied to vowels in theory, in the Kaldi setup, the tone is applied to all the phonemes of the corresponding syllable. One possible explanation is that the use of tones to vowels may also affect the realization of consonants due to the co-articulation effects.

One original lexicon item: Amway   a: m _1 . w aI _1
The corresponding Kaldi item: Amway   a:_1 m_1         w_1 a_1 I_1

The period in the pronunciation indicates the syllable boundary. With this tonal information, the number of phonemes increased to around 6 times as there are totally 6 different phonemes.

3). Position dependent phonemes.

The phonemes used in Kaldi are further distinguished using their positions in words. Four positions marker are used: (B)egin, (E)nd, (I)nternal and (S)ingleton . For this setup, even SIL is marked to have following variations: SIL SIL_B SIL_E SIL_I SIL_S.

4). Features.

In the Kaldi's setup, PLP features are used together with Pitch features and/or FFV (fundamental frequency variations).