Wednesday, January 15, 2014

Extract Chinese texts from Wiki

To build a local Mandarin speech corpus, the wiki texts are decided to use as prompts for speakers to read. Following is the major steps involved in extracting Mandarin sentences from the Chinese wiki dump, which are learned from

1. Wiki Dump downloading:

The one we used is (although at the time of writing it is no longer the latest one).

2. Main body text extraction

The extraction uses the developed initially for Italian Wiki text extraction. A simple command will do:

bzcat zhwiki-20131221-pages-articles-multistream.xml.bz2 | python -b 1000M -o texts > output.log

The extracted file will be automatically split into files with size less than 1000M (option "-b 1000M") and will be saved in the folder "texts" (option "-o texts").

3. Traditional Chinese to Simplified Chinese

The conversion is done using the open source tool opencc :

opencc -i wiki_00 -o wiki_00_chs -c zht2zhs.ini

wiki_00 is the input file obtained from the previous step and wiki_00_chs is the converted output file and the zht2zhs.ini is simply specifying which configuration file to use. No need to create it on your own.

Till now, we have obtained the wiki texts in a single file with each article in a "<doc..> ... </doc>" entry.

Following are steps specific to the generation of sentences used for speech recording.

Assuming an average reading speed of 240 characters per minutes ( and an average sentences length of 20 characters (15~25), one sentence will lead to a speech recording of 5 seconds. For a 150 hours speech corpus, we hence would need 108,000 sentences. Let's prepare 150,000 or 200,000 sentences.

The text normalization is relatively easier as we can simple discard those with unwanted variations for the purpose of sentence selection. The steps we have done are as follows:

a) Remove the in-sentence comments of the format "(...)"
b) Replace the Chinese coded alphanumeric strings to traditional ascii ones, such as "a" to "a"
c) Convert numerical year representation to Chinese representation, such as "1976年" to "一九七六年"
d) Convert percentage numbers to Chinese character representation
e) Convert numbers to Chinese characters
f) Split paragraph into sentences based on the boundary symbols: "。!;?!;?"
g) For sentences with more than 50 characters (twice the number of the maximum length we want to use), they will be further split based on commas: ",,"
h) Remove all the left punctuation
i) Remove spaces in the sentences
j) Final check of whether the sentence is made up of Chinese characters only (zhon.unicode.HAN_IDEOGRAPHS from
k) save the sentence if it is not empty string

Detailed Python implementation could be found at

The last step is to use an existing LM to compute the perplexity for each sentences and the final selection is based on that score. The per utterance perplexity computation could be done with the ngram tool of the srilm package.