A very detailed explanation of the WFST-based lattice indexing and searing for STD can be found in Lattice indexing for spoken term detection. It is one of the best papers I have read, both informative and well written. The Kaldi KWS tools are based on this paper and some implementation details could be found at http://kaldi.sourceforge.net/kws.html.
First, just to clear the confusion between the decoding beam and lattice beam for my self:
1) decoding beam is the beam width for search the best path through the complete search graph;
2) lattice beam is the beam width for pruning the whole lattice based on the obtained best path.
Intuitively, we need to find the best path first and then do another round of lattice pruning using the lattice beam. Practically, the pruning is done periodically rather than waiting for the end of the file ( Kaldi does it every 25 frames). The approach used in Kaldi is equivalent to link all currently active states to a "dummy" final state and then pruning in the normal way. [Generating exact lattices in the WFST framework].
Hence for the STD task, we can use a larger decoding beam to ensure the search for the best hypotheses and a smaller lattice beam to limit the lattice used for STD indexing and searching.
For STD, we need to construct an exact inverted index for ASR lattices with time information. All substrings seen in the lattices along with their time information must be indexed.
A factor transducer is an inverted index of the set of substrings (which is referred to as factors) of the set of strings comprising a document. A timed factor transducer stores the time information on the arc weights.
The whole procedure is as follows: we start with preprocessing each input automaton (ASR lattice) to obtain a posterior lattice in which non-overlapping arc clusters are separately labelled. Then from each processed input automaton we construct an intermediate factor transducer which recognizes exactly the set of factor of the input. We convert these intermediate structures into deterministic transducers by augmenting each factor with a disambiguation symbol and then applying weighted automaton optimization. Finally, we take the union of these deterministic transducers and further optimize the result to obtain a deterministic inverted index of the entire dataset.
The path weights of the ASR lattices correspond to joint probabilities assigned by the language and acoustic models. A general weight-pushing algorithm is applied to convert these weights to desired negated log posterior probabilities given the lattice.
For STD, we would like to keep separate index entries for non-overlapping occurrences in an utterance since we no longer search for the utterances but the exact time intervals containing the query term. A clustering algorithm is applied to achieve this separation, which assigns the arcs with the same input label and overlapping time-spans a unique class id. The clustering is done as follows:
1) sort the collected (start time, end time) pairs with respect to end times;
2) identify the largest set of non-overlapping (start time, end time) pairs and assign them as cluster heads;
3) classify the rest of the arcs according to maximal overlap.
After clustering, each arc carries a cluster identifier on the output label.
Next, all the factors are indexed in the following way:
1) map each arc to (negated log probability, start time, end time)
2) create a unique new initial state
3) create a unique new final state
4) for each state, $q$ in the automaton, create two new arcs: a) an initial arc from the new initial state to to $q$; b) a final arc from $q$ to the new final state.
Then the paths carrying the same factor-pair are merged and weighted $\epsilon$-removal, determinization and minimization are applied. Followed by the removal of cluster identifiers on the non-final arcs and insertion of disambiguation symbols into the final arcs. The resulting factor transducer is further optimized by applying weighted determinization and minimization.
The timed factor transducer of the entire dataset is constructed by:
1) taking the union of each TFT;
2) encoding the input-output labels as a single label and applying weighted $\epsilon$-removal, determinization, and minimization;
3) remove the disambiguation symbols on the final transitions.
With this generated TFT index, we can utilize factor selection filters in WFST form to restrict, transform or re-weight the index entries, for example:
A) a pronunciation dictionary which maps words to phone sequences can be applied to word lattices to obtain phonetic lattices.
B) a simple grammar which restricts the factors can be used. We can use such a grammar to reject the silence symbol, i.e., factors including the silence symbol are removed from the index.
Finally the search over the TFT index is the construction of the response automaton given a query automaton X:
1) composing X with the index TFT on the input side and projecting the resulting transducer onto its output labels;
2) removing the $\epsilon$ transitions and finally sorting with the shortest path algorithm.
Each successful path in the resulting response transducer is a single arc which carries an automaton identifier on its label and a (-log posterior probability, start time, end time) triplet on its weight. A simple traverse in the arc order gives sorted results.
In the paper, the authors find a lattice beam width of 4 is ideal for their STD experiments.