5. Threading the language model
Let's look at the last step, threading (or adding back-off
indices) to the pruned language model.
$ make m3_thread
./slmthread ../swap/lm_sc.3gm ../data/lm_sc.t3g
The goal of threading is to accelerate the searching in language
model. Take trigram as an example, if we want to calculate the
probability of sentence "ABCDE。", P(S) =
P(A).P(B|A).P(C|AB).P(D|BC).P(E|CD).P(<EOS>|DE), and assume
no back-off needed when evaluating all these conditional
probabilities.
First, we get P(A) from level[1] (unigram) via a binary search; and
find B from A's children via a binary search, to get P(B|A); and
then find C from B's children via another binary search, to get
P(C|AB). The next step is to calculate P(D|BC). Then we need go
back to unigram to find B, and then get C from B's children, then
get P(D|BC), all these done by binary searches. Obviously, the
performance is low. When we evaluating P(C|AB), if we could
directly locate to (B, C) from node C, we could get the search much
faster.
We use (level, idx) to denote a node, the information on this node
is described as (W, h', p, b, c):
- W:word id. The location of this node implies its history
information.
- h':The back-off node, described as (level', idx').
- p:p(W|h)
- b:bow(h)
- c:The start index of child node in next level.
Certainly, there is no b and c on leaf node. Now, the back-off
structure becomes a graph from a tree. Its basic operation is: on
current history state hi(level,idx), by given an input
word W, it transfer to another history state hj(level',
idx'), and return P(W|h). The level' on hj, not always
equals to level-1. E.g., C(ABC)>0, but (B, C) is not seen in
training, then the h' on this node is the C node on level[1]. If C
is not seen either, h' is then the pseudo root (level[0][0]).
Let's look at the following snippet,
level+1
h (level,idx) ----------->
W1,p1,h'1,[b,c]
h',bow
\
... ...
\
Wi,pi,h'i,[b,c]
\
... ...
-------->
Wk,pk,h'k,[b,c]
Given an input word W. If W is one child of h, denoted as N
(level+1, idx'), its p value is just P(W|h).
If N has children, then N is the new history
node;
If not, the h' on N becomes the new history
node.
If W is not child of h, execute the same processing from h' on
(level, idx), find the new history node, and times the probability
with bow(h), return it as P(W|h). Note, it's a recursive
process.
Let's look at the code:
slmthread.cpp:
CSIMSlmWithIteration::getIdString(it, history):
The m_history member (std::vector<int>) holds the
indices for each level, save the word ID on this node to passed in
argument history. Method next() increase the last element in
m_history (i.e., the index in current level), and call
adjustIterator() to find the preceding nodes, save the indices to
m_history.
CSIMSlmWithIteration::findBackOffState(n, hw, bol,
bon):
Find the back-off node (h') of a specified n-gram (held
in hw[1..n]), and return its level and idx. Call findState() to get
the idx of hw[2..n] on level[n-1], if the index is not less than 0
then this node (n-1, idx) does have child node, return the location
of h'. Otherwise, call findState() to get the idx of hw[3..n] on
level[n-2], ... If the loop reach hw[n], return the pseudo
root.
E.g., find the back-off node for trigram (A, B, C). Find out if (B,
C) exists, if so, return (2, idx_BC). Otherwise, find out if (C)
exists, if so return (1, idx_C). Otherwise, return (0,
0).
CSIMSlmWithIteration::findState(n, hw):
Find the index of specified n-gram (held in hw[1..n])
on level[n]. If it does not exist, return -1.
We also perform the compression for 32-bits floating numbers
when threading. bow values are compressed to 14 bits, pr values are
compressed to 16 bits. The basic idea is to count all bow (or pr)
values, combine the clustering floating numbers, make the total
number is under 1<<BITS_BOW (or 1<<BITS_PR); in the
generated language model binary file, there are two floating
tables, we need to search the table to get the original value. I'm
not so clear about this algorithm, hopes the original author
Phill
Zhang to introduce more details.
Now, we get the final language model -- lm_sc.t3g. You could use
tslminfo to look at the data in plain text format:
$ make m3_tslminfo
./tslminfo -p -v -l ../raw/dict.utf8 ../data/lm_sc.t3g
>../swap/lm_sc.t3g.arpa
$ tail ../swap/lm_sc.t3g.arpa
点击 率 也 0.036363638937 (2,839922)
点击 率 最高 0.081818044186 (2,840054)
点击 率 的 0.096969485283 (2,840080)
点击 率 达到 0.036363638937 (2,840122)
点击 量 达 0.485714286566 (2,1132198)
点击 鼠标 , 0.400000005960 (1,1)
点击 鼠标 左 0.309090882540 (2,1186378)
率队 取得 <unknown> 0.479999989271 (2,366031)
率队 在 <unknown> 0.130769222975 (2,431213)
率队 打 了 0.479999989271 (2,642183)
We could use
CThreadSlm to access or use the language model. In next
section, we will use
slmseg as an example, to see how we construct the lattice and
search by leveraging the generated language model.