听往事如风
星期五 九月 05, 2008

是在Easy Morning上听到小娟的,配乐简单,歌声也很干净无雕琢,属于随心而发、都市民谣的类型。浅吟低唱,我最爱的是陈升。小娟也有翻唱陈升的歌,完全不同的感觉。推荐大家找来听听看。
这是他们的blog,http://blog.sina.com.cn/xiaojuan1234321music

是在Easy Morning上听到小娟的,配乐简单,歌声也很干净无雕琢,属于随心而发、都市民谣的类型。浅吟低唱,我最爱的是陈升。小娟也有翻唱陈升的歌,完全不同的感觉。推荐大家找来听听看。
这是他们的blog,http://blog.sina.com.cn/xiaojuan1234321music
Just enjoy it 
| pkg:/SUNWscim-anthy@1.2.4,5.11-0.96:20080825T192330Z | Info | Manifest |
| pkg:/SUNWscim-chewing@0.3.1,5.11-0.96:20080825T192330Z | Info | Manifest |
| pkg:/SUNWscim-hangul@0.3.2,5.11-0.96:20080825T192331Z | Info | Manifest |
| pkg:/SUNWscim-pinyin@0.5.91,5.11-0.96:20080825T192331Z | Info | Manifest |
| pkg:/SUNWscim-sunpinyin@1.0,5.11-0.96:20080825T192335Z | Info | Manifest |
| pkg:/SUNWscim-tables-chinese@0.5.7,5.11-0.96:20080825T192355Z | Info | Manifest |
| pkg:/SUNWscim-tables-extra@0.5.7,5.11-0.96:20080825T192401Z | Info | Manifest |
| pkg:/SUNWscim-tables-india@0.5.7,5.11-0.96:20080825T192401Z | Info | Manifest |
| pkg:/SUNWscim-tables-japanese@0.5.7,5.11-0.96:20080825T192401Z | Info | Manifest |
| pkg:/SUNWscim-tables-korean@0.5.7,5.11-0.96:20080825T192402Z | Info | Manifest |
| pkg:/SUNWscim-tables@0.5.7,5.11-0.96:20080825T192354Z | Info | Manifest |
| pkg:/SUNWscim-thai@0.1.0,5.11-0.96:20080825T192402Z | Info | Manifest |
| pkg:/SUNWscim@1.4.7,5.11-0.96:20080825T192322Z | Info | Manifest |
Just found that from opensolaris_b93, an SMF service for querying/updating gtk-im-modules is available, svc:/application/desktop-cache/input-method-cache:default. This service is online by default, and would be executed on every booting. However,
it's not available on Nevada yet.
Just got the existing news from Huang Peng, ibus platform is approaching to be finished. Huang added XIM server agent recently, and also implemented ibus-chewing, ibus-hangul. Really impressive and rapid progress!
Gnu-iconv supports more encoding conversions and provides better performances for some conversions over the Solaris-iconv. E.g., currently, Solaris-iconv does not support the conversions between GB18030 and UCS-2BE, UCS-4LE/BE, UTF-16LE/BE; and the conversions of GB18030<->UTF-8 (UCS-2LE) in Gnu-iconv is two times faster. And Gnu-iconv has a star-shaped structure with some exceptions, which uses UCS-4 as the intermediary encoding. While Solaris-iconv has a peer-2-peer structure (with alias), it's really painful to add a new encoding.
So, you may want to use Gnu-iconv library. For Solaris 10/Nevada, you could download&install gnu-libiconv from www.sunfreeware.com, for opensolaris, you could install SUNWgnu-libiconv from pkg.opensolaris.org, but the OS.o package does not contain the header files.
You may notice that, the function symboles in gnu-libiconv, had been added the prefix of "lib", e.g., iconv_open -> libiconv_open. So, LD_PRELOAD and RUNPATH are not sufficient for replacing iconv(3) routines in libc. You need to make sure to include the "iconv.h" from gnu-libiconv.
In many application systems, shared objects are loaded as sub-modules via dlopen(3C). In case it's written by C++, you need to be careful about throwing exceptions. The client may try unload this module (by dlclose(3C)) in exception handling block, so that the application may crash when C++ runtime routines try to clean the exception (by calling its destructor) before leaving current catching function. In practice, exception class is usually defined in header file, so that its code body maybe included both in app and sub-module. Even in this case, linker or loader may choose the copy in sub-module. (E.g., the -Bdirect flag of ld(1)).
The conclusion is, do not unload a shared object when catching an exception thrown from it, or not to throw exceptions from a shared object. This is the root cause that SCIM x11 frontend would crash for the 2nd launching.
In previous step, we generated the threaded language model. The next question is how to use this model? Let's look at the header file of CThreadSlm class, slm.h, here are the public methods:
Now, let's look at slmseg to see how to searching by leveraging CThreadSlm class.
$ make slmids3
./slmseg -d ../raw/dict.utf8 -f bin -s 10 -m ../data/lm_sc.t3g ../raw/corpus.utf8 >../swap/lm_sc.ids
The source code of slmseg/slmseg.cpp:
struct TLatticeWord {
int m_left;
int m_right; //[m_left, m_right) is the index range on lattice for this word
int m_wordId; //word id
};
struct TLatticeStateValue{
double m_pr; //the probability on this state node
TLatticeWord* mp_btword; //the back-trace word, i.e., the given word causing the transformation
CThreadSlm::TState m_btstate; //the back-trace state, i.e., where the current state comes from
};
/* The mapping of SLM state node, and its Lattice state information */
typedef std::map<CThreadSlm::TState, TLatticeStateValue> TLatticeColumnStates;
/* represent a column of lattice */
struct TLatticeColumn {
TLatticeWordVec m_wordstarting; //The words starting from this column, these are actually the input words when expending this column,
//the newly transferred state node, located at lattice[word.m_right].
TLatticeColumnStates m_states; //the SLM state nodes and their corresponding Lattice states
};
processSingleFile():
Read the corpus sentence by sentence. For each sentence, call buildLattice() to build the searching lattice, and call searchBest() to search on the lattice, and get the best segmentation by getBestPath(), and call output() to write out the best segmentation.
buildLattice(sentence, lattice):
Set the SLM state on lattice[0] as pseudo root. Initialize i as 0, perform FMM for sntnc[i..n], get the word length -- len. And call getAmbiLen() to get the length of maximum length of crossing-ambiguities -- ambilen. If ambilen <= len, then there is no crossing-ambiguities, then call insertLatticeWord() to insert ([i, i+len), Wid) to the word list of lattice[i]; set i=i+len, continue the loop. If there is crossing-ambiguities, call fullSegbuildLattice() to perform a full segmentation for sub-sentence sntnc[i..i+ambilen), for all segmented words (including all single-character word), assuming one's location is [left, right), insert it to lattice[left]; and set i=i+ambilen, continue the loop. After the iteration is done, add an end-of-sentence id (we use "。" for this id in following examples) at the end.
Use the example in section 2,
Perform the FMM for this sample sentence, the first word we get is "发扬", and no ambiguity, so insert "发扬" to lattice[0]. Then i=2, continue the loop. The FMM result is "为人民", but the ambilen is 6, so we need perform full segmentation for sntnc[2..7]. Insert all possible segments on idx (2<=idx<7) to latttice[idx]. Then i=2+6=8, continue the loop. The FMM result is "的", and no ambiguity. Then i=9, the FMM result is "精神" and no ambiguity. After the loop is finished, add an end-of-sentence ID at the tail. We could see, there is no word on lattice[1], and the words on lattice[2] are "为", "为人", and "为人民". And so on...
searchBest(lattice):
Iterate the lattice from 0. For all lattice state nodes on lattcie[i], use the words starting from this lattice ([left, right), Wid) to expand, save the newly transferred state node h' (CThreadSlm::historify(his)) to lattice[word.m_right].m_states. Note, given a word D, two different state nodes (A, C) and (B, C) may result in the same h' (i.e., (C, D)), so we only need to keep the path who has bigger probability.

Let's start from lattice[0], we have set it as pseudo root (0, 0) in buildLatice() method. There is only one word on lattice[0], "发扬"; call CThreadSlm::trnasferNegLog((0,0), wid_of("发扬")), get the new SLM state node and the probability P("发扬"), save its h' (lvl=1, idx) and its lattice state information (probability and back-trace information) to lattice[2]. Then look at lattice[1], there is no state node to be expanded. Then going forward to lattice[2], there is one state node, and three words ("为", "为人", "为人民"), save the newly transferred three nodes to lattice[3], lattice[4] and lattice[5]. And so on...
This is a dynamic programming problem, and what we used here is Viterbi Lattice algorithm.
getBestPath(lattice, segResult):
From end to start, back-trace the lattice, save the word id on best path to segResult. By reversing segResult, we got the best segmentation. For our example, it's the blue path on above figure.
总体来说,北京奥运会的开幕式很完美,一开始有些拙劣的山水画作,在中途由孩子们添上了色彩和活力(特别是太阳变笑脸的细节),再最后由各国运动员的脚踩踏出绚烂的彩虹,的确是很好的创意。主题曲节奏悠缓,但是给人的印象不深,未能使人过耳不忘。中国代表团入场的时候,我一直期待姚明会将四川的那个小朋友抱起来放到肩膀上,不过未能看到。另外关于主火炬点燃,我本来猜想会是由海峡两岸三地的中国运动员共同点燃,那样也许更有意义且震撼人心。李宁空中跑动的姿态很潇洒,只是最后点燃的当口不知什么原因停顿了一小段时间。
中国,加油!北京,加油!!
Let's look at the last step, threading (or adding back-off
indices) to the pruned language model.
$ make m3_thread
./slmthread ../swap/lm_sc.3gm ../data/lm_sc.t3g
The goal of threading is to accelerate the searching in language
model. Take trigram as an example, if we want to calculate the
probability of sentence "ABCDE。", P(S) =
P(A).P(B|A).P(C|AB).P(D|BC).P(E|CD).P(<EOS>|DE), and assume
no back-off needed when evaluating all these conditional
probabilities.
First, we get P(A) from level[1] (unigram) via a binary search; and
find B from A's children via a binary search, to get P(B|A); and
then find C from B's children via another binary search, to get
P(C|AB). The next step is to calculate P(D|BC). Then we need go
back to unigram to find B, and then get C from B's children, then
get P(D|BC), all these done by binary searches. Obviously, the
performance is low. When we evaluating P(C|AB), if we could
directly locate to (B, C) from node C, we could get the search much
faster.
We use (level, idx) to denote a node, the information on this node
is described as (W, h', p, b, c):
Certainly, there is no b and c on leaf node. Now, the back-off
structure becomes a graph from a tree. Its basic operation is: on
current history state hi(level,idx), by given an input
word W, it transfer to another history state hj(level',
idx'), and return P(W|h). The level' on hj, not always
equals to level-1. E.g., C(ABC)>0, but (B, C) is not seen in
training, then the h' on this node is the C node on level[1]. If C
is not seen either, h' is then the pseudo root (level[0][0]).
Let's look at the following snippet,
level+1
h (level,idx) ----------->
W1,p1,h'1,[b,c]
h',bow
\
... ...
\
Wi,pi,h'i,[b,c]
\
... ...
-------->
Wk,pk,h'k,[b,c]
Given an input word W. If W is one child of h, denoted as N
(level+1, idx'), its p value is just P(W|h).
If N has children, then N is the new history
node;
If not, the h' on N becomes the new history
node.
If W is not child of h, execute the same processing from h' on
(level, idx), find the new history node, and times the probability
with bow(h), return it as P(W|h). Note, it's a recursive
process.
Let's look at the code:
slmthread.cpp:
CSIMSlmWithIteration::getIdString(it, history):
The m_history member (std::vector<int>) holds the indices for each level, save the word ID on this node to passed in argument history. Method next() increase the last element in m_history (i.e., the index in current level), and call adjustIterator() to find the preceding nodes, save the indices to m_history.
CSIMSlmWithIteration::findBackOffState(n, hw, bol,
bon):
Find the back-off node (h') of a specified n-gram (held in hw[1..n]), and return its level and idx. Call findState() to get the idx of hw[2..n] on level[n-1], if the index is not less than 0 then this node (n-1, idx) does have child node, return the location of h'. Otherwise, call findState() to get the idx of hw[3..n] on level[n-2], ... If the loop reach hw[n], return the pseudo root.
E.g., find the back-off node for trigram (A, B, C). Find out if (B, C) exists, if so, return (2, idx_BC). Otherwise, find out if (C) exists, if so return (1, idx_C). Otherwise, return (0, 0).
CSIMSlmWithIteration::findState(n, hw):
Find the index of specified n-gram (held in hw[1..n]) on level[n]. If it does not exist, return -1.
We also perform the compression for 32-bits floating numbers
when threading. bow values are compressed to 14 bits, pr values are
compressed to 16 bits. The basic idea is to count all bow (or pr)
values, combine the clustering floating numbers, make the total
number is under 1<<BITS_BOW (or 1<<BITS_PR); in the
generated language model binary file, there are two floating
tables, we need to search the table to get the original value. I'm
not so clear about this algorithm, hopes the original author
Phill
Zhang to introduce more details.
Now, we get the final language model -- lm_sc.t3g. You could use
tslminfo to look at the data in plain text format:
$ make m3_tslminfo
./tslminfo -p -v -l ../raw/dict.utf8 ../data/lm_sc.t3g
>../swap/lm_sc.t3g.arpa
$ tail ../swap/lm_sc.t3g.arpa
点击 率 也 0.036363638937 (2,839922)
点击 率 最高 0.081818044186 (2,840054)
点击 率 的 0.096969485283 (2,840080)
点击 率 达到 0.036363638937 (2,840122)
点击 量 达 0.485714286566 (2,1132198)
点击 鼠标 , 0.400000005960 (1,1)
点击 鼠标 左 0.309090882540 (2,1186378)
率队 取得 <unknown> 0.479999989271 (2,366031)
率队 在 <unknown> 0.130769222975 (2,431213)
率队 打 了 0.479999989271 (2,642183)
We could use
CThreadSlm to access or use the language model. In next
section, we will use
slmseg as an example, to see how we construct the lattice and
search by leveraging the generated language model.