素心如何天上月 (Yong Sun's Blog)

听往事如风

星期五 九月 05, 2008

小娟 - 如风往事

是在Easy Morning上听到小娟的,配乐简单,歌声也很干净无雕琢,属于随心而发、都市民谣的类型。浅吟低唱,我最爱的是陈升。小娟也有翻唱陈升的歌,完全不同的感觉。推荐大家找来听听看。

这是他们的blog,http://blog.sina.com.cn/xiaojuan1234321music

[0] 评论
Like this post? del.icio.us | furl | slashdot | technorati | digg

scim packages are available on pkg.opensolaris.org

星期五 九月 05, 2008

Just enjoy it :)

pkg:/SUNWscim-anthy@1.2.4,5.11-0.96:20080825T192330Z Info Manifest
pkg:/SUNWscim-chewing@0.3.1,5.11-0.96:20080825T192330Z Info Manifest
pkg:/SUNWscim-hangul@0.3.2,5.11-0.96:20080825T192331Z Info Manifest
pkg:/SUNWscim-pinyin@0.5.91,5.11-0.96:20080825T192331Z Info Manifest
pkg:/SUNWscim-sunpinyin@1.0,5.11-0.96:20080825T192335Z Info Manifest
pkg:/SUNWscim-tables-chinese@0.5.7,5.11-0.96:20080825T192355Z Info Manifest
pkg:/SUNWscim-tables-extra@0.5.7,5.11-0.96:20080825T192401Z Info Manifest
pkg:/SUNWscim-tables-india@0.5.7,5.11-0.96:20080825T192401Z Info Manifest
pkg:/SUNWscim-tables-japanese@0.5.7,5.11-0.96:20080825T192401Z Info Manifest
pkg:/SUNWscim-tables-korean@0.5.7,5.11-0.96:20080825T192402Z Info Manifest
pkg:/SUNWscim-tables@0.5.7,5.11-0.96:20080825T192354Z Info Manifest
pkg:/SUNWscim-thai@0.1.0,5.11-0.96:20080825T192402Z Info Manifest
pkg:/SUNWscim@1.4.7,5.11-0.96:20080825T192322Z Info Manifest

Our next plan is to integrate libm17n and scim-m17n. Stay tuned.

[1] 评论
Like this post? del.icio.us | furl | slashdot | technorati | digg

An SMF service for querying/updating gtk-im-modules

星期三 八月 27, 2008

Just found that from opensolaris_b93, an SMF service for querying/updating gtk-im-modules is available, svc:/application/desktop-cache/input-method-cache:default. This service is online by default, and would be executed on every booting. However, it's not available on Nevada yet.


刚刚发现,从opensolaris_b93开始,已经含有一个用来查询/更新gtk输入法模块的SMF服务了,svc:/application/desktop-cache/input-method-cache:default。这个服务缺省就是打开的,并且在每次启动时都会被执行。不过,在Nevada中还没有集成这个服务。

[3] 评论
Like this post? del.icio.us | furl | slashdot | technorati | digg

The latest status of ibus

星期一 八月 25, 2008

Just got the existing news from Huang Peng, ibus platform is approaching to be finished. Huang added XIM server agent recently, and also implemented ibus-chewing, ibus-hangul. Really impressive and rapid progress!


刚刚从Huang Peng那里听到一个好消息,ibus已经接近完成了。Huang Peng最近刚刚为ibus添加了XIM服务器的agent模块,并实现了ibus-chewing和ibus-hangul。真是进展神速啊!

[5] 评论
Like this post? del.icio.us | furl | slashdot | technorati | digg

小小一周岁了!

星期四 八月 21, 2008

时间过的飞快,小小转眼已经一周岁了...

[4] 评论
Like this post? del.icio.us | furl | slashdot | technorati | digg

Using Gnu-iconv on Solaris and OpenSolaris

星期一 八月 18, 2008

Gnu-iconv supports more encoding conversions and provides better performances for some conversions over the Solaris-iconv. E.g., currently, Solaris-iconv does not support the conversions between GB18030 and UCS-2BE, UCS-4LE/BE, UTF-16LE/BE; and the conversions of GB18030<->UTF-8 (UCS-2LE) in Gnu-iconv is two times faster. And Gnu-iconv has a star-shaped structure with some exceptions, which uses UCS-4 as the intermediary encoding. While Solaris-iconv has a peer-2-peer structure (with alias), it's really painful to add a new encoding.

So, you may want to use Gnu-iconv library. For Solaris 10/Nevada, you could download&install gnu-libiconv from www.sunfreeware.com,  for opensolaris, you could install SUNWgnu-libiconv from pkg.opensolaris.org,  but the OS.o package does not contain the header files.

You may notice that, the function symboles in gnu-libiconv, had been added the prefix of "lib", e.g., iconv_open -> libiconv_open. So, LD_PRELOAD and RUNPATH are not sufficient for replacing iconv(3) routines in libc. You need to make sure to include the "iconv.h" from gnu-libiconv.



和Solaris的iconv相比较,Gnu-iconv支持更多的编码转换,并且在某些编码转换上有更好的性能。例如,目前Solaris-iconv不支持从GB18030到UCS-2BE、UCS-4LE/BE和UTF-16LE/BE之间的转换;而GB18030<->UTF-8 (UCS-2LE)在Gnu-iconv中的转换速度,是Solaris-iconv的两倍。并且Gnu-iconv是一种星型结构(也有某些点到点的例外情况),它使用UCS-4作为中间转换的介质。而Solaris-iconv是一种点到点的结构(支持别名),因此添加一个新的编码实在是有些痛苦。

因此,你可能希望使用Gnu-iconv程序库。对Solaris 10/Nevada来说,你可以从www.sunfreeware.com下载并安装gnu-libiconv,对opensolaris你可以用pkg(1)从pkg.opensolaris.org上安装SUNWgnu-libiconv的程序包,不过这个包没有包括头文件。

你可能已经注意到了,gnu-libiconv中的符号名,都被加上了"lib"的前缀,例如iconv_open->libiconv_open。因此LD_PRELOAD和RUNPATH并不能替换libc中的iconv(3)调用。你必须确保include gnu-libiconv中的"iconv.h"头文件。

[0] 评论
Like this post? del.icio.us | furl | slashdot | technorati | digg

C++ exceptions and shared objects

星期二 八月 12, 2008

In many application systems, shared objects are loaded as sub-modules via dlopen(3C). In case it's written by C++, you need to be careful about throwing exceptions. The client may try unload this module (by dlclose(3C)) in exception handling block, so that the application may crash when C++ runtime routines try to clean the exception (by calling its destructor) before leaving current catching function. In practice, exception class is usually defined in header file, so that its code body maybe included both in app and sub-module. Even in this case, linker or loader may choose the copy in sub-module. (E.g., the -Bdirect flag of ld(1)).

The conclusion is, do not unload a shared object when catching an exception thrown from it, or not to throw exceptions from a shared object. This is the root cause that SCIM x11 frontend would crash for the 2nd launching.

[0] 评论
Like this post? del.icio.us | furl | slashdot | technorati | digg

SunPinyin Code Tour (6)

星期二 八月 12, 2008

6. Using the generated language model

In previous step, we generated the threaded language model. The next question is how to use this model? Let's look at the header file of CThreadSlm class, slm.h, here are the public methods:

  1. load(): Load the language model to memory
  2. free(): Release the memory allocated for the language model
  3. isUseLogPr(): If the language model use log values for probabilities
  4. transfer(history, wid, result): By given a word id wid, transfer from history state (history) to new state (result), and return P(wid|history).
  5. transferNegLog(history, wid, result): Similar with above method, but returns -log(p(wid|history).
  6. history_state_of(st): Get the h' of st, and return.
  7. historify(st): Set st to its h'.
  8. lastWordId(st): Return the last word id of state st. Assuming st is (lvl, idx), if lvl>=N, then return m_Levels[N][idx]; if 0<lvl<N, return m_level[lvl][idx]; if level==0 and idx==0, return the word id on pseudo root (i.e., 0), if idx>0 just return idx (we will discuss this situation when introducing history cache.

Now, let's look at slmseg to see how to searching by leveraging CThreadSlm class.

$ make slmids3
./slmseg -d ../raw/dict.utf8 -f bin -s 10 -m ../data/lm_sc.t3g ../raw/corpus.utf8 >../swap/lm_sc.ids

The source code of slmseg/slmseg.cpp:

struct TLatticeWord {
  int m_left;
  int m_right;   //[m_left, m_right) is the index range on lattice for this word
  int m_wordId;  //word id
};

struct TLatticeStateValue{
  double             m_pr;       //the probability on this state node
  TLatticeWord*      mp_btword;  //the back-trace word, i.e., the given word causing the transformation
  CThreadSlm::TState m_btstate;  //the back-trace state, i.e., where the current state comes from
};

/* The mapping of SLM state node, and its Lattice state information */
typedef std::map<CThreadSlm::TState, TLatticeStateValue> TLatticeColumnStates;

/* represent a column of lattice */
struct TLatticeColumn {
  TLatticeWordVec m_wordstarting;  //
The words starting from this column, these are actually the input words when expending this column,
                                   //the newly transferred state node, located at lattice[word.m_right].
  TLatticeColumnStates m_states;   //the SLM state nodes and their corresponding Lattice states
};

processSingleFile():

Read the corpus sentence by sentence. For each sentence, call buildLattice() to build the searching lattice, and call searchBest() to search on the lattice, and get the best segmentation by getBestPath(), and call output() to write out the best segmentation.

buildLattice(sentence, lattice):

Set the SLM state on lattice[0] as pseudo root. Initialize i as 0, perform FMM for sntnc[i..n], get the word length -- len. And call getAmbiLen() to get the length of maximum length of crossing-ambiguities -- ambilen. If ambilen <= len, then there is no crossing-ambiguities, then call insertLatticeWord() to insert ([i, i+len), Wid) to the word list of lattice[i]; set i=i+len, continue the loop. If there is crossing-ambiguities, call fullSegbuildLattice() to perform a full segmentation for sub-sentence sntnc[i..i+ambilen), for all segmented words (including all single-character word), assuming one's location is [left, right), insert it to lattice[left]; and set i=i+ambilen, continue the loop. After the iteration is done, add an end-of-sentence id (we use "。" for this id in following examples) at the end.

Use the example in section 2,

Perform the FMM for this sample sentence, the first word we get is "发扬", and no ambiguity, so insert "发扬" to lattice[0]. Then i=2, continue the loop. The FMM result is "为人民", but the ambilen is 6, so we need perform full segmentation for sntnc[2..7]. Insert all possible segments on idx (2<=idx<7) to latttice[idx]. Then i=2+6=8, continue the loop. The FMM result is "的", and no ambiguity. Then i=9, the FMM result is "精神" and no ambiguity. After the loop is finished, add an end-of-sentence ID at the tail. We could see, there is no word on lattice[1], and the words on lattice[2] are "为", "为人", and "为人民". And so on...

searchBest(lattice):

Iterate the lattice from 0. For all lattice state nodes on lattcie[i], use the words starting from this lattice ([left, right), Wid) to expand, save the newly transferred state node h' (CThreadSlm::historify(his)) to lattice[word.m_right].m_states. Note, given a word D, two different state nodes (A, C) and (B, C) may result in the same h' (i.e., (C, D)), so we only need to keep the path who has bigger probability.


Let's start from lattice[0], we have set it as pseudo root (0, 0) in buildLatice() method. There is only one word on lattice[0], "发扬"; call CThreadSlm::trnasferNegLog((0,0), wid_of("发扬")), get the new SLM state node and the probability P("发扬"), save its h' (lvl=1, idx) and its lattice state information (probability and back-trace information) to lattice[2]. Then look at lattice[1], there is no state node to be expanded. Then going forward to lattice[2], there is one state node, and three words ("为", "为人", "为人民"), save the newly transferred three nodes to lattice[3], lattice[4] and lattice[5]. And so on...

This is a dynamic programming problem, and what we used here is Viterbi Lattice algorithm.

getBestPath(lattice, segResult):

From end to start, back-trace the lattice, save the word id on best path to segResult. By reversing segResult, we got the best segmentation. For our example, it's the blue path on above figure.

[0] 评论
Like this post? del.icio.us | furl | slashdot | technorati | digg

完美中的小小遗憾

星期六 八月 09, 2008



总体来说,北京奥运会的开幕式很完美,一开始有些拙劣的山水画作,在中途由孩子们添上了色彩和活力(特别是太阳变笑脸的细节),再最后由各国运动员的脚踩踏出绚烂的彩虹,的确是很好的创意。主题曲节奏悠缓,但是给人的印象不深,未能使人过耳不忘。中国代表团入场的时候,我一直期待姚明会将四川的那个小朋友抱起来放到肩膀上,不过未能看到。另外关于主火炬点燃,我本来猜想会是由海峡两岸三地的中国运动员共同点燃,那样也许更有意义且震撼人心。李宁空中跑动的姿态很潇洒,只是最后点燃的当口不知什么原因停顿了一小段时间。

中国,加油!北京,加油!!

[3] 评论
Like this post? del.icio.us | furl | slashdot | technorati | digg

SunPinyin Code Tour (5)

星期二 八月 05, 2008

5. Threading the language model

Let's look at the last step, threading (or adding back-off indices) to the pruned language model.

$ make m3_thread
./slmthread ../swap/lm_sc.3gm ../data/lm_sc.t3g

The goal of threading is to accelerate the searching in language model. Take trigram as an example, if we want to calculate the probability of sentence "ABCDE。", P(S) = P(A).P(B|A).P(C|AB).P(D|BC).P(E|CD).P(<EOS>|DE), and assume no back-off needed when evaluating all these conditional probabilities.

First, we get P(A) from level[1] (unigram) via a binary search; and find B from A's children via a binary search, to get P(B|A); and then find C from B's children via another binary search, to get P(C|AB). The next step is to calculate P(D|BC). Then we need go back to unigram to find B, and then get C from B's children, then get P(D|BC), all these done by binary searches. Obviously, the performance is low. When we evaluating P(C|AB), if we could directly locate to (B, C) from node C, we could get the search much faster.

We use (level, idx) to denote a node, the information on this node is described as (W, h', p, b, c):

  • W:word id. The location of this node implies its history information.
  • h':The back-off node, described as (level', idx').
  • p:p(W|h)
  • b:bow(h)
  • c:The start index of child node in next level.

Certainly, there is no b and c on leaf node. Now, the back-off structure becomes a graph from a tree. Its basic operation is: on current history state hi(level,idx), by given an input word W, it transfer to another history state hj(level', idx'), and return P(W|h). The level' on hj, not always equals to level-1. E.g., C(ABC)>0, but (B, C) is not seen in training, then the h' on this node is the C node on level[1]. If C is not seen either, h' is then the pseudo root (level[0][0]).

Let's look at the following snippet,

                                level+1
h (level,idx)    ----------->   W1,p1,h'1,[b,c]
    h',bow       \              ... ...
                  \             Wi,pi,h'i,[b,c]
                   \            ... ...
                    -------->   Wk,pk,h'k,[b,c]

Given an input word W. If W is one child of h, denoted as N (level+1, idx'), its p value is just P(W|h).
    If N has children, then N is the new history node;
    If not, the h' on N becomes the new history node.

If W is not child of h, execute the same processing from h' on (level, idx), find the new history node, and times the probability with bow(h), return it as P(W|h). Note, it's a recursive process.

Let's look at the code:

slmthread.cpp:

CSIMSlmWithIteration::getIdString(it, history):

The m_history member (std::vector<int>) holds the indices for each level, save the word ID on this node to passed in argument history. Method next() increase the last element in m_history (i.e., the index in current level), and call adjustIterator() to find the preceding nodes, save the indices to m_history.

CSIMSlmWithIteration::findBackOffState(n, hw, bol, bon):

Find the back-off node (h') of a specified n-gram (held in hw[1..n]), and return its level and idx. Call findState() to get the idx of hw[2..n] on level[n-1], if the index is not less than 0 then this node (n-1, idx) does have child node, return the location of h'. Otherwise, call findState() to get the idx of hw[3..n] on level[n-2], ... If the loop reach hw[n], return the pseudo root.

E.g., find the back-off node for trigram (A, B, C). Find out if (B, C) exists, if so, return (2, idx_BC). Otherwise, find out if (C) exists, if so return (1, idx_C). Otherwise, return (0, 0).

CSIMSlmWithIteration::findState(n, hw):

Find the index of specified n-gram (held in hw[1..n]) on level[n]. If it does not exist, return -1.

We also perform the compression for 32-bits floating numbers when threading. bow values are compressed to 14 bits, pr values are compressed to 16 bits. The basic idea is to count all bow (or pr) values, combine the clustering floating numbers, make the total number is under 1<<BITS_BOW (or 1<<BITS_PR); in the generated language model binary file, there are two floating tables, we need to search the table to get the original value. I'm not so clear about this algorithm, hopes the original author Phill Zhang to introduce more details.

Now, we get the final language model -- lm_sc.t3g. You could use tslminfo to look at the data in plain text format:

$ make m3_tslminfo
./tslminfo -p -v -l ../raw/dict.utf8 ../data/lm_sc.t3g >../swap/lm_sc.t3g.arpa

$ tail ../swap/lm_sc.t3g.arpa
点击 率 也   0.036363638937 (2,839922)
点击 率 最高   0.081818044186 (2,840054)
点击 率 的   0.096969485283 (2,840080)
点击 率 达到   0.036363638937 (2,840122)
点击 量 达   0.485714286566 (2,1132198)
点击 鼠标 ,   0.400000005960 (1,1)
点击 鼠标 左   0.309090882540 (2,1186378)
率队 取得 <unknown>   0.479999989271 (2,366031)
率队 在 <unknown>   0.130769222975 (2,431213)
率队 打 了   0.479999989271 (2,642183)

We could use CThreadSlm to access or use the language model. In next section, we will use slmseg as an example, to see how we construct the lattice and search by leveraging the generated language model.

[2] 评论
Like this post? del.icio.us | furl | slashdot | technorati | digg