星期六 三月 21, 2009

I just committed the python binding of CPinyinTrie class in SunPinyin -- pytrie.pyx (written in cython). Together with pyslm extension, you (and I off course) are able to start building the input method engine with python. And I'm also prototyping some new stuff for SunPinyin-2.0 with Python (in very early stage).

Who has interests in porting SunPinyin to ibus? ;)

To build it on Solaris with SunStudio,
$ export CC=/opt/SUNWspro/bin/CC
$ export LDFLAGS="-lCrun -lCstd"

星期三 二月 11, 2009

I tried to find some backup tools for jroller, some only backs up the recent 30 entries in RSS feed, some does not handul unicode characters. By referring the python/xmlrpclib and MetaWeblog API documents, I wrote a simple backup tool, you could download it here. (But I did not find a good way to get tags for entries.) You may need to modify the global variables, like blogid/user/passwd.
#!/usr/bin/python

import xmlrpclib

blogid = 'yongsun'
user = 'yong.sun@sun.com'
passwd = '**********'
host = 'http://blogs.sun.com'
server = xmlrpclib.ServerProxy(host+'/roller-services/xmlrpc', use_datetime=True)
num = 1000
... ...
... ...
Please note that the passwd is the "Weblog Client API Password", it's probably your original login password, anyway, you could set it in your profile page.

星期二 一月 13, 2009

Double-Array-Trie is wildly used in dictionary constructing, and there are some implementations out there, such as, darts (a plain implementation, does not have tail pool), darts-clone, dastrie, libdatrie etc. While for a small trie (like pinyin table), an implementation without tail pool is good enough. I write a simple constructing program in python, you could have a look at the full source here.

It's basically a breadth-first traverse, get the head node from queue, find a proper base for it and mark the check array for its child nodes, then append the child nodes to queue, continue the above processing till the queue is empty.
class DATrie (object):
    ... ...

    def find_base (self, s, children, i=1):
        if s == 0:
            return 0

        if not children:
            return s

        while True:
            for ch in children:
                k = i + self.encode_character (ch)
                if self.base[k] or self.check[k] or k == s:
                    i += 1
                    break
            else:
                break

        return i
    ... ...

    def construct_from_trie (self, trie, with_value=True):
        nodes = [(trie.root, 0)]

        while nodes:
            trienode, s = nodes.pop(0)
            b = self.find_base (s, trienode.trans)
            self.base[s] = -b if trienode.val else b
            if with_value: self.value[s] = trienode.val

            for ch in trienode.trans:
                c = self.encode_character (ch)
                t = abs(self.base[s]) + c
                self.check[t] = s if s else -1

                nodes.append ((trienode.trans[ch], t))

        for i in xrange (self.encode_character (max(trie.root.trans))+1):
            if self.check[i] == -1:
                self.check[i] = 0
For constructing larger dictionaries, I'd recommend you to use darts-clone, dastrie (only for a static trie), or libdatrie (only supports 16-btis value data).

References:
  1. An Implementation of Double-Array Trie
  2. An Efficient Digital Search Algorithm by Using a Double-Array Structure.

星期二 十二月 09, 2008

import gettext

''' 
gettext module will use the encoding specified in Content-Type header for 
Gnu mo files, and convert the message strings to unicode. Here you could 
sepcify the *output* encoding to others.
'''
gettext.bind_textdomain_codeset('gedit', codeset='UTF-8')

''' 
gettext module will try to retrieve messages from /usr/share/locale by 
default, otherwise you need to explicitly set it. 
'''
gettext.bindtextdomain ('gedit', '/usr/share/locale')

_ = lambda msg: gettext.dgettext ('gedit', msg)
N_ = lambda msg: msg

print _("Save")

星期四 十月 09, 2008

Spectral Clustering is the last topic of our NLP learning group activity, hosted by Feng. Here is my homework, you may refer to this tutorial for the symbols used in this simple program. While I still have no idea about the underlying principles in the algorithm.

#!/usr/bin/python
# copyright (c) 2008 Feng Zhu, Yong Sun

import heapq
from functools import partial
from numpy import *
from scipy.linalg import *
from scipy.cluster.vq import *
import pylab

def line_samples ():
    vecs = random.rand (120, 2)
    vecs [:,0] *= 3
    vecs [0:40,1] = 1
    vecs [40:80,1] = 2
    vecs [80:120,1] = 3
    return vecs

def gaussian_simfunc (v1, v2, sigma=1):
    tee = (-norm(v1-v2)**2)/(2*(sigma**2))
    return exp (tee)

def construct_W (vecs, simfunc=gaussian_simfunc):
    n = len (vecs)
    W = zeros ((n, n))
    for i in xrange(n):
        for j in xrange(i,n):
            W[i,j] = W[j,i] = simfunc (vecs[i], vecs[j])
    return W

def knn (W, k, mutual=False):
    n = W.shape[0]
    assert (k>0 and k<(n-1))

    for i in xrange(n):
        thr = heapq.nlargest(k+1, W[i])[-1]
        for j in xrange(n):
            if W[i,j] < thr:
                W[i,j] = -W[i,j]

    for i in xrange(n):
        for j in xrange(i, n):
            if W[i,j] + W[j,i] < 0:
                W[i,j] = W[j,i] = 0
            elif W[i,j] + W[j,i] == 0:
                W[i,j] = W[j,i] = 0 if mutual else abs(W[i,j])

vecs = line_samples()
W = construct_W (vecs, simfunc=partial(gaussian_simfunc, sigma=2))
knn (W, 10)
D =
diag([reduce(lambda x,y:x+y, Wi) for Wi in W])
L = D - W

evals, evcts = eig(L,D)
vals = dict (zip(evals, evcts.transpose()))
keys = vals.keys()
keys.sort()

Y = array ([vals[k] for k in keys[:3]]).transpose()
res,idx = kmeans2(Y, 3, minit='points')

colors = [(1,2,3)[i] for i in idx]
pylab.scatter(vecs[:,0],vecs[:,1],c=colors)
pylab.show()

星期三 十月 08, 2008

It's fairly easy to run k-means clustering in python, refer to $pydoc scipy.cluster.vq.kmeans (or kmeans2). While the initial selected centers affect the performance a lot. Thanks Feng Zhu, that introduced k-means++ to us, which is a very good and effective way to select the initial centers.

While it's quite confusing for its 1b step, selecting ci=x', with probability
\frac{D(x')^2} {\sum_{x \in X} D(x)^2}
But from authors' c++ implementation, the processing (Utils.cpp:chooseSmartCenters()) seems a little different with the description in paper. Looks like we only need to minimize the sum_{x in X} min (D(x)^2, ||x-xi||^2).

Here comes my simple python version:

def kinit (X, k):
    'init k seeds according to kmeans++'
    n = X.shape[0]

    'choose the 1st seed randomly, and store D(x)^2 in D[]'
    centers = [X[randint(n)]]
    D = [norm(x-centers[0])**2 for x in X]

    for _ in range(k-1):
        bestDsum = bestIdx = -1

        for i in range(n):
            'Dsum = sum_{x in X} min(D(x)^2,||x-xi||^2)'
            Dsum = reduce(lambda x,y:x+y,
                          (min(D[j], norm(X[j]-X[i])**2) for j in xrange(n)))

            if bestDsum < 0 or Dsum < bestDsum:
                bestDsum, bestIdx  = Dsum, i

        centers.append (X[bestIdx])
        D = [min(D[i], norm(X[i]-X[bestIdx])**2) for i in xrange(n)]

    return array (centers)

'to use kinit() with kmeans2()'
res,idx = kmeans2(Y, kinit(Y,3), minit='points')

星期五 九月 12, 2008

You may know that the interactive shell of python uses Gnu-libreadline (in GPL) to implement the history retrieving feature. However, does this impact your own python program or application with python interpreter embedded in?

I did a little study on the python interpreter, libraries and built-in modules on linux (which were built with libreadline), looks like there is only one extension module links to libreadline, /usr/lib/python2.5/lib-dynload/readline.so. And this module would only be loaded by the interactive python shell. When I launch an ordinary python script directly, or embed python interpreter to a C/C++ application, this module would not be loaded, unless the script imports the readline module explicitly. (P.S., I used pmap(1) to verify if libreadline.so is loaded in runtime.)

I think only the readline ext module is under GPL term, and it should be safe to build python with libreadline. :)

星期三 九月 10, 2008

from functools import wraps

def interface (fn):
    @wraps
    def to_be_implemented (*args):
        raise Exception ("Interface '%s' is not implemented!" % fn.__name__)
    return to_be_implemented

class Foo (object):
    @interface
    def test (self): pass

class Bar(Foo):
    def test (self): pass

class Qux(Foo):
    pass

Bar().test()
Qux().test()

星期一 八月 25, 2008

Just got the existing news from Huang Peng, ibus platform is approaching to be finished. Huang added XIM server agent recently, and also implemented ibus-chewing, ibus-hangul. Really impressive and rapid progress!

刚刚从Huang Peng那里听到一个好消息,ibus已经接近完成了。Huang Peng最近刚刚为ibus添加了XIM服务器的agent模块,并实现了ibus-chewing和ibus-hangul。真是进展神速啊!

星期三 七月 23, 2008

As I blog'd before, dbus+python maybe a wonderful technical choice for input method framework. I discussed with Huang Peng (the author of scim-python), we both thought it's worth to have a try. Huang Peng is an expert in dbus and python, after a short period, the project has grown to a remarkable level. That's the ibus project, licensed in LGPLv2.1.

Huang Peng contributed the dbus server API python binding to dbus community, and implemented gtk and qt input method modules basing on glib-dbus and qt-dbus, then implemented the input method bus system with python-dbus, and ported the pinyin IME from scim-python, added python binding for libanthy and libm17n, then added this two IMEs to ibus platform. Maybe the only missing part now, is an XIM frontend. I only contributed some design ideas :$. ibus leveraged many good designs from scim and imbus, it's a project which has huge potentials, and it's deserved to be named as "the next generation input method framework".

You could check out the latest code from http://github.com/phuang, then follow the instructions on http://code.google.com/p/ibus/wiki/ReadMe to build the project. BTW, the gnome.asia submit would be held in Oct. in Beijing, we may have a session about input methods, and we would invite the active input method developers to have a forum, looking forward to see you! :)

This blog copyright 2009 by yongsun