星期三 七月 23, 2008

之前提到过,dbus+python可能是实现输入法框架一个很好的技术选择。和scim-python的作者Huang Peng也交流了这个想法,大家都觉得值得一试。Huang Peng兄对dbus和python都有深入的掌握,开始动手实现不久,就已经颇具规模。这就是现在的ibus项目,采用的开放协议为LGPLv2.1。

Huang Peng为dbus社区贡献了dbus server API的python binding,基于glib-dbus和qt-dbus实现了gtk和qt的input method module,用python-dbus实现了输入法BUS平台,将scim-python中的pinyin输入法移植过来,编写了anthy和m17n的python binding、并将这两个输入法加入到ibus平台中。目前所缺的也许只有一个XIM的前端了。而我只是偶尔提供一些意见以供参详,惭愧惭愧啊。ibus借鉴了许多scim和imbus的设计思想,是一个非常有潜力的开源项目。称之为“next gernation input method framework”也毫不过分。

你可以从http://github.com/phuang下载最新的源代码,再按照http://code.google.com/p/ibus/wiki/ReadMe的指示来build这个项目。另外,今年10月的gnome.asia峰会将在北京召开,到时候可能会有一个关于输入法的session,我们邀请了许多活跃在输入法开发社区的开发者和大家进行交流,希望大家到时候踊跃参加哦!:)

星期三 五月 07, 2008

You python interpreter maybe compiled with --enable-unocide=ucs2, so that the built-in unichr(i) function will raise an exception, if the given value is larger than 0xFFFF. While the 'ucs2' here actually means utf16, which is a variable length encoding. And you need a simple function to convert utf32/ucs4 to utf16. Here is the example code snippet,

def ucs4chr(codepoint):
    try:
        return unichr(codepoint)
    except ValueError:
        hi, lo = divmod (codepoint-0x10000, 0x400)
        return unichr(0xd800+hi) + unichr(0xdc00+lo)

def ucs4ord(str):
    if len(str)==1:
        return ord(str)
    if len(str)==2:
        hi, lo = ord(str[0])-0xd800, ord(str[1])-0xdc00
        return hi*0x400+0x10000
    raise TypeError("ucs4ord() expected a valid ucs4 character")

星期日 四月 20, 2008

We just added the python binding for SunPinyin' SLM (statistical language model), pyslm, and several alternative utilities written by python. To build the extension, you need download/install cython. To build the pyslm extension on Solaris with SunStudio, you may need the following settings (seems that the distutils module on Solaris has some problems to set the compiler and flags properly):

$ export CC=/opt/SUNWspro/bin/CC
$ export LDFLAGS="-lCrun"

I referred to the tutorial "Wrapping C++ Classes in Pyrex/Cython" to write the extension. This article implicitly mentioned how to deal the reference parameter (I also figured it out by myself :)), e.g., if we have a new member function for Rectangle:

void getPosition (int &x, int &y) 

Then we could declare this member function in "pyx" file as:

ctypedef struct c_Rectangle "Rectangle":
    int x0, y0, x1, y1
    ... ...
    void getPosition (int x, int y)

This trick will cheat cython/pyrex to populate the arguments properly. This also works for the C-structure that has bit-fields.

星期一 三月 31, 2008

I'm recently thinking of a new input-method platform, that's python+dbus.

The advantages are, dbus has the python binding, that make it easier to write an IM server daemon with python. In client side, since dbus also has the glib and QT bindings, it also makes it easier to write the input method modules for gtk and QT. For input method developers, writing the input method in python (or plus with C/C++ python extension) is also a nice thing.

How about the performance? No idea, while I think it's worth to have a try and make a prototype. :)

星期三 十二月 12, 2007

在lambda系统中实现递归,需要借用Y组合算子(combinator)。下面是Python中的一个实作。

Y = lambda F: (lambda f: F(lambda x: f(f)(x))) (lambda f: F(lambda x: f(f)(x)))
F = lambda f: (lambda x: 1 if x==0 else x*f(x-1))

fact = Y(F)
print fact(10)


感觉这个Y combinator实在是令人头晕目眩,不知所以。具体的推导过程和相关理论,请参见刘未鹏先生的宏文,以及负暄琐话相关的blog(12)。读了几遍,仍然只是朦朦胧胧地明白了一点点,特别是后面的对角线,再读、再读... ...

星期一 十二月 03, 2007

我需要在Python程序中存取一个很大的数组,数组的每一项是(int, int, float, int)的记录。如果直接用list来存放,占据的内存巨大(因为不仅所有这些数都是对象,且tuple本身也是对象)。Python提供了一个array模块,以更有效地存取数字值,但是它只支持单一的数据类型,例如你无法创建这样的array对象:a = array.array('2lfl')。

我想到了存放在文件中,并用mmap的方式来访问。除了mmap,我不知道Python中是否还有其他方法可以得到一块raw的内存。且mmap在性能和效率上,有一定的优越性。最后,辗转得到了下面的代码:

class MMArray:
    __file = __mem = None
    __realsize = __capsize = 0

    def __init__(self, type='B', fname=None, capsize=1024*1024):
        self.__elmsize = struct.calcsize(type)

        if not fname:
            fno, self.__fname = tempfile.mkstemp("-mmarray", "pyslm-")
            self.__file = os.fdopen (fno, "w+")
            self.__enlarge(capsize)
        else:
            self.fromfile(fname)

    def fromfile(self, fname):
        if not os.path.exists(fname):
            raise "The file '%s' does not exist!"

        fsize = os.path.getsize(fname)
        if fsize == 0:
            raise "The size of file '%s' is zero!" % fname

        if self.__mem: self.__mem.close()
        if self.__file: self.__file.close()

        self.__file = open (fname, "r+")
        self.__mem = mmap.mmap(self.__file.fileno(), fsize)
        self.__realsize = self.__capsize = fsize/self.__elmsize

    def tofile(self, fname):
        if fname == self.__file.name:
            raise "Can not dump the array to currently mapping file!"
        tf = open(fname, "w+")
        bsize = self.__realsize * self.__elmsize
        tf.write (self.__mem[:bsize])
        tf.close()

    def __enlarge(self, capsize):
        if self.__capsize >= capsize:
            return
       
        self.__capsize = capsize
        self.__file.seek(self.__elmsize * self.__capsize - 1)
        self.__file.write('\0')
        self.__file.flush()

        if (self.__mem): self.__mem.close()
        self.__mem = mmap.mmap(self.__file.fileno(), self.__file.tell())

    def __del__ (self):
        bsize = self.__realsize * self.__elmsize
        self.__file.truncate (bsize)
        self.__file.close()
        if self.__mem: self.__mem.close()
        os.remove(self.__fname)

    def __getitem__(self, idx):
        if idx < 0 or idx >= self.__realsize:
            raise IndexError
        return self.__access(idx)

    def __setitem__(self, idx, buf):
        if idx < 0 or idx >= self.__realsize:
            raise IndexError
        if type(buf) != type("") or len(buf) != self.__elmsize:
            raise "Not a string, or the buffer size is incorrect!"
        self.__access(idx, buf)

    def __access (self, idx, buf=None):
        start = idx * self.__elmsize
        end = start + self.__elmsize
        if not buf: return self.__mem[start:end]
        self.__mem[start:end] = buf

    def size(self):
        return self.__realsize

    def append(self, buf):
        if type(buf) != type("") or len(buf) != self.__elmsize:
            raise "Not a string, or the buffer size is incorrect!"

        if self.__realsize >= self.__capsize:
            self.__enlarge(self.__capsize*2)

        self.__access(self.__realsize, buf)
        self.__realsize += 1

    def __iter__(self):
        for i in xrange(0, self.__realsize):
            yield self.__access(i)

    def truncate(self, tsize):
        if self.__realsize >= tsize:
            self.__realsize = tsize

当然,还有许多要改进的地方,例如支持从尾部索引(即index<0),以及slicing等等。

星期日 十一月 18, 2007

python内建的dict(字典)类使用的是hash算法,因此它的key不是有序的。而C++中的std::map或std::set使用的是平衡二叉树(通常为红黑树),其key是有序的。在网上搜了搜,找到了一个用C和pyrex混合实现的红黑树模块,python-rbtree

我编写了一个极简单的测试程序,在Solaris x86 + python 2.4.4平台上运行,分别使用dict和rbtree,插入两百万个记录(key是3个整型,value是1个整型,你大概猜到我在干什么了吧 :))。且在dict插入完之后,调用dict.keys().sort()对其key进行排序(也就是快排)。比较的结果是,两种方法使用的内存相当(大概在200M左右)。但是hash算法的速度要快一倍以上。当记录个数增加到五百万个时,结果还是差不多──即内存使用相当,hash算法快一倍。

至少在这个数量级上,内建的dict性能更佳。我还尝试了另一个纯Python的红黑树实现--RBTree.py,结果令人失望,在记录个数比较多的情况下,似乎根本无法得到正确的结果。

结论,python中的dict是可信赖的!

This blog copyright 2009 by yongsun