I wrote a simple python script to extract the contents from Sogou corpus.

#!/usr/bin/python

import codecs
import sys

usage = """
Usage:
    sogou_corpus_conv.py corpus_in_xml > contents_in_txt
"""

try:
    file = codecs.open(sys.argv[1], "r", "GB18030" )
except:
    print usage
    exit(1)

for line in file:
    if line.startswith("<content>"):
        start, end = len("</content>"), -len("</content>")-1
        line = line[start:end].replace(u'\ue525', '')
        print line.encode("UTF-8")

With the extracted contents, you could continue to build the SunPinyin SLM.
评论:

this input-method seems more and more interesting...
how is the progress of porting SunPinyin to SCIM?
cant wait.

发表于 BlueF 在 2007年11月04日, 12:52 上午 CST #

Hi, BlueF, the scim porting is almost finished, the only missing feature is the configuration UI.

发表于 Yong Sun 在 2007年11月04日, 08:53 上午 CST #

发表一条评论:
该日志评论功能被禁用了。

This blog copyright 2009 by yongsun