A simple stript to extract the contents from Sogou corpus
#!/usr/bin/python
import codecs
import sys
usage = """
Usage:
sogou_corpus_conv.py corpus_in_xml > contents_in_txt
"""
try:
file = codecs.open(sys.argv[1], "r", "GB18030" )
except:
print usage
exit(1)
for line in file:
if line.startswith("<content>"):
start, end = len("</content>"), -len("</content>")-1
line = line[start:end].replace(u'\ue525', '')
print line.encode("UTF-8")
With the extracted contents, you could continue to build the SunPinyin SLM.


this input-method seems more and more interesting...
how is the progress of porting SunPinyin to SCIM?
cant wait.
发表于 BlueF 在 2007年11月04日, 12:52 上午 CST #
Hi, BlueF, the scim porting is almost finished, the only missing feature is the configuration UI.
发表于 Yong Sun 在 2007年11月04日, 08:53 上午 CST #