星期六 十月 11, 2008
Schedule:
And here is the PDF version of
presentation slides. Thanks a lot for the co-authors, Peng Wu, Peng Huang, and Feng Zhu, and thanks to James for reviewing. Looking forward to meet you in the forum, and share your thoughts and comments with us!
星期四 十月 09, 2008
Spectral Clustering is the last topic of our NLP learning group activity, hosted by Feng. Here is my homework, you may refer to this tutorial for the symbols used in this simple program. While I still have no idea about the underlying principles in the algorithm.
#!/usr/bin/python
# copyright (c) 2008 Feng Zhu, Yong Sun
import heapq
from functools import partial
from numpy import *
from scipy.linalg import *
from scipy.cluster.vq import *
import pylab
def line_samples ():
vecs = random.rand (120, 2)
vecs [:,0] *= 3
vecs [0:40,1] = 1
vecs [40:80,1] = 2
vecs [80:120,1] = 3
return vecs
def gaussian_simfunc (v1, v2, sigma=1):
tee = (-norm(v1-v2)**2)/(2*(sigma**2))
return exp (tee)
def construct_W (vecs, simfunc=gaussian_simfunc):
n = len (vecs)
W = zeros ((n, n))
for i in xrange(n):
for j in xrange(i,n):
W[i,j] = W[j,i] = simfunc (vecs[i], vecs[j])
return W
def knn (W, k, mutual=False):
n = W.shape[0]
assert (k>0 and k<(n-1))
for i in xrange(n):
thr = heapq.nlargest(k+1, W[i])[-1]
for j in xrange(n):
if W[i,j] < thr:
W[i,j] = -W[i,j]
for i in xrange(n):
for j in xrange(i, n):
if W[i,j] + W[j,i] < 0:
W[i,j] = W[j,i] = 0
elif W[i,j] + W[j,i] == 0:
W[i,j] = W[j,i] = 0 if mutual else abs(W[i,j])
vecs = line_samples()
W = construct_W (vecs, simfunc=partial(gaussian_simfunc, sigma=2))
knn (W, 10)
D = diag([reduce(lambda x,y:x+y, Wi) for Wi in W])
L = D - W
evals, evcts = eig(L,D)
vals = dict (zip(evals, evcts.transpose()))
keys = vals.keys()
keys.sort()
Y = array ([vals[k] for k in keys[:3]]).transpose()
res,idx = kmeans2(Y, 3, minit='points')
colors = [(1,2,3)[i] for i in idx]
pylab.scatter(vecs[:,0],vecs[:,1],c=colors)
pylab.show()
星期三 十月 08, 2008
It's fairly easy to run k-means clustering in python, refer to $pydoc scipy.cluster.vq.kmeans (or kmeans2). While the initial selected centers affect the performance a lot. Thanks Feng Zhu, that introduced k-means++ to us, which is a very good and effective way to select the initial centers.
While it's quite confusing for its 1b step, selecting ci=x', with probability
%5E2%7D%20%7B%5Csum_%7Bx%20%5Cin%20X%7D%20D(x)%5E2%7D)
But from authors'
c++ implementation, the processing
(Utils.cpp:chooseSmartCenters()) seems a little different with the description in paper. Looks like we only need to minimize the
sum_{x in X} min (D(x)^2, ||x-xi||^2).
Here comes my simple python version:
def kinit (X, k):
'init k seeds according to kmeans++'
n = X.shape[0]
'choose the 1st seed randomly, and store D(x)^2 in D[]'
centers = [X[randint(n)]]
D = [norm(x-centers[0])**2 for x in X]
for _ in range(k-1):
bestDsum = bestIdx = -1
for i in range(n):
'Dsum = sum_{x in X} min(D(x)^2,||x-xi||^2)'
Dsum = reduce(lambda x,y:x+y,
(min(D[j], norm(X[j]-X[i])**2) for j in xrange(n)))
if bestDsum < 0 or Dsum < bestDsum:
bestDsum, bestIdx = Dsum, i
centers.append (X[bestIdx])
D = [min(D[i], norm(X[i]-X[bestIdx])**2) for i in xrange(n)]
return array (centers)
'to use kinit() with kmeans2()'
res,idx = kmeans2(Y, kinit(Y,3), minit='points')
星期三 十月 08, 2008
5 years ago, Oct/8/2003, I joined Sun China ERI, worked in Asian Globalization Center. It's really amazing time for me to stay at Sun and AGC. I love the open culture, awesome teammates and colleagues, comfortable and beautiful working environment, coffee and tea-times ...
It's a new start for me, I'm expecting my 10-years, 15-years, 20-years anniversaries 
星期二 九月 30, 2008
其实我挺理解范跑跑的,他非常非常爱家中尚在襁褓的幼女。那么幼小的生命,会让父母觉得要不惜一切来保护他/她,这一切之中,包括了自己的生命,或许也包括了对职业操守的坚守。如果他的女儿已经8、9岁,初长成了,也许就可以在那个危机时刻欣然赴死了。
在小小出生之后,我也曾回想过那个经典的命题,如果孩子、妻子和父母同船落水,你会先救哪一个。我和老婆的答案都是,先救孩子。我原先到国外出差的时候,保险的受益人是老婆,现在就改成了小小(也感谢老婆对此的理解)。如果有什么大灾难发生在我身上,我也一定会想尽一切办法活下来。这活下来的信念,即让你坚强,也让你脆弱。
我不能设想、也不允许,小小在6岁之前,长时间离开父母身边。尽管父母、岳父母有信心带好他,而我们也对此没什么太大的异议,但同时我们也坚定地相信我们能做的更好。
星期一 九月 22, 2008
CCTV说,伊利在市场上销售的产品检测到了三聚氰胺,但是供应给奥运会的奶制品是没有问题的,言下之意是供应奥运会的产品是不同的,这也暗示着伊利其实是知道市场上销售的产品是存在质量隐患的。伊利则声明说,供应奥运和市场上的产品是相同品质的。而22家产品69批次产品检测的结果告诉我们,乳业公司其实是知道某些地方的奶源品质可能有问题,而这也基本上成了行业的一条潜规则。
希望能把整个事件调查清楚,不要仅仅是下架、销毁、赔偿就了事。一定要严惩!
从凤凰卫视抄来的,中国人在食品中完成了化学扫盲:
- 从大米里我们认识了石蜡
- 从火腿里我们认识了敌敌畏
- 从咸鸭蛋、辣椒酱里我们认识了苏丹红
- 从多宝鱼我们认识了孔雀石绿
- 从火锅里我们认识了福尔马林
- 从银耳、蜜枣里我们认识了硫磺
- 从木耳中认识了硫酸铜
- 今天三鹿又让同胞知道了三聚氰胺的化学作用
星期一 九月 22, 2008
According to the man-page, iconv(3C) would return the number of non-identical (or non-reversible in Gnu's vocabulary), when that happens. But what's the non-identical/non-reversible conversion? Try the following example:
$ echo "abc测试" | iconv -f UTF-8 -t ASCII
abc??
As you can see, the last two characters are out-of scope of ASCII, however they are legal/valid UTF-8 characters. In this case, iconv(3C) shall convert them to another two characters (non-identical-conversion character, in some cases it's '?'), and returns 2. However, Gnu's iconv raises an error (return -1, and set errno to EILSEQ) in this case. Though, according to its manpage, EILSEQ is set to indicate there is an invalid multibyte sequence in the input.
So, it not a portable way to use this method to tell if the target encoding is capable to represent the source contents. And you should not either rely on the non-identical conversion numbers. A successful conversion may return -1 and with E2BIG when output buffer is exhausted, meanwhile non-identical conversions may happen. And there is no flag in iconv(3C)/iconv_open(3C) to control whether to perform non-identical conversions or to raise an error.
In gernal, iconv(3C) is not a well-defined interface.
P.S., this post is a summary of the discussion between JDS/Evolution team and myself, to locate/isolate an iconv(3C) related bug.
星期五 九月 19, 2008
Just saw the message on [osol-discuss], that David is working on porting gnu-libc porting to Solaris/OpenSolaris, and made really impressive progress. Likes the efforts for porting glibc to BSD, it makes the GNU/kOpenSolairs (which means OpenSolaris Kernel + GNU userland) to be reality.
Here are the references:
- http://csclub.uwaterloo.ca/~dtbartle/opensolaris
- https://savannah.nongnu.org/projects/glibc-bsd
星期三 九月 17, 2008

亮点不少,21M像素,支持H.264 1080P的高清短片、且在拍摄时支持AF,支持Microdrives微盘。期待D700大降啊,呵呵...
星期二 九月 16, 2008
新闻联播播出了,22家婴幼儿奶粉公司69批次产品检出三聚氰胺。包括圣元、伊利、蒙牛、雅士利等一线品牌,还有许多二线的地方品牌。真是触目惊心啊!而同公司的产品,供应奥运会的、或者出口海外的,就没有问题,真是令人愤恨!少年强则中国强,现在连祖国花朵们的生命安全都得不到保障。
不仅各相关人员企业应该严惩,各级质检部门的负责人应该集体引咎辞职并追诉渎职罪!
诶哟,你提醒...
当初要是用rubyӍ...
ruby效率比python稍差...
原来这样:),分...