It's fairly easy to run k-means clustering in python, refer to $pydoc scipy.cluster.vq.kmeans (or kmeans2). While the initial selected centers affect the performance a lot. Thanks Feng Zhu, that introduced k-means++ to us, which is a very good and effective way to select the initial centers.

While it's quite confusing for its 1b step, selecting ci=x', with probability
\frac{D(x')^2} {\sum_{x \in X} D(x)^2}
But from authors' c++ implementation, the processing (Utils.cpp:chooseSmartCenters()) seems a little different with the description in paper. Looks like we only need to minimize the sum_{x in X} min (D(x)^2, ||x-xi||^2).

Here comes my simple python version:

def kinit (X, k):
    'init k seeds according to kmeans++'
    n = X.shape[0]

    'choose the 1st seed randomly, and store D(x)^2 in D[]'
    centers = [X[randint(n)]]
    D = [norm(x-centers[0])**2 for x in X]

    for _ in range(k-1):
        bestDsum = bestIdx = -1

        for i in range(n):
            'Dsum = sum_{x in X} min(D(x)^2,||x-xi||^2)'
            Dsum = reduce(lambda x,y:x+y,
                          (min(D[j], norm(X[j]-X[i])**2) for j in xrange(n)))

            if bestDsum < 0 or Dsum < bestDsum:
                bestDsum, bestIdx  = Dsum, i

        centers.append (X[bestIdx])
        D = [min(D[i], norm(X[i]-X[bestIdx])**2) for i in xrange(n)]

    return array (centers)

'to use kinit() with kmeans2()'
res,idx = kmeans2(Y, kinit(Y,3), minit='points')
评论:

发表一条评论:
该日志评论功能被禁用了。

This blog copyright 2009 by yongsun