- for each center we identify the subset of training points (its cluster) that is closer to it than any other center;
- the means of each feature for the data points in each cluster are computed, and this mean vector becomes the new center for that cluster.

*x*can be assigned to the cluster of the closest prototype.

The Scipy library provides a good implementation of the K-Means algorithm. Let's see how to use it:

from pylab import plot,show from numpy import vstack,array from numpy.random import rand from scipy.cluster.vq import kmeans,vq # data generation data = vstack((rand(150,2) + array([.5,.5]),rand(150,2))) # computing K-Means with K = 2 (2 clusters) centroids,_ = kmeans(data,2) # assign each sample to a cluster idx,_ = vq(data,centroids) # some plotting using numpy's logical indexing plot(data[idx==0,0],data[idx==0,1],'ob', data[idx==1,0],data[idx==1,1],'or') plot(centroids[:,0],centroids[:,1],'sg',markersize=8) show()The result should be as follows:

In this case we splitted the data in 2 clusters, the blue points have been assigned to the first and the red ones to the second. The squares are the centers of the clusters.

Let's see try to split the data in 3 clusters:

# now with K = 3 (3 clusters) centroids,_ = kmeans(data,3) idx,_ = vq(data,centroids) plot(data[idx==0,0],data[idx==0,1],'ob', data[idx==1,0],data[idx==1,1],'or', data[idx==2,0],data[idx==2,1],'og') # third cluster points plot(centroids[:,0],centroids[:,1],'sm',markersize=8) show()This time the the result is as follows:

Your blog is awesome and has helped me discover all kinds of useful tidbits in scipy and numpy. Thank you!

ReplyDeleteThanks for the post. I have used kmeans to identify clusters (rings) in a matrix of sea surface height. The objective is to identify the rings and to determine their centroids. But kmeans requires as input parameter the number of clusters to be sought. That is a problem because I usually do not know previously how many rings will be present in the area. So, I was wondering how to avoid this kmeans limitation. Do you have any idea?

ReplyDeleteRegards,

Alex

Hello Alexandre,

Deletehow to find a good choice for k is a studied problem in machine learning. I suggest you to read this paper: http://books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf

Thanks for your comment!

from where did you get the module "from scipy.cluster.vq import kmeans,vq"? where can I get it?

ReplyDeletethis stuff looks great, btw.

It come with the scipy installation:

Deletehttp://docs.scipy.org/doc/scipy/reference/cluster.vq.html#module-scipy.cluster.vq

Excellent blog, JG. I love it. I use it and recommend it to others. One question. Many of the k-means tutorials that I've found rely on self-made, perfectly configured data -- i.e., they'll "generate" numbers in just the form necessary to get an easy k-means clustering. Whereas in the real world many of us are using k-means to cluster documents (using NLP) or other information that requires a great deal of work/formatting before we can even use k-means. For example, on documents first one must create a vector (tf-idf for instance) and then complete a similarity measurement (euclidean for instance).

ReplyDeleteThus, the greatest challenge to performing a k-means is often just getting the data into a format that the kmeans calls can work with. For instance, tutorials always seem to have their data in the following list-within-a-list format:[[number1, number2], [number1, number2], [number1, number2], [number1, number2]... ]. How would you recommend getting documents (i.e., natural language) into that format? Note that cosine similarity performed on a tf-idf vector always returns a single list, which isn't "clusterable" (to my knowledge).

Any help appreciated and thank you in advance for all of your work.

Hello,

DeleteYou could take a look to the vector space models. Those models make you able to build a fixed length vector using the frequencies of some words you're interested in. This post is a good introduction:

http://pyevolve.sourceforge.net/wordpress/?p=1589

I'll check it out. Thanks much, JG.

Deletei am not able to view the result what can be done..

ReplyDeleteshow() command is not working

sorted out..

DeleteExcellent! Help me much to understand k-means clustering.

ReplyDeleteIs it possible to use this in a capacitated-VRP (vehicle routing problem) ?

In which, each node has "demand", and there is a fixed vehicle's "capacity".

Subject to: sum of all node's demands (in a cluster) is smaller-or-equal to vehicle's capacity.

Any helps very much appreciated. Thank you.

I mean as explained in here, "Cluster-First Route-Second Method"

Deletehttp://neo.lcc.uma.es/vrp/solution-methods/heuristics/cluster-first-route-second-method/

Thanks...

This is excellent material, and your code explaining how to use the scipy implementation is beautiful and clear. Just recently I've implemented the algorithm in python myself, it's a lot of fun to play around with configurations to see the clustering in action: http://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

ReplyDeleteCould you please discuss a bit the role of 'whitening' which seems to be kind of highly reccommended by the scipy tutorial?

ReplyDeleteHi, you can find some general information about the whitening transformation here: http://en.wikipedia.org/wiki/Whitening_transformation

DeleteUsually, when you have a data matrix, you use whitening in order to have unit variance across all the samples. This operation could improve the result of the k-means algorithm.

Thank you very much for your post, it's very helpful. Here, your dataset has two variables that you partition into 2 and then 3 clusters, so it makes sense to plot the k-means clusters like this, with 1 variable on X-axis and one on the Y-axis. but what about if your dataset has more dimensions? Do you have any suggestions about ways to look at the output under these circumstances?

ReplyDeleteThanks,

SS

Hi, if you have more than two variables and you need to visualize your data in a 2D space, you need a dimensionality reduction method. Here's an example on how to use Isomap: http://glowingpython.blogspot.co.uk/2012/05/manifold-learning-on-handwritten-digits.html

DeleteInteresting, I'll check it out. Thanks for the speedy response!

DeleteThis comment has been removed by the author.

ReplyDeleteThis post has been very helpful. Thank you very much

ReplyDeleteIf I would like say 100 clusters from k-means, too many to do the plotting by hand as you have in this example, how would you visualise/plot the results from the clustering?

ReplyDeleteYou have to defined a list with a color for each cluster and plot each cluster separately with a for loop choosing the appropriate color each time.

DeleteGreat post.

ReplyDeleteDo you know of anything similar that can find clusters in 3D space? For example, I have a set of particles each with x, y, z co-ordinates. How would I go about finding clusters (of various densities) in such a dataset?

Any thoughts or ideas?

kmeans works in 3D spaces. You just have to modify the input matrix.

DeleteFantastic. Will investigate it further. Thanks.

DeleteNice demonstration. But can you tell me how to use this scipy.cluster.vq module to generate codebook for an array of mfcc feature vectors. I've extracted the MFCC feature vectors (13 coefficients) ...now i wish to use vq to perform pattern matching stuff. any idea ?

ReplyDeleteOf course, you could cluster your MFCC vectors in order to find similarities in some parts of your signal. But the applications that you can realize are related to the kind of signal you have (speech, music, ...).

DeleteThis comment has been removed by a blog administrator.

ReplyDeleteHi I have a problem using k-mean clustering with scipy.

ReplyDeleteI have a set of data as x-axis and y-axis

[[-0.0365, 0.0121],

[ 0.0623, -0.0019],

[ 0.0352, -0.0007],

[ 0.0609, -0.0096]]

If i use the k-mean function from matlab it clusters it properly(i mean as it is expected) i.e 1st row and last row comes under one cluster and middle two rows comes under one cluster.

But, when i use scipy as it is told in this blog, results are not as expected i.e 1st row comes under one cluster and last 3 rows comes under another cluster. Can any one pls tell me why is it so?

Tnx in advance:)

Hi Shambhulingayya, just try to plot your data. You'll see that your first sample is an outlier (it's even the only one with a negative x and a positive y) and that the other 3 samples are close each other. The clustering provided by scipy makes more sense to me.

DeleteWell, BTW FYI, I am using it for pattern recognition. The data sets i mentioned above are the output from the PCA(reduced dimensional dataset).

DeleteBut, in reality my patterns which i know by it's appearances are, (1st and 4th) are similar and (2nd and 4th are similar).

Why i am doubting this because, if i cluster the same data using matlab then, clustering is happening perfectly right i.e 1st and 4th under one cluster and 2nd and 3rd under another cluster.

Any thoughts in this...?

FYI, here r the K-mean output of matlab and python which differ each other for same data set

DeleteMatlab - https://drive.google.com/file/d/0B4XLQnqhjPJiUWp2MFRLU3F4aWM/view?usp=sharing

Python - https://drive.google.com/file/d/0B4XLQnqhjPJiaWFGc0hvNDdGM3c/view?usp=sharing

Try to use the same starting centroids and the same number of iterations on both the implementations.

DeleteThis comment has been removed by the author.

ReplyDeleteVery Informative Blog! Could you please tell me how to identify which cluster stands for what? For eg, if I have 4 features for t-shirt sizes (age, weight, height, gender) and if I get 3 clusters, how to find which cluster out of the 3 stands for small size, medium size and large size?

ReplyDeleteThe values of the centroids represent your variables. If you look at them you will be able to understand which centroid represents the size of the related cluster.

DeleteI am new to scipy. Can any one tell me how do I extract the data points belonging to each cluster? Here is the code-

ReplyDeletedata = vstack(arr1)

centroids,_ = kmeans(data,4)

idx,_ = vq(data,centroids)

plot(data[idx==0,0],data[idx==0,1],'ob',data[idx==1,0],data[idx==1,1],'or',data[idx==2,0],data[idx==2,1],'og',data[idx==3,0],data[idx==3,1],'oy')

plot(centroids[:,0],centroids[:,1],'sg',markersize=8)

Thank you so much for this writeup. It has helped me dip my toes into kmeans and scipy. I look forward to continuing.

ReplyDeleteI had one quick question about labeling points. I have k=8 as my best fit for my data, and can differentiate the clusters well. I imported data from a pandas dataframe with an index, then subsequently into a numpy array to perform clustering and plotting-by-idx. Can you suggest a method to take the index.values from the dataframe and label the plot accordingly so I can associate the specific points with their sample of origin?

I BDavis, DataFrame has the attribute index which gives what you probably want.

DeleteThank you for the prompt reply!

DeleteI am able to obtain the ordered index, no problem. I am having difficulty integrating the sample titles contained within with the logical index plotting methods you implement. I was trying to iterate over it in a list, but associating it with the coordinates within the cluster data is elusive.

What is the similarity measure of this implementation of k-means? Thank you.

ReplyDeleteHi, this algorithm uses the euclidian distance.

DeleteHey.

DeleteIs it possible to use some other metrics like Pearson correlation?

Or can I use precalculated distance matrix for the clustering?

Hi L, not with this implementation.

DeleteWorks perfectly for me. thanks

ReplyDeletehey

ReplyDeleteEvery time you perform the algo, the centroid number happens to change making the plots colouring different at every run. Does anybody knows how to fix that?

Thank you

The initial centroid are randomly chosen. You can fix the initial random seed using this function: numpy.random.seed

DeleteThanks a lot, It's a nice trick.

DeleteAnother question, when you get the centroids from the "whiten" data, how do you get back the real values for these centroids? ... maybe not clear as a question

It depends on how you "whithen" your data. If you simply removed the mean and divided by the standard deviation, you can just multiply by the old standard deviation and add the old mean again.

DeleteOk, I used python whiten which just divide each column by its std ... I thought it subtract the mean firstly to get a standard score... Thats why I couldnt get back to my real data. Thanks a lot !

Delete