The Glowing Python: k-nearest neighbor search

Saturday, April 14, 2012

k-nearest neighbor search

A k-nearest neighbor search identifies the top k nearest neighbors to a query. The problem is: given a dataset D of vectors in a d-dimensional space and a query point x in the same space, find the closest point in D to x. The following function performs a k-nearest neighbor search using the euclidean distance:

from numpy import random,argsort,sqrt
from pylab import plot,show

def knn_search(x, D, K):
 """ find K nearest neighbours of data among D """
 ndata = D.shape[1]
 K = K if K < ndata else ndata
 # euclidean distances from the other points
 sqd = sqrt(((D - x[:,:ndata])**2).sum(axis=0))
 idx = argsort(sqd) # sorting
 # return the indexes of K nearest neighbours
 return idx[:K]

The function computes the euclidean distance between every point of D and x then returns the indexes of the points for which the distance is smaller.
Now, we will test this function on a random bidimensional dataset:

# knn_search test
data = random.rand(2,200) # random dataset
x = random.rand(2,1) # query point

# performing the search
neig_idx = knn_search(x,data,10)

# plotting the data and the input point
plot(data[0,:],data[1,:],'ob',x[0,0],x[1,0],'or')
# highlighting the neighbours
plot(data[0,neig_idx],data[1,neig_idx],'o',
  markerfacecolor='None',markersize=15,markeredgewidth=1)
show()

The result is as follows:

The red point is the query vector and the blue ones represent the data. The blue points surrounded by a black circle are the nearest neighbors.

16 comments:

Michael LinApril 19, 2012 at 5:36 PM
How does this compare with using Scipy's cKDtree? http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html
ReplyDelete
Replies
JustGlowingApril 19, 2012 at 5:49 PM
Hi Michael, the class scipy.spatial.cKDTree implements another algorithm for the nearest-neighbor search based on KDTrees. Of course, KDTrees have pro and cons. For example, the search cost using a KDTree is logarithmic (so, it's faster than the naive algorithm implemented here) but you have to build the tree and if need to delete or insert points in your dataset, you have to modify the tree.
If you need more details look at this: http://en.wikipedia.org/wiki/K-d_tree
ReplyDelete
Replies
UnknownDecember 12, 2012 at 11:02 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownDecember 12, 2012 at 11:17 AM
"""Thanks a lot, I made a small change so that the user can make several queries with one call of knn_serach(...). Sorry I don't know how to format code here"""

from numpy import random,argsort,sqrt,array,ones
from pylab import plot,show

# The function computes the euclidean distance between every point of D and x then returns the indexes of the points for which the distance is smaller.
def knn_search(x, D, K):
""" find K nearest neighbours of data among D """
ndata = D.shape[0]
# num of query points
queries=x.shape
K = K if K < ndata else ndata
# euclidean distances from the other points
diff=array(D*ones(queries,int)).T - x[:,:ndata].T
sqd=sqrt(((diff.T)**2).sum(axis=2))
# sorting
idx=argsort(sqd)
# return the indexes of K nearest neighbours
return idx[:,:K]

# Now, we will test this function on a random bidimensional dataset:
data = random.rand(200,2) # random dataset
x = array([[[0.4,0.4]],[[0.6,0.8]],[[0.9,0.2]],[[0.2,0.9]]]) # query points

# Performing the search
neig_idx = knn_search(x,data,10)

# Plotting the data and the input points
plot(data[:,0],data[:,1],'ob',x.T[0,0],x.T[1,0],'or')

# Highlighting the neighbours for each input
plot(data[neig_idx,0],data[neig_idx,1],'o', markerfacecolor='None',markersize=15,markeredgewidth=1)
#plot(data[neig_idx[1],0],data[neig_idx[1],1],'xk', markerfacecolor='None',markersize=15,markeredgewidth=1)
show()
ReplyDelete
Replies
AnonymousMarch 22, 2013 at 11:58 AM
Awesome code - this really helped me out! Thanks for sharing!
ReplyDelete
Replies
AnonymousMay 23, 2013 at 7:57 PM
GP,

Great tutorial. Thanks as always for uploading these. One question, tho: It's not clear to me (a beginner) what form "x" should take when it's passed into knn_search function. You say that x is "a query point," but what does a query point look like?? Is it a slice of ndata -- a point within the features?? Thank you for your thoughts!
ReplyDelete
Replies
AnonymousDecember 4, 2013 at 3:02 PM
I have my training data in a csv file. The data contains 35 points corresponding to 3D vector in 3 columns x,y, and z and a feature 'color' in the fourth column. Not being a great pythonista, how do I modify your code here to employ my data to test a random new vector?
ReplyDelete
Replies
UnknownFebruary 2, 2016 at 4:19 PM
should be axis=1 instead of axis=0 for euclidean distance
ReplyDelete
Replies
AnonymousSeptember 2, 2020 at 1:52 PM
Great post!

the only problem is, it only works for 2D spaces, while it would be much more usefull if it worked in higher dimentions!
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Saturday, April 14, 2012

k-nearest neighbor search

16 comments:

Quote