Talk:k-nearest neighbors algorithm

Skip to table of contents

This is the talk page for discussing improvements to the K-nearest neighbors algorithm article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Daily pageviews of this article

A graph should have been displayed here but graphs are temporarily disabled. Until they are enabled again, visit the interactive graph at pageviews.wmcloud.org

Comment[edit]

First citation needs to be fixed. It actually doesn't mention Cover at all.

Comment[edit]

I will write a more complete article for this when the exam season has passed.

Are there not any weighted kNN algorithms, that will take (to some degree) the actual distances into account? If the 5 closest neighbors are 0.1, 4000, 10000, 20000 and 55570 units away, we might say the very close first match has more importance than the others.

For an example of a weighted kNN algorithm, see F. Nigsch et al., Journal of Chemical Information and Modeling, 46, 2412-2422 (2006), DOI: 10.1021/ci060149f

So what is the complexity of kNN? I'm guessing O(n^2) but I'm not 100 % sure...

A kd-tree can be built in O(n log(n)) time, and queried in O(log(n)) time. A Ball-tree provides better guarantees and can be even faster than a kd-tree for finding neighbors. It's only O(n^2) if you're lazy and you code it up by comparing distances between every pair of points. Also, yes weighted k-NN is very common. If you weight by the inverse of the distance, it is called linear weighting because it is equivalent to using linear interpolation. This weighting is especially good for regression problems. For classification, it is more common to weight by the inverse of the square of the distance because it's faster to compute and tends to classify slightly better. This article also requires some significant discussion of feature-scaling, which can make k-NN significantly more accurate and more robust to irrelevant attributes. I like to use a simple hill-climber with cross-validation on the training set to adjust feature scaling factors. It improves accuracy by a significant amount. --128.187.80.2 (talk) 22:26, 9 December 2008 (UTC)[reply]

For a single classification result, it's n:

Step 1: calculate the distances: O(n)
Step 2: find the k smallest: also O(n)--see selection algorithm
Step 3: count them: O(k)

If there are multiple classification results, this can be made better (e.g. by first binning or otherwise indexing the training data--this is briefly mentioned in the article), but then we have to introduce another variable: m, say, for the number of classifcation results (I call them "test points")... Peteymills (talk) 15:17, 15 June 2010 (UTC)[reply]

refly to other something.....[edit]

Comoplexly of kNN is O(n^2). I think so "It's guess" Complexly of NNS(Nearest Neighbor Search) is O(nlogn) on the KB-tree, finding one item.

I'm studing a kNN in multiple nodes. so I made Hk-NN is called by kNN. Processing time reduced effectively. I need to share and discussion in order to find out better Hk-NN method. I don't know cost of kNN, and anyone who are known cost did nothing but guessing.

If you known kNN's processing cost with accuracy, please send a message by the mail.

My e-mail: hanjjon@gmail.com

Maybe this draft version of an information retrieval book will help you. Chapter 14 discusses kNN. http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html Hope this helps. --GrandiJoos 07:59, 1 October 2007 (UTC)[reply]

Kriging?[edit]

How did any of this Kriging stuff get in here (history)? —Preceding unsigned comment added by 66.92.232.105 (talk) 23:15, 16 March 2008 (UTC)[reply]

I think the Kriging stuff was given too much prominence. Is there a consensus on its relevance and whether it would be better further down the article?

Jbom1 (talk) 16:45, 17 March 2008 (UTC)[reply]

Listen - Kriging has nothing to do with K-nearest neighbor, and shouldn't be in this article. KNN is a classifier - takes a set of multidimensional points and a user choice of K, then splits the data into K different classes. Kriging is a way to take a set of multidimensional points, then for a new point with one dimension unknown, interpolate to generate that unknown. Simply, KNN does not rely on Kriging, and has no relationship that is apparent from the paragraph on Kriging in this article. —Preceding unsigned comment added by 158.130.14.30 (talk) 16:58, 17 March 2008 (UTC)[reply]

Split this topic in two?[edit]

I agree with previous comment. KNN is a statistical clustering algorith which uses density linkage. Maybe the two subjects should be treated separately? 196.36.217.137 (talk) 17:23, 3 December 2008 (UTC)[reply]

This article is all about instance learning, which is perhaps the most common reason to find the k-nearest-neighbors, but there are many other uses as well. For example, the k-nearest-neighbors are often used to establishing neighborhood metrics for manifold learning. I propose that we make another article called "knn_instance_learner" and move all the stuff about regression and classification into it. This article should be about algorithms for finding the k-nearest neighbors (like kd-trees, Ball-trees, etc), no matter what the reason is for doing so. --128.187.80.2 (talk) 22:31, 9 December 2008 (UTC)[reply]

No, keep it here IMO. The other article you have in mind is Nearest neighbor search --mcld (talk) 20:42, 20 October 2009 (UTC)[reply]

"Furthest" Neighbors?[edit]

I removed reference to "furthest" neighbours from overview. kNN regression is done on the basis of a (possibly weighted) average of the properties of 'nearest neighbours. See the Nigsch et al. article cited earlier in the discussion.

neighbor vs neighbour[edit]

I suggest paying attention which spelling the authors use. There is now both British and American spelling used in the article.

Wassermann7 (talk) 21:48, 19 July 2010 (UTC)[reply]

Schematic image is not clear[edit]

For me this looks more like a Parzen window. I think to make the representation clear, the circles have go exact though the center of the kth sample. Maybe the black of the circle boundary can be even a bit reduced towards gray. Also some links from the test sample to the training samples could be done, together with some counting numbers, to make clear that the radius of the neighbourhood depends only upon the distance to the closest training samples. 193.52.240.233 (talk) 02:29, 10 January 2011 (UTC)[reply]

Optimal k = 10[edit]

I removed the line about the optimal k being 10. Following the paper trail, the best that can be said is that when analyzing some kinds of forestry data, a k between 10 and 15 has been found to work well, nothing more. Helpre (talk) 17:37, 1 April 2011 (UTC)[reply]

Plural neighbor[edit]

The good faith move to k-nearest neighbors algorithm has been reverted, because both scholarly and non-scholarly searches show that the singular name for the algorithm is a bit more well-used in the vernacular. Remember, the issue here is the name of the algorithm, not how many neighbors are computed. If anyone is in disagreement with this revert, then please don't hesitate to voice your opinion. It is, of course, okay with me if anyone would like to open a move request discussion. – Paine Ellsworth ^CLIMAX! 20:03, 7 September 2013 (UTC)[reply]

For regression[edit]

under the section "For regression" it says:

"Find a heuristically optimal number k of nearest neighbors, based on RMSE. This is done using cross validation."

Is there some reference/further details to this hinted at method? I am searching now without much luck, seems to me like it should be referenced.

37.11.59.68 (talk) 20:55, 13 December 2013 (UTC)[reply]

Non-parametric? Really?[edit]

Maybe k-NN is somehow related to non-parametric statistics, but for sure it's not a non-parametric method. The first sentence is at least misleading, and at best hilarious since the algorithm has its parameter explicitly in its name... :) BartoszKP (talk) 14:40, 1 April 2014 (UTC)[reply]

Exactly what I wanted to write in this discussion. --178.12.145.158 (talk) 11:53, 1 April 2017 (UTC)[reply]

The fact that thees two comments were made at the same day three years apart, and in particular at this specific day, is somewhat curious but otherwise irrelevant. --178.12.145.158 (talk) 11:54, 1 April 2017 (UTC)[reply]

I'm confused about method and model here. Nonparametric statistics lists kNN as a model. Would that be more appropriate? — Preceding unsigned comment added by Rakeroot (talk • contribs) 19:11, 19 February 2019 (UTC)[reply]

Re: "Are there not any weighted kNN algorithms?"[edit]

Yes, there are weighted kNN algorithms, and some can be VERY good.

The one used in the CRM114 KNN classifier uses a 1/r^2 weighting, so all known samples contribute a little, but the contribution of far-away examples is extremely small.

I would put an entry in, but since I'm the author of CRM114, it would be a little too close to being a primary source. So - if someone else would care to read the following paper from the US NIST repository (that being somewhat reputable):

 http://trec.nist.gov/pubs/trec15/papers/crm.spam.final.pdf

especially section titled "Hypothesis 4" and then write it up here.

Thanks! (I hope this is the right way to handle "no primary publication- give good citations" principles) Dr. Crash (talk) 20:08, 15 June 2016 (UTC)[reply]

The recent merge in from Nearest neighbour classifiers brings in a section on weighted kNN algorithms. Klbrain (talk) 13:13, 1 September 2016 (UTC)[reply]

Make formulas more useful to newbies?[edit]

Is there some way that we could make the formulas in "Statistical Setting" a little bit more inviting? Perhaps writing it again in simple English or linking to some of the concepts? Not having a background in statistics, I have no idea what "fancy R raised to the d" means nor how to search for it. The rest is similarly meaningless to me. I understand that the format used is concise and precise for it's target audience but it's a huge roadblock for people outside that audience. — Preceding unsigned comment added by TomCerul (talk • contribs) 17:54, 28 April 2022 (UTC)[reply]

k_n := k diverges?[edit]

There's a mention in the text of the condition that "k_n := k diverges." I don't know what this means and I'm not familiar with the := notation. And I'm a pretty advanced reader! Can someone clarify this? — Preceding unsigned comment added by 2603:8080:2807:EDC:9DD9:872F:900D:7D64 (talk) 09:22, 4 June 2022 (UTC)[reply]

bagged nearest neighbour classifier[edit]

There is an unhelpful, possibly old, link to another Wikipedia page. The outbound link is in the section called, "The weighted nearest neighbour classifier." The phrase "bagged nearest neighbour classifier" is linked to a page called, "Bootstrap aggregating". Not only is the linked page title different from the linked text, the linked page only mentions K nearest neighbors once and in an off-hand way. The linked page does not shed light on the topic of this page. In fact, it seems to contradict the statement made in this page. I suggest deleting the sentence that contains the link. Jason.Rafe.Miller (talk) 12:15, 23 June 2022 (UTC)[reply]