Abstract—We propose a formula that quantifies the reliability of kNN classification in the two class case. Let the training set consist of N 1 instances of class 1 and N 2 instances of class 2. Let the k nearest neighbors of the test instance contain k 1 instances of class 1 and k 2 instances of class 2, k = k 1 + k 2 , k 1 << N 1 , k 2 << N 2 . We derive, under some additional assumptions, the estimate for the probability that the test instance belongs to class 1. Index Terms—classification, kNN, confidence, probability, error I. INTRODUCTION k-Nearest Neighbor (kNN) is a powerful method of nonparametric discrimination, or supervised learning [1]-[4]. Each object, or instance, to be classified, is characterized by d values , 1 i x i d = K and is thus represented by a point in d-dimensional space. The distance between the two instances can be defined in different ways, the simplest of which is the usual Euclidean metric 2 ( ). i i i x x′ - ∑ Given a training set (a set of instances with known class assignments) and a positive (odd) integer k, classification of the test object is performed as follows. 1. In the training set, find the k nearest neighbor training instances to the test object. 2. Each of these k instances belongs to one of the two classes. As k is assumed odd one class has more members in the k-neighborhood of the test instance. 3. Classify the test instance as belonging to this class. This simple algorithm has two noticeable drawbacks. First, it does not properly take into account the number of instances of each class in the training set. Simply adding more instances of a given class to the training set would bias classification results in favor of this class. Thus the algorithm in the above simple form is only applicable when each class in the training set is represented by an equal number of instances. Second, the algorithm provides no information on the confidence of class assignment for individual test instances. Consider, for example, the case of 15 k = and two classes. It is intuitively clear that the confidence of class assignment in the 15:0 situation is much higher than in the 8:7 situation. In many applications, such as those related to clinical diagnostics, it is very important to be able to characterize the confidence of each Manuscript received July 22, 2007. M. Tsypin (phone: 970-846-6953; fax: 970-870-9045; e-mail: mtsypin@biodesix.com) and H. Röder (email: hroder@ biodesix.com) are with Biodesix, Steamboat Springs, CO 80487 USA individual class assignment. Here we address these problems by providing a probability estimate of the test instance belonging to each of the classes, based on the class labels of each of the k nearest neighbors from the training set. We restrict ourselves to the case of two classes. We provide two derivations, one within the kernel density estimation framework (a fixed vicinity of the test instance determines the number of neighbors), the other within the kNN framework (a fixed number of neighbors determines the size of the vicinity). Both lead to the same result for the probability estimate of the test instance belonging to each of the classes. II. PROBABILITY ESTIMATE WITHIN THE KERNEL APPROACH Consider the case of two classes, which we denote as 1 and 2. Each instance is represented by a point 1 ( ) d x x x = r K in a metric d-dimensional space. Denote the full d-dimensional space by Ω . Class 1 is characterized by the (unknown) probability distribution 1 1 ( ), () 1 p x p x dx Ω = ∫ r r r . Class 2 is characterized by the (unknown) probability distribution 2 2 ( ), () 1 p x p x dx Ω = ∫ r r r . The training set consists of 1 N points drawn from class 1, and 2 N points drawn from class 2. Denote a vicinity of the test point (representing the test instance) by . ω (In case the Euclidean distance is used, ω is a sphere centered at the test point, but this is irrelevant for the following.) For a given realization of the training set, we observe 1 k points in ω from class 1, and 2 k points in ω from class 2. We make the following assumptions and approximations. 1. Existence of probability densities. As already stated above, the training set is considered to be a sample drawn from the underlying probability distributions with densities 1 () p x r and 2 () p x r . 2. Fixed . ω The vicinity ω of the test point is considered fixed: It can depend on the position of the test point, on the probability distributions from which the training set is drawn, as well as on 1 N and 2 N , but is assumed to stay the same for each realization of the training set. 3. Uniformity within . ω The vicinity ω is sufficiently small, such that probability densities for the classes, 1 () p x r and 2 () p x r , are approximately constant within . ω 4. Poisson approximation. For each class, we make an approximation that the number of training set instances of this class in ω is drawn from the Poisson distribution. This On the Reliability of kNN Classification Maxim Tsypin and Heinrich Röder Proceedings of the World Congress on Engineering and Computer Science 2007 WCECS 2007, October 24-26, 2007, San Francisco, USA ISBN:978-988-98671-6-4 WCECS 2007