ORIGINAL ARTICLE Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification Nenad Tomas ˇev • Milos ˇ Radovanovic ´ • Dunja Mladenic ´ • Mirjana Ivanovic ´ Received: 9 March 2012 / Accepted: 19 November 2012 Ó Springer-Verlag Berlin Heidelberg 2012 Abstract Most data of interest today in data-mining applications is complex and is usually represented by many different features. Such high-dimensional data is by its very nature often quite difficult to handle by conventional machine-learning algorithms. This is considered to be an aspect of the well known curse of dimensionality. Conse- quently, high-dimensional data needs to be processed with care, which is why the design of machine-learning algorithms needs to take these factors into account. Furthermore, it was observed that some of the arising high- dimensional properties could in fact be exploited in improving overall algorithm design. One such phenome- non, related to nearest-neighbor learning methods, is known as hubness and refers to the emergence of very influential nodes (hubs) in k-nearest neighbor graphs. A crisp weighted voting scheme for the k-nearest neighbor classifier has recently been proposed which exploits this notion. In this paper we go a step further by embracing the soft approach, and propose several fuzzy measures for k-nearest neighbor classification, all based on hubness, which express fuzziness of elements appearing in k-neighbor- hoods of other points. Experimental evaluation on real data from the UCI repository and the image domain suggests that the fuzzy approach provides a useful measure of confidence in the predicted labels, resulting in improve- ment over the crisp weighted method, as well as the stan- dard kNN classifier. Keywords Classification k-nearest neighbor Fuzzy Hubs Curse of dimensionality 1 Introduction High-dimensional data is ubiquitous in modern applica- tions. It arises naturally when dealing with text, images, audio, data streams, medical records, etc. The impact of this high dimensionality is manyfold. It is a well known fact that many machine-learning algorithms are plagued by what is usually termed the curse of dimensionality. This comprises a set of properties that tend to become more pronounced as the dimensionality of data increases. First and foremost is the unavoidable sparsity of data. In high- dimensional spaces all data is sparse, meaning that there is not enough data to make reliable density estimates. Another detrimental influence comes from the concentra- tion of distances, as all data points tend to become rela- tively more similar to each other as dimensionality increases. Such a decrease of contrast makes distinguishing between relevant and irrelevant points in queries much more difficult. This phenomenon has been thoroughly explored in the past [1, 10]. Usually, it only holds for data drawn from the same underlying probability distribution. This is an extended version of the paper Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification, which was presented at the MLDM 2011 conference [27]. N. Tomas ˇev (&) D. Mladenic ´ Institute Joz ˇef Stefan, Artificial Intelligence Laboratory, Joz ˇef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana, Slovenia e-mail: nenad.tomasev@ijs.si D. Mladenic ´ e-mail: dunja.mladenic@ijs.si M. Radovanovic ´ M. Ivanovic ´ Department of Mathematics and Informatics, University of Novi Sad, Trg D. Obradovic ´a 4, 21000 Novi Sad, Serbia e-mail: radacha@dmi.uns.ac.rs M. Ivanovic ´ e-mail: mira@dmi.uns.ac.rs 123 Int. J. Mach. Learn. & Cyber. DOI 10.1007/s13042-012-0137-1