Classification of Gene Expression Data: A Hubness-aware Semi-supervised Approach Krisztian Buza Brain Imaging Center Research Center for Natural Sciences Hungarian Academy of Sciences, Budapest, Hungary buza@biointelligence.hu http://www.biointelligence.hu Abstract Background and Objective Classification of gene expression data is the common denominator of various biomedical recognition tasks. How- ever, obtaining class labels for large training samples may be difficult or even impossible in many cases. Therefore, semi-supervised classification techniques are required as semi-supervised classifiers take advantage of unlabeled data. Methods Gene expression data is high-dimensional which gives rise to the phenomena known under the umbrella of the curse of dimensionality, one of its recently explored aspects being the presence of hubs or hub- ness for short. Therefore, hubness-aware classifiers have been developed recently, such as Naive Hubness-Bayesian k-Nearest Neighbor (NHBNN). In this paper, we propose a semi-supervised extension of NHBNN which follows the self-training schema. As one of the core components of self- training is the certainty score, we propose a new hubness-aware certainty score. Results We performed experiments on publicly available gene expres- sion data. These experiments show that the proposed classifier outper- forms its competitors. We investigated the impact of each of the compo- 1