~ 338 ~ International Journal of Chemical Studies 2020; SP-8(4): 338-343 P-ISSN: 2349–8528 E-ISSN: 2321–4902 www.chemijournal.com IJCS 2020; SP-8(4): 338-343 © 2020 IJCS Received: 15-05-2020 Accepted: 19-06-2020 Sudha Bishnoi Department of Mathematics and Statistics, CCS Haryana Agricultural University, Hisar, Haryana, India BK Hooda Department of Mathematics and Statistics, CCS Haryana Agricultural University, Hisar, Haryana, India Corresponding Author: Sudha Bishnoi Department of Mathematics and Statistics, CCS Haryana Agricultural University, Hisar, Haryana, India A survey of distance measures for mixed variables Sudha Bishnoi and BK Hooda DOI: https://doi.org/10.22271/chemi.2020.v8.i4f.10087 Abstract Distance measures are base for many statistical and data science methods with their applicability in various fields of science. Mixed variables data which is combination of continuous and categorical variables occurs frequently in fields such as medical, agriculture, remote sensing, biology, marketing, ecology etc., but a little work has been done for evaluating distance for such type of data. As there is not much literature available on distance measures for mixed data, therefore the fundamental sources that provide a comprehensive detail of a particular measure for mixed variables data were studied and reviewed in this paper. Keywords: distance measure, similarity measure, mixed data, heterogeneous data, k nearest neighbor, classification, discrimination Introduction Distance is defined as a quantitative degree of how far apart two objects are. A synonym for distance is dissimilarity. The calculation of distance between individuals or two or more groups also called populations arises in many areas such as biology, psychology, ecology, medical diagnosis and agriculture. Some statistical techniques also use the distance measures as their base like discriminant analysis, classification, clustering etc. Further distance measures are of vital importance in machine learning, they are base of many popular machine learning algorithms like k-nearest neighbor which is a supervised learning technique and k-means clustering which is an unsupervised learning technique. When all the variables are continuous, the most commonly used distance measure is the Euclidean distance, and the simple matching coefficient is most common when all the variables are categorical. Most of the researches which need calculation of distance are confined to continuous variables, but in real world the data is mostly a combination of continuous and categorical variables also called as mixed variables data or heterogeneous data. Vast literature on distance measures is available when the data is of only continuous nature (Cha, 2007) [3] or of only categorical nature (Boriah et al., 2008) [2] , but when data is mix of both continuous and categorical type then most of the researchers either ignore its categorical nature and proceed with distance measures for continuous data or they transform the continuous data into categorical and proceed with distance measure for categorical data. But conversion of variables into the same scale involves loss of information. If one wishes to retain the variables in their original form, then a reasonable solution is to develop formulae specifically for mixed data types. Gordan (1981) [7] suggested to analyze separately for each variable type and then combining those results. The various distance measures that are available for mixed type of data are explained in detail in section 2, and section 3 concludes this paper. Distance measures for mixed variables We begin with some basic introduction to a distance measure. Distance basically indicates how different two vectors are, it is a function which takes two input vectors and returns a real positive number called the distance between two vectors. The value of this distance function should be small between similar pointsand large between dissimilar data points. The mathematical definition of a distance measure includes three requirements to be satisfied, which are defined as: