~ 338 ~
International Journal of Chemical Studies 2020; SP-8(4): 338-343
P-ISSN: 2349–8528
E-ISSN: 2321–4902
www.chemijournal.com
IJCS 2020; SP-8(4): 338-343
© 2020 IJCS
Received: 15-05-2020
Accepted: 19-06-2020
Sudha Bishnoi
Department of Mathematics and
Statistics, CCS Haryana
Agricultural University, Hisar,
Haryana, India
BK Hooda
Department of Mathematics and
Statistics, CCS Haryana
Agricultural University, Hisar,
Haryana, India
Corresponding Author:
Sudha Bishnoi
Department of Mathematics and
Statistics, CCS Haryana
Agricultural University, Hisar,
Haryana, India
A survey of distance measures for mixed variables
Sudha Bishnoi and BK Hooda
DOI: https://doi.org/10.22271/chemi.2020.v8.i4f.10087
Abstract
Distance measures are base for many statistical and data science methods with their applicability in
various fields of science. Mixed variables data which is combination of continuous and categorical
variables occurs frequently in fields such as medical, agriculture, remote sensing, biology, marketing,
ecology etc., but a little work has been done for evaluating distance for such type of data. As there is not
much literature available on distance measures for mixed data, therefore the fundamental sources that
provide a comprehensive detail of a particular measure for mixed variables data were studied and
reviewed in this paper.
Keywords: distance measure, similarity measure, mixed data, heterogeneous data, k nearest neighbor,
classification, discrimination
Introduction
Distance is defined as a quantitative degree of how far apart two objects are. A synonym for
distance is dissimilarity. The calculation of distance between individuals or two or more
groups also called populations arises in many areas such as biology, psychology, ecology,
medical diagnosis and agriculture. Some statistical techniques also use the distance measures
as their base like discriminant analysis, classification, clustering etc. Further distance measures
are of vital importance in machine learning, they are base of many popular machine learning
algorithms like k-nearest neighbor which is a supervised learning technique and k-means
clustering which is an unsupervised learning technique.
When all the variables are continuous, the most commonly used distance measure is the
Euclidean distance, and the simple matching coefficient is most common when all the
variables are categorical. Most of the researches which need calculation of distance are
confined to continuous variables, but in real world the data is mostly a combination of
continuous and categorical variables also called as mixed variables data or heterogeneous data.
Vast literature on distance measures is available when the data is of only continuous nature
(Cha, 2007)
[3]
or of only categorical nature (Boriah et al., 2008)
[2]
, but when data is mix of
both continuous and categorical type then most of the researchers either ignore its categorical
nature and proceed with distance measures for continuous data or they transform the
continuous data into categorical and proceed with distance measure for categorical data. But
conversion of variables into the same scale involves loss of information.
If one wishes to retain the variables in their original form, then a reasonable solution is to
develop formulae specifically for mixed data types. Gordan (1981)
[7]
suggested to analyze
separately for each variable type and then combining those results. The various distance
measures that are available for mixed type of data are explained in detail in section 2, and
section 3 concludes this paper.
Distance measures for mixed variables
We begin with some basic introduction to a distance measure. Distance basically indicates how
different two vectors are, it is a function which takes two input vectors and returns a real
positive number called the distance between two vectors. The value of this distance function
should be small between similar pointsand large between dissimilar data points. The
mathematical definition of a distance measure includes three requirements to be satisfied,
which are defined as: