International Journal of Computer Applications (0975 8887) Volume 64No.17, February 2013 39 A New Dissimilarity Measure between Feature-Vectors Liviu Octavian Mafteiu-Scai West University of Timisoara, Timisoara, Romania ABSTRACT Distance measures is very important in some clustering and machine learning techniques. At present there are many such measures for determining the dissimilarity between the feature- vectors, but it is very important to make a choice that depends on the problem to be solved. This paper proposes a simple but robust distance measure called Reference Distance Weighted, for calculating distance between feature-vectors with real values. The basic attribute that distinguishes it from other measures is that the distance is measured from one of the feature-vector, considered as a reference system, to other feature-vectors. In fact this reference vector belongs to a class of a classification system. A second distinctive attribute is that its value does not depend on the orders of magnitude of the different characteristics of vectors. In addition, through a parameter called factor of relevance, each feature receives a weight in terms of its influence, because different features have different influence on dissimilarity estimation depending on the final problem to be solved. An extension of the proposed distance allows working with hybrid vectors, ie real and logical values. Future research directions are also provided. General Terms Algorithms. Keywords classification, distance, dissimilarity, features 1. INTRODUCTION Over the time, for the processes of classification and recommendation have been proposed a number of distances to determine the dissimilarity between two feature-vectors , some of the most popular being: Hamming distance (DH) [1], Minkovski distance (DM) [2], Euclidean distance (DE), Manhattan distance (DMH) and Chebyshev distance (DC). The Minkowski distance is a metric which is a generalization of the Euclidean, Manhattan and Chebyshev distances. For two feature-vectors       and       , where n is the number of features:       (1) If p=1 is obtained       (2) If p=2 is obtained       (3) If p=±∞, by passing to the limit are obtained:        (4) and        (4) With all the popularity of indicators mentioned above, they do not always offer the best solution for all types of data and problems, as mentioned in [5] and [6]. There are a lots of other measures dedicated to particular problems [7, 8, 9, 10, 11, etc ]. It is clear that all of them have advantages and disadvantages, as there are so far a general measure, good/optimal for all types of problems. 2. THEORETICAL CONSIDERATIONS The following proposes a new measure to evaluating the dissimilarity of two feature-vectors, called Reference Distance Weighted, noted with RDW. The term "reference" shows that the distance is measured from a reference system, ie from the feature-vector specific to a class of problems/objects to the feature-vector of the problem/object to be classified. The term "weighted" has two meanings: the first show that each feature have a specific weight / relevance / importance in final problem to be solved; the second meaning refers to how big is the difference between two features relative to the reference feature value. The RDW indicator was designed to use it in systems of equations classification, process that depend on some characteristics of associated matrices, such as: size, sparsity, number of non-zero values on the main diagonal, nonzero elements distribution, symmetry, positivity etc. Some of these features of matrices have been successfully used in other classification processes, relevant examples are given by Shuting Xu in [13, [14] and by T. George in [16]. RDW can be seen as a function:      . The relation for computing RDW value is:         (5)                       with: -n number of features considered; -u={u 1 ,u 2 ,…,u n } is the feature-vector of reference, associated to a class, vector from whom the distance is measured; -v={v 1 ,v 2 ,…,v n } is the feature-vector associated to the problem that must be solved or object that must be classified, vector up to which is measured the distance; -α={α 1 , α 2 , …α n }, is a vector called relevance vector, whose components α i are parameters specific for each feature in part and assigned to each feature, called relevance factor, proportional to the importance/weight of the respective feature under the conditions of problem to be solved. In relation (5) the case ui = 0 is excluded to avoid dividing by zero. Remark 1 : The generalization of relation (5), ie including the situation ui = 0, can be done by introducing a correction factor ɛ:            where r i represent the magnitude of v i value. For example, if v i [1,9], we have r i = 1 leading to ɛi=0.01, a value that will