On the computation of the correlation integral for fractal dimension estimation Zakiah Kalantan King Abdulaziz University, Department of Statistics, Jeddah, Saudi Arabia Durham University, Department of Mathematical Sciences, Durham, UK Email: zkalanten@kau.edu.sa Jochen Einbeck Durham University Department of Mathematical Sciences DH1 3LE Durham, UK Email: jochen.einbeck@durham.ac.uk Abstract—Dimension reduction is a powerful technique which transforms data from a high-dimensional to a low-dimensional space. Usually, it requires fixing the intrinsic dimension (ID) of the low-dimensional subspace in advance. Fractal dimension is a global ID estimation method, which studies the geometry of the data set. The correlation dimension is a common method to find the fractal dimension, but its practical implementation is far from straightforward, since the correlation integral needs to be estimated for a ball of radius tending to 0. The aim of this paper is to develop approaches to approximate the correlation integral in this limit. Experimental results on real world and simulated data are used to demonstrate the algorithms and compare to other methodology. A simulation study which verifies the effectiveness of the proposed methods is also provided. I. I NTRODUCTION Nowadays, pattern recognition or data mining algorithms have to deal with very high dimensional data. The high dimension does not only lead to increased computing time and storage space of available information, but also poses con- siderable challenges to the statistical methods and algorithms themselves. To overcome such problems, usually referred to as the “curse of dimensionality”, one may investigate whether the high dimensional data frame can be represented in some lower dimensional space, ideally without losing much of the original information. Though many approaches for dimension reduction do exist, for instance projection techniques based on (linear) principal components or (non-linear) principal mani- folds [15], most of such methods require fixing the intrinsic dimension of the low-dimensional subspace in advance. The intrinsic dimension (ID) can be defined as a minimum number of variables necessary to describe the data without much loss of information. ID estimation methods can be di- vided into two groups; local methods which try to estimate the ID using the information from the neighborhood of patterns, and global methods where the ID is estimated by using the whole data set. An example for global ID estimation approaches is frac- tal methods. Fractal dimension is a measure that describes the geometry of an irregular object (here: a data set) by an estimated real number, whilst Hausdorff dimension and topological dimension provide an integer estimate. It describes the filling of the fractal object’s space, which can be used to construct ID estimators. The most important properties of fractals are self-similarity and symmetry. Fractal techniques [16] are widely used in many fields, such as snow accumulation in forests [1], tree crowns [5], computer vision applications [2], [3], chaos theory [7], medical imaging and in time series analysis. In some applications, they construct fractals to produce realistic natural objects, as moons or planets, by using computer graphics. Various fractal-based methods have been proposed, as quantization estimator [18], kernel correlation method [12], horizontal structuring element, box-counting and correlation dimension [4] [5] [6]. Camastra presented a good survey on intrinsic dimension estimation methods focusing on fractal-based methods [10] [11]. Correlation dimension is a popular method that is used to compute the fractal dimension. The method requires the construction of a so–called correlation integral, from which the ID is extracted using appropriate techniques. This step is not straightforward, since it requires counting the number of data pairs within a ball of radius tending to 0. The objective of this paper is to estimate the intrinsic dimension of a data set by providing new approaches for the computation of the correlation dimension from the correlation integral. The paper is organized as follows: In Section II, the concept of correlation dimension is briefly reviewed. Section III presents the illustration of the improved methods; Intercept method, Slope method and Polynomial method. In Section IV, we provide case studies on two real data sets and a simulation study, which are used to state the effectiveness of methods. Finally, conclusions are drawn in Section V. II. CORRELATION DIMENSION The correlation dimension is commonly used to estimate the fractal dimension. The idea is to estimate the dimension via a pairwise distances algorithm. Grassberger and Procaccia introduced the correlation integral algorithm [9] [11], named GP method, which is used to define the correlation dimension estimate for a given data set. Let Ω= x 1 ,x 2 ,...,x n ∈ IR q denote a set of data points, and r any positive number. The correlation integral is defined as C (r) = lim n→∞ 2 n(n − 1) n i=1 n j=i+1 I (‖x j − x i ‖≤ r) (1) where I (.) is an indicator function, and ‖x j − x i ‖ denotes the Euclidean distance between data points, x j and x i . In practice,