Using Correlation Dimension for Analysing Text Data Ilkka Kivim¨ aki ⋆ , Krista Lagus, Ilari T. Nieminen, Jaakko J. V¨ ayrynen, and Timo Honkela Adaptive Informatics Research Centre, Aalto University School of Science and Technology firstname.lastname@tkk.fi http://www.cis.hut.fi/research/ Abstract. In this article, we study the scale-dependent dimensionality properties and overall structure of text data with a method that mea- sures correlation dimension in different scales. As experimental results, we present the analysis of text data sets with the Reuters and Europarl corpora, which are also compared to artificially generated point sets. A comparison is also made with speech data. The results reflect some of the typical properties of the data and the use of our method in improving various data analysis applications is discussed. Key words: correlation dimension, dimensionality calculation, dimensionality reduction, statistical natural language processing 1 Introduction Knowing the intrinsic dimensionality of a data set can be a benefit, for instance, when deciding the parameters of a dimension reduction method. One popular technique for determining the intrinsic dimensionality of a finite data set is calculating its correlation dimension, which is a fractal dimension. This is usually done according to the method introduced by Grassberger and Procaccia in [1]. Usually the goal in these dimensionality calculations is to characterise a data set by a single statistic. Not much emphasis is always put to the notion of the dependence of correlation dimension on the scale of observation. However, as we will show, the scale-dependent dimensionality properties can vary between different data sets according to the nature of the data. Most neural network and statistical methods such as singular value decomposition or the self-organising map are usually applied without considering this fact. Even in papers studying dimensionality calculation methods (e.g. [2] and [3]) the scale-dependence of dimensionality is noted, but usually left without further discussion. We focus on natural language data. It has been observed that the intrinsic dimensionality of text data, such as term-document matrices, is often much ⋆ Has received funding from the Academy of Finland and a grant from the Department of Mathematics and Statistics at the University of Helsinki.