Computational Statistics and Data Analysis 54 (2010) 290–297 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Kernel-based machine learning for fast text mining in R Alexandros Karatzoglou a,∗ , Ingo Feinerer b a LITIS, INSA de Rouen, Avenue de Université, 76801 Saint-Etienne du Rouvray, France b Database and Artificial Intelligence Group, Institute of Information Systems, Vienna University of Technology, Austria article info Article history: Received 5 November 2008 Received in revised form 15 September 2009 Accepted 19 September 2009 Available online 25 September 2009 abstract Recent advances in the field of kernel-based machine learning methods allow fast processing of text using string kernels utilizing suffix arrays. kernlab provides both kernel methods’ infrastructure and a large collection of already implemented algorithms and includes an implementation of suffix-array-based string kernels. Along with the use of the text mining infrastructure provided by tm these packages provide R with functionality in processing, visualizing and grouping large collections of text data using kernel methods. The emphasis is on the performance of various types of string kernels at these tasks. © 2009 Elsevier B.V. All rights reserved. 1. Introduction Kernel-based methods are state of the art machine learning methods which have gained prominence mainly through the successful application of Support Vector Machines (SVMs) in a large domain of applications. SVMs have also been applied in classification of text documents typically using the inner product between two vector representations of text documents. 1.1. Kernel methods Kernel-based machine learning methods use an implicit mapping of the input data into a high-dimensional feature space defined by a kernel function, i.e., a function returning the inner product 〈Φ(x), Φ(y)〉 between the images of two data points x and y in the feature space. The learning then takes place in the feature space, and the data points only appear inside dot products with other points. This is often referred to as the ‘‘kernel trick’’ (Schölkopf and Smola, 2002). More precisely, if a projection Φ: X → H is used, the dot product 〈Φ(x), Φ(y)〉 can be represented by a kernel function k k(x, y) =〈Φ(x), Φ(y)〉, (1) which is computationally simpler than explicitly projecting x and y into the feature space H. One interesting property of kernel-based learning systems is that, once a valid kernel function has been selected, one can practically work in spaces of any dimension without any significant additional computational cost, since feature mapping is never effectively performed. In fact, one does not even need to know which features are being used. Another advantage of kernel-based machine learning methods is that one can design and use a kernel for a particular problem that could be applied directly to the data without the need for a feature extraction process. This is particularly important in problems where a lot of structure of the data is lost by the feature extraction process (e.g., text processing). 1.2. Kernel methods for text mining New methods developed within the past years range from algorithms for e.g. clustering (Shi and Malik, 2000; Ng et al., 2001) over dimensionality reduction (Song et al., 2008) to two-sample tests (Song et al., 2008). The availability of kernels that ∗ Corresponding author. E-mail addresses: alexandros.karatzoglou@insa-rouen.fr (A. Karatzoglou), feinerer@dbai.tuwien.ac.at (I. Feinerer). 0167-9473/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2009.09.023