c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) e112–e121 j o ur nal homep age : w ww.intl.elsevierhealth.com/journals/cmpb Kml: A package to cluster longitudinal data Christophe Genolini a,b,c,* , Bruno Falissard a,b,d a Inserm, U669, Paris, France b Univ Paris-Sud and Univ Paris Descartes, UMR-S0669, Paris, France c Modal’X,Univ Paris-Nanterre, UMR-S0669, Paris, France d AP-HP, Hôpital de Bicêtre, Service de Psychiatrie, Le Kremlin-Bicêtre, France a r t i c l e i n f o Article history: Received 21 May 2010 Received in revised form 4 May 2011 Accepted 25 May 2011 Keywords: Package presentation Longitudinal data k-Means Cluster analysis Non-parametric algorithm a b s t r a c t Cohort studies are becoming essential tools in epidemiological research. In these studies, measurements are not restricted to single variables but can be seen as tra- jectories. Thus, an important question concerns the existence of homogeneous patient trajectories. KmL is an R package providing an implementation of k-means designed to work speciﬁ- cally on longitudinal data. It provides several different techniques for dealing with missing values in trajectories (classical ones like linear interpolation or LOCF but also new ones like copyMean). It can run k-means with distances speciﬁcally designed for longitudinal data (like Frechet distance or any user-deﬁned distance). Its graphical interface helps the user to choose the appropriate number of clusters when classic criteria are not efﬁcient. It also provides an easy way to export graphical representations of the mean trajectories result- ing from the clustering. Finally, it runs the algorithm several times, using various kinds of starting conditions and/or numbers of clusters to be sought, thus sparing the user a lot of manual re-sampling. © 2011 Elsevier Ireland Ltd. All rights reserved. 1. Introduction Cohort studies are becoming essential tools in epidemiolog- ical research. In these studies, measurements collected for a single subject can be seen as trajectories. Thus, an impor- tant question concerns the existence of homogeneous patient trajectories. From a statistical point of view many methods have been developed to deal with this issue [1–4]. In its sur- vey [5] Warren-Liao divide these methods into ﬁve families: partitioning methods construct k clusters containing at least one individual; hierarchical methods work by grouping data objects into a tree of clusters; density-based methods make clusters grow as long as the density in the “neighborhood” exceeds a certain threshold; grid-based methods quantize the ∗ Corresponding author at: Inserm, U669, 97 Bd Port Royal, 75014 Paris, France. Tel.: +33 6 21 48 47 84. E-mail address: genolini@u-paris10.fr (C. Genolini). object space and perform the clustering operation on the resulting ﬁnite grid structure; model-based methods assume a model for each cluster and look for the best ﬁt of data to the model. The pros and cons of these approaches are regularly dis- cussed [6,7] even if there is little data to show which method is indeed preferable in which situation. In this paper, we consider k-means, a well-known partitioning method [8,9]. In favor of an algorithm of this type the following points can be cited: (1) it does not require any normality or para- metric assumptions within clusters (although it might be more efﬁcient under certain assumptions). This might be of great interest when the aim is to cluster data on which no prior information is available; (2) it is likely to be more robust as regards numerical convergence; (3) in the particular 0169-2607/$ – see front matter © 2011 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.cmpb.2011.05.008