c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) e112–e121
j o ur nal homep age : w ww.intl.elsevierhealth.com/journals/cmpb
Kml: A package to cluster longitudinal data
Christophe Genolini
a,b,c,*
, Bruno Falissard
a,b,d
a
Inserm, U669, Paris, France
b
Univ Paris-Sud and Univ Paris Descartes, UMR-S0669, Paris, France
c
Modal’X,Univ Paris-Nanterre, UMR-S0669, Paris, France
d
AP-HP, Hôpital de Bicêtre, Service de Psychiatrie, Le Kremlin-Bicêtre, France
a r t i c l e i n f o
Article history:
Received 21 May 2010
Received in revised form 4 May 2011
Accepted 25 May 2011
Keywords:
Package presentation
Longitudinal data
k-Means
Cluster analysis
Non-parametric algorithm
a b s t r a c t
Cohort studies are becoming essential tools in epidemiological research. In these
studies, measurements are not restricted to single variables but can be seen as tra-
jectories. Thus, an important question concerns the existence of homogeneous patient
trajectories.
KmL is an R package providing an implementation of k-means designed to work specifi-
cally on longitudinal data. It provides several different techniques for dealing with missing
values in trajectories (classical ones like linear interpolation or LOCF but also new ones like
copyMean). It can run k-means with distances specifically designed for longitudinal data
(like Frechet distance or any user-defined distance). Its graphical interface helps the user
to choose the appropriate number of clusters when classic criteria are not efficient. It also
provides an easy way to export graphical representations of the mean trajectories result-
ing from the clustering. Finally, it runs the algorithm several times, using various kinds of
starting conditions and/or numbers of clusters to be sought, thus sparing the user a lot of
manual re-sampling.
© 2011 Elsevier Ireland Ltd. All rights reserved.
1. Introduction
Cohort studies are becoming essential tools in epidemiolog-
ical research. In these studies, measurements collected for
a single subject can be seen as trajectories. Thus, an impor-
tant question concerns the existence of homogeneous patient
trajectories. From a statistical point of view many methods
have been developed to deal with this issue [1–4]. In its sur-
vey [5] Warren-Liao divide these methods into five families:
partitioning methods construct k clusters containing at least
one individual; hierarchical methods work by grouping data
objects into a tree of clusters; density-based methods make
clusters grow as long as the density in the “neighborhood”
exceeds a certain threshold; grid-based methods quantize the
∗
Corresponding author at: Inserm, U669, 97 Bd Port Royal, 75014 Paris, France. Tel.: +33 6 21 48 47 84.
E-mail address: genolini@u-paris10.fr (C. Genolini).
object space and perform the clustering operation on the
resulting finite grid structure; model-based methods assume
a model for each cluster and look for the best fit of data to the
model.
The pros and cons of these approaches are regularly dis-
cussed [6,7] even if there is little data to show which method
is indeed preferable in which situation. In this paper, we
consider k-means, a well-known partitioning method [8,9].
In favor of an algorithm of this type the following points
can be cited: (1) it does not require any normality or para-
metric assumptions within clusters (although it might be
more efficient under certain assumptions). This might be
of great interest when the aim is to cluster data on which
no prior information is available; (2) it is likely to be more
robust as regards numerical convergence; (3) in the particular
0169-2607/$ – see front matter © 2011 Elsevier Ireland Ltd. All rights reserved.
doi:10.1016/j.cmpb.2011.05.008