Extracting and Rendering Representative Sequences
⋆
Alexis Gabadinho, Gilbert Ritschard, Matthias Studer, and Nicolas S. Müller
Department of Econometrics and Laboratory of Demography, University of Geneva
40, bd du Pont-d’Arve, CH-1211 Geneva, Switzerland
alexis.gabadinho@unige.ch
http://mephisto.unige.ch/TraMineR
Abstract. This paper is concerned with the summarization of a set of categori-
cal sequences. More specifically, the problem studied is the determination of the
smallest possible number of representative sequences that ensure a given cover-
age of the whole set, i.e. that have together a given percentage of sequences in
their neighbourhood. The proposed heuristic for extracting the representative sub-
set requires as main arguments a pairwise distance matrix, a representativeness
criterion and a distance threshold under which two sequences are considered as
redundant or, identically, in the neighborhood of each other. It first builds a list of
candidates using a representativeness score and then eliminates redundancy. We
propose also a visualization tool for rendering the results and quality measures for
evaluating them. The proposed tools have been implemented in our TraMineR R
package for mining and visualizing sequence data and we demonstrate their effi-
ciency on a real world example from social sciences. The methods are nonetheless
by no way limited to social science data and should prove useful in many other
domains.
Keywords: Categorical sequences, Representatives, Pairwise dissimilarities, Dis-
crepancy of sequences, Summarizing sets of sequences, Visualization.
1 Introduction
In the social sciences, categorical sequences appear mainly as ordered list of states
(employed/unemployed) or events (leaving parental home, marriage, having a child)
describing individual life trajectories, typically longitudinal biographical data such as
employment histories or family life courses. One widely used approach for extract-
ing knowledge from such sets consists in computing pairwise distances by means of
sequence alignment algorithms, and next clustering the sequences by using these dis-
tances [1]. The expected outcome of such a strategy is a typology, with each cluster
grouping cases with similar patterns (trajectories). An important aspect of sequence
analysis is also to compare the patterns of cases grouped according to the values of
covariates (for instance sex or socioeconomic position in the social sciences).
A crucial task is then to summarize groups of sequences by describing the patterns
that characterize them. This could be done by resorting to graphical representations
⋆
This work is part of the Swiss National Science Foundation research project FN-122230 “Min-
ing event histories: Towards new insights on personal Swiss life courses”.
A. Fred et al. (Eds.): IC3K 2010, CCIS 128, pp. 94–106, 2011.
© Springer-Verlag Berlin Heidelberg 2011