Extracting and Rendering Representative Sequences Alexis Gabadinho, Gilbert Ritschard, Matthias Studer, and Nicolas S. Müller Department of Econometrics and Laboratory of Demography, University of Geneva 40, bd du Pont-d’Arve, CH-1211 Geneva, Switzerland alexis.gabadinho@unige.ch http://mephisto.unige.ch/TraMineR Abstract. This paper is concerned with the summarization of a set of categori- cal sequences. More specifically, the problem studied is the determination of the smallest possible number of representative sequences that ensure a given cover- age of the whole set, i.e. that have together a given percentage of sequences in their neighbourhood. The proposed heuristic for extracting the representative sub- set requires as main arguments a pairwise distance matrix, a representativeness criterion and a distance threshold under which two sequences are considered as redundant or, identically, in the neighborhood of each other. It first builds a list of candidates using a representativeness score and then eliminates redundancy. We propose also a visualization tool for rendering the results and quality measures for evaluating them. The proposed tools have been implemented in our TraMineR R package for mining and visualizing sequence data and we demonstrate their effi- ciency on a real world example from social sciences. The methods are nonetheless by no way limited to social science data and should prove useful in many other domains. Keywords: Categorical sequences, Representatives, Pairwise dissimilarities, Dis- crepancy of sequences, Summarizing sets of sequences, Visualization. 1 Introduction In the social sciences, categorical sequences appear mainly as ordered list of states (employed/unemployed) or events (leaving parental home, marriage, having a child) describing individual life trajectories, typically longitudinal biographical data such as employment histories or family life courses. One widely used approach for extract- ing knowledge from such sets consists in computing pairwise distances by means of sequence alignment algorithms, and next clustering the sequences by using these dis- tances [1]. The expected outcome of such a strategy is a typology, with each cluster grouping cases with similar patterns (trajectories). An important aspect of sequence analysis is also to compare the patterns of cases grouped according to the values of covariates (for instance sex or socioeconomic position in the social sciences). A crucial task is then to summarize groups of sequences by describing the patterns that characterize them. This could be done by resorting to graphical representations This work is part of the Swiss National Science Foundation research project FN-122230 “Min- ing event histories: Towards new insights on personal Swiss life courses”. A. Fred et al. (Eds.): IC3K 2010, CCIS 128, pp. 94–106, 2011. © Springer-Verlag Berlin Heidelberg 2011