Feature Selection for Temporal Health Records Rohan A. Baxter, Graham J. Williams, and Hongxing He CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra 2602, Australia, {Rohan.Baxter,Graham.Williams,Hongxing.He}@cmis.csiro.au Abstract. In this paper we consider three alternative feature vector representations of patient health records. The longitudinal (temporal), irregular character of patient episode history, an integral part of a health record, provides some challenges in applying data mining techniques. The present application involves episode history of monitoring services for elderly patients with diabetes. The application task was to examine patterns of monitoring services for patients. This was approached by clustering patients into groups receiving similar patterns of care and visualising the features devised to highlight interesting patterns of care. 1 Introduction We are interested in the problem of clustering individuals given observed data about the individuals where the observed data does not naturally occur in vec- tor form. Clustering algorithms are typically applied to data in vector form. For example, we may have k-measurements on a set of patients and so the mea- surements on each individual i are represented as a k-dimensional vector. For vector-form data well-known and widely-applied clustering techniques can be applied. Such techniques are generally model-based methods include mixture modelling [6], or distance-based methods [3]. Much real world data is actually in non-vector form consisting of observa- tions of an individual, recording information at particular time points. Such variable-length event sequence data is described in Sect. 2, but examples include a patient’s usage of medical services and an individual’s stock trading behaviour. The data is characterised as irregular events where each event may encapsulate a diﬀerent type of action. The data mining practitioner wishing to cluster event sequence data appears to have three options. The ﬁrst option is to convert the event sequence data into feature vectors [4]. A problem with this approach is that information is inevitably lost in the vectorisation process. The second option is to use a distance-based clustering method which allows for non-vector data. An edit-distance metric [5] which uses insert, delete and replace operations to turn one sequence into another is an example of this approach. A diﬃculty here is in deﬁning an eﬀective distance metric. A suitable distance metric needs to be created for each new application. The third option is the use of mixtures of a generative probabilistic model [2, 1]. This is an attractive approach but not further explored here.