Mining Aviation Safety Data: A Hybrid Approach Eric Bloedorn The MITRE Corporation bloedorn@mitre.org ABSTRACT Data mining is broadly defined as the search for interesting patterns from large amounts of data. Techniques for performing data mining come from a wide variety of disciplines including traditional statistics, machine learning, and information retrieval. While this means that for any given application there is probably some “data mining” technique for finding interesting patterns, it also means there exists a confusing array of possible data mining tools and approaches for any given application. This problem is exacerbated when the available data contains both structured as well as unstructured (free-text) data. For example, the aviation safety data used in the reported experiments contains records which include both free text event descriptions as well as structured fields for phase-of-flight and location. Performing separate analysis on these different sources of data does not fully exploit the available information (e.g. clustering records without regard to narratives can match reports of total electrical failure with human factors problems). Unfortunately currently available tools provide little support. This paper describes one approach to combining the information available from all of these different types of data together to get a single ‘similarity’ score. The importance of picking tools appropriate to the types of data in hand is also stressed. INTRODUCTION As a new field, data mining has a large number of tools and associated it. While this means the field has much to offer, it makes it difficult for data owners, new to this type of analysis, to decide which tools are appropriate. One way to cope with this selection task is to let the data drive the decision. Different data types require different analysis methods. One useful way of organizing the discussion of data types is to break these types down by structure. Structure can refer to either the structure of the entire record, or the structure of an individual field. For the most part, this discussion will assume that each record is highly structured, i.e. that each record has a fixed number of fields (attributes) and that these fields are in a known order. The problems of dealing with semi-structured data in which a fixed schema can not adequately describe the data organization are described in Seligman et al, [1998]. The following sections outline four different data types: quantitative (interval and ratio), ordinal, nominal and free-text. For each data type a definition of that type as well as suggested analysis methods is provided. 2000 The MITRE Corporation. ALL RIGHTS RESERVED.