Interactive Methods for Taxonomy Editing and Validation Scott Spangler IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 408-927-2887 email: spangles@us.ibm.com Jeffrey Kreulen IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 408-927-2431 email: kreulen@us.ibm.com ABSTRACT Today’s enterprise understands that improved utilization of its collective knowledge assets leads to improved business performance. The reality of proliferation of electronic information and pressures to produce more with fewer resources while performing increasingly complex tasks makes this a continuous challenge. To address this challenge and create value where there is currently chaos, enterprises are building knowledge repositories and structuring them in ways that are meaningful to their organization, business and processes. This structuring typically manifests itself in the form of one or more taxonomies. The taxonomies are meaningful hierarchical categorizations of documents into topics reflecting the natural relationships between the documents and their business objectives. Improving the quality of these taxonomies and reducing the overall cost required to create them is therefore an important area of research. Supervised and unsupervised text clustering are important technologies that comprise only a part of a complete solution. However, there exists a great need for the ability for a human to efficiently interact with a taxonomy during the editing and validation phase. We have developed a comprehensive approach to solving this problem, and implemented this approach in a software tool called eClassifier. eClassifier provides features to help the taxonomy editor understand and evaluate each category of a taxonomy and visualize the relationships between the categories. Multiple techniques allow the user to make changes at both the category and document level. Metrics then establish how well the resultant taxonomy can be modeled for future document classification. eClassifier enables the development of multiple taxonomies so that multiple relationships in the documents can be modeled. In this paper, we present a comprehensive set of viewing, editing and validation techniques we have implemented in the Lotus Discovery Server resulting in a significant reduction in the time required to create a quality taxonomy. 1 INTRODUCTION Businesses have been able to systematically increase the leverage gained from enterprise data through technologies such as relational database management systems and techniques such as data warehousing. Additionally, it is conjectured that the amount of knowledge encoded in electronic text far surpasses that available in data alone. However, the ability to take advantage of this wealth of knowledge is just beginning to meet the challenge. Businesses that can take advantage of this potential will surely be at an advantage through increased efficiencies. One important step in achieving this potential has been to structure the inherently unstructured information in meaningful ways. A well-established first step in gaining understanding is to segment examples into meaningful categories [2]. This leads to the idea of taxonomies--natural hierarchical organizations of the information in alignment with the business goals, organization and processes. While there will be some commonality in some industries, these natural organizations will have significant diversity across domains and organizations. Research to address this need for taxonomy development has concentrated largely around automated grouping techniques such as text clustering. While we believe that text clustering is an invaluable tool, indeed it is part of our solution, we assert that it is insufficient to meet the full challenge of taxonomy generation by itself. Our experience using variations of K- Means [6][10] and Expectation Maximization (EM) clustering algorithms [16] [17] have shown that they generate useful seed taxonomies, but rarely generate a satisfactory final taxonomy for a given business problem. For example, if you were to cluster a set of patents with the intent to create a technology based taxonomy you would typically find some of the clusters to be technologies and some to be based on some other aspect or relationship found in the text such as processes. One might postulate that the clustering algorithm is in fact not the issue, but this is a feature selection problem. An alternative approach would be to leverage controlled vocabularies. However, we find this approach to be very labor intensive and would still yield results that would need further refinement. Our approach to solve this problem focuses on the visualization, editing and validation of clustering results. We will go into details of our approach below but further clarification on the problem and its relationship to cluster validation is warranted. The problem we are attempting to solve has been referred to in the literature [4] [7] as clustering validation. Validation methods have typically been based on one of three types of