Keyword-Based Browsing and Analysis of Large Document Sets Ido Dagan and Ronen Feldman Math and Computer Science Dept. Bar-Ilan University Ramat-Gan, ISRAEL {feldman,dagan}@bimacs.cs.biu.ac.il Haym Hirsh Dept. of Computer Science Rutgers University Piscataway, NJ USA 08855 hirsh@cs.rutgers.edu Abstract Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. This paper describes the KDT system for Knowledge Discovery in Texts. It is built on top of a text-categorization paradigm where text articles are annotated with keywords organized in a hierarchical structure. Knowledge discovery is performed by analyzing the co-occurrence frequencies of keywords from this hierarchy in the various documents. We show how this term- frequency approach supports a range of KDD operations, providing a general framework for knowledge discovery and exploration in collections of unstructured text. Introduction Traditional databases store large collections of information in the form of structured records, and provide methods for querying the database to obtain all records whose content satisfies the user's query. More recently, however, researchers in Knowledge Discovery in Databases (KDD) have provided a new family of tools for accessing information in databases (e.g. Brachman et al, 1993; Frawley et al, 1991; Kloesgen, 1992; Kloesgen, 1995b; Ezawa and Norton, 1995). The goal of KDD has been defined as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from given data" (Piatetsky- Shapiro and Frawley 1991). Work in this area includes applying machine-learning and statistical-analysis techniques towards the automatic discovery of patterns in databases, as well as providing user-guided environments for exploration of data. In Proceedings of the International Symposium on Document Analysis and Information Retrieval, 1996