168 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 1, JANUARY-FEBRUARY 1997 Learning to Predict: INC2.5 Mirsad Hadzikadic and Benjamin F. Bohren Abstract—This paper discusses INC2.5, an incremental concept formation system. The goal of INC2.5 is to form a hierarchy of concept descriptions based on previously seen instances which will be used to predict the classification of a new instance description. Each subtree of the hierarchy consists of instances which are similar to each other. The further from the root, the greater the similarity is between the instances within the same groupings. The ability to classify instances based on their description has many applications. For example, in the medical field doctors are required daily to diagnose patients; in other words, classify patients according to their symptoms. INC2.5 has been successfully applied to several domains, including breast cancer, general trauma, congressional voting records, and the monk’s problems. Index Terms—Concept formation, diagnosis, database mining, knowledge acquisition, similarity-based learning. ———————— ✦ ———————— 1 INTRODUCTION THE ability to learn from observation and accurately predict future instances is a feature expected in humans. However, due to its complexity, attempting to simulate this task in a digital computer requires nonstandard algorithms. If successful, the computer would be a useful tool for people in need of an automated classifi- cation system. This tool could either help domain experts search for previously unknown patterns of behavior or provide expert advice where adequate resources are not available. The medical field is a good example of an area where such a tool would be beneficial. Areas with a shortage of doctors would now have full- time help with diagnosing diseases, residents could use it to broaden their experience, and experienced doctors could discover new disorders thereby improving their own ability to correctly predict diseases. Finally, all medical personnel could use it to con- firm or weaken an opinion. In order to be effective, the system/tool described above should be unsupervised and incremental. The term unsupervised indicates that there is no teacher to decide on either the number or identity of the categories to be learned by the system. This is especially useful in medicine where new interpretations of previous findings appear almost as regularly as new diseases do. The term incremental, on the other hand, indicates that examples/instances are acquired one at a time. While it is certainly possible to form concepts by looking at all instances at once, such an approach would not be advisable in the context of medical decision making where each medical practitioner can build his/her own decision tree, i.e., a hierarchy of categories, from his/her own previous cases. This incremental approach will allow the system to properly represent physician’s understanding of both new diseases and modifications of the existing ones when and as they happen in the population. One approach the tool could use to classify objects is known as incremental concept formation. [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12] The algorithm builds a hierarchy of the known in- stances. An instance is defined as any object, event, or place that can be defined in terms of attribute/value pairs. Ideally, all in- stances which lead to the same consequence would be grouped in the same branch of the tree. After a tree is obtained, it can then be used to predict the classification of new instances. In the past, con- cept formation researchers have used soybean disease, congres- sional voting records [1], [7], [10], breast cancer [4], primary tumor [4], audiology [4], and the monk’s problems [13] as domains for testing their algorithms. In this paper, we will use breast cancer, general trauma, congressional voting records, and the monk’s problems. The remainder of this paper is divided into four sections. Sec- tion 2 introduces the architecture of INC2.5, an incremental simi- larity-based concept formation system, while Section 3 quantifies its prediction performance. Section 4 of the paper provides a brief overview of previous work in concept formation. Finally, a sum- mary of future work is given in Section 5. 2 INC2.5 INC2.5, a descendant of INC2 [4], [5], [6], [7], builds a tree hierar- chy via six operators and a similarity-based classification evalua- tion function. INC2.5 enhances INC2 by introducing two new op- erators and an improved classification algorithm. Each node in the INC2.5 tree has a description consisting of the following data: name of the node, list of attributes and associated values, measure of cohesiveness, number of children, and number of instances stored under this class. Each node has an identical list of attributes. For each attribute the node will store all values found in its instance descriptions. A node containing a single instance will have zero or more values associated with each attribute. A class node is identical in structure to an instance node, but contains the union of all instance values stored under it. When a value occurs more than once, the number of occurrences of the attribute value in that branch of the tree is displayed in the attribute list. Upon calculating the similarity be- tween two instances, the two attribute lists are compared and re- turn a number reflecting both distinctive and common features. The classification evaluation function, its components, and the tree-searching algorithms are explained in detail throughout this section. 2.1 Evaluation Function The evaluation function can be broken into two components, simi- larity and cohesiveness. Similarity is used for both classifying previ- ous instances and predicting the class membership of new in- stances. Cohesiveness calculates the average similarity of all pairs of instances contained in a class. The next two sections will for- mally describe these functions. 2.1.1 Similarity The similarity s of two nodes is based on the comparison between the two sets of attribute/value pairs. The function is derived from the contrast model [14], which defines similarity as a linear combi- nation of common and distinctive attribute/value pairs, features. Equation 2.1 reviews the similarity function s(A, B) where A and B denote descriptions of instance or class nodes a and b, respec- tively; c(A, B) represents the contribution of the common features of a and b; d(A,B) introduces the influence of the features of a not shared by b; |a| is the number of instances stored under the node associated with class a; |b| is similarly interpreted for class b; and p(f|A) reflects the conditional probability of feature f, given A. 1041-4347/97/$10.00 © 1997 IEEE ———————————————— • M. Hadzikadic is with Orthopaedic Informatics Research, Carolinas Medical Center and the Computer Science Department, University of North Carolina at Charlotte, Charlotte, NC 28223. E-mail: mirsad@uncc.edu. • B.F. Bohren is with Orthopaedic Informatics Research, Carolinas Medical Cen- ter, P.O. Box 32861, Charlotte, NC 28232. E-mail: bfbohren@uncc.edu. Manuscript received Feb. 21, 1995; revised Jan. 16, 1996. For information on obtaining reprints of this article, please send e-mail to: transkde@computer.org, and reference IEEECS Log Number K96068. Authorized licensed use limited to: University of North Carolina at Charlotte. Downloaded on July 9, 2009 at 20:36 from IEEE Xplore. Restrictions apply.