168 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 1, JANUARY-FEBRUARY 1997
Learning to Predict: INC2.5
Mirsad Hadzikadic and Benjamin F. Bohren
Abstract—This paper discusses INC2.5, an incremental concept
formation system. The goal of INC2.5 is to form a hierarchy of concept
descriptions based on previously seen instances which will be used to
predict the classification of a new instance description. Each subtree of
the hierarchy consists of instances which are similar to each other. The
further from the root, the greater the similarity is between the instances
within the same groupings. The ability to classify instances based on
their description has many applications. For example, in the medical
field doctors are required daily to diagnose patients; in other words,
classify patients according to their symptoms. INC2.5 has been
successfully applied to several domains, including breast cancer,
general trauma, congressional voting records, and the monk’s
problems.
Index Terms—Concept formation, diagnosis, database mining,
knowledge acquisition, similarity-based learning.
———————— ✦ ————————
1 INTRODUCTION
THE ability to learn from observation and accurately predict future
instances is a feature expected in humans. However, due to its
complexity, attempting to simulate this task in a digital computer
requires nonstandard algorithms. If successful, the computer
would be a useful tool for people in need of an automated classifi-
cation system. This tool could either help domain experts search
for previously unknown patterns of behavior or provide expert
advice where adequate resources are not available. The medical
field is a good example of an area where such a tool would be
beneficial. Areas with a shortage of doctors would now have full-
time help with diagnosing diseases, residents could use it to
broaden their experience, and experienced doctors could discover
new disorders thereby improving their own ability to correctly
predict diseases. Finally, all medical personnel could use it to con-
firm or weaken an opinion.
In order to be effective, the system/tool described above should
be unsupervised and incremental. The term unsupervised indicates
that there is no teacher to decide on either the number or identity of
the categories to be learned by the system. This is especially useful in
medicine where new interpretations of previous findings appear
almost as regularly as new diseases do. The term incremental, on
the other hand, indicates that examples/instances are acquired one
at a time. While it is certainly possible to form concepts by looking at
all instances at once, such an approach would not be advisable in the
context of medical decision making where each medical practitioner
can build his/her own decision tree, i.e., a hierarchy of categories,
from his/her own previous cases. This incremental approach will
allow the system to properly represent physician’s understanding of
both new diseases and modifications of the existing ones when and
as they happen in the population.
One approach the tool could use to classify objects is known as
incremental concept formation. [1], [2], [3], [4], [5], [6], [7], [8], [9],
[10], [11], [12] The algorithm builds a hierarchy of the known in-
stances. An instance is defined as any object, event, or place that
can be defined in terms of attribute/value pairs. Ideally, all in-
stances which lead to the same consequence would be grouped in
the same branch of the tree. After a tree is obtained, it can then be
used to predict the classification of new instances. In the past, con-
cept formation researchers have used soybean disease, congres-
sional voting records [1], [7], [10], breast cancer [4], primary tumor
[4], audiology [4], and the monk’s problems [13] as domains for
testing their algorithms. In this paper, we will use breast cancer,
general trauma, congressional voting records, and the monk’s
problems.
The remainder of this paper is divided into four sections. Sec-
tion 2 introduces the architecture of INC2.5, an incremental simi-
larity-based concept formation system, while Section 3 quantifies
its prediction performance. Section 4 of the paper provides a brief
overview of previous work in concept formation. Finally, a sum-
mary of future work is given in Section 5.
2 INC2.5
INC2.5, a descendant of INC2 [4], [5], [6], [7], builds a tree hierar-
chy via six operators and a similarity-based classification evalua-
tion function. INC2.5 enhances INC2 by introducing two new op-
erators and an improved classification algorithm. Each node in the
INC2.5 tree has a description consisting of the following data:
name of the node, list of attributes and associated values, measure of
cohesiveness, number of children, and number of instances stored under
this class.
Each node has an identical list of attributes. For each attribute
the node will store all values found in its instance descriptions. A
node containing a single instance will have zero or more values
associated with each attribute. A class node is identical in structure
to an instance node, but contains the union of all instance values
stored under it. When a value occurs more than once, the number
of occurrences of the attribute value in that branch of the tree is
displayed in the attribute list. Upon calculating the similarity be-
tween two instances, the two attribute lists are compared and re-
turn a number reflecting both distinctive and common features.
The classification evaluation function, its components, and the
tree-searching algorithms are explained in detail throughout this
section.
2.1 Evaluation Function
The evaluation function can be broken into two components, simi-
larity and cohesiveness. Similarity is used for both classifying previ-
ous instances and predicting the class membership of new in-
stances. Cohesiveness calculates the average similarity of all pairs
of instances contained in a class. The next two sections will for-
mally describe these functions.
2.1.1 Similarity
The similarity s of two nodes is based on the comparison between
the two sets of attribute/value pairs. The function is derived from
the contrast model [14], which defines similarity as a linear combi-
nation of common and distinctive attribute/value pairs, features.
Equation 2.1 reviews the similarity function s(A, B) where A and B
denote descriptions of instance or class nodes a and b, respec-
tively; c(A, B) represents the contribution of the common features
of a and b; d(A,B) introduces the influence of the features of a not
shared by b; |a| is the number of instances stored under the node
associated with class a; |b| is similarly interpreted for class b; and
p(f|A) reflects the conditional probability of feature f, given A.
1041-4347/97/$10.00 © 1997 IEEE
————————————————
• M. Hadzikadic is with Orthopaedic Informatics Research, Carolinas Medical
Center and the Computer Science Department, University of North Carolina at
Charlotte, Charlotte, NC 28223. E-mail: mirsad@uncc.edu.
• B.F. Bohren is with Orthopaedic Informatics Research, Carolinas Medical Cen-
ter, P.O. Box 32861, Charlotte, NC 28232.
E-mail: bfbohren@uncc.edu.
Manuscript received Feb. 21, 1995; revised Jan. 16, 1996.
For information on obtaining reprints of this article, please send e-mail to:
transkde@computer.org, and reference IEEECS Log Number K96068.
Authorized licensed use limited to: University of North Carolina at Charlotte. Downloaded on July 9, 2009 at 20:36 from IEEE Xplore. Restrictions apply.