An Integrated Characterization and Discrimination Scheme to Improve Learning Efficiency in Large Data Sets Roberto GEMELLO, Franco MANA CSELT - Centro Studi e Laboratori Telecomunicazioni S.p.a. via G. Reiss Romoli, 274 - 10148 Torino (Italy) Abstract This work proposes a learning scheme which integrates Characterization and Discrimination activities with the aim of improving learning efficiency in large data sets. Characterization is considered to be a process which builds up a rough concept description using only positive examples. This description already excludes most of the extreme negative examples. Discrimination is considered to be an incremental learning process which begins with the characteristic description and refines it so as to make it consistent with the negative examples (near misses) which are still covered. During this phase learning efficiency is greatly improved by considering only near misses as counter- examples. Finally, the description is simplified by dropping some characterizing but not discriminant parts of the description. This learning scheme is discussed and compared with the traditional data reduction techniques. Some experimental results are reported which show the gain in efficiency obtained, particularly on real applicative domains. 1 Introduction Machine Learning has been one of the main theoretical topics of AI research over the past decade and several learning techniques have been widely investigated. Among them Inductive Learning from Examples has reached advanced level and is beginning to move out of the labs and face real applicative problems. In taking this step the learning systems must change their techniques of dealing with the few hand coded examples of artificial domains and they must be adapted so that they can manage the large number of real data present in the environment in which the learning system is operating. On one hand this impact with large data bases of samples is positive for the learning systems: in fact they can demonstrate their ability to learn rules which are not merely a summary of the examples but incorporate an ability to make predictions, which, in turn, can be tested on statistically relevant test sets. On the other hand, an increased number of samples can cause inefficiency due to the complex computations involved, especially in the systems which use a first order representation language, and their greater learning power calls for much greater computational effort. For this reason it is necessary to study techniques which will allow the examples to be used more efficiently in the inductive process. These techniques have often been called Data Reduction Techniques [Michalski and Larson, 1978, Cramm, 1983, Pollack, 1983] and their aim is to cut down computational effort by reducing the number of examples involved in the learning process, without compromising the meaningfulness of the learned knowledge. Unfortunately, the methods which have been proposed up to now are either inadequate for a first order representation language or, in their turn, arc computa- tionally too expensive. After a brief revision of the present state of the art, a new data reduction technique is presented, which adopts an approximation of the characterization as the evaluation criterion to select the counter-examples for each class. This technique is a return to the classical concept of near miss introduced by Winston [1979], and proposes a more operational definition of the same concept, that is used to reduce the number of counter- examples which have to be taken into account during the discrimination process. The main idea is as follows: first a costless approximation (p* of the characterization (p is computed for each class, using only the positive examples; then, the class counter-examples covered by cp' are defined as near misses of the class (the most useful counter-examples for the computation of a discriminant description of the class); finally, a discriminant description y of the class is obtained through an incremental learning process which, taking cp' as the starting hypothesis, specializes and simplifies it so as to make it consistent and less complex. The learned knowledge (a first order discriminant formula for each class) is proved to be complete and consistent (within a prescribed tolerance) with the original examples Gemello and Mana 719