LEARNING CONCEPT DESCRIPTIONS FROM EXAMPLES WITH ERRORS Jakub Segen A T & T Bell Laboratories, Holmdel, N.J. 07733 ABSTRACT This paper presents a scheme for learning complex descriptions, such as logic formulas, from examples with errors. The basis for learning is provided by a selection criterion which minimizes a combined measure of discrepancy of a description with training data, and complexity of a description. Learning rules for two types of descriptors are derived: one for finding descriptors with good average discrimination over a set of concepts, second for selecting the best descriptor for a specific concept. Once these descriptors are found, an unknown instance can be identified by a search using the descriptors of the first type for a fast screening of candidate concepts, and the second for the final selection of the closest concept. 1. Introduction While the majority of the AI work on learning concentrates in error free domains, there is an acknowledged need for learning techniques directed towards noisy data [Dietterich and Michalski, 1983], [Mitchell, 1982]. A problem of a major importance in learning from data with errors is the choice of the preference criterion for ranking competing descriptions. The criteria such as maximum likelihood, minimum error, or minimum estimated entropy which are generally used for inference from noisy data, suffice for inferring simple parametric models, but are not well suited to learning in rich spaces of symbolic descriptions used in A I . These criteria minimize the discrepancy between a description and the training observations. If the language used to form descriptions is sufficiently rich to express the training data, such criteria will rank a description that exactly matches the training observations as better or equal to any other description. For example, if the space of descriptions includes predicate calculus expressions, a concept A represented in the training set by three instances, whose parameter "length" assumes values 4.6, 5.2, and 5.7, might generate a description: (length(A) - 4.6 OR length(A) 5.2 OR length(A) - 5.7). Any errors in the training data will be represented in such an overspecified description along with possible regularities. A version of this problem known as the "curse of dimensionality" appears even with simple vector models when the number of dimensions is not specified, [Kanal, 1974]. One way of preventing the inference process from generating overspecified descriptions is to include some measure of description complexity in the preference criterion, to bias it towards simple descriptions. This idea is well known in philosophy of science (Occam's razor), and various measures of complexity (or simplicity) were proposed in AI literature [Michalski and Stepp, 1983], [Michalski, 1983], [Mitchell, 1980], [Buchanan and Mitchell, 1978]. A specific question is the trade-off between the complexity of a description and the discrepancy with data. A criterion objectively combining these two measures by relating both to Kolmogorov'g complexity [Kolmogorov, 1968], was introduced in [Segen, 1980] and called minimal representation criterion. In this paper we apply the minimal representation criterion to derive general rules for learning concept descriptions from noisy training data. These rules can be used to learn symbolic descriptors, such as logic formulas, as well as parametric models. In Section 2 of this paper we summarize the minimal representation criterion, in Section 3 we apply it to derive selection rules for two types of descriptors: concept specific descriptors and globally useful system descriptors, and to decide which descriptors should be used with default values. In Section 4 we show how to apply both types of descriptors to classify instances using bottom-up and top-down strategies. 2. Minimal Representation Criterion Consider the problem of rinding a program for a Turing machine, to generate a given finite sequence of observations. While there are infinitely many programs for any such sequence, it seems reasonable to chose the shortest program since it represents the least commitment and minimum redundancy. If we treat a program to be a randomly generated binary sequence with O's and l's having equal probability, then the shortest program is also the most probable one. The problem of selecting a probability model P(y) from a sequence of observations can be recast as a case of the above problem by establishing an isomorphism between the class of probability distributions and a subset of programs for a Turing machine [Segen, 1980]. Selecting the shortest program in this subset corresponds to finding a probability distribution minimizing the expression (1) where S(P(y)) is the number of bits needed to specify the probability distribution P(y). All the logarithms used in this paper are in the base 2. The above criterion for estimating the probability distribution has been called the minimal representation criterion. Its main difference from the maximum likelihood criterion (equivalent to seeking a minimum of comes from the term which is a measure of a complexity of the specification of the probability distribution P(y). Including it in the criterion in effect penalizes more complex distributions. Properties of the minimal representation criterion were treated formally in [Segen, 1980]. It has been applied to discover patterns in a continuous signal and in a symbol sequence, and to such problems as selecting the number of clusters. 3. Choosing Concept Descriptors The problem of selecting a single descriptor for each concept can be stated as follows: Given is a training set T, consisting of a set of instances and a concept assignment for each of the instances: Also given is a space F of functions, which we call descriptor functions or descriptors, defined on the domain of instances. A descriptor can be any computable function with a probability distribution defined on its range of values. For each of the concepts we want to select a descriptor that is most