Identiﬁcation with Probability One of Stochastic Deterministic Linear Languages Colin de la Higuera 1 and Jose Oncina 2‡ 1 EURISE, Universit´ e de Saint-Etienne, 23 rue du Docteur Paul Michelon, 42023 Saint-Etienne, France cdlh@univ-st-etienne.fr, WWW home page: http://eurise.univ-st-etienne.fr/~cdlh 2 Departamento de Lenguajes y Sistemas Inform´aticos, Universidad de Alicante, Ap.99. E-03080 Alicante, Spain oncina@dlsi.ua.es, WWW home page: http://www.dlsi.es/~oncina Abstract. Learning context-free grammars is generally considered a very hard task. This is even more the case when learning has to be done from positive examples only. In this context one possibility is to learn stochastic context-free grammars, by making the implicit assump- tion that the distribution of the examples is given by such an object. Nevertheless this is still a hard task for which no algorithm is known. We use recent results to introduce a proper subclass of linear grammars, called deterministic linear grammars, for which we prove that a small canonical form can be found. This has been a successful condition for a learning algorithm to be possible. We propose an algorithm for this class of grammars and we prove that our algorithm works in polynomial time, and structurally converges to the target in the paradigm of identiﬁcation in the limit with probability 1. Although this does not ensure that only a polynomial size sample is necessary for learning to be possible, we argue that the criterion means that no added (hidden) bias is present. 1 Introduction Context-free grammars are known to have a superior modeling capacity than regular grammars or ﬁnite state automata. Learning these grammars is also harder but considered an important and challenging task. Yet without external help such as a knowledge of the structure of the strings [Sak92] only clever but limited heuristics have been proposed [LS00,NMW97]. When no positive examples exist, or when the actual problem is that of build- ing a language model, stochastic context-free grammars have been proposed. In a number of applications (computational biology [SBH + 94] and speech recog- nition [WA02] are just two typical examples), it is speculated that success will ‡ The author thanks the Generalitat Valenciana for partial support of this work through project CETIDIB/2002/173.