A BROAD-COVERAGE PARSER FOR KNOWLEDGE ACQUISITION FROM TECHNICAL TEXTS Sylvain Delisle Department of Computer Science, University of Ottawa Ottawa, Ontario, Canada, K1N 6N5 sylvain@csi.uottawa.ca & Stan Szpakowicz * Computer Science Department, University of the Witwatersrand P O Wits, Johannesburg 2050, South Africa 1. INTRODUCTION We consider an application of natural language processing (NLP) to the realistic task of acquiring knowledge from technical texts. The parser we describe cannot rely on rich a priori domain-specific knowledge. It must be usable despite the scarcity of semantics. This paper presents the parser’s syntactic component; we discuss its current state and how it should evolve in the near future. 1.1 The Parser as Part of a Text Processing System The parser is a central component of an experimental text processing system, under construction at our Department. This system, called KATE (Knowledge Acquisition from TExts), will semi-automatically process technical text and incrementally build a conceptual model of the domain. Our approach relies on two important assumptions: 1) the process of linear 1 knowledge-based text understanding can be converted into incremental knowledge acquisition, 2) the text is linguistically correct and describes the domain. The organization of KATE is presented in [Szpakowicz 90]. The parser—the language processing component of KATE—should accept most sentences found in the text. Without such a broad-coverage parser, automatic acquisition of knowledge from text simply cannot be realized. This is because without a rich semantic model 2 , syntax is the only support for meaning; the broader the parser’s coverage, the better the representation of meaning. (That is not to say that syntax and semantics are the same. We only maintain that, in the absence of the domain’s detailed semantic model, syntax is the next best way of getting at the meaning.) At present, the * On leave from the University of Ottawa. 1 Linear means that sentences are processed sequentially, and the text is so written that sequential reading suffices to understand it without much “jumping around”. 2 It is essential for us not to assume any such a priori model. Indeed, the goal is to (semi-automatically) construct that model!