Natural Language Processing Based on Semantic Inferentialism for Extracting Crime Information from Text Vládia Pinheiro Departamento de Ciências da Computação Federal University of Ceará (UFC) Fortaleza, Ceará, Brazil vladia@lia.ufc.br Vasco Furtado 1 , Tarcisio Pequeno 2 , Douglas Nogueira 3 Mestrado em Informática Aplicada University of Fortaleza (UNIFOR) 1,2,3 and ETICE 1 Fortaleza, Ceará, Brazil vasco@unifor.br 1 , tarcisio@unifor.br 2 Abstract— This article describes an architecture for Information Extraction systems on the web, based on Natural Language Processing (NLP) and especially geared toward the exploration of information about crime. The main feature of the architecture is its NLP module, which is based on the Semantic Inferential Model. We demonstrate the feasibility of the architecture through the implementation thereof to provide input for a collaborative web-based system of registering crimes called WikiCrimes. Natural Language Processing, Information Extraction, Crime Information I. INTRODUCTION Among the main sources of information used by public safety analysts and managers are textual accounts. These reports range from descriptions of crimes that have occurred, to information on profiles of persons monitored by intelligence agencies. These reports bring out the characteristics, peculiarities and inter-relationships of the events/persons and allow one to perceive trends in order to produce knowledge and effective actions to improve public safety. However, reading the vast amount of information in natural language is a difficult and time-consuming task. Therefore, a system of Natural Language Processing (NLP) that assists the understanding of these texts is of great value. In most of the approaches to NLP used today, apprehension of meaning is restricted to explicit textual information. In this paper, we describe a framework for information extraction that is based on a new model for the expression and semantic analysis of texts in natural language: the Semantic Inferential Model, or SIM [1]. This model aims to build a NLP system with an added layer for understanding texts, which offers reasoning on the implicit and explicit content of the text. For example, based on the description of a crime, new knowledge can be inferred, such as: probable locations of crimes, use of weapons and violence, gang participation, profile of the criminal, etc. Often this knowledge is not explicit in the text, and can only be perceived if we have a network of premises and conclusions of the sentences used in the text. In addition to allowing richer inferences that are of greater interest for the activity of police intelligence, analysts will have a tool that automates the process of searching for relevant text on crimes within a vast universe. Our proposal is sustained on the argument that the power to investigate a higher number of sources of information, coupled with an improvement of the semantic analysis of such sources, will leverage the quality of the actions undertaken for public safety. In this paper, we propose an architecture for Information Extraction (IE) systems to explore texts on information useful to public safety. In particular, we exemplify the use of the architecture to explore descriptions of criminal occurrences that provide input for a collaborative web-based system of registering crimes, called WikiCrimes [2] (www.wikicrimes.org). A software program called WikiCrimes Information Extractor (WikiCrimes IE ) was developed under the proposed architecture. The architecture prescribes – in addition to the innovative NLP module – the use of user-oriented programming tools that enable improved interaction and manipulation of natural-language content available on the web. Finally, we discuss the assessment of the WikiCrimes IE system, related works, and future efforts II. EXTRACTION OF INFORMATION ON PUBLIC SAFETY The aim of Information Extraction (IE) systems is to automatically find and extract the relevant information in a document or collection of documents containing text in natural language, and structure this information for output patterns – in a database or texts in natural language, for example – in order to facilitate the manipulation and analysis thereof [3]. The architecture for IE systems proposed in this article covers all modules of Grishman’s architecture, proposed in [3], and adds modules aimed at improving interaction with users so that information extraction can be done interactively. Figure 1 shows the architecture comprised of four main modules: Source, User-Oriented Programming Tool, Syntactic Parser and the SIM Framework. The User-oriented programming tool allows the development of commands to capture information, usually contained on web pages, thus facilitating the manipulation thereof. This tool is proposed in the architecture due to a 978-1-4244-6446-3/10/$26.00 © 2010 IEEE 19 ISI 2010, May 23-26, 2010, Vancouver, BC, Canada