Natural Language Processing Based on Semantic
Inferentialism for Extracting Crime Information from
Text
Vládia Pinheiro
Departamento de Ciências da Computação
Federal University of Ceará (UFC)
Fortaleza, Ceará, Brazil
vladia@lia.ufc.br
Vasco Furtado
1
, Tarcisio Pequeno
2
, Douglas Nogueira
3
Mestrado em Informática Aplicada
University of Fortaleza (UNIFOR)
1,2,3
and ETICE
1
Fortaleza, Ceará, Brazil
vasco@unifor.br
1
, tarcisio@unifor.br
2
Abstract— This article describes an architecture for Information
Extraction systems on the web, based on Natural Language
Processing (NLP) and especially geared toward the exploration of
information about crime. The main feature of the architecture is
its NLP module, which is based on the Semantic Inferential
Model. We demonstrate the feasibility of the architecture
through the implementation thereof to provide input for a
collaborative web-based system of registering crimes called
WikiCrimes.
Natural Language Processing, Information Extraction, Crime
Information
I. INTRODUCTION
Among the main sources of information used by public
safety analysts and managers are textual accounts. These
reports range from descriptions of crimes that have occurred, to
information on profiles of persons monitored by intelligence
agencies. These reports bring out the characteristics,
peculiarities and inter-relationships of the events/persons and
allow one to perceive trends in order to produce knowledge and
effective actions to improve public safety. However, reading
the vast amount of information in natural language is a difficult
and time-consuming task. Therefore, a system of Natural
Language Processing (NLP) that assists the understanding of
these texts is of great value.
In most of the approaches to NLP used today, apprehension
of meaning is restricted to explicit textual information. In this
paper, we describe a framework for information extraction that
is based on a new model for the expression and semantic
analysis of texts in natural language: the Semantic Inferential
Model, or SIM [1]. This model aims to build a NLP system
with an added layer for understanding texts, which offers
reasoning on the implicit and explicit content of the text. For
example, based on the description of a crime, new knowledge
can be inferred, such as: probable locations of crimes, use of
weapons and violence, gang participation, profile of the
criminal, etc. Often this knowledge is not explicit in the text,
and can only be perceived if we have a network of premises
and conclusions of the sentences used in the text. In addition to
allowing richer inferences that are of greater interest for the
activity of police intelligence, analysts will have a tool that
automates the process of searching for relevant text on crimes
within a vast universe.
Our proposal is sustained on the argument that the power to
investigate a higher number of sources of information, coupled
with an improvement of the semantic analysis of such sources,
will leverage the quality of the actions undertaken for public
safety. In this paper, we propose an architecture for
Information Extraction (IE) systems to explore texts on
information useful to public safety. In particular, we exemplify
the use of the architecture to explore descriptions of criminal
occurrences that provide input for a collaborative web-based
system of registering crimes, called WikiCrimes [2]
(www.wikicrimes.org). A software program called WikiCrimes
Information Extractor (WikiCrimes
IE
) was developed under the
proposed architecture. The architecture prescribes – in addition
to the innovative NLP module – the use of user-oriented
programming tools that enable improved interaction and
manipulation of natural-language content available on the web.
Finally, we discuss the assessment of the WikiCrimes
IE
system,
related works, and future efforts
II. EXTRACTION OF INFORMATION ON PUBLIC SAFETY
The aim of Information Extraction (IE) systems is to
automatically find and extract the relevant information in a
document or collection of documents containing text in natural
language, and structure this information for output patterns – in
a database or texts in natural language, for example – in order
to facilitate the manipulation and analysis thereof [3]. The
architecture for IE systems proposed in this article covers all
modules of Grishman’s architecture, proposed in [3], and adds
modules aimed at improving interaction with users so that
information extraction can be done interactively. Figure 1
shows the architecture comprised of four main modules:
Source, User-Oriented Programming Tool, Syntactic Parser
and the SIM Framework.
The User-oriented programming tool allows the
development of commands to capture information, usually
contained on web pages, thus facilitating the manipulation
thereof. This tool is proposed in the architecture due to a
978-1-4244-6446-3/10/$26.00 © 2010 IEEE 19 ISI 2010, May 23-26, 2010, Vancouver, BC, Canada