A Case Study in Partial Parsing Unstructured Text
Chia-Chu Chiang
*
, John Talburt, Ningning Wu, Elizaberth Pierce, Chris Heien, Ebony Gulley,
and JaMia Moore
*
Department of Computer Science
Department of Information Science
University of Arkansas at Little Rock
2801 South University Avenue
Little Rock, Arkansas 72203-1099, USA
E-mail: {cxchiang|jrtalburt}@ualr.edu
Abstract
This paper presents a parsing method for the entity ex-
traction from open source documents. A web page of inter-
est is first downloaded to a text file. The method then ap-
plies a set of patterns to the text file to extract interesting
entity fragments. The patterns are currently particularly
designed for obituary announcements. With the extracted
entities, the next step is to identify these entities before they
are populated into a database. An entity resolution process
is presented to determine the actual identities. A case study
is illustrated with the method and the results are presented
also. Although the results show that the method is not tech-
nically effective and promising, the research results do help
understand how well or bad a quick parsing technique ex-
tracts entities of interest from obituaries on the web. More
effective techniques should be further considered to im-
prove the extraction results.
Keywords: Data mining, Information extraction, Parsing,
Pattern match, and Text mining.
1. Introduction
Everyday there are many news articles and announce-
ments added to the Internet. Such information can be very
useful for business. However, it is often represented in an
unstructured format that is not suitable for automatic com-
puter processing for extraction. Hence, the growing interest
in text mining focuses on data mining from textual sources.
The automated task of text extraction usually begins
with the downloading of web pages into text files, removal
of HTML tags in the texts, text scanning into tokens, and
sentential parsing. The success of this task heavily depends
on how the entities of interest can be accurately extracted.
One typical method of finding entities is partially parsing
the text (shallow parsing) without attempting a complete
sentences parse (deep parsing).
In this paper, we are going to present a text parsing
method used to locate entities using the pattern match ap-
proach. The rationale for this approach is based on the hy-
potheses that we made stating that attributes of certain enti-
ties make it reasonably accurate for determining the roles of
the entities in the text. Unfortunately, due to the ambiguity
of natural language, this approach may fail to find patterns
or mistakenly match the patterns in some circumstances. In
addition, we would like to answer the following questions.
How well or poorly does this method extract the entities of
interest in a particular domain of obituary? Should we look
for another effective method to improve the extraction re-
sults or just revise the method to provide satisfactory results
for our customers? A case study is presented to illustrate
the method and answer the above questions. This paper will
also describe the problems and challenges encountered with
the method in the process of extracting entities from obitu-
aries on the web.
2. Related work
The first question raised when performing text mining is
how to determine which parsing method should be applied
- deep parsing or shallow parsing. It has been reported that
using deep (full) parsing is very slow and very error prone
[1]. It is good enough to use shallow parsing to focus on
important phrases for information extraction and to skip
irrelevant parts.
FASTUS (Finite State Automaton Text Understanding
System) is a system used for extracting information from
free text in natural language including English and Japanese
[2, 3]. FASTUS performs deep parsing on a text. First, the
system recognizes names and phrases in which the groups
of nouns, verbs, and prepositions are further recognized.
Complex noun groups and verb groups are then con-
Fifth International Conference on Information Technology: New Generations
978-0-7695-3099-4/08 $25.00 © 2008 IEEE
DOI 10.1109/ITNG.2008.68
447
Fifth International Conference on Information Technology: New Generations
978-0-7695-3099-4/08 $25.00 © 2008 IEEE
DOI 10.1109/ITNG.2008.68
447
Fifth International Conference on Information Technology: New Generations
978-0-7695-3099-4/08 $25.00 © 2008 IEEE
DOI 10.1109/ITNG.2008.68
447
Fifth International Conference on Information Technology: New Generations
978-0-7695-3099-4/08 $25.00 © 2008 IEEE
DOI 10.1109/ITNG.2008.68
447
Fifth International Conference on Information Technology: New Generations
978-0-7695-3099-4/08 $25.00 © 2008 IEEE
DOI 10.1109/ITNG.2008.68
447