A Case Study in Partial Parsing Unstructured Text Chia-Chu Chiang * , John Talburt, Ningning Wu, Elizaberth Pierce, Chris Heien, Ebony Gulley, and JaMia Moore * Department of Computer Science Department of Information Science University of Arkansas at Little Rock 2801 South University Avenue Little Rock, Arkansas 72203-1099, USA E-mail: {cxchiang|jrtalburt}@ualr.edu Abstract This paper presents a parsing method for the entity ex- traction from open source documents. A web page of inter- est is first downloaded to a text file. The method then ap- plies a set of patterns to the text file to extract interesting entity fragments. The patterns are currently particularly designed for obituary announcements. With the extracted entities, the next step is to identify these entities before they are populated into a database. An entity resolution process is presented to determine the actual identities. A case study is illustrated with the method and the results are presented also. Although the results show that the method is not tech- nically effective and promising, the research results do help understand how well or bad a quick parsing technique ex- tracts entities of interest from obituaries on the web. More effective techniques should be further considered to im- prove the extraction results. Keywords: Data mining, Information extraction, Parsing, Pattern match, and Text mining. 1. Introduction Everyday there are many news articles and announce- ments added to the Internet. Such information can be very useful for business. However, it is often represented in an unstructured format that is not suitable for automatic com- puter processing for extraction. Hence, the growing interest in text mining focuses on data mining from textual sources. The automated task of text extraction usually begins with the downloading of web pages into text files, removal of HTML tags in the texts, text scanning into tokens, and sentential parsing. The success of this task heavily depends on how the entities of interest can be accurately extracted. One typical method of finding entities is partially parsing the text (shallow parsing) without attempting a complete sentences parse (deep parsing). In this paper, we are going to present a text parsing method used to locate entities using the pattern match ap- proach. The rationale for this approach is based on the hy- potheses that we made stating that attributes of certain enti- ties make it reasonably accurate for determining the roles of the entities in the text. Unfortunately, due to the ambiguity of natural language, this approach may fail to find patterns or mistakenly match the patterns in some circumstances. In addition, we would like to answer the following questions. How well or poorly does this method extract the entities of interest in a particular domain of obituary? Should we look for another effective method to improve the extraction re- sults or just revise the method to provide satisfactory results for our customers? A case study is presented to illustrate the method and answer the above questions. This paper will also describe the problems and challenges encountered with the method in the process of extracting entities from obitu- aries on the web. 2. Related work The first question raised when performing text mining is how to determine which parsing method should be applied - deep parsing or shallow parsing. It has been reported that using deep (full) parsing is very slow and very error prone [1]. It is good enough to use shallow parsing to focus on important phrases for information extraction and to skip irrelevant parts. FASTUS (Finite State Automaton Text Understanding System) is a system used for extracting information from free text in natural language including English and Japanese [2, 3]. FASTUS performs deep parsing on a text. First, the system recognizes names and phrases in which the groups of nouns, verbs, and prepositions are further recognized. Complex noun groups and verb groups are then con- Fifth International Conference on Information Technology: New Generations 978-0-7695-3099-4/08 $25.00 © 2008 IEEE DOI 10.1109/ITNG.2008.68 447 Fifth International Conference on Information Technology: New Generations 978-0-7695-3099-4/08 $25.00 © 2008 IEEE DOI 10.1109/ITNG.2008.68 447 Fifth International Conference on Information Technology: New Generations 978-0-7695-3099-4/08 $25.00 © 2008 IEEE DOI 10.1109/ITNG.2008.68 447 Fifth International Conference on Information Technology: New Generations 978-0-7695-3099-4/08 $25.00 © 2008 IEEE DOI 10.1109/ITNG.2008.68 447 Fifth International Conference on Information Technology: New Generations 978-0-7695-3099-4/08 $25.00 © 2008 IEEE DOI 10.1109/ITNG.2008.68 447