Automated Information Extraction from Empirical Software Engineering Literature: Is that possible? Daniela Cruzes 1,2 , Manoel Mendonça 1 1 NUPERC/UNIFACS, Salvador, BA- Brazil 2 FEEC/UNICAMP Campinas, SP- Brazil {daniela, mgmn} @unifacs.br Victor Basili Dept. of Computer Science, University of Maryland, College Park, MD, 20742, USA basili@cs.umd.edu Forrest Shull Fraunhofer Center - Maryland, 4321 Hartwick Road, College Park, MD, 20740, USA fshull@fc-md.umd.edu Mario Jino FEEC/UNICAMP Caixa Postal 6101 13083-970 Campinas (SP), Brazil Jino @dca.fee.unicamp.br Abstract The number of scientific publications is constantly increasing, and the results published on Empirical Software Engineering are growing even faster. Some software engineering publishers have begun to collaborate with research groups to make available repositories of software engineering empirical data. However, these initiatives are limited due to data ownership and privacy issues. As a result, many researchers in the area have adopted systematic reviews as a mean to extract empirical evidence from published material. Systematic reviews are labor intensive and costly. In this paper, we argue that the use of Information Extraction Tools can support systematic reviews and significantly speed up the creation of repositories of SE empirical evidence. 1. Introduction The number of scientific publications is continuously increasing, and the number of journals reporting on results from Empirical Software Engineering is also growing. In this scenario, it is important to have approaches to execute secondary studies, i.e., studies that draw conclusions over the evidence collected from previous studies. Systematic Review [3] is quickly becoming the approach of choice to integrate evidence from Software Engineering literature. The systematic review process requires that a user identify a comprehensive collection of articles, extract information from those articles, verify the accuracy of those extracted facts, and analyze the extracted facts using either qualitative or quantitative techniques. Although a systematic review accurately captures evidence, the process is costly, taking several months from conception to publication [2] and many hours of effort [7]. Therefore, it is unquestionnable that the area would profit from tools and methods that could help to locate, organize, and summarize information for systematic reviews, as well as to synthesize it into usable knowledge [4]. The question one should ask is: Can such tools be built? This paper investigates the use of Text Mining to accomplish some of these tasks. In particular, it focuses on the use of Information Extraction Techniques to locate and organize information in documents for systematic reviews. Text Mining (TM) is about looking for patterns in natural language text [14]. It recognizes that complete understanding of natural language text is not attainable and focuses on extracting small pieces of information from text with high reliability. Information Extraction is a technique used to detect relevant information in larger documents and present it in a structured format. It is used to analyze the text and locate specific pieces of information in it [10]. Information Extraction (IE) is one of the most prominent techniques currently used in TM. It is a starting point to analyze unstructured text. In particular, by combining Natural Language Processing (NLP) tools, lexical resources, and semantic constraints, it can provide effective modules for mining documents of various domains [10]. Peshkin and Pfeffer [11] define IE as the task of filling template information from previously unseen text which belongs to a pre-defined domain. Its goal is to extract from documents salient facts about pre-specified types of events, entities, or relationships. These facts are then entered automatically into a database, which may then be used for further processing. Although this approach has been used for systematic reviews in other fields [4], empirical software