LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform Alexandre D. Alves, Horacio H. Yanasse National Institute for Space Research (INPE) alexandre.alves@inpe.br, horacio@lac.inpe.br Nei Y. Soma Aeronautic Institute of Technology (ITA) soma@ita.br Abstract The Lattes CV system, a curricular information system maintained by CNPq, is the core of the Lattes Platform. This system is undoubtedly the major source of information on Brazilian researchers. This paper describes “LattesMiner”, a multilingual domain-specific language for automatic infor- mation extraction from Lattes curricula. It is composed by a set of classes written in Java that allows developers to imple- ment their own applications with a high-level abstraction and expression power. LattesMiner can extract data belonging to the Lattes Platform from any individual researcher or group of researchers by its name or given (ID) number. The data extracted can be analyzed and used, for instance, to identify academic social networks, regional competences, profile of groups in different areas of research etc. We illustrate its use with a case study. Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features General Terms Domain-Specific Language, Lattes Plat- form Keywords Domain-Specific Language, Information Ex- traction, Academic Social Network 1. Introduction Lattes Platform (LP) is an information system implanted by CNPq (National Council for Scientific and Technological Development) to manage information on science, technolo- gy and innovation related to researchers and institutions in Brazil [6]. This platform is undoubtedly the major source of information available on Brazilian researchers, acknowl- edged in a recent article published in Nature [13]. The article cites the LP as an example of high-quality database. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SPLASH’11 Workshops, October 23–24, 2011, Portland, Oregon, USA. Copyright c 2011 ACM 978-1-4503-1183-0/11/10. . . $10.00 The LP is maintained by the Brazilian Government and it includes information systems, databases and Web portals. The Lattes CV system, a curricular information system, is the main component of the platform. Currently, the Lattes CV system stores around 2.000.000 curricula of researchers, lecturers, students and professionals from diverse areas of knowledge with actuation in science, technology and inno- vation. The Lattes curriculum (Lattes CV) is a document created by the CNPq with the objective of standardizing and central- izing academic, professional and personal information of the Brazilian scientific community. By using the Lattes CV sys- tem it is possible to consult these information at any time via Web. The data of each individual curriculum are filled by the professional him/herself and they have been used by agen- cies in the country to evaluate researchers, projects, graduate programs etc. Hence, the data are continuously updated by the researchers. Furthermore, the scientific community itself monitors the quality and correctness of the information dis- played in the system, since the resource allocation is based upon the comparison of the curriculum of the professionals. Therefore, this system has a very high quality information extraction potential. In the last years, many works were developed using data extracted from LP of researchers of different areas of knowl- edge. Some of these works analyzed the profile of the Pro- ductivity Research Scholarship fellows in areas such as Pub- lic Health [4][22], Dentistry [23][5], Medicine [16][14][19] and Chemistry [21]. Further information were also consid- ered, such as gender and region of the researchers [3] or sta- tistical correlation between the productivity of researchers and his/her proficiency in written English [26]. Master dis- sertations [7], Doctoral thesis [17] and many other works an- alyze data extracted from LP in their development. A com- mon problem presented in these works is that the curricula and the information extracted had to be obtained manually. This paper describes “LattesMiner”, an internal multilin- gual DSL (Domain-Specific Language) for automatic infor- mation extraction from Lattes curricula. Observe that, de- spite being public and accessible via Web 1 , the access to 1 http://lattes.cnpq.br/ 85