A Computational Model to Textual Extraction And Construction of Social And Complex Networks Patrícia Freitas Braga Programa de Modelagem Computacional, SENAI Cimatec, Salvador, BA, Brazil patyfb04@gmail.com Hernane Borges de Barros Pereira Programa de Modelagem Computacional, SENAI Cimatec, Salvador, BA, Brazil Departamento de Ciências Exatas, UEFS, Feira de Santana, BA, Brazil hbbpereira@gmail.com Macelo A. Moret Programa de Modelagem Computacional, SENAI Cimatec, Salvador, BA, Brazil Departamento de Física, UEFS, Feira de Santana, BA, Brazil mamoret@gmail.com Abstract - This work aims presenting a computational modeling to extract specific data from textual repository, in order to build social and complex networks. These networks structures are implicit in texts. This paper presents the model process, which involves text mining by regular expressions, and the construction of networks. To validate the model, an experimental procedure was applied to build scientific collaboration networks in the context of post-graduation programs. Keywords-component; Text Mining; Regular Expressions; Socia; Complex Networks; Scientific Collaboration Networks; I. INTRODUCTION Social and complex networks have topological characteristics which allow the understanding of their dynamic. The networks behavior could reflect aspects such structure composition, weaker links, centrality points, structure vulnerabilities, expansion capacity, clusters presence, and many other features that configure complex networks. In this research, the objective is to build scientific collaboration networks in productions of from post- graduation programs. The information needed to compose these networks structures was (the information) in digital texts, and due to this fact, text mining and complex networks were the two main points to the development of this model. The data to be extracted from the textual structure is very specific, and considering this, regular expressions were chosen as a text mining form. There are some text mining and complex networks related works (e.g. [5]). However, the major part of these existent researches is grounded on semantic structures to build complex networks. Semantic structures are based on existent relationships among words. In this research, the networks structures are implicit in the texts, and they are not semantic-related. The relations are built according to a logical meaning. The motivation to the development of this model is reasoned by an academic necessity to measure researcher’s relations in co-authorship in post-graduation programs. One significant work, with the same subject as related in this research [8], where he focused the scientific collaboration networks. However, he used some publication databases in a variety of fields, to obtain information about publications and authors, and their collaboration networks. Evaluating the proposed model results, it was possible to analyze the scientific productivity of researchers and higher education institutions, and understand the scientific collaboration networks dynamic, besides other aspects. II. TEXT MINING AND REGULAR EXPRESSIONS According to [2], Text Mining consists in extracting regularities and recognizing patterns or tendencies from large text volumes. Text mining uses non-structured information, dependent of natural language. The general absence of a well defined structure in the texts makes necessary the pre- processing of texts to normalize the data. This process is called Natural Language Processing (NLP). In a simplified form, in text mining process, the text is processed by NLP where the texts have their dimensionality reduced, and then are submitted to a statistical analyzer which will index the most frequent terms. In this research, regular expressions were used as a text mining procedure, which were based on pattern recognition in texts. Regular Expressions are based on the existence of formal patterns in texts, such as words formatting, characters arrangement, among other aspects. The problem is that not every text has well defined 72 978-1-4577-1133-6/11/$26.00 c 2011 IEEE