Business Process Mining by Means of Statistical Languages Model Dafne Rosso Pelayo and Raúl A. Trejo Ramírez Instituto Tecnológico de Estudios Superiores de Monterrey drosso@pemex.gob.mx, raul.trejo@itesm.mx Abstract The goal of this research is to provide an alternative for business processes evaluation and tracking, based on the analysis of non-structured information generated by such processes within the organization areas. In this article we introduce a method to determine the occurrence probability of a business process within the enterprise’s text documents. The proposed method introduces the use of Statistical language model (SLM) [1], as a new technique in business processes mining area [2]. In order to obtain this objective the following is considered: the probability that a sub process or a process part is in the text paragraph; the probability that this text belongs to a business process; the language model of the processes set; and the set of realized activities which is reconstructed according to the processes that gave origin to the analyzed documents. 1. Introduction Business processes mining is a technique that uses workflows registered within enterprise applications logs [3, 13, 15] to reconstruct business processes. From the earlier works in business processes mining [4] to present day, there has been a development of new heuristic techniques based on intelligent computation that involve genetic algorithms, data mining algorithms, and neuronal networks, in addition to traditional statistical techniques. In [2] there is a summary of these developed techniques. For example, [3], [10], [11], show the reconstruction of a business processes model by making a job workflow modeling, based on the analysis and events log in the period of time in which these happen, nevertheless, these analyses come mostly from the (structured) logs of enterprise systems like SAP, PeopleSoft, or CRM systems. The business processes of our interest follow the process classification framework (PCF), this framework is a high level neutral enterprise model, that reflect the activities in which the enterprise incurs to satisfy its business and organizational objectives [14]. The alternative we propose is a novel technique to perform business processes mining. One of the main motivations for this research is that, due the nature of the processes, the non-structured information they generate is typically very generic, vague and complex in structure. Besides, business processes in many cases are not completely automated, because there are activities, analysis of experts and decision making which are not feasible to structure. As [2] indicates, more formal research is required for business processes, being important to look for solutions that allow analysis and obtain knowledge of this type of information. In this research we focus in text documents generated as a result of business processes execution, instead of starting from the analysis of workflow logs. The documents analyzed, belong to a dominion of widely dispersed texts, i.e., texts belong to different areas and contain highly dispersed information of the items within the document. To reconstruct the original processes, a statistical language model will be used to classify documents, SLM has been applied for information retrieval in heuristic techniques for document classification [1, 9], in this work, we use SLM for document classification according to process events or activities [15]. This allows for establishing a method of text clustering for the document, which operates by means of a probability precision. 2008 Seventh Mexican International Conference on Artificial Intelligence 978-0-7695-3441-1/08 $25.00 © 2008 IEEE DOI 10.1109/MICAI.2008.49 404 Authorized licensed use limited to: IEEE Xplore. Downloaded on January 12, 2009 at 10:44 from IEEE Xplore. Restrictions apply.