IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.6, June 2007 45 Manuscript received June 5, 2007 Manuscript revised June 20, 2007 Web Usage Mining Using Self Organized Maps Paola Britos, Damián Martinelli, Hernán Merlino, Ramón García-Martínez PhD Computer Science Program, National University of La Plata. Software & Knowledge Engineering Center, Buenos Aires Institute of Technology. Intelligent Systems Laboratory, University of Buenos Aires. Buenos Aires. Argentina Summary This paper detail the capacity of use of Self Organized Maps, kind of artificial neural network, in the process of Web Usage Mining to detect user’s patterns. The process detail the transformations necessaries to modify the data storage in the Web Servers Log files to an input of Self Organized Maps Key words: Web mining - Self Organized Maps - User’s pattern - Web Servers Log files. 1. Introduction Data mining is a set of techniques and tools used to the no trivial process of extracting and present implicit knowledge, no knowledge before, this information is useful and human reliable; this is processing from a great set of data; with the object of describing in automatic way models, no knowledge before; to detect tendencies and patterns [Felgaer, 2004; Piatetski-Shapiro et al., 1991; Chen et al., 1996; Mannila, 1997] The Web Mining are the set of techniques of Data Mining applied to Web [Cernuzzi-Molas, 2004]. The Web Usage Mining is the process of applying techniques to detect patterns of usage to Web Page [Srivastava et al., 2000; Kosala-Blockeel, 2000]. The Web Usage Mining use the data storage in the Log files of Web server as first resource; in this file the Web server register the access at each resource in the server by the users [Batista-Silva, 2002; Borges-Levene, 2000; Chen y Chau, 2004]. With the Web usage mining we can obtain: (a) the understanding of the users pattern. (b) We can obtain the information by personalize to the site. (c) Build tune up the server. (d) Modify the site according to the preferences access by the users. (e) Build business rules. By this we can get: (a) new clients. (b) Marketing campaigns (c) build a more efficient site [Abraham y Ramos, 2003]. The user’s habit detection has 3 steps: (a) pre-processing. (b) Detect commons patterns, and (c) analyze the patterns [Mobasher et al., 1999] Pre-processing is the action to convert the data storage in logs files, using techniques of cleaning data, abstraction data, user detection and sessions detection. In the next step, detect commons patterns; we can use statistics techniques, rules association, gather, classification, etc. In the last step, analyze the patterns, Web Usage Mining try to understand the patterns detected in before step. The most common techniques is data visualization applying filters, zooms, etc [Keim, 2002; Ankerst, 2001] The artificial neural networks (ANN), try to simulate the action doing by the human brain; RNA has the possibility of get abstraction of data and work with incomplete data or with errors, RNA has knowledge and can adapt it; and operate in real time [Grosser, 2004; Daza, 2003]. RNA is built by a common part called neurons. These units of processing are interconnected; each neuron has it this activation threshold. The learning in RNA is built by the adjustment of activation threshold in each neuron [Roy, 2000; Abidi, 1996; García Martínez et al., 2003] 2. Problem description 2.1 Logs processing A critical step in the identification of user’s habit in web sites is the cleaning and transformation of Web server Log files; and user’s sessions identification [Mobasher, 1999]. 2.2 Log files cleaning Cleaning Web server Log files has a lot of steps [Huysmans et al., 2003; Mobasher, 1999; Pierrakos et al., 2001; Lalani, 2003; Kerkhofs, 2001; Eirinaki & Vazirgiannis, 2003]. When one user request a page, this request is added to the Log File, but, if this page has images, in they will be added in the Log file. This is the same for any resource in the page, for example JavaScript’s, flash animations, videos, etc. In most of the cases these resources aren’t necessary for the detection of user’s habits; for this reason is good cat this records from the log file; to do this task we only need to search records by file extension. To give a little list we can consider cut extensions with jpg, jgeg, gif, js, css, swf, avi, mov, etc. In some case is proper to filter page inserted in others with frames; in other way is common to generate pages dynamically. Errors code in HTTP is used too filter records in the Logs files, the most common errors in HTTP are: error code 200, 4003 (recourse not found), 403 (access denage), and 500 (internal server error). For a complete list we can consult at RFC 2616 [RFC 2616].