International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 8, August 2012) 333 Extraction of Business Rules from Web logs to Improve Web Usage Mining. Sawan Bhawsar 1 , Kshitij Pathak 2 , Sourabh Mariya 3 , Sunil Parihar 4 1,2,4 Mahakal Institute of Technology, Ujjain 3 Jawaharlal Institute of technology, Khargone Abstract— The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi- structured machine-readable documents. Information Extraction systems attempt to pull out information from documents by filling out predefined templates. The information typically consists of entities and relations between entities like who did what to whom, where, when and how. In our case input is server web log files, and the extracted information is web sessions performed by users. This extracted information can be further used to improve web usage mining, like next page prediction i.e. prediction of next page accessed by the user, improve web personalization, fraud detection and future prediction accessed by user, user profiling, and also to know about user browsing behaviour. A broad goal of IE is to allow computation to be done on the previously unstructured data i.e. server web log. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain. Keywords— Information extraction, business rules, web usage mining, preprocessing of web log file. I. INTRODUCTION The explosive growth and popularity of the world-wide web has resulted in a huge amount of information sources on the Internet. However, due to the heterogeneity and the lack of structure of Web information sources, access to this huge collection of information has been limited to browsing and searching. Sophisticated Web mining applications, such as comparison shopping robots, require expensive maintenance to deal with different data formats. To automate the translation of input pages into structured data, a lot of efforts have been devoted in the area of information extraction (IE). Unlike information retrieval (IR), which concerns how to identify relevant documents from a document collection, IE produces structured data ready for post-processing, which is crucial to many applications of Web mining and searching tools. Formally, an IE task is defined by its input and its extraction target. The input can be unstructured documents like free text that are written in natural language or the semi-structured documents that are pervasive on the Web, such as tables or itemized and enumerated lists The extraction target of an IE task can be a relation of k-tuple (where k is the number of attributes in a record)or it can be a complex object with hierarchically organized data. For some IE tasks, an attribute may have zero (missing) or multiple instantiations in a record. The difficulty of an IE task can be further complicated when various permutations of attributes or typographical errors occur in the input documents. Programs that perform the task of IE are referred to as extractors or wrappers. A wrapper was originally defined as a component in an information integration system which aims at providing a single uniform query interface to access multiple information sources. In an information integration system, a wrapper is generally a program that “wraps” an information source (e.g. a database server, or a Web server) such that the information integration system can access that information source without changing its core query answering mechanism. In the case where the information source is a Web server, a wrapper must query the Web server to collect the resulting pages via HTTP protocols, perform information extraction to extract the contents in the HTML documents, and finally integrate with other data sources. Among the three procedures, information extraction has received most attentions and some use wrappers to denote extractor programs. Therefore, we use the terms extractors and wrappers interchangeably. Wrapper induction (WI) or information extraction (IE) systems are software tools that are designed to generate wrappers. A wrapper usually performs a pattern matching procedure (e.g., a form of finite-state machines) which relies on a set of extraction rules. Tailoring a WI system to a new requirement is a task that varies in scale depending on the text type, domain, and scenario. To maximize reusability and minimize maintenance cost, designing a trainable WI system has been an important topic in the research fields of message understanding, machine learning, data mining, etc.