Stages of Knowledge Discovery in Web Sites Matthew Smith and Christophe Giraud-Carrier Brigham Young University, Department of Computer Science, Provo, UT 84602, USA Abstract E-commerce greatly facilitates and enhances the level of interaction between a retailer and its customers, thus offering the potential for smarter marketing through thoughtful site design and analytics. This paper presents the three stages of knowledge discovery that move e-retailers from on-line catalog providers to finely-tuned, customer-centric service providers. Keywords: E-commerce Intelligence, Web Mining, Clickstream Analysis, Personalization. 1. Introduction As the Internet continues its expansion, most retailers have successfully added the Web to their other more traditional distribution channels. For too many companies, the story, unfortunately, ends there. That is, the Web channel is just that, another distribution channel, often consisting of little more than an on-line catalog tied to a secure electronic point of sale. Although valuable in its own right, such use of the Web falls far short of some of the unique possibilities it offers for intelligent marketing. Indeed, with nearly everything traceable and measurable, the Web is a marketer’s dream come true. Physical (i.e., brick-and-mortar) stores are static and generally customer-blind: • The store layout and contents are the same for all customers and changes are often costly. • Visits leave very little useful trace, other than perhaps, sale’s data, generally limited to what was bought, when it was bought and by what method of payment. On the other hand, on-line stores are dynamic and customer-aware: • Layout and contents are easily modified and can even be tailored to individual visitors. • Every visit automatically generates a trail of information on the customer’s experience (e.g., how long she stayed, what she looked at, whether she bought or not, etc.), and possibly on the customer herself. This work is supported in part by the BYU Bookstore and Overstock.com. Unfortunately, due to lack of knowledge, Web capabilities and data are too often under-exploited in E-commerce. The full benefit of the emerging and growing Web channel belongs to those who leverage the rich information it provides [7, 10]. To that end, this paper describes three stages of knowledge discovery in Web sites: (1) clickstream analysis, (2) advanced Web mining, and (3) personalization. Moving through these stages requires increasing sophistication, but also produces increasing return-on- investment. 2. Clickstream Analysis The readiest source of information or data to mine on a Web site is Web usage data. Web servers are generally configured so as to store such data, also known as clickstream data, in Web server log files. If not, they can be set up to do so easily and quickly. Web server logs grow as visitors interact with the Web site. For each visitor, the log contains basic identifying information (e.g., originating IP address) and time-stamped entries for all visited pages, from the time the visitor first reaches the site to the time she leaves it. The amount of data found in Web server logs is enormous and clearly evades direct human interpretation. However, log analysis tools (e.g., AWStats [2], The Webalizer [3], SiteCatalyst [15], ClickTracks [6], NetTracker [17], etc.) can be used to summarize it, allowing marketers to gain some knowledge about their e-commerce activities. In particular, the following questions can be answered: • Where do most visitors come from? • How many of visitors come from a direct link or bookmark vs. a search engine vs. a partner Web site (if any)? • What search engines (e.g., Google, Yahoo, MSN, etc.) and search terms are most frequently used? • How long do visitors stay on the Web site? • How many pages do they visit on average? • Which pages are most popular? • When do they leave the site? • Which pages do visitors commonly leave the Web site from? Hence, the result of analyzing Web server logs allows reporting on unique visitors, hits, visit duration,