Modeling Complex Clickstream Data by Stochastic Models: Theory and Methods Choudur Lakshminarayan Hewlett Packard Enterprises choudur.lakshminarayan@hpe.com Ram Kosuru Hewlett Packard Enterprises ram.kosuru@hpe.com Meichun Hsu Hewlett Packard Enterprises meichun.hsu@hpe.com ABSTRACT As the website is a primary customer touch-point, millions are spent to gather web data about customer visits. Sadly, the trove of data and corresponding analytics have not lived up to the promise. Current marketing practice relies on am- biguous summary statistics or small-sample usability stud- ies. Idiosyncratic browsing and low conversion (browser-to- buyer) make modeling hard. In this paper, we model brows- ing patterns (sequence of clicks) via Markov chain theory to predict users’ propensity to buy within a session. We focus on model complexity, imputing missing values, data augmentation, and other attendant issues that impact per- formance. The paper addresses the following aspects; (1) Determine appropriate order of the Markov chain (assess the influence of prior history in prediction), (2) Impute miss- ing transitions by exploiting the inherent link structure in the page sequences, (3) predict the likelihood of a purchase based on variable-length page sequences, and (4) Augment the training set of buyers (which is typically very small: 2%) by viewing the page transitions as a graph and ex- ploiting its link structure to improve performance. The cock- tail of solutions address important issues in practical digital marketing. Extensive analysis of data applied to a large commercial web-site shows that Markov chain based classi- fiers are useful predictors of user intent. Keywords Click Streams, Markov Chains, Link Analysis, Imputation, Prediction 1. INTRODUCTION More recently, The Big Data movement of injecting ana- lytics into business processes is driving the urgency to har- ness clickstream data for online analytics. It has been re- ported that the Chief Marketing Officer (CMO) may have a higher budget than the CIO and more particularly, the on- line analytics market is projected to be approximately 4 bil- lion USD with a compounded annual growth rate (CAGR) Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author’s site if the Material is used in electronic media. WWW’16 Companion, April 11–15, 2016, Montréal, Québec, Canada. ACM 978-1-4503-4144-8/16/04 http://dx.doi.org/10.1145/2872518.2891070. of 22.5% by 2017(source Frost and Sullivan). In this con- text, the cutting edge is moving to digital marketing that utilizes data generated by customer interactions with web- sites. In spite of the steady growth of Internet commerce, and investment in website infrastructures, the average global conversion (turning a visitor into a buyer) rate is approx- imately 2 - 3%, an abandonment (browsers dropping off websites) rate of 75%, and for every dollar invested on con- verting a customer, 92 cents is spent on acquiring the cus- tomer (source: Gartner/marketing). Therefore, improving the conversion rate by a small fraction translates to billions in revenue. Marketers have elaborate infrastructures to collect, store, manage and analyze vast amounts of customer data via web content management, reporting, multivariate testing and rudimentary analytics. However, ability to describe browser actions on websites are sadly confined to counting discrete web events such number of visits, number of pages clicked in a session, session duration, average page duration, abandon- ment, conversion, etc., followed by rudimentary predictive analytics. This paper is premised on the idea that a commer- cial website is a coherent network of related pages connected by a set of hyperlinks. A visitor’s session-level sequence of clicks is viewed as navigation through a set of relevant pages (augmenting her/his knowledge about products and services rendered by the vendor) to eventually make a purchasing decision. It is therefore assumed that web navigation is an appropriate abstraction of browser intent and can be used to predict the likelihood of a buying-event. The presumed inter-relationships between pages in visitor sessions lend to modeling by a class of stochastic processes known as Markov chains (MC). Markov chains are studied extensively in web usage mining [1, 2, 3, 4]. Unlike other efforts in the litera- ture, however, we will apply Markov chains to predict the likelihood of conversion based on variable-length page se- quences within a session, while providing new techniques to enrich the Markov graph structures. Web sessions evolve as visitors traverse the website by clicking on links successively on pages. Thus, a session is a finite sequence of inter-linked pages. So, arrival at a given page is conditioned on being on a prior page. Markov chains postulate that the joint proba- bility of a sequence of pages can be decomposed into a prod- uct of conditionally independent probabilities. Therefore, a Markov chain is specified by conditional independence and order. The notion of conditional independence assumes that the sequence of clicks is not a series of independent events. And conditional independence is parametrized by the de- gree of influence of past clicks (order). Treating the website 879