Economic Prediction using Heterogeneous Data Streams from the World Wide Web Abby Levenberg 1 , Edwin Simpson 2 , Stephen Roberts 1,2 , and Georg Gottlob 1,3 1 Oxford-Man Institute of Quantitative Finance, University of Oxford, abby.levenberg@oxford-man.ox.ac.uk 2 Machine Learning Research Group, Department of Engineering Science, University of Oxford, sjrob@robots.ox.ac.uk, edwin@robots.ox.ac.uk 3 Department of Computer Science, University of Oxford, georg.gottlob@cs.ox.ac.uk Abstract. Learning to predict financial and economic variables of interest is a hard problem with a large body of literature devoted to it. Of late there has been a significant amount of work on using sources of text from the Web (such as Twitter or Google Trends) to predict financial and economic variables. Much of this work has relied on some form or other of superficial sentiment analysis to represent the text. In this work we present a novel approach to predicting economic variables using multiple heterogenous streams of Web data. We can incorporate different data types into our model – such as time series and text – by first treating each data stream as a separate source with its own features and predictive distribution. For the text data streams we use a novel approach to prediction using a sentiment composition model to generate features. We then use a Bayesian classifier combination model to combine the independent “weak” predictions into a single prediction of the Nonfarm Payroll index, a primary economic indicator. Our results show that using a classifier combination model over multiple streams can achieve very high predictive accuracy. Keywords: heterogeneous data streams, economic prediction, classifier combination, text sentiment 1 Introduction There is a vast amount of data available on the Internet from a huge number of distinct online sources and the rate of its output is increasing daily. Currently there is significant interest in both industrial and academic research that aims to utilize such big data provided by the WWW to make predictions and gain insights into various aspects of daily life. Of late there has been a lot of work using textual WWW data to make predictions of a financial nature attempting to find correlations between the data and various lead economic and financial indicators such as the stock market or employment rates. Structured extraction of and learning from these online sources of data is a useful and challenging problem that spans the machine learning, information extraction, and quantitive finance research communities. In this work we forecast the trend of the United States Nonfarm Payrolls (NFP), a monthly economic index that measures employment growth (decay) and is considered an important indicator of the wel- fare of the U.S. economy. 4 The NFP index is part of the Current Employment Statistics Survey, a comprehensive report released by the United States Department of Labor, Bureau of Labor Statistics, on the state of the national labor market. Released on the first Friday of each month, the index is given as the change in the number of (nonfarm) employment compared to the prior month. Besides indicating the state of the economy, the NFP is an index that “moves the market” upon its release [17] with the market reacting positively to a increase in the index and negatively to a decline. It is of interest to anyone with an stake in the market, such as banks, hedge funds, prop traders, etc., to try 4 http://research.stlouisfed.org/fred2/series/PAYNSA?cid=32305 The Proceedings of ECML/PKDD 2013 Workshop Scalable Decision Making: Uncertainty, Imperfection, Deliberation (SCALE) September 23, 2013, Prague, Czech Republic