Economic Prediction using Heterogeneous Data Streams from the World Wide Web Abby Levenberg 1 , Edwin Simpson 2 , Stephen Roberts 1,2 , and Georg Gottlob 1,3 1 Oxford-Man Institute of Quantitative Finance, University of Oxford, abby.levenberg@oxford-man.ox.ac.uk 2 Machine Learning Research Group, Department of Engineering Science, University of Oxford, sjrob@robots.ox.ac.uk, edwin@robots.ox.ac.uk 3 Department of Computer Science, University of Oxford, georg.gottlob@cs.ox.ac.uk Abstract. Learning to predict ﬁnancial and economic variables of interest is a hard problem with a large body of literature devoted to it. Of late there has been a signiﬁcant amount of work on using sources of text from the Web (such as Twitter or Google Trends) to predict ﬁnancial and economic variables. Much of this work has relied on some form or other of superﬁcial sentiment analysis to represent the text. In this work we present a novel approach to predicting economic variables using multiple heterogenous streams of Web data. We can incorporate diﬀerent data types into our model – such as time series and text – by ﬁrst treating each data stream as a separate source with its own features and predictive distribution. For the text data streams we use a novel approach to prediction using a sentiment composition model to generate features. We then use a Bayesian classiﬁer combination model to combine the independent “weak” predictions into a single prediction of the Nonfarm Payroll index, a primary economic indicator. Our results show that using a classiﬁer combination model over multiple streams can achieve very high predictive accuracy. Keywords: heterogeneous data streams, economic prediction, classiﬁer combination, text sentiment 1 Introduction There is a vast amount of data available on the Internet from a huge number of distinct online sources and the rate of its output is increasing daily. Currently there is signiﬁcant interest in both industrial and academic research that aims to utilize such big data provided by the WWW to make predictions and gain insights into various aspects of daily life. Of late there has been a lot of work using textual WWW data to make predictions of a ﬁnancial nature attempting to ﬁnd correlations between the data and various lead economic and ﬁnancial indicators such as the stock market or employment rates. Structured extraction of and learning from these online sources of data is a useful and challenging problem that spans the machine learning, information extraction, and quantitive ﬁnance research communities. In this work we forecast the trend of the United States Nonfarm Payrolls (NFP), a monthly economic index that measures employment growth (decay) and is considered an important indicator of the wel- fare of the U.S. economy. 4 The NFP index is part of the Current Employment Statistics Survey, a comprehensive report released by the United States Department of Labor, Bureau of Labor Statistics, on the state of the national labor market. Released on the ﬁrst Friday of each month, the index is given as the change in the number of (nonfarm) employment compared to the prior month. Besides indicating the state of the economy, the NFP is an index that “moves the market” upon its release [17] with the market reacting positively to a increase in the index and negatively to a decline. It is of interest to anyone with an stake in the market, such as banks, hedge funds, prop traders, etc., to try 4 http://research.stlouisfed.org/fred2/series/PAYNSA?cid=32305 The Proceedings of ECML/PKDD 2013 Workshop Scalable Decision Making: Uncertainty, Imperfection, Deliberation (SCALE) September 23, 2013, Prague, Czech Republic