Automated news reading: Stock price prediction based on nancial news using context-capturing features Michael Hagenau , Michael Liebmann, Dirk Neumann University of Freiburg, Platz der Alten Synagoge, 79085 Freiburg, Germany abstract article info Article history: Received 8 May 2012 Received in revised form 25 November 2012 Accepted 4 February 2013 Available online 20 February 2013 Keywords: Text mining Financial news Stock price prediction Decision support We examine whether stock price prediction based on textual information in nancial news can be improved as previous approaches only yield prediction accuracies close to guessing probability. Accordingly, we enhance existing text mining methods by using more expressive features to represent text and by employing market feed- back as part of our feature selection process. We show that a robust feature selection allows lifting classication accuracies signicantly above previous approaches when combined with complex feature types. This is because our approach allows selecting semantically relevant features and thus, reduces the problem of over-tting when applying a machine learning approach. We also demonstrate that our approach is highly protable for trading in practice. The methodology can be transferred to any other application area providing textual information and corresponding effect data. © 2013 Elsevier B.V. All rights reserved. 1. Introduction When analysts, investors and institutional traders evaluate current stock prices, news plays an important role in the valuation process. In fact, news carries information about the rm's fundamentals and qualita- tive information inuencing expectations of market participants. From a theoretical point of view, an efcient valuation of a rm should reect the present value of the rm's expected future cash ows. The expecta- tions on the rm's development are crucially dependent on the informa- tion set that is available to investors. The information set consists of news that contains qualitative as well as quantitative information from various sources, e.g., corporate disclosures, third party news articles and analyst reports. If nancial news conveys novel information leading to adjusted expectations about either rm's cash ows or investor's discount rates, it affects stock returns [4,18]. In the news, not only nancial gures have a signicant impact on stock price, but also the qualitative textual components impact stock prices [27] when containing new information [14,29]. Due to improved information intermediation, the amount of available information has dramatically increased for the last decades. Since it is getting increasingly difcult for investors to follow and consider all avail- able information, automated classication of the most important infor- mation becomes more relevant. Research in automated classication of textual nancial news is, how- ever, in its infancy. Despite numerous attempts and application areas (c.f. [15]), prediction accuracies for the direction of stock prices follow- ing the release of corporate nancial news rarely exceeded 58% (see Table 1) an accuracy level hardly above random guessing probability (50%) leaving room for substantial improvements. Automated classication of textual news comprises text mining which translates unstructured information into a machine readable for- mat and mostly uses machine learning techniques for classication. While suitable machine learning techniques for text classication are well established [8,12], the development of suitable text representa- tions is still part of ongoing research [24]. Essentially, text representa- tion techniques refer to the way text is handled. One prominent example is the bag-of-words model, which regards the text as a compi- lation of unordered single words. In such a case, the feature type single wordsconstitutes the text representation. More complex feature types refer to word combinations. Clearly, not all words are needed to reect a given text; text mining is concerned with the search for the most rele- vant features to represent the text. Existing literature on nancial text mining typically relies on very sim- ple textual representations, such as the aforementioned bag-of-words model. Further, the list of words used for text representation are created either on the basis of dictionaries [17,28] or retrieved from the message corpus based on actual occurrences of the words. Despite well researched approaches to select the most relevant words or word combinations based on exogenous feedback [8], existing work often relies on frequency-based statistics of the message corpus, such as the informa- tion retrieval measure TF-IDF [19] or, even simpler, the minimum oc- currence of a word combination [24]. Having in mind that these approaches used in nancial text mining are very simple and do not em- ploy state-of-the-art methods, we expect potential for improvement with respect to two areas: First, we need to explore more complex Decision Support Systems 55 (2013) 685697 Corresponding author. Tel.: +49 761 203 2395; fax: +49 761 203 2416. E-mail addresses: michael.hagenau@is.uni-freiburg.de (M. Hagenau), michael.liebmann@is.uni-freiburg.de (M. Liebmann), dirk.neumann@is.uni-freiburg.de (D. Neumann). 0167-9236/$ see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.dss.2013.02.006 Contents lists available at SciVerse ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss