Automated news reading: Stock price prediction based on ﬁnancial news using context-capturing features Michael Hagenau ⁎, Michael Liebmann, Dirk Neumann University of Freiburg, Platz der Alten Synagoge, 79085 Freiburg, Germany abstract article info Article history: Received 8 May 2012 Received in revised form 25 November 2012 Accepted 4 February 2013 Available online 20 February 2013 Keywords: Text mining Financial news Stock price prediction Decision support We examine whether stock price prediction based on textual information in ﬁnancial news can be improved as previous approaches only yield prediction accuracies close to guessing probability. Accordingly, we enhance existing text mining methods by using more expressive features to represent text and by employing market feed- back as part of our feature selection process. We show that a robust feature selection allows lifting classiﬁcation accuracies signiﬁcantly above previous approaches when combined with complex feature types. This is because our approach allows selecting semantically relevant features and thus, reduces the problem of over-ﬁtting when applying a machine learning approach. We also demonstrate that our approach is highly proﬁtable for trading in practice. The methodology can be transferred to any other application area providing textual information and corresponding effect data. © 2013 Elsevier B.V. All rights reserved. 1. Introduction When analysts, investors and institutional traders evaluate current stock prices, news plays an important role in the valuation process. In fact, news carries information about the ﬁrm's fundamentals and qualita- tive information inﬂuencing expectations of market participants. From a theoretical point of view, an efﬁcient valuation of a ﬁrm should reﬂect the present value of the ﬁrm's expected future cash ﬂows. The expecta- tions on the ﬁrm's development are crucially dependent on the informa- tion set that is available to investors. The information set consists of news that contains qualitative as well as quantitative information from various sources, e.g., corporate disclosures, third party news articles and analyst reports. If ﬁnancial news conveys novel information leading to adjusted expectations about either ﬁrm's cash ﬂows or investor's discount rates, it affects stock returns [4,18]. In the news, not only ﬁnancial ﬁgures have a signiﬁcant impact on stock price, but also the qualitative textual components impact stock prices [27] when containing new information [14,29]. Due to improved information intermediation, the amount of available information has dramatically increased for the last decades. Since it is getting increasingly difﬁcult for investors to follow and consider all avail- able information, automated classiﬁcation of the most important infor- mation becomes more relevant. Research in automated classiﬁcation of textual ﬁnancial news is, how- ever, in its infancy. Despite numerous attempts and application areas (c.f. [15]), prediction accuracies for the direction of stock prices follow- ing the release of corporate ﬁnancial news rarely exceeded 58% (see Table 1) — an accuracy level hardly above random guessing probability (50%) leaving room for substantial improvements. Automated classiﬁcation of textual news comprises text mining which translates unstructured information into a machine readable for- mat and mostly uses machine learning techniques for classiﬁcation. While suitable machine learning techniques for text classiﬁcation are well established [8,12], the development of suitable text representa- tions is still part of ongoing research [24]. Essentially, text representa- tion techniques refer to the way text is handled. One prominent example is the bag-of-words model, which regards the text as a compi- lation of unordered single words. In such a case, the feature type ‘single words’ constitutes the text representation. More complex feature types refer to word combinations. Clearly, not all words are needed to reﬂect a given text; text mining is concerned with the search for the most rele- vant features to represent the text. Existing literature on ﬁnancial text mining typically relies on very sim- ple textual representations, such as the aforementioned bag-of-words model. Further, the list of words used for text representation are created either on the basis of dictionaries [17,28] or retrieved from the message corpus based on actual occurrences of the words. Despite well researched approaches to select the most relevant words or word combinations based on exogenous feedback [8], existing work often relies on frequency-based statistics of the message corpus, such as the informa- tion retrieval measure TF-IDF [19] or, even simpler, the minimum oc- currence of a word combination [24]. Having in mind that these approaches used in ﬁnancial text mining are very simple and do not em- ploy state-of-the-art methods, we expect potential for improvement with respect to two areas: First, we need to explore more complex Decision Support Systems 55 (2013) 685–697 ⁎ Corresponding author. Tel.: +49 761 203 2395; fax: +49 761 203 2416. E-mail addresses: michael.hagenau@is.uni-freiburg.de (M. Hagenau), michael.liebmann@is.uni-freiburg.de (M. Liebmann), dirk.neumann@is.uni-freiburg.de (D. Neumann). 0167-9236/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.dss.2013.02.006 Contents lists available at SciVerse ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss