Automated news reading: Stock price prediction based on financial
news using context-capturing features
Michael Hagenau ⁎, Michael Liebmann, Dirk Neumann
University of Freiburg, Platz der Alten Synagoge, 79085 Freiburg, Germany
abstract article info
Article history:
Received 8 May 2012
Received in revised form 25 November 2012
Accepted 4 February 2013
Available online 20 February 2013
Keywords:
Text mining
Financial news
Stock price prediction
Decision support
We examine whether stock price prediction based on textual information in financial news can be improved as
previous approaches only yield prediction accuracies close to guessing probability. Accordingly, we enhance
existing text mining methods by using more expressive features to represent text and by employing market feed-
back as part of our feature selection process. We show that a robust feature selection allows lifting classification
accuracies significantly above previous approaches when combined with complex feature types. This is because
our approach allows selecting semantically relevant features and thus, reduces the problem of over-fitting when
applying a machine learning approach. We also demonstrate that our approach is highly profitable for trading in
practice. The methodology can be transferred to any other application area providing textual information and
corresponding effect data.
© 2013 Elsevier B.V. All rights reserved.
1. Introduction
When analysts, investors and institutional traders evaluate current
stock prices, news plays an important role in the valuation process. In
fact, news carries information about the firm's fundamentals and qualita-
tive information influencing expectations of market participants. From a
theoretical point of view, an efficient valuation of a firm should reflect
the present value of the firm's expected future cash flows. The expecta-
tions on the firm's development are crucially dependent on the informa-
tion set that is available to investors. The information set consists of news
that contains qualitative as well as quantitative information from various
sources, e.g., corporate disclosures, third party news articles and analyst
reports. If financial news conveys novel information leading to adjusted
expectations about either firm's cash flows or investor's discount rates,
it affects stock returns [4,18]. In the news, not only financial figures
have a significant impact on stock price, but also the qualitative textual
components impact stock prices [27] when containing new information
[14,29].
Due to improved information intermediation, the amount of available
information has dramatically increased for the last decades. Since it is
getting increasingly difficult for investors to follow and consider all avail-
able information, automated classification of the most important infor-
mation becomes more relevant.
Research in automated classification of textual financial news is, how-
ever, in its infancy. Despite numerous attempts and application areas
(c.f. [15]), prediction accuracies for the direction of stock prices follow-
ing the release of corporate financial news rarely exceeded 58% (see
Table 1) — an accuracy level hardly above random guessing probability
(50%) leaving room for substantial improvements.
Automated classification of textual news comprises text mining
which translates unstructured information into a machine readable for-
mat and mostly uses machine learning techniques for classification.
While suitable machine learning techniques for text classification are
well established [8,12], the development of suitable text representa-
tions is still part of ongoing research [24]. Essentially, text representa-
tion techniques refer to the way text is handled. One prominent
example is the bag-of-words model, which regards the text as a compi-
lation of unordered single words. In such a case, the feature type ‘single
words’ constitutes the text representation. More complex feature types
refer to word combinations. Clearly, not all words are needed to reflect a
given text; text mining is concerned with the search for the most rele-
vant features to represent the text.
Existing literature on financial text mining typically relies on very sim-
ple textual representations, such as the aforementioned bag-of-words
model. Further, the list of words used for text representation are created
either on the basis of dictionaries [17,28] or retrieved from the message
corpus based on actual occurrences of the words. Despite well researched
approaches to select the most relevant words or word combinations
based on exogenous feedback [8], existing work often relies on
frequency-based statistics of the message corpus, such as the informa-
tion retrieval measure TF-IDF [19] or, even simpler, the minimum oc-
currence of a word combination [24]. Having in mind that these
approaches used in financial text mining are very simple and do not em-
ploy state-of-the-art methods, we expect potential for improvement
with respect to two areas: First, we need to explore more complex
Decision Support Systems 55 (2013) 685–697
⁎ Corresponding author. Tel.: +49 761 203 2395; fax: +49 761 203 2416.
E-mail addresses: michael.hagenau@is.uni-freiburg.de (M. Hagenau),
michael.liebmann@is.uni-freiburg.de (M. Liebmann), dirk.neumann@is.uni-freiburg.de
(D. Neumann).
0167-9236/$ – see front matter © 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.dss.2013.02.006
Contents lists available at SciVerse ScienceDirect
Decision Support Systems
journal homepage: www.elsevier.com/locate/dss