Challenges for Recommender Systems Evaluation
Francesco Ricci
David Massimo
Antonella De Angeli
fricci@unibz.it
davmassimo@unibz.it
antonella.deangeli@unibz.it
Free University of Bozen-Bolzano
Bolzano, Italy
ABSTRACT
Many businesses and web portals adopt Recommender Systems
(RSs) to help their users to tame information overload and make
better choices. Despite the fact that RSs should support user decision
making, academic researchers, when evaluating the efectiveness
of a RS, largely adopt ofine rather than live user studies methods.
We discuss the relationships between these evaluation methods
by considering a tourism RS case study. We then suggest future
directions to be taken by HCI and RS research to better assess the
user’s value of RSs.
CCS CONCEPTS
· Information systems → Evaluation of retrieval results; Rec-
ommender systems; · Human-centered computing → HCI
design and evaluation methods.
KEYWORDS
Recommender Systems, evaluation methods
ACM Reference Format:
Francesco Ricci, David Massimo, and Antonella De Angeli. 2021. Challenges
for Recommender Systems Evaluation. In CHItaly 2021: 14th Biannual Con-
ference of the Italian SIGCHI Chapter (CHItaly ’21), July 11ś13, 2021, Bolzano,
Italy. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3464385.
3464733
1 INTRODUCTION
Recommender Systems (RSs) are software tools aiming at support-
ing human decision-making, especially when choices are made
over large products or services catalogues [29]. RSs are used in
several online platforms, for media streaming suggestions (Net-
fix, Spotify), or in Location-Based Social Networks for Restaurant
recommendation (Foursquare).
RSs are evaluated by means of two methods: ofine and online
experiments [5, 6, 11, 15]. Ofine experiments focus on the core
recommendation algorithm and simulate the (online) interactive
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission
and/or a fee. Request permissions from permissions@acm.org.
CHItaly ’21, July 11ś13, 2021, Bolzano, Italy
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8977-8/21/06. . . $15.00
https://doi.org/10.1145/3464385.3464733
process where the RS generates recommendations that a user may
appreciate or not. In order to perform this simulation it is necessary
to know how the user could actually react to the recommendations.
This is obtained by employing existing ratings or choices data sets
and splitting them in two parts: train and test [8, 12]. The train set is
used to train the recommendation algorithm and the test set is used
to simulate the user reactions to the generated recommendations:
if the user’s evaluation of a recommended item is present in the
test set, then that is used to decide whether the recommendation is
correct or not.
Conversely, in online studies, real users are requested to evaluate
the recommendations [7, 13, 18]. These experiments are also called
łuser studiesž and their focus is not only the evaluation of the łpre-
cisionž of the recommendation algorithm but also the assessment
of a range of complementary properties of the whole user/system
interaction [15, 21]: the perceived recommendation quality and the
system efectiveness.
Despite the user-centric nature of RSs, ofine studies are more
popular than user studies, which are more complex and time con-
suming to set up and conduct. In fact, a user study entails a full
working system which heavily impacts on the time required to
set up the experiment. Moreover, users willing to participate in
the study need to be identifed via recruiting campaigns, e.g., typi-
cally by asking colleagues, students and by sending invitations via
mailing-list. As a matter of fact, a large number of participants is
difcult to be found. Moreover, collecting reliable recommendation
evaluations is not always possible: often the true experience asso-
ciated with the consumption of the recommended item cannot be
adequately simulated in a user study. Even music, which is rather
easy to be played for the user to evaluate, if not listened to in the
typical listening context of the user, may produce an unreliable eval-
uation [17, 19, 32]. It is even more difcult to evaluate a tourism RS,
which is normally used in the pre-trip planning phase, or during a
travel, to get up to date information about what to visit next [14, 33].
In both cases, the actual visit to a recommended Point of Interest
(POI) can only be łillustratedž to the user; the user must rely on
the provided information to decide whether the recommendation
is relevant or not. As a matter of fact, the real visit to the POI may
produce a totally diferent experience and evaluation.
Researchers often ignore these issues, and rarely conduct both
ofine and online studies for the same RS. The assumption is that
any of the two may still bring useful knowledge about the efect of
the RS on the target users. Conversely, in this paper, we leverage
the knowledge that we have acquired in performing combined