Challenges for Recommender Systems Evaluation Francesco Ricci David Massimo Antonella De Angeli fricci@unibz.it davmassimo@unibz.it antonella.deangeli@unibz.it Free University of Bozen-Bolzano Bolzano, Italy ABSTRACT Many businesses and web portals adopt Recommender Systems (RSs) to help their users to tame information overload and make better choices. Despite the fact that RSs should support user decision making, academic researchers, when evaluating the efectiveness of a RS, largely adopt ofine rather than live user studies methods. We discuss the relationships between these evaluation methods by considering a tourism RS case study. We then suggest future directions to be taken by HCI and RS research to better assess the user’s value of RSs. CCS CONCEPTS · Information systems Evaluation of retrieval results; Rec- ommender systems; · Human-centered computing HCI design and evaluation methods. KEYWORDS Recommender Systems, evaluation methods ACM Reference Format: Francesco Ricci, David Massimo, and Antonella De Angeli. 2021. Challenges for Recommender Systems Evaluation. In CHItaly 2021: 14th Biannual Con- ference of the Italian SIGCHI Chapter (CHItaly ’21), July 11ś13, 2021, Bolzano, Italy. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3464385. 3464733 1 INTRODUCTION Recommender Systems (RSs) are software tools aiming at support- ing human decision-making, especially when choices are made over large products or services catalogues [29]. RSs are used in several online platforms, for media streaming suggestions (Net- fix, Spotify), or in Location-Based Social Networks for Restaurant recommendation (Foursquare). RSs are evaluated by means of two methods: ofine and online experiments [5, 6, 11, 15]. Ofine experiments focus on the core recommendation algorithm and simulate the (online) interactive Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. CHItaly ’21, July 11ś13, 2021, Bolzano, Italy © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8977-8/21/06. . . $15.00 https://doi.org/10.1145/3464385.3464733 process where the RS generates recommendations that a user may appreciate or not. In order to perform this simulation it is necessary to know how the user could actually react to the recommendations. This is obtained by employing existing ratings or choices data sets and splitting them in two parts: train and test [8, 12]. The train set is used to train the recommendation algorithm and the test set is used to simulate the user reactions to the generated recommendations: if the user’s evaluation of a recommended item is present in the test set, then that is used to decide whether the recommendation is correct or not. Conversely, in online studies, real users are requested to evaluate the recommendations [7, 13, 18]. These experiments are also called łuser studiesž and their focus is not only the evaluation of the łpre- cisionž of the recommendation algorithm but also the assessment of a range of complementary properties of the whole user/system interaction [15, 21]: the perceived recommendation quality and the system efectiveness. Despite the user-centric nature of RSs, ofine studies are more popular than user studies, which are more complex and time con- suming to set up and conduct. In fact, a user study entails a full working system which heavily impacts on the time required to set up the experiment. Moreover, users willing to participate in the study need to be identifed via recruiting campaigns, e.g., typi- cally by asking colleagues, students and by sending invitations via mailing-list. As a matter of fact, a large number of participants is difcult to be found. Moreover, collecting reliable recommendation evaluations is not always possible: often the true experience asso- ciated with the consumption of the recommended item cannot be adequately simulated in a user study. Even music, which is rather easy to be played for the user to evaluate, if not listened to in the typical listening context of the user, may produce an unreliable eval- uation [17, 19, 32]. It is even more difcult to evaluate a tourism RS, which is normally used in the pre-trip planning phase, or during a travel, to get up to date information about what to visit next [14, 33]. In both cases, the actual visit to a recommended Point of Interest (POI) can only be łillustratedž to the user; the user must rely on the provided information to decide whether the recommendation is relevant or not. As a matter of fact, the real visit to the POI may produce a totally diferent experience and evaluation. Researchers often ignore these issues, and rarely conduct both ofine and online studies for the same RS. The assumption is that any of the two may still bring useful knowledge about the efect of the RS on the target users. Conversely, in this paper, we leverage the knowledge that we have acquired in performing combined