Tag Recommendation for Large-Scale Ontology-Based Information Systems Roman Prokofyev 1 , Alexey Boyarsky 234 , Oleg Ruchayskiy 5 , Karl Aberer 2 , Gianluca Demartini 1 , and Philippe Cudr´ e-Mauroux 1 1 eXascale Infolab, University of Fribourg—Switzerland {firstname.lastname}@unifr.ch 2 Ecole Polytechnique F´ ed´ erale de Lausanne—Switzerland {firstname.lastname}@epfl.ch 3 Instituut-Lorentz for Theoretical Physics, U. Leiden—The Netherlands 4 Bogolyubov Institute for Theoretical Physics, Kiev—Ukraine 5 CERN TH-Division, PH-TH, Geneva—Switzerland oleg.ruchayskiy@cern.ch Abstract We tackle the problem of improving the relevance of automatically selected tags in large-scale ontology-based information systems. Contrary to tra- ditional settings where tags can be chosen arbitrarily, we focus on the problem of recommending tags (e.g., concepts) directly from a collaborative, user-driven on- tology. We compare the effectiveness of a series of approaches to select the best tags ranging from traditional IR techniques such as TF/IDF weighting to novel techniques based on ontological distances and latent Dirichlet allocation. All our experiments are run against a real corpus of tags and documents extracted from the ScienceWise portal, which is connected to ArXiv.org and is currently used by growing number of researchers. The datasets for the experiments are made available online for reproducibility purposes. 1 Introduction The nature of scientiﬁc research is drastically changing. Fewer and fewer scientiﬁc advances are carried out by small groups working in their laboratories in isolation. In today’s data-driven sciences (be it biology, physics, complex systems or economics), the progress is increasingly achieved by scientists having heterogeneous expertise, working in parallel, and having a very contextualized, local view on their problems and results. We expect that this will result in a fundamental phase transition in how scientiﬁc results are obtained, represented, used, communicated and attributed. Different to the classical view of how science is performed, important discoveries will in the future not only be the result of exceptional individual efforts and talents, but alternatively an emergent property of a complex community-based socio-technical system. This has fundamental implications on how we perceive the role of technical systems and in particular inform- ation processing infrastructures for scientiﬁc work: they are no longer a subordinate instrument that facilitates daily work of highly gifted individuals, but become an es- sential tool and enabler for performing scientiﬁc progress, and eventually might be the instrument within which scientiﬁc discoveries are made, represented and brought to use.