Human-Centered Implicit Tagging: Overview and Perspectives Mohammad Soleymani Department of Computing Imperial College London London SW7 2AZ, UK m.soleymani@imperial.ac.uk Maja Pantic Department of Computing Imperial College London Faculty of EEMCS University of Twente, the Netherlands m.pantic@imperial.ac.uk Abstract—Tags are an effective form of metadata which help users to locate and browse multimedia content of interest. Tags can be generated by users (user-generated explicit tags), automatically from the content (content-based tags), or assigned automatically based on non-verbal behavioral reactions of users to multimedia content (implicit human-centered tags). This paper discusses the deﬁnition and applications of implicit human- centered tagging. Implicit tagging is an effortless process by which content is tagged based on users’ spontaneous reactions. It is a novel but growing research topic which is attracting more attention with the growing availability of built-in sensors. This paper discusses the state of the art in this novel ﬁeld of research and provides an overview of publicly available relevant databases and annotation tools. We ﬁnally discuss in detail challenges and opportunities in the ﬁeld. Index Terms—tagging, implicit tagging, emotion recognition, multimedia indexing. I. I NTRODUCTION Information management systems use tags as an effective form of metadata to support users in ﬁnding and re-ﬁnding multimedia content of interest. Tags can come in different form including semantic tags and geotags [1]. In contrast to classic tagging schemes where users direct input is mandatory, Implicit Human-Centered Tagging (IHCT) was proposed [2] to gather tags and annotations without any effort from users. The main idea behind IHCT is that nonverbal behaviors displayed when interacting with multimedia data (e.g., facial expressions, head nods, eye gaze, physiological responses, etc) provide information useful for improving the tag sets associated with the data. The resulting tags are called “implicit” since there is no need for users’ direct input as reactions to multimedia are displayed spontaneously. Currently, social media websites encourage users to tag the multimedia content. However, the users’ intent when tagging multimedia content does not always match the information retrieval goals. A large portion of user- deﬁned tags are either motivated by the goal of increasing the popularity and reputation of a user in an online community or based on individual judgments and goals [2]. For example, a user might tag content to increase the popularity and vis- ibility of himself or his content. In contrast to the standard “explicit” tagging, implicit tagging does not prompt the users for tags while they listen to or watch a multimedia content. Moreover, if implicit tagging is done reliably resulting tags carry less irrelevant and inaccurate information compared to the case with “explicit” tagging. Tags obtained through IHCT are expected to be more robust than tags associated with the data explicitly, at least in terms of: generality (they make sense to everybody) and statistical reliability (all tags will be sufﬁciently represented). A scheme of implicit tagging versus explicit scenario versus explicit tagging is shown in Fig. 1. Fig. 1. Implicit tagging vs. explicit tagging scenarios. The analysis of the bodily reactions to multimedia content replace the direct interaction between user and the computer. Therefore, user do not have to put any effort into tagging the content. The users’ behavior and spontaneous reactions to multi- media data can provide useful information for multimedia indexing with the following scenarios: (i) direct assessment of tags: users spontaneous reactions will be translated into emotional keywords, e.g., funny, disgusting, scary [3], [4], [5], [6]; (ii) assessing the correctness of explicit tags or topic relevance, e.g., agreement or disagreement over a displayed tag or the relevance of the retrieved result [7], [8], [9], [10]; (iii) user proﬁling: a user’s personal preferences can be detected based on her reactions to retrieved data and be used for re-ranking the results; (iv) content summarization: highlight detection is also possible using implicit feedbacks from the users [11], [12]. Multimedia indexing has focused on generating charac- terizations of content in terms of events, objects, etc. The judgment relies on cognitive processing combined with general world knowledge and is considered to be objective due to its reproducibility by users with a wide variety of backgrounds. Parallel to this approach to indexing, an alternative has also emerged that also take affective aspects into account. Here,