Multimed Tools Appl DOI 10.1007/s11042-016-4315-0 Efficient audio-driven multimedia indexing through similarity-based speech / music discrimination Nikolaos Tsipas 1 · Lazaros Vrysis 1 · Charalampos Dimoulas 1 · George Papanikolaou 1 Received: 31 July 2016 / Revised: 13 December 2016 / Accepted: 26 December 2016 © Springer Science+Business Media New York 2017 Abstract In this paper, an audio-driven algorithm for the detection of speech and music events in multimedia content is introduced. The proposed approach is based on the hypoth- esis that short-time frame-level discrimination performance can be enhanced by identifying transition points between longer, semantically homogeneous segments of audio. In this con- text, a two-step segmentation approach is employed in order to initially identify transition points between the homogeneous regions and subsequently classify the derived segments using a supervised binary classifier. The transition point detection mechanism is based on the analysis and composition of multiple self-similarity matrices, generated using differ- ent audio feature sets. The implemented technique aims at discriminating events focusing on transition point detection with high temporal resolution, a target that is also reflected in the adopted assessment methodology. Thereafter, multimedia indexing can be efficiently deployed (for both audio and video sequences), incorporating the processes of high reso- lution temporal segmentation and semantic annotation extraction. The system is evaluated against three publicly available datasets and experimental results are presented in compari- son with existing implementations. The proposed algorithm is provided as an open source software package in order to support reproducible research and encourage collaboration in the field.  Nikolaos Tsipas nitsipas@auth.gr Lazaros Vrysis lvrysis@auth.gr Charalampos Dimoulas babis@eng.auth.gr George Papanikolaou pap@eng.auth.gr 1 Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece