Multi-Modal Distance Metric Learning: A Bayesian Non-parametric Approach Behnam Babagholami-Mohamadabadi (B ) , Seyed Mahdi Roostaiyan, Ali Zarghami, and Mahdieh Soleymani Baghshah Department of Computer Engineering, Sharif University of Technology, Tehran, Iran babagholami@alum.sharif.edu Abstract. In many real-world applications (e.g. social media applica- tion), data usually consists of diverse input modalities that originates from various heterogeneous sources. Learning a similarity measure for such data is of great importance for vast number of applications such as classification, clustering, retrieval, etc. Defining an appropriate distance metric between data points with multiple modalities is a key challenge that has a great impact on the performance of many multimedia applications. Existing approaches for multi-modal distance metric learning only offer point estimation of the distance matrix and/or latent features, and can therefore be unreliable when the number of training examples is small. In this paper we present a novel Bayesian framework for learning distance functions on multi-modal data through Beta Process, by which we embed data of different modal- ities into a single latent space. Moreover, using the flexible Beta process model, we can infer the dimensionality of the hidden space using training data itself. We also develop a novel Variational Bayes (VB) algorithm to compute the posterior distribution of the parameters that imposes the constraints (similarity/dissimilarity constraints) directly on the posterior distribution. We apply our framework to text/image data and present empirical results on retrieval and classification to demonstrate the effec- tiveness of the proposed model. Keywords: Metric learning · Multi-modal data · Beta process · Varia- tional inference · Gibbs sampling 1 Introduction Recently, multi-modal data has been grown explosively thanks to the ubiquity of the social media (e.g. Facebook, Flicker, Youtube, iTuens, etc). In such data, information comes through multiple input channels (images contain tags and captions, videos are associated with audio signals and/or user comments). Hence, Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-16199-0 5) contains supplementary material, which is available to authorized users. c Springer International Publishing Switzerland 2015 L. Agapito et al. (Eds.): ECCV 2014 Workshops, Part III, LNCS 8927, pp. 63–77, 2015. DOI: 10.1007/978-3-319-16199-0 5