International Journal of Electrical and Computer Engineering (IJECE) Vol. 11, No. 1, February 2021, pp. 879~891 ISSN: 2088-8708, DOI: 10.11591/ijece.v11i1.pp879-891 879 Journal homepage: http://ijece.iaescore.com Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks Petcharat Panyapanuwat, Suwatchai Kamonsantiroj, Luepol Pipanmaekaporn Department of Computer and Information Science, King Mongkut’s University of Technology North Bangkok, Thailand Article Info ABSTRACT Article history: Received Jan 1, 2020 Revised Jun 8, 2020 Accepted Aug 18, 2020 Due to its efficiency in storage and search speed, binary hashing has become an attractive approach for a large audio database search. However, most existing hashing-based methods focus on data-independent scheme where random linear projections or some arithmetic expression are used to construct hash functions. Hence, the binary codes do not preserve the similarity and may degrade the search performance. In this paper, an unsupervised similarity-preserving hashing method for content-based audio retrieval is proposed. Different from data-independent hashing methods, we develop a deep network to learn compact binary codes from multiple hierarchical layers of nonlinear and linear transformations such that the similarity between samples is preserved. The independence and balance properties are included and optimized in the objective function to improve the codes. Experimental results on the Extended Ballroom dataset with 8 genres of 3,000 musical excerpts show that our proposed method significantly outperforms state-of- the-art data-independent method in both effectiveness and efficiency. Keywords: Content-based audio retrieval Deep learning Deep neural networks Similarity-preserving hash Unsupervised learning This is an open access article under the CC BY-SA license. Corresponding Author: Petcharat Panyapanuwat, Department of Computer and Information Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand. Email: panyapetch@hotmail.com 1. INTRODUCTION With rapidly growing database of digital audio recordings, the novel retrieval strategies have received great attention. Early retrieval approach uses textual metadata describing the content of music audio (e.g., artist name, song title, album name, genre, or release year of music). In case such descriptions are not available, it is required content-based retrieval strategy that the perceptual aspects of the audio are utilized. [1]. Content-based audio retrieval approach is generally solved with two steps: first, features are extracted from the audio file and then used to build indexes for searching. Two main issues of performing a search over a large database are search speed and efficient storage. The most interesting approach for handling these problems is binary hashing, where the high-dimensional features are encoded into compact binary codes. There have been several hashing methods proposed in the literature. They can be devided into two categories, data-independent methods and data-dependent methods. Methods in data-independent category [2-7] use random linear projections or some arithmetic expression to construct hash functions. Without the training process, they are robust to data variation. However, such methods require long hash codes to achieve high precision. This increases the storage cost and degrades the search efficiency [8]. Methods in data-dependent category, also called learning to hash methods, aim to learn a set of hash functions from available training data that yield compact codes to achieve satisfactory search performance [9]. Existing data-dependent methods can be classified into unsupervised, supervised, and semi-supervised