Classification of Bird Sound Using High-and Low-Complexity Convolutional Neural Networks Aymen Saad 1* , Javed Ahmed 2 , Ahmed Elaraby 3 1 Department of Information Technology, Technical College of Management, Kufa, Al Furat Alawsat Technical University, Kufa 54003, Iraq 2 Center of Excellence for Robotics, AI and Blockchain (CRAIB), Computer Science department, Sukkur IBA University, Airport Road, Sukkur 65200, Pakistan 3 Department of Computer Science, Faculty of Computers and Information, South Valley University, Qena 83523, Egypt Corresponding Author Email: aymen.abdalameer@atu.edu.iq https://doi.org/10.18280/ts.390119 ABSTRACT Received: 16 November 2021 Accepted: 23 January 2022 Birds are a reflection of environmental health as pollution and climate change affect bio- diversity. Experts in ecology and machine learning stand to benefit the most from large- scale monitoring of biodiversity. Today, convolutional neural networks (CNNs) are the preferred choice for species recognition as their performance has consistently outperformed humans. However, CNNs are disadvantaged by their high computational complexity and the need to provide vast amounts of training data. This paper compares the performance versus the complexity of two widely used CNNs, namely ResNet- 50 and MobileNetV1. ResNet- 50 is a high-complexity CNN while MobilenetV1 is a low-complexity CNN targeted for mobile applications. We used spectrogram images of Brazilian bird sounds as inputs to both networks. These birds were chosen due to their abundance of samples in the Xeno-canto bird sound repository. Short-Time Fourier Transform (STFT) and Mel Frequency Cepstral Coefficient (MFCC) algorithms are used to extracting spectrogram images. To validate the precision of the classifier, 1,000 spectrogram images of each of ten bird species are produced and fed into both classifiers. The findings indicate that the accuracy of MobileNetV1 is close to that of ResNet-50, with MFCC which is 85.73 and 90.56 respectively. Keywords: convolutional neural network, spectrogram, bird sound classification, res net, mobile net 1. INTRODUCTION Birds are particularly useful ecological markers as they reflect changes in their environment. Studies on the diversity of birds are therefore indispensable [1]. Autonomous recorders are used in bioacoustics monitoring to collect large amounts of audio data from fauna vocalisations [2]. Domain experts can manually identify birds but with larger volumes of information, the process is tedious and time-consuming. Hence a more realistic approach is through machine learning [3-6]. Several bird identifications challenges such as BirdCLEF [2, 7, 8] have been held to evaluate bird sound classifiers. From 2016 onwards, convolutional neural networks (CNNs) have consistently outperformed other classifiers in classifying bird sounds in BirdCLEF [7]. CNN architectures such as Inception-v3 [9] and ResNet [10] perform classification tasks based on the ability of the deep layers of neural network models to extract high-level features from the input images. They are benchmarked using the 1000- image ImageNet dataset [11]. CNNs classify by first converting bird sounds to spectrogram images. However, CNNs are noted for their high computational complexity thus making them unsuitable for applications where the power budgets are restricted. Hence, simpler architectures are continuously being explored to cater to applications where excess accuracy is not required. The ResNet-50 architecture is a typical state-of-the-art CNN with a depth of 50 layers, 25.6 million arithmetic operations and parameter size of 96 MB [11]. In contrast, the MobileNetV1 has 28 layers deep, requiring 4.2 million operations and a parameter size of 16.9 MB [12]. It is an evolution of the earlier MobileNet targeted for embedded applications [13]. With the simplicity comes to a slight loss in accuracy which is investigated in this paper. Since the performance of both architectures has been measured by ImageNet, they can both classify images into up to 1000 object categories. CNNs need vast amounts of data to train their network parameters. For bird sounds, training data is plentiful for the more common species. For rarer species, data augmentation is regularly performed to create synthetic samples. In our experiments, both networks were fed with 10,000 sound samples of Brazilian birds. These birds were selected due to the abundance of samples from the Xeno-Canto repository. Each audio clip is resampled and segmented into 1-second samples data at 16 kHz. Each sample containing the bird call signal is then expanded into three samples. Using Short-Time Fourier Transform (STFT) and Mel Frequency Cepstral Coefficient (MFCC) algorithms, the spectrogram representation of the samples is obtained, and then all images are resized using MATLAB 2019b to 224*224. We hypothesize that MobileNetV1 will achieve near ResNet-50 classification accuracy while benefitting from significantly lower computational costs based on the disparity in the number of arithmetic operations of both CNN models. Traitement du Signal Vol. 39, No. 1, February, 2022, pp. 187-193 Journal homepage: http://iieta.org/journals/ts 187