Dealing with uncertainty in anomalous audio event detection using fuzzy modeling Zied Mnasri 1,2 , Stefano Rovetta 1 , Francesco Masulli 1 , and Alberto Cabri 1 1 DIBRIS, Universit` a degli studi di Genova, Genoa, Italy 2 Electrical engineering dept., ENIT, University Tunis El Manar, Tunis, Tunisia zied.mnasri@enit.utm.tn, {stefano.rovetta,francesco.masulli}@unige.it, alberto.cabri@dibris.unige.it Abstract. Surveillance systems are getting more and more multimodal. The availability of audio motivates a method for anomalous audio event detection (anomalous AED) for road traﬃc surveillance, which is pro- posed in this paper. The method is based on combining anomaly de- tection techniques, such as reconstruction deep autoencoders and fuzzy membership functions. A baseline deep autoencoder is used to compute the reconstruction error of each audio segment. The comparison of this error to a preset threshold provides a primary estimation of outlierness. To account for the uncertainty associated to this decision-making step, a fuzzy membership function composed of an optimistic/upper component and a pessimistic/lower component is used. Evaluation results obtained after defuzziﬁcation show that with a careful parameter setting, the pro- posed membership function improves the performance of the baseline autoencoder for anomaly detection, and yields better or at least similar results than other anomaly detection state-of-the-art methods such as one-class SVM. Keywords: Audio event detection, uncertainty, anomaly detection, deep autoencoder, fuzzy membership 1 Introduction Designing an audio surveillance system depends ﬁrst on the type of surveillance task, i.e. classiﬁcation of all detected events, or detection of anomalous/outlier events only. In case of audio event classiﬁcation, several techniques, basically de- veloped for speech/speaker recognition may be useful, such as generative models (HMM and GMM) and discriminative models (SVM and neural networks). In the other case, several anomaly/outlier detection techniques have been applied to audio data, with diﬀerent levels of eﬃciency. These methods can be clas- siﬁed into metric-based e.g. KL-divergence distance, reconstruction-based e.g. autoencoders, and domain-based e.g. one-class SVM. Also, several feature rep- resentations have been proposed, either using hand-crafted low level descriptors (LLD), calculated in both temporal and spectral domains, or using feature em- bedding through auto-regressive tools, e.g. autoencoders, or from feature fusion [1].