A COMPUTATIONALLY CONSTRAINED OPTIMIZATION FRAMEWORK FOR IMPLEMENTATION AND TUNING OF SPEECH ENHANCEMENT SYSTEMS Daniele Giacobello, Jason Wung, Ramin Pichevar, JoshuaAtkins Beats Electronics, LLC, 8600 Hayden Place, Culver City, CA 90232 {Daniele.Giacobello, Jason.Wung, Ramin.Pichevar, Josh.Atkins}@beatsbydre.com ABSTRACT In this work, we propose an optimization framework for tuning the parameters of a speech enhancement system to maximize its perfor- mance while constraining its computational complexity imposed by a target platform. Some parameters allow for enabling or disabling certain algorithmic components of the system, effectively guiding the implementation effort. The speech enhancement system is de- ployed in a speech recognition front-end and in a full-duplex tele- phony system. The optimization variables are the parameters of the system and the performance is measured using phone accuracy rate and mean opinion score, respectively. The problem is then a nonlin- ear program of combinatorial nature which is solved efﬁciently using a genetic algorithm. The results show improvement in performance over common tuning and implementation strategies. 1. INTRODUCTION Speech enhancement (SE) algorithms are fundamental to most speech-centric applications due to a plethora of acoustical distur- bances that degrade the captured speech signals [1]. The research and development effort in designing SE systems aims at integrating different algorithms and maximizing the performance using objec- tive measures [2]. When the SE systems are used in full-duplex speech communications, the objective is to maximize the perceptual quality using the mean opinion score (MOS) [3], which can be cal- culated using automated techniques that mimic the human hearing process [4]. The current ITU-T standardized model is the Perceptual Objective Listening Quality Assessment (POLQA) [5], which pro- duces reliable scores for evaluating SE algorithms and overcomes several limitations of its predecessor, the Perceptual Evaluation of Speech Quality (PESQ) [6]. When SE systems are used as a pre- processor for automatic speech recognition (ASR), the objective of the algorithmic design is to maximize the speech recognition accuracy [7]. While model-domain enhancement methods have been shown to better account for the mismatch between the training condition and the application scenario [8], methods relying on ﬁxed acoustic models using the hidden Markov models (HMMs) are still the most common methods for limited-vocabulary recognition on embedded systems [9]. Therefore, these methods rely heavily on the SE algorithms to enhance the speech signals before feature ex- traction to match the training condition of the ASR [10]. Accurate ways to assess ASR reliability are still a matter of debate since they are heavily application and context dependent [11]. However, for embedded systems, the phone accuracy rate (PAR), or at a higher se- mantic level the word accuracy rate (WAR), is generally appropriate as a performance measure for the ASR. During development and prototyping, a commercially viable SE system must take into account the constraints of the target platform [12]. For audio related applications, ﬁeld-programmable gate arrays (FPGAs) [13] and dedicated digital signal processors (DSPs) are the most common choices since they generally have lower cost, lower latency, and lower energy consumption [14]. However, meeting the computational budget of the target hardware, commonly measured in terms of million cycles per second (MCPS), is generally a non- negotiable condition [15]. The computational complexity of an algo- rithm is calculated by counting the number of basic mathematical op- erations, e.g., multiplications, additions, or multiply-accumulations (MACs), as well as the usage of pre-deﬁned, highly-optimized sub- routines already embedded in the processor, e.g., the fast Fourier transforms (FFTs) [16]. The objective of maximizing the perceptual quality or the speech recognition accuracy often contradicts the computational constraints imposed by the target platform. While proﬁling each component of a SE system during development is a good practice to avoid overly complex solutions, the tuning of the system is often done at an ad- vanced stage of the development and may inﬂuence the computa- tional complexity dramatically. Furthermore, the optimization often relies on measures that are easier to handle mathematically, e.g., the mean-squared error (MSE) or the log-spectral distortion (LSD) [2], but may not relate well to the actual goal of the system, i.e., max- imizing the perceptual quality or the speech recognition accuracy. In our recent works [17, 18], we formalized the tuning of a SE sys- tem for full-duplex communications by casting it as an optimization problem, where the objective function was a perceptual objective measure and the optimization variables were its parameters. The work was then extended to the optimization of a ASR front-end [19], where the objective function was the back-end recognizer accuracy. Similar ideas were used in [20] and in [21], to tune the parameters of a noise reduction system and the parameters of a ASR back-end, respectively. In previous works, however, the optimization problem was un- constrained. Thus any solution satisfying the maximization of the perceptual objective quality or recognition accuracy could be the so- lution to our problem. In this work, a nonlinear penalty function accounting for the computational complexity is introduced in the op- timization framework. The system to be optimized is comprised of several algorithmic blocks and two large databases of conversational speech, derived from the TIMIT database [22], that cover a wide range of scenarios which are used for training and testing. The sys- tem is then optimized for either full-duplex communications or an ASR front-end with the computational complexity constraint speci- ﬁed in terms of MCPS. 2. SPEECH ENHANCEMENT ALGORITHM Let yrns be the near-end microphone signal, which consists of the near-end speech srns and noise vrns mixed with the acoustic echo