AbstractAudio quality in the Internet can be strongly affected by network conditions. As a consequence, many techniques to evaluate it have been developed. In particular, the ITU-T adopted in 2001 a technique called Perceptual Evaluation of Speech Quality (PESQ) to automatically measuring speech quality. PESQ is a well-known and widely used procedure, providing in general an accurate evaluation of perceptual quality by comparing the original and received voice sequences. One obvious inherent limitation of PESQ is, thus, that it requires the original signal (we say the reference), to make its evaluation. This precludes the use of PESQ for assessing the perceived quality in real-time, as the reference is in general not available. In this paper, we describe a procedure for estimating PESQ output working only with measures taken on the network state and properties of the communication system, without any use of the reference. It is based on the use of statistical learning techniques. Specifically, we rely on recent ideas for learning with specific types of neural networks, known under the name of Echo State Networks (ESNs), a member of the class of Reservoir Computing systems. These tools have been proven to be very efficient and robust in many learning tasks. The experimental results obtained show the good accuracy of the resulting procedure, and its capability to give its estimations of speech quality in a real-time context. This allows putting our measuring modules in future Internet applications or services based on voice transmission, for instance for control purposes. Index TermsQuality assessment, speech quality, echo state networks, reservoir computing. I. INTRODUCTION Measuring the quality of a voice signal transmitted over the Internet is an important topic today, and one of main available tools for this purpose is the Perceptual Evaluation of Speech Quality (PESQ) method accepted in 2001 as the ITU-T objective speech quality measurement standard P.862 [1]. The network conditions vary over time, and in many contexts, several different factors lead to losses, which in turn lead to degradations in the perception of the quality by the users. PESQ analyzes this quality by comparing the received signal with the original speech sequence. For this reason, we say that it is a “full reference” technique, the reference being the original signal. Researchers in many areas use PESQ and the tool has been widely diffused in commercial measurement products. Recently, the ITU started to update its voice Manuscript received March 1, 2013; revised April 15, 2013. This work was supported in part by the European Celtic Project “QuEEN”. S. Basterrech is with the University of Rennes 2, Rennes, France (e-mail: Sebastian.Basterrechtiscordio@etudiant.uhb.fr). G. Rubino is with the National Institute for Research in Computer Science and Control (INRIA Rennes Bretagne Atlantique), Rennes, France (e-mail: Gerardo.Rubino@inria.fr). assessment recommendations by promoting the new P.863 standard Perceptual Objective Listening Quality Assessment (POLQA) [2], but PESQ still remains the main tool for voice quality assessment, and probably will continue to be so in the upcoming years. This is because it provides reasonably good correlation with the scores given by humans to VoIP applications. Observe that since PESQ requires the original signal, we cannot use it in real-time conditions. In [3] the authors present a method for approximating the values given by PESQ but without any need for the original sequence. The idea is to estimate PESQ scores using a Feedforward Neural Network model, based on data concerning the packet loss process provoked by the network. These neural models are used because they are simple to manipulate and they lead to good results, even if another type of learning tool, the Recurrent Neural Network, exist. The latter are, in general, very powerful to learn non-linear mappings (which is the case in assessing perceptual quality) using sequential training algorithms. However, their use has been limited mainly due to the inefficiency of their learning algorithms [4], [5], which suffer from slow convergence rates and low robustness, thus in particular limiting their applicability in real-time contexts. Recently, a new computational neural model has been proposed under the name of Reservoir Computing (RC). It offers a solution to the previously mentioned drawbacks of recurrent architectures while introducing no significant disadvantages. The first two proposed RC models were Liquid State Machines (LSMs) [6] and Echo State Networks (ESNs) [7], both almost simultaneously published. The two types of models have been successfully applied in many problems achieving very good results in temporal and non-temporal learning tasks [5], [8], [9]. In this paper, we study the problem of estimating PESQ scores in a context where the reference signal is not available, using the ESN model. The main idea is to capture the relation between certain network parameters that affect the perceived quality and their corresponding PESQ scores. The ESN tool is known for its modeling accuracy, parsimony and efficiency in the learning process [5]. Another notorious property is that the obtained tool is simple to extend or to update. Its extensibility and parsimony properties can be useful when new data is known when the system is already in operation. Our approach offers a new method for VoIP quality assessment in the context of Internet applications or services, which is able to provide accurate assessments in real-time. To illustrate the performance of our model, we present some numerical results, and we also add a couple of comparisons with other basic statistical learning techniques. The paper is organized as follows. In Section II, we begin Real-Time Estimation of Speech Quality through the Internet Using Echo State Networks Sebastián Basterrech and Gerardo Rubino 183 DOI: 10.7763/JACN.2013.V1.37 Journal of Advances in Computer Network, Vol. 1, No. 3, September 2013