708 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-27, NO. 6, NOVEMBER 1981 Rate-Distortion Speech Coding with a Minimum Discrimination Information Distortion Measure ROBERT M. GRAY, FELLOW, IEEE, AUGUSTINE H. GRAY, JR., SENIOR MEMBER, IEEE, GUILLERMO REBOLLEDO, STUDENT MEMBER, IEEE, AND JOHN E. SHORE, SENIOR MEMBER, IEEE Abstrucf-An information theory approach to the theory and practice of linear predictive coded (LPC) speech compression systems is developed. It is shown that a traditional LPC system can be viewed as a minimum distortion or nearest-neighbor system where the distortion measure is a minimum discrimination information between a speech process model and an observed frame of actual speech. This distortion measure is used in an algorithm for computer-aided design of block source codes subject to a fidelity criterion to obtain a 750-bits/s speech compression system that resembles an LPC system but has a much lower rate, a larger memory requirement, and requires no on-line LPC analysis. Quantitative and informal subjective comparisons are made among our system and LPC systems. I. INTRODUCTION zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA S PEECH COMPRESSION systems can generally be classified as either waveform coders or speech coders. Waveform coders attempt to communicate an approxima- tion to the original sampled speech waveform. Examples are traditional pulse-code modulation (PCM) and differen- tial pulse-code modulation (DPCM) systems, tree coding techniques, and adaptive versions of these systems. Wave- form coders typically require from one bit per sample (6500 bits/s for the sampling rate considered here) for adaptive tree coding to 10 bits per sample for PCM. Speech coders (or voice coders or vocoders) attempt to track the underlying process producing the waveform rather than the waveform itself. For a frame or vector of sampled speech, the encoder first performs a system identification procedure and selects a speech production model from an allowed class for the observed frame. Then a digital ap- proximation to the model is formed and communicated to the receiver, where the reproduction speech is synthesized using the communicated model. At the present time, the best quality low-rate speech coding systems of reasonable Manuscript received March 4, 1980; revised December 12, 1980. This work was supported in part by the Air Force Office of Scientific Research under Contract F 49620-79-0058, in part by the Conseio Nacionale de Ciencia y Technologia de Mexico, and in part by the-Naval Research Laboratorv under Contract NRL NOOO39-79-C-0450. R. M. Gray is with the Information Systems Laboratory, Stanford University, Stanford, CA 94305. A. H. Gray, Jr., is with Signal Technology, Inc., Santa Barbara, CA. G. Rebolledo was with the Information Systems Laboratory, Stanford University, Stanford, CA. He is now with Time and Space Processing, Inc., Santa Clara, CA. J. E. Shore is with the Naval Research Laboratory, Washington, DC 20375. complexity are those based on linear predictive coding (LPC) techniques [32,], [36], [22], [23], [2], [31], 1391.These systems typically transmit from 2400 to 6000 bits/s. In an LPC system the speech production or synthesis model is a finite-order autoregressive (all pole) filter driven by either white noise (for unvoiced sounds) or a periodic pulse train (for voiced sounds). Such a model can be described by a random process {yi}EO defined by the difference equation zyxwvutsrqponmlkjihgfedcbaZYXWVUTS y, = - 5 ajy,-j + ~7(/3’/~~~ + (1 - P)“ 2Jn), Tl20 j=l Yi = ui7 i = -M,. . ., -1, (1) where the ui are the initial conditions for the filter (which are usually determined by the previous frame), u is the filter grain, and /?‘/‘e,, + (1 - ,f3)‘/2s, is the driving pro- cess, where E, is a zero-mean unit variance sequence of independent random variables and s, is a periodic pulse train with period r, phase (p, and zero time-average mean and unit time-average energy. The parameter p E [0, l] provides a convenient mix of voiced (p = 0) and unvoiced (p = 1) sounds [32, p. 2471. We assume for convenience that p > 0, but it can be very small for voiced sounds and is not known a priori. Intuitively, the process {y,} is produced by driving an autoregressive filter described by the z-transform a/A(z), A(z) = 1 + $j akzek, k=l by a zero-mean unit power process. We shall refer to o/A(z) or simply a/A, as the model filter and l/A(z) or l/A as the normalized model filter. We assume that the ai are chosen so that the filter is stable, and hence {y,} is asymptotically stationary. The complete model is specified by the parameters M, (I, a,; . ., a M, r, p, (p, the initial conditions {ai}, and a marginal probability density function for en. It has been found experimentally that good quality can be obtained with a variety of densities for z,, including a simple uni- form distribution or a Gaussian distribution. We adopt the 0018-9448/81,‘1100-0708$00.75 01981 IEEE