A Bottom-Up Algorithm for Estimating Time-Varying Delays in Coded Speech Stephen D. Voran Institute for Telecommunication Sciences 325 Broadway Boulder, Colorado 80305, USA svoran@its.bldrdoc.gov 303-497-3839 Voice 303-497-5969 Fax ABSTRACT In packetized speech transmission, end-to-end delay can vary, even over short timescales. Estimating the resulting speech delay histories is critical to diagnostic and quality estimation efforts. We present a new bottom-up algorithm for estimating time-varying speech delays. The bottom-up approach is well-suited to real-time implementation. The algorithm works with very low-rate codecs as well as the higher-rate codecs that are more common in VoIP applications. We describe the new algorithm in some detail and provide descriptions of the databases and techniques used to develop and test the new algorithm. Keywords: speech delay estimation, speech quality estimation, temporal discontinuity, VoIP 1. INTRODUCTION The packetized transmission of telephone bandwidth speech is gaining prominence in the telecommunications industry. A significant driver of this trend is Voice over Internet Protocol (VoIP) services. In circuit-switched speech transmission, end-to-end transmission delay is nearly always constant, but in packetized speech transmission, end-to-end delay can vary, even over short timescales. The mechanisms that cause this delay variation are well-documented. See [1] and [2] for examples. There are two main motivations for estimating the delay history of a packetized speech transmission. First, an estimated delay history is required before one can make meaningful input-output based objective estimates of perceived speech quality. (Output-only based estimates of speech quality can enjoy great immunity to delay issues.) Second, knowing the delay history can help to guide the design and optimization of the entire speech transmission system, including the jitter buffer playout algorithm at the receiver. The delay estimation problem can be described as follows. Given a pair of vectors of speech samples x (system input) and y (system output), for each sample in y find the offset (in samples) to the corresponding sample in x. In most practical applications the result is a partition of y into multiple intervals of samples with a single offset value for each interval. The sample rate associated with x and y can be used to convert the offsets in samples to relative delays. The timing relationship between the recording of x and the recording of y can be used to convert these relative delays to absolute delays. One solution is described in [3] and was first mentioned in [4]. This can be described as a top-down solution since it starts with a single estimate for all of the samples in y and then uses a set of rules to recursively subdivide y into smaller and smaller segments, with the goal of terminating when each of the final segments has a single constant delay. In this paper we propose a bottom-up solution. Here a fixed length delay-estimation window is swept across y, resulting in a series of delay estimates at regular intervals across y. These estimates are then processed by a median filter which again uses a fixed window that is swept across y. When a real-time implementation is desired, data-flow issues make a bottom-up solution more practical than a top-down solution. We are aware of emerging systems that employ packetized transmission of very low-rate encoded speech. One scenario involves Internet-based interconnection of land-mobile radio systems that use very low-rate speech codecs. This can result in received speech that has both significant codec distortions and non-constant delay. Reference [5] indicates that the solution given in [4] (at least as realized in conjunction with the quality estimation algorithm given in [5]) has not been verified as applicable to this scenario. The algorithm described in this paper was developed to provide reliable delay estimates for very low- rate speech codecs, as well as the more typical VoIP scenario. The algorithm described here follows from the delay estimation technique that is described in [6] and [7]. In particular, it first uses low resolution techniques to search wide ranges of possible delays, then uses high resolution techniques to search narrower ranges of possible delays, and finally uses a set of rules to combine these results only when advantageous. Matching low resolution with wide searches and high resolution with narrow searches is an inherently efficient approach. The three stage approach also allows for the fact that some systems have a well-defined delay down to the speech sample level, while others simply do not. In the next section we describe the speech database that was used to develop the algorithm. Then we address the issue of how to realistically measure the performance of a delay estimation algorithm. Next we provide a description