RERANKING FOR SENTENCE BOUNDARY DETECTION IN CONVERSATIONAL SPEECH Brian Roark, a Yang Liu, b Mary Harper, c Robin Stewart, d Matthew Lease, e Matthew Snover, f Izhak Shafran, g Bonnie Dorr, f John Hale, h Anna Krasnyanskaya, i and Lisa Yung, g a OGI/OHSU; b UT, Dallas; c Purdue; d Williams; e Brown; f U. of Maryland; g Johns Hopkins; h Michigan State; i UCLA ABSTRACT We present a reranking approach to sentence-like unit (SU) boundary detection, one of the EARS metadata extraction tasks. Techniques for generating relatively small n-best lists with high oracle accuracy are presented. For each candidate, fea- tures are derived from a range of information sources, in- cluding the output of a number of parsers. Our approach yields significant improvements over the best performing sys- tem from the NIST RT-04F community evaluation 1 . 1. INTRODUCTION Automatic speech recognition (ASR) system quality is typi- cally measured in terms of the accuracy of the word sequence. However, automated speech processing applications may ben- efit from (or sometimes even require) system output that is richer than an undelimited sequence of recognized words. For example, sentence breaks and disfluency annotations are crit- ical for legibility [1], as well as for downstream processing algorithms with complexity that is polynomial in the length of the string, such as parsing. One aspect of the DARPA EARS program 2 was to focus on structural metadata extrac- tion (MDE) [2], including a range of disfluency annotations and sentence-like unit (SU) boundary detection. This paper specifically addresses the task of SU bound- ary detection. Previous approaches to this task have used finite-state sequence modeling approaches, including Hidden Markov Models (HMM) [3] and Conditional Random Fields (CRF) [4]. While these approaches have yielded good results, the characteristics of this task make it especially challenging for Markov models. Average SU length for conversational telephone speech is around 7; hence, most of the time the previous states will be for non-boundary positions, providing relatively impoverished state sequence information. Thus, in [4], a Maximum Entropy (MaxEnt) model that did not use state sequence information, was able to outperform an HMM by including additional rich information. Our approach is to rely upon a baseline model [5] to produce n-best lists of pos- sible segmentations, and extract disambiguating features over entire candidate segmentations, with no Markov assumption. This paper presents an effective n-best candidate extraction algorithm, along with a detailed investigation of the utility of a range of features for improving SU boundary detection. In the next section we provide background on the SU de- tection task, baseline models, and the general reranking ap- proach. We then present our n-best extraction algorithm and 1 http://www.nist.gov/speech/tests/rt/rt2004/fall/ 2 http://www.darpa.mil/ipto/programs/ears/ the features we investigated, followed by empirical results un- der a variety of conditions. 2. BACKGROUND In this section, we provide background on the baseline SU detection models and our reranking approach. 2.1. MDE tasks and baseline models There are four tasks for structural MDE in EARS in the most recent evaluations: SU detection, speech repair detection, self- interruption point (IP) detection, and filler detection. Evalua- tion is conducted using human reference transcriptions (REF) and ASR output, the latter to assess the impact of recognition errors. Two corpora with different speaking styles were used in EARS: conversational telephone speech (CTS) and broad- cast news. Performance is generally measured as the num- ber of errors (insertion, deletion, and substitution when sub- type of the events is considered) per reference events (e.g, SU boundaries, speech repairs), which we will refer to as NIST error. We also report F-measure accuracy 3 for SU detection. We now briefly summarize the ICSI/SRI/UW MDE sys- tem [5], which is the baseline for the current research. The MDE tasks can be seen as classification tasks that determine whether an interword boundary is an event boundary (e.g., SU or IP) or not. To detect metadata events, multiple knowledge sources are utilized, including prosodic and textual informa- tion. Typically, at each interword boundary, prosodic features are extracted to reflect pause length, duration of words and phones, pitch contours, and energy contours. These prosodic features are modeled by a decision tree classifier, which gen- erates a posterior probability of an event given the feature set associated with a boundary. Textual cues are captured by con- textual information of words, their corresponding classes, or higher-level syntactic information. Three different Markov modeling approaches are base- lines for the MDE tasks: HMM, MaxEnt, and CRF. In all cases, there is a hidden event (E) at each word, represent- ing the segmentation decision following the word. There are also features (F ) corresponding to the observed input, e.g., the words and prosodic features. The HMM is a second order Markov model, the CRF model first order, and the MaxEnt model order 0. Both the MaxEnt and CRF models are trained using conditional likelihood objectives; whereas, the HMM is trained as a generative model. The CRF model has been shown to outperform the MaxEnt model, which outperforms the HMM [4]. Baseline results will be presented in section 4. 3 If c, s, and r are the number of correct, system, and reference SU bound- aries, respectively, then F=2c/(s + r).