ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 90 (2004) 1084 – 1095 From Audio-Only to Audio and Video Text-to-Speech Eric Cosatto , Hans Peter Graf , J ¨ orn Ostermann eric.cosatto, hpgraf, joern.ostermann @ieee.org Juergen Schroeter AT&TLabs – Research Room D163, 180 Park Ave, Florham Park, NJ 07932, USA. jsh@research.att.com Summary Assessing the quality of Text-to-Speech (TTS) systems is a complex problem due to the many modules involved that address different subtasks during synthesis. Adding face synthesis – the animation of a “talking head” and its rendering to video – to a TTS system makes evaluation even more difficult. In the case of talking heads, today, we are at the infancy of research towards evaluating such systems. This paper reports on progress made with the AT&T sample-based Visual TTS (VTTS) system. Our system incorporates unit-selection synthesis (now well known from Audio TTS) and a moderate-size recorded database of video segments that are modified and concatenated to render the desired output. Given the high quality the system achieves, we feel for the first time that we are close to passing the Turing test, that is, that we are almost able to synthesize “talking heads” that look like recordings of real people. We demonstrate this point in applications, either over the web (client/server), or in stand-alone form, in a kiosk setting. Several steps are necessary to assure a very high quality sample based VTTS system. First, highly accurate image analysis tools are important for creating the necessary video clip databases. The problem is compounded by the fact that facial videos cannot be stored whole due to unfavorable combinatorics: for a given synthetic sequence, it is very unlikely that a whole face video clip contains the correct mouth sequence, the appropriate eye sequence, and also a suitable “background” face, given what we want to synthesize. Consequently, separate parts of a synthetic face need to be accessible independently from each other at synthesis time. Therefore, image analysis tools semi-automatically extract (i.e., cut) desired facial features out of recorded video, normalize the apparent position of the camera (the “pose”, i.e. angle and distance between face and lens), and index and store the images in disjoint databases. Second, fast search techniques (“unit selection”) extract the most appropriate sequences of facial building blocks at runtime. This includes background face images that convey desired head movements and serve as canvases for painting (projecting) other content-bearing parts of the face such as mouth and eyes. In a final step, the resulting composite face image is then rendered on a graphic screen for display. The higher the quality of a (V)TTS system, the more important it is to carefully evaluate all algorithmic choices. Naturally, subjective testing, although time consuming and expensive, has to be the ultimate measure. However, we used objective measures for quality assessment during the development phase of our system. For example, we found that accuracy and timeliness of lip closures and protrusions, turning points (where a speaker’s mouth changes direction from opening to closing), and overall smoothness of the articulation are very critical for achieving high quality. We also found that “visual prosody”, the movement of the head in synchrony with the stress pattern of the spoken sentence, is important for a natural look. PACS no. 4372.Kb, 43.72.Ja, 43.71.Gv 1. Introduction At the start of the new millennium, telecommunications has fully embraced Internet-Protocol (IP) networks in form of supporting multiple media such as voice, video, Received 8 November 2003, accepted 14 April 2004. Currently at NEC Laboratories, Princeton, New Jersey Now with Institut f¨ ur Theoretische Nachrichtentechnik und Informa- tionsverarbeitung, University of Hannover, Germany documents, database accesses, etc. Going forward, more and more devices, from telephones to PDAs and PCs, will enable communications over IP networks in multi- ple modalities, including “video” in addition to the tra- ditional “voice” communication. Increasingly, human-to- human communication will be amended by communica- tion between humans and machines for such applications as e- commerce, customer care, and information delivery in services [1]. The “speech circle” depicted in Figure 1 illustrates the general concepts and different modules used in natural lan- 1084 c S. Hirzel Verlag EAA