Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System Tim Capes * , Paul Coles, Alistair Conkie, Ladan Golipour, Abie Hadjitarkhani, Qiong Hu, Nancy Huddleston, Melvyn Hunt, Jiangchuan Li, Matthias Neeracher, Kishore Prahallad, Tuomo Raitio, Ramya Rasipuram, Greg Townsend, Becci Williamson, David Winarsky, Zhizheng Wu, Hepeng Zhang Apple Inc., USA Abstract This paper describes Apple’s hybrid unit selection speech syn- thesis system, which provides the voices for Siri with the re- quirement of naturalness, personality and expressivity. It has been deployed into hundreds of millions of desktop and mobile devices (e.g. iPhone, iPad, Mac, etc.) via iOS and macOS in multiple languages. The system is following the classical unit selection framework with the advantage of using deep learning techniques to boost the performance. In particular, deep and re- current mixture density networks are used to predict the target and concatenation reference distributions for respective costs during unit selection. In this paper, we present an overview of the run-time TTS engine and the voice building process. We also describe various techniques that enable on-device capabil- ity such as preselection optimization, caching for low latency, and unit pruning for low footprint, as well as techniques that improve the naturalness and expressivity of the voice such as the use of long units. Index Terms: Speech synthesis, unit selection, hybrid, recur- rent mixture density network, on-device 1. Introduction Text-to-speech (TTS) synthesis is an essential component of a voice-based intelligent personal assistant (e.g. Siri) and other voice interactive applications (e.g. navigation). The goal of a TTS system is to produce highly intelligible, expressive and natural-sounding synthetic speech that is indistinguishable from human speech. There are two mainstream techniques for industry produc- tion development, namely waveform concatenation (i.e. unit se- lection) and statistical parametric speech synthesis (SPSS) [1]. Given a sequence of text input, unit selection directly assembles waveform segments to produce synthetic speech, while SPSS predicts synthetic speech from trained acoustic models. Unit selection typically produces more natural-sounding speech than SPSS, provided the database used has sufﬁcient high quality au- dio material. Unit selection synthesis in its current form has its origin with [2]. Unit sizes are most often chosen to be half phones or diphones or sometimes demi-syllables [3]. Normally, min- imal signal processing is done in such a system. The system performs best when the database is large, and when the audio quality is good. For commercial systems, professional voice actors are often selected as the voice talent(s). Most recently, much work has centered on using a sta- tistical model to predict acoustic and prosodic parameters for synthesis and then using these predictions to set the costs in * Authors listed in alphabetical order by the last name. a unit selection system – this is known as hybrid unit selec- tion [4]. A variety of techniques have been studied for hy- brid unit selection including recent works that use deep learning techniques [5, 6, 7]. Our system is following this direction and employing deep learning techniques to implement both concate- nation and target costs to improve unit selection. In this paper, we present an overview and implementation details of the Apple Siri unit selection speech synthesis system, which has been deployed into hundreds of millions of desk- top and mobile devices (e.g. iPhone, iPad, Mac, etc.) through iOS and macOS. In particular, we have the following contri- butions: ﬁrst, we use deep and recurrent mixture density net- works (MDNs) to predict target and concatenation distributions and jointly implement the target and concatenation costs in a probabilistic way; second, we introduce multiple optimizations (e.g., long units, preselection, and unit pruning) for naturalness, low latency, and low footprint; and third, we also describe our language-neutral voice building process. 2. Hybrid Unit Selection System Overview Our on-device TTS system follows the typical unit selection framework, which uses a front-end to produce linguistic fea- tures, pre-selection for low latency, statistical model to imple- ment concatenation and target costs for Viterbi search that ﬁnds the optimum unit sequence, and waveform concatenation to generate the ﬁnal synthesized waveform. As a hybrid system, it beneﬁts from a uniﬁed deep learning-based acoustic model to predict acoustic and prosodic feature distributions and to imple- ment concatenation and target costs. The reason we choose this framework is that it can produce higher quality than statistical parametric speech synthesis with acceptable footprint and la- tency. Additionally as an on-device system, it allows synthesis without an internet connection. In this section, we will brieﬂy introduce each module and some additional optimizations for low-latency and quality. 2.1. Front-end The ﬁrst step in synthesis is to process input text and gener- ate phonetic transcription. The main goal of the front-end (also known as text processing), is to generate a phonetic transcrip- tion of the raw input text alongside several linguistic features (punctuation, syllabiﬁcation, accentuation) in order to guide the prosody prediction and unit selection steps to produce intelligi- ble and natural speech. To supplement the natural features of the input text, the front-end can also incorporate explicit anno- tations placed in the text stream to provide hints about pacing, prosody, and discourse domain, many of which are known for speciﬁc types of Siri responses. We are currently using a tradi- Copyright  2017 ISCA INTERSPEECH 2017 August 20–24, 2017, Stockholm, Sweden http://dx.doi.org/10.21437/Interspeech.2017-1798 4011