Phonated Speech Reconstruction Using Twin Mapping Models Hamid R. Sharifzadeh * , Amir HajiRassouliha * , Ian V. McLoughlin † , Iman T. Ardekani * , Jacqueline E. Allen ‡ * Signal Processing Lab, Unitec Institute of Technology, Auckland, New Zealand Email: {hsharifzadeh, ahajirassouliha, iardekani}@unitec.ac.nz † School of Computing, The University of Kent, Kent, United Kingdom Email: i.v.mcloughlin@kent.ac.uk ‡ , Department of Otolaryngology, North Shore Hospital, Auckland, New Zealand Email: jeallen@voiceandswallow.co.nz Abstract—Computational speech reconstruction algorithms have the ultimate aim of returning natural sounding speech to aphonic and dysphonic individuals. These algorithms can also be used by unimpaired speakers for communicating sensitive or private information. When the glottis loses function due to disease or surgery, aphonic and dysphonic patients retain the power of vocal tract modulation to some degree but they are unable to speak anything more than hoarse whispers without prosthetic aid. While whispering can be seen as a natural and secondary aspect of speech communications for most people, it becomes the primary mechanism of communications for those who have impaired voice production mechanisms, such as laryngectomees. In this paper, by considering the current limitations of speech reconstruction methods, a novel algorithm for converting whis- pers to normal speech is proposed and the efficiency of the algorithm is discussed. The proposed algorithm relies upon twin mapping models and makes use of artificially generated whispers (called whisperised speech) to regenerate natural phonated speech from whispers. Through a training-based approach, the mapping models exploit whisperised speech to overcome frame to frame time alignment problem in the speech reconstruction process. I. I NTRODUCTION The human voice is the most magnificent instrument for communication, capable of expressing deep emotions, con- veying oral history through generations, or of starting a war. However, those who suffer from aphonia (no voice) and dysphonia (voice disorders) are unable to make use of this critical form of communication. They are typically unable to project anything more than hoarse whispers [1]. Whispered speech is useful for quiet and private com- munications in daily life [2], [3], [4]. Unimpaired speakers occasionally use whispers to communicate in the public loca- tions such as libraries, cinema theatres, or during lectures and meetings. But whispered speech becomes the primary com- municative mechanism for many people experiencing voice box difficulties [5], [6]. There is no definitive estimate of the global population suffering some form of voice problem, but information from a number of studies [7], [8], [9] suggests that one third of the population have impaired voice production at some point in their lives (temporary) and further that the number of new patients with significant, long lasting voice problems (e.g. laryngectomees) are annually around 35, 000 in OECD countries. Patients reduced to whispering have generally lost their pitch generation mechanism [1] through physiological block- ing of vocal cord vibrations or, in pathological cases, blocking through disease or exclusion by an operation. Typical pros- theses for voice impaired patients (esophageal speech [10], transoesophageal puncture (TEP) [11], and electrolarynx de- vices [12]) allow patients to regain limited speaking ability but do not generate natural sounding speech; at best their sound is monotonous or robotised [13], [14], [15], [16]. Additional drawbacks of traditional prostheses are difficulty of use and risk of infection from surgical insertion [17], [18]. Thus, within a speech processing framework, recent computational reconstruction methods (and particularly whispers to phonated speech) are aiming to regenerate natural sounding speech for aphonic and dysphonic individuals. Furthermore, comparing with traditional prostheses, these methods would be non- invasive and non-surgical. Couple of methods are available for converting whispers to normal speech [19], [20], [21], [22], [23]. The driving idea of all these methods is based on the assumption of whispers are missing some acoustic and spectral features comparing with normal speech; hence, the problem of converting whispers to normal speech is formalised as a reconstruction issue [4], [24]. Through this approach, these methods aim to add or enhance missing or modified features and increase the signal similarity of whispers to normal speech. In general, these reconstruction methods can be classified into two major groups of training and non-training based methods. Utilising machine learning algorithms are the basis of training-based methods (whispers are mapped to the corresponding normal speech), while non- training methods rely upon whisper enhancement and pitch regeneration. These reconstruction methods (either training-based or non- training) have different disadvantages such as problems in con- verting continuous speech (due to using phoneme switching) [20], being computationally expensive (due to using highly overlapped frames for spectral enhancement, or using jump Markov linear system for pitch and voicing parameters) [19],