Phonetic Segmentation of Emotional Speech with HMM-based Methods Iosif Mporas, Todor Ganchev, Nikos Fakotakis Artificial Intelligence Group, Wire Communications Laboratory Dept. of Electrical and Computer Engineering, University of Patras, Greece {imporas, tganchev, fakotaki}@upatras.gr Abstract In the present work we address the problem of phonetic segmentation of emotional speech. Investigating various traditional and recent HMM-based methods for speech segmentation, which we elaborated for the specifics of emotional speech segmentation, we demonstrate that the HMM-based method with hybrid embedded-isolated training offers advantageous segmentation accuracy, when compared to other HMM-based models used so far. The increased precision of the segmentation is consequence of the iterative training process employed in the hybrid-training method, which refines the model parameters and the estimated phonetic boundaries taking advantage of the estimations made at previous iterations. Furthermore, we demonstrate the benefits of using purposely-built models for each target category of emotional speech, when compared to the case of one common model built solely from neutral speech. This advantage, in terms of segmentation accuracy, justifies the effort for creating and employing the purposely-built segmentation models per emotional category, since it significantly improves the overall segmentation accuracy. Keywords: Phonetic segmentation, hidden Markov models, emotional speech. 1. Introduction Over the last years, there is an extensive use of automated systems, supporting voice or multimodal human-machine interaction [1], such as voice portals, call centers, e-banking, info kiosks, web services and applications etc. Due to the wide-spread use of this technology and the demand for convenient and efficient interaction, the design and development of natural and user-friendly speech interfaces became of primary importance. In general, humans feel more natural when communicating with other humans because of the extra information represented in their non-verbal expressions can be recognized, processed, and reflected [2]. During a human-to-human interaction, there are two channels transmitting in parallel, one conveying 1