Zied Mnasri, Fatouma Boukadida & Noureddine Ellouze Signal processing: An International Journal (SPIJ), Volume ( 4): Issue(6) 352 F 0 Contour Modeling for Arabic Text-to-Speech Synthesis Using Fujisaki Parameters and Neural Networks Zied Mnasri zied.mnasri@gmail.com Ecole Nationale d’Ingénieurs de Tunis Electrical Engineering Department Signal, Image and Pattern Recognition Research Unit University Tunis El Manar Tunis, 1002, Tunisia Fatouma Boukadida fatoumaboukadida@yahoo.fr Institut Supérieur des Technologies Médicales Electrical Engineering Department University Tunis El Manar Tunis, 1002, Tunisia Noureddine Ellouze noureddine.ellouze@enit.rnu.tn Ecole Nationale d’Ingénieurs de Tunis Electrical Engineering Department Signal, Image and Pattern Recognition Research Unit University Tunis El Manar Tunis, 1002, Tunisia Abstract Speech synthesis quality depends on its naturalness and intelligibility. These abstract concepts are the concern of phonology. In terms of phonetics, they are transmitted by prosodic components, mainly the fundamental frequency (F 0 ) contour. F 0 contour modeling is performed either by setting rules or by investigating databases, with or without parameters and following a timely sequential path or a parallel and super-positional scheme. In this study, we opted to model the F 0 contour for Arabic using the Fujisaki parameters to be trained by neural networks. Statistical evaluation was carried out to measure the predicted parameters accuracy and the synthesized F 0 contour closeness to the natural one. Findings concerning the adoption of Fujisaki parameters to Arabic F 0 contour modeling for text-to-speech synthesis were discussed. Keywords: F0 Contour, Arabic TTS, Fujisaki Parameters, Neural Networks, Phrase Command, Accent Command. 1. INTRODUCTION TTS systems have known much improvement with a variety of techniques. However, naturalness is still a troublesome aspect, which needs to be looked after. In fact, naturalness is too large as a concept; it may be related to the speech synthesizer, which is required to produce an acoustic signal matching as closely as possible to the natural waveform, or to the listeners, who react perceptually to the sound they hear [1].