INTERFACE: a Matlab
©
Tools for Building Animated
MPEG4 Talking Heads from Motion-Captured Data
Graziano Tisato, Piero Cosi, Carlo Drioli, Fabio Tesser
Istituto di Scienze e Tecnologie della Cognizione
Sezione di Padova “Fonetica e Dialettologia”
Via G. Anghinoni, 10 – 35121 Padova (ITALY)
+39 049 8274413
[tisato, cosi, drioli, tesser]@pd.istc.cnr.it
ABSTRACT
INTERFACE is an integrated software, designed and
implemented in Matlab©, for building emotive/expressive talking
heads from motion-captured data. INTERFACE simplifies and
automates many of the operation needed for that purpose. A set of
processing tools, focusing mainly on dynamic articulatory data
physically extracted by an automatic optotracking 3D movement
analyzer, was implemented in order to build up the animation
engine, that is based on the Cohen-Massaro coarticulation model,
and also to create the correct WAV and FAP files needed for the
animation. LUCIA, our animated MPEG-4 talking face, in fact,
can copy a real human by reproducing the movements of some
markers positioned on his face and recorded by an optoelectonic
device, or can be directly driven by an emotional XML tagged
input text thus realizing a true audio visual emotive/expressive
synthesis. LUCIA’s voice is based on an Italian version of
FESTIVAL-MBROLA packages, modified for expressive
synthesis by means of an appropriate APML/VSML tagged
language.
Categories and Subject Descriptors
I.6.7 [SIMULATION AND MODELING]: Simulation Support
Systems
General Terms
Algorithms, Design, Human Factors, Standardization
Keywords
Talking Head, Facial Animation, Motion Capture, MPEG4.
1. INTRODUCTION
The transmission of emotions in speech communication is a topic
that has recently received considerable attention. Automatic
speech recognition (ASR) and multimodal or audio-visual (AV)
speech synthesis are examples of fields in which the processing of
emotions can have a great impact and can improve the
effectiveness and naturalness of man-machine interaction. In our
TTS (text-to-speech) framework, AV speech synthesis, that is the
automatic generation of voice and facial animation from arbitrary
text, is based on parametric descriptions of both the acoustic and
visual speech modalities. The visual speech synthesis uses 3D
polygon models, that are parametrically articulated and deformed,
while the acoustic speech synthesis uses an Italian version of the
FESTIVAL diphone TTS synthesizer [1] now modified with
emotive/expressive capabilities.
2. INTERFACE
INTERFACE, whose block diagram is given in Figure 1, is an
integrated software designed and implemented in Matlab© in
order to simplify and automates many of the operation needed for
building-up a talking head from motion-captured data.
INTERFACE was mainly focused on articulatory data collected
by ELITE, a fully automatic movement analyzer for 3D
kinematics data acquisition [2], but could be easily adapted to
other motion-captured data. ELITE provides for 3D coordinate
reconstruction, starting from 2D perspective projections, by
means of a stereophotogrammetric procedure which allows a free
positioning of the TV cameras. The 3D data coordinates are then
used to create our lips articulatory model and to drive directly,
copying human facial movement, our talking face.
INTERFACE was created mainly to develop LUCIA [3] our
graphic MPEG-4 [4] compatible facial animation engine (FAE).
In MPEG-4 FDPs (Facial Definition Parameters) define the shape
of the model while FAPs (Facial Animation Parameters), define
the facial actions [5]. In our case, the model uses a pseudo-
muscular approach, in which muscle contractions are obtained
through the deformation of the polygonal mesh around feature
points that correspond to skin muscle attachments. A particular
facial action sequence is generated by deforming the face model,
in its neutral state, according to the specified FAP values,
indicating the magnitude of the corresponding action, for the
corresponding time instant.
For a complete description of all the features and characteristics
of INTERFACE, a full detailed PDF manual is being prepared
and it is available at the official LUCIA web site:
http://www.pd.istc.cnr.it/LUCIA/Docs/InterFace-AISV2004.pdf
INTERFACE, handles four types of input data from which the
corresponding MPEG-4 compliant FAP-stream could be created:
(A) Articulatory data, represented by the markers trajectories
captured by ELITE; these data are processed by 4 programs:
• “Track”, which defines the pattern utilized for
acquisition and implements a new 3D trajectories
reconstruction procedure;