This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS 1
Combining Speech and Handwriting Modalities for
Mathematical Expression Recognition
Sofiane Medjkoune, Harold Mouch` ere, Simon Petitrenaud, and Christian Viard-Gaudin
Abstract—In this paper, we open new perspectives for mathe-
matical expression recognition by introducing an original bimodal
system. Since handwritten mathematical expression recognition is
a very challenging task prone to many ambiguities, we use speech
as an additional modality to circumvent limitations that are in-
herent to the written form. A use case scenario corresponds to
lectures given in classrooms where the teacher would write and
read aloud any mathematical expressions to allow a better inter-
pretation. In addition to state-of-the-art solutions for recognizing
handwriting and speech, we introduce a multilayer architecture for
the merger of modalities. Specifically, the Dempster–Shafer theory
is used to process the information at the symbol level. This bimodal
system is evaluated on real bimodal data, the HAMEX dataset.
Large improvements are observed when speech and handwriting
are combined when compared to the single handwriting modality.
Index Terms—Belief functions, handwriting, information align-
ment, mathematical expression (ME), multilevel data fusion,
speech.
I. INTRODUCTION
N
OWADAYS, most electronic devices are audio enabled,
integrate digital surfaces for pen-based input and possess
sufficient computing power for multimedia applications. This
technological progress pushes forward the frontiers of human–
computer interaction [1]–[4]. This is particularly true given the
fact that multimodality enables disambiguation during human–
human interaction [5], [6]. For instance, during a lecture, a
teacher generally exploits both speech and handwriting modali-
ties to explain a certain phenomenon as clearly as possible. Both
modalities have complementary properties and the mix of both
is clearly beneficial for communication.
In the literature, some works dealing with the combination of
speech and handwriting modalities already exist for various ap-
plications [7], [8]. For instance, in [9], the combination of speech
and handwriting is exploited for user authentication. Kaiser [4]
proposed SHACER, a Speech and HAndwriting reCognizER
Manuscript received January 15, 2016; revised July 18, 2016 and October 4,
2016; accepted November 30, 2016. This paper was recommended by Guest
Editor G. Pirlo.
S. Medjkoune is with the Department of Automatic and Computer Sci-
ence, Ecole des Mines de Douai, 59500 Douai France (e-mail: smedjkoune@
gmail.com).
H. Mouch` ere and C. Viard-Gaudin are with the UBL, Universit´ e de Nantes,
IRCCyN UMR CNRS 6597, 44035 Nantes France (e-mail: harold.mouchere@
univ-nantes.fr; christian.viard-gaudin@univ-nantes.fr).
S. Petitrenaud is with the UBL, Universit´ e du Maine, LIUM-EA 4023, 72085
Le Mans France (e-mail: simon.petit-renaud@lium.univ-lemans.fr).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/THMS.2017.2647850
Fig. 1. (a) Handwritten and (b) audio signals of MEs.
used for meeting recording recognition purposes. In [10], in-
formation coming from speech and handwriting are fused to
label elements of a whiteboard chart. The multimodal interac-
tion exploiting speech and handwriting in the classroom context
is studied in [11], etc.
Nevertheless, there is no existing work dealing with automatic
recognition, using both handwriting and speech, for graphical
languages such as diagrams, architectural plans, chemical for-
mulas or mathematical expressions (MEs). In this paper, we ad-
dress the combination of speech and handwriting streams in the
particular case of ME recognition. This study considers online
handwriting using touch-screens, whiteboards, and electronic
pens [cf., Fig. 1(a)], and a speech signal containing the corre-
sponding information. The same ME is assumed to be available
within the constraints of each modality. Recognizing MEs is a
very difficult problem for several reasons. First of all, mathe-
matical language is composed of a large set of symbols; several
hundred symbols are required and many of them look similar.
Second, the mathematical language is not a one-dimensional
(1-D) language. Indeed, the spatial relationships between ele-
ments play an important role in the meaning of the expression.
The extraction of the layout can be even more difficult from the
audio signal, since the verbal expression of spatial relationships
can be nontrivial. As Fig. 2 shows, speech- and handwriting-
based systems do not have to face the same difficulties. An error
committed by one of the systems may be corrected by the other.
Therefore, improved performance can be expected by propos-
ing a speech–handwriting system for mathematical notation
recognition.
The main outcome of this work is to propose a multilayer
architecture allowing mixing speech and handwriting at dif-
ferent levels to recognize MEs. We detail various strategies
and their contributions for the fusion of modalities at the sym-
bol, relational and interpretation levels. Experimental results
show the effectiveness of this global system on the original
HAMEX (for Handwritten and Audio dataset of Mathematical
EXpressions) dataset. The rest of this paper is divided into eight
sections. Section II provides a description of the difficulties in
2168-2291 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.