This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS 1 Combining Speech and Handwriting Modalities for Mathematical Expression Recognition Soﬁane Medjkoune, Harold Mouch` ere, Simon Petitrenaud, and Christian Viard-Gaudin Abstract—In this paper, we open new perspectives for mathe- matical expression recognition by introducing an original bimodal system. Since handwritten mathematical expression recognition is a very challenging task prone to many ambiguities, we use speech as an additional modality to circumvent limitations that are in- herent to the written form. A use case scenario corresponds to lectures given in classrooms where the teacher would write and read aloud any mathematical expressions to allow a better inter- pretation. In addition to state-of-the-art solutions for recognizing handwriting and speech, we introduce a multilayer architecture for the merger of modalities. Speciﬁcally, the Dempster–Shafer theory is used to process the information at the symbol level. This bimodal system is evaluated on real bimodal data, the HAMEX dataset. Large improvements are observed when speech and handwriting are combined when compared to the single handwriting modality. Index Terms—Belief functions, handwriting, information align- ment, mathematical expression (ME), multilevel data fusion, speech. I. INTRODUCTION N OWADAYS, most electronic devices are audio enabled, integrate digital surfaces for pen-based input and possess sufﬁcient computing power for multimedia applications. This technological progress pushes forward the frontiers of human– computer interaction [1]–[4]. This is particularly true given the fact that multimodality enables disambiguation during human– human interaction [5], [6]. For instance, during a lecture, a teacher generally exploits both speech and handwriting modali- ties to explain a certain phenomenon as clearly as possible. Both modalities have complementary properties and the mix of both is clearly beneﬁcial for communication. In the literature, some works dealing with the combination of speech and handwriting modalities already exist for various ap- plications [7], [8]. For instance, in [9], the combination of speech and handwriting is exploited for user authentication. Kaiser [4] proposed SHACER, a Speech and HAndwriting reCognizER Manuscript received January 15, 2016; revised July 18, 2016 and October 4, 2016; accepted November 30, 2016. This paper was recommended by Guest Editor G. Pirlo. S. Medjkoune is with the Department of Automatic and Computer Sci- ence, Ecole des Mines de Douai, 59500 Douai France (e-mail: smedjkoune@ gmail.com). H. Mouch` ere and C. Viard-Gaudin are with the UBL, Universit´ e de Nantes, IRCCyN UMR CNRS 6597, 44035 Nantes France (e-mail: harold.mouchere@ univ-nantes.fr; christian.viard-gaudin@univ-nantes.fr). S. Petitrenaud is with the UBL, Universit´ e du Maine, LIUM-EA 4023, 72085 Le Mans France (e-mail: simon.petit-renaud@lium.univ-lemans.fr). Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/THMS.2017.2647850 Fig. 1. (a) Handwritten and (b) audio signals of MEs. used for meeting recording recognition purposes. In [10], in- formation coming from speech and handwriting are fused to label elements of a whiteboard chart. The multimodal interac- tion exploiting speech and handwriting in the classroom context is studied in [11], etc. Nevertheless, there is no existing work dealing with automatic recognition, using both handwriting and speech, for graphical languages such as diagrams, architectural plans, chemical for- mulas or mathematical expressions (MEs). In this paper, we ad- dress the combination of speech and handwriting streams in the particular case of ME recognition. This study considers online handwriting using touch-screens, whiteboards, and electronic pens [cf., Fig. 1(a)], and a speech signal containing the corre- sponding information. The same ME is assumed to be available within the constraints of each modality. Recognizing MEs is a very difﬁcult problem for several reasons. First of all, mathe- matical language is composed of a large set of symbols; several hundred symbols are required and many of them look similar. Second, the mathematical language is not a one-dimensional (1-D) language. Indeed, the spatial relationships between ele- ments play an important role in the meaning of the expression. The extraction of the layout can be even more difﬁcult from the audio signal, since the verbal expression of spatial relationships can be nontrivial. As Fig. 2 shows, speech- and handwriting- based systems do not have to face the same difﬁculties. An error committed by one of the systems may be corrected by the other. Therefore, improved performance can be expected by propos- ing a speech–handwriting system for mathematical notation recognition. The main outcome of this work is to propose a multilayer architecture allowing mixing speech and handwriting at dif- ferent levels to recognize MEs. We detail various strategies and their contributions for the fusion of modalities at the sym- bol, relational and interpretation levels. Experimental results show the effectiveness of this global system on the original HAMEX (for Handwritten and Audio dataset of Mathematical EXpressions) dataset. The rest of this paper is divided into eight sections. Section II provides a description of the difﬁculties in 2168-2291 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.