www.editada.org
International Journal of Combinatorial Optimization Problems and
Informatics, 13(2), May-Aug 2022, 65–75. ISSN: 2007-1558
_______________________________________________________________________________________
© Editorial Académica Dragón Azteca S. de R.L. de C.V. (EDITADA.ORG), All rights reserved.
English mispronunciation detection module using a Transformer network integrated
into a chatbot
Marcos E. Martinez-Quezada, J. Patricia Sánchez-Solís, Gilberto Rivera, Rogelio Florencia*, Francisco
López-Orozco
División Multidisciplinaria de Ciudad Universitaria / Universidad Autónoma de Ciudad Juárez, Chih 32500,
Mexico
*Correspondence: rogelio.florencia@uacj.mx
Abstract. Today it is crucial to have up-to-date
information for companies to be more competitive in this
business world. There are applications based on speech
recognition that allows access to data stored in databases.
However, the proper functioning of these applications lies
in good pronunciation, a skill that most people do not have.
In this paper, the architecture of an English
mispronunciation detection module integrated into a
chatbot is proposed. It allows users to enter the audio of the
phrases in which they want to evaluate their pronunciation.
The output is the mispronounced words, thus helping the
user to practice their English language pronunciation. The
proposed architecture consists of an Automatic Speech
Recognizer (ASR) model based on a Transformer network
that converts the audio signal to text and an algorithm for
string alignment that identifies mispronounced words using
the Levenshtein distance. The Transformer network was
trained using the LibriSpeech and L2-ARTIC datasets. The
module was evaluated using the Accuracy metrics,
reaching 90%, and the Character Error Rate metric,
reaching 9.5%. Additionally, its performance was
evaluated on a group of real users, showing promising
results.
Keywords: Mispronunciation detection, Automatic Speech
recognition, Transformer Network.
Article Info
Received: August 31, 2021
Accepted: October 23, 2021
1 Introduction
Business Intelligence refers to the tools and strategies used in the processing, analysis, and visualization of data to support
decision-making in companies [1]. Accessing up-to-date information in real time, stored on company servers or in the cloud,
could allow decision-makers to have the certainty of carrying out business operations and obtaining favorable dividends for their
companies. In this sense, applications based on speech recognition are intended to answer queries expressed in natural language
by users [2]. However, on the one hand, for these applications to achieve a good performance, the pronunciation of the users is a
key element, a skill that most people do not have. On the other hand, most of these applications have been developed for
English, which is the predominant language in this globalized world.
Pronunciation is often the most difficult skill to develop when learning a second language. Interaction with other people is a key
point in developing speech skills. However, sometimes learning partner is not available, which may delay the improvement of
this skill [3]. There are several tools that can help learners in language learning, such as websites or apps. One of those tools is
chatbots, which has been well received in the second language learning task [4]. Additionally, Automatic Speech Recognition
(ASR) systems are often used in the mispronunciation detection task [5]. Considering that frequent interaction with a chatbot
could allow users to improve their pronunciation skills, the integration of both chatbots and ASR systems could be useful to
emphasize the pronunciation of the language.