© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1881
A study on the techniques for speech to speech translation
Shrishti Sandeep Gupta
1
, Vaishali Ramakant Shirodkar
2
1
Student, Information Technology Department, Goa College of Engineering, Farmagudi – Goa.
2
Assistant professor, Information Technology Department, Goa College of Engineering, Farmagudi – Goa.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - As globalism continues to advance, language
barriers obstruct free communication. Speech translation
targets at converting speech from one language into speech
or text in another language. This technology helps overcome
communication barriers between people communicating via
different languages and can allow access to digital content
in different languages. One of the central applications of
automatic speech translation is to translate documents such
as presentations, lectures, broadcast news, etc. However,
direct speech-to-speech translation is a very complicated
task that involves recognizing and automatically
translating speech in real time. This paper outlines some of
the work done in this field.
Key Words: Speech-to-Speech, Translation, Transformer,
Encoder, Decoder, Attention.
1. INTRODUCTION
Speech translation refers to the task of transcribing a
spoken utterance in a source language into the target
language. These systems are typically categorized into the
cascade and End-to-End systems. The traditional speech
translation system follows a step-by-step process which
can be broken down into three components:
● automatic speech recognition (ASR)
● text-to-text machine translation (MT)
● text-to-speech (TTS) synthesis.
The ASR, MT, and TTS systems are trained and tuned
independently. The ASR processes and transforms the
speech into text in the source language, MT then
transforms this text into the corresponding text in the
target language. Finally, TTS converts the target language
text into speech utterances.
Researchers are now analyzing direct speech-to-speech
translation (S2ST) models that translate speech without
relying on text generation as an intermediate step. Direct
S2ST comprises fewer decoding steps. Therefore such
systems have lower computational costs and low inference
latency.
2. ANALYSIS OF VARIOUS METHODS USED FOR
SPEECH TO SPEECH TRANSLATION
2.1 Cascaded speech translation model
This paper [1] implements a speech-to-speech translation
robot in the domain of medical care that helps English
speaking patients describe their symptoms to Korean
doctors or nurses. The system consists of three main parts
- speech recognition, English-Korean translation, and
Korean speech generation. English-Korean translation in
this system is based on the rule-based translation. This
system consists of five main modules: tokenization, part-
of-speech tagging, sentence components grouping, Korean
grammar application, and word-by-word translation
It utilizes CMU Sphinx-4 as a speech recognition tool
which is an open source program of Java speech
recognition library. Once the recognition is successful, it
passes the transcribed text to the translation system. Then
the translation algorithm divides a sentence into basic
sentence components, such as subject, verb, object, and
prepositional phrase. It rearranges the parsed components
by applying syntactic rules of Korean.
As a last step, the DARwIn-OP speaks the result of a
translated sentence in Korean. As there was no
appropriate Korean TTS program that can be applied to
this program, pre-recorded MP3 files were used. Each
word is matched to each Korean voice recording by
looking up the hash table.
2.2 Listen, Attend, and Spell (LAS) model
Chiu et al. [2] presents the Listen, Attend, and Spell (LAS)
model for direct speech to speech translation. The LAS
model is a single neural network that includes an
attention-based encoder-decoder. The LAS model consists
of 3 modules. The encoder takes the input features, x, and
maps them to a higher-level feature representation, h
enc
.
The output of the encoder is passed to an attender, which
determines which encoder features in h
enc
should be
attended to in order to predict the next output symbol, yi.
The output from the attention module is passed to the
decoder, which takes the attention context ci, generated
from the attender, and an embedding from the previous
prediction, yi−1, to produce a probability
distribution, P(yi|yi−1, . . . , y0, x), given the previous units,
{yi−1, . . . , y0}, and input, x.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072