© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1881 A study on the techniques for speech to speech translation Shrishti Sandeep Gupta 1 , Vaishali Ramakant Shirodkar 2 1 Student, Information Technology Department, Goa College of Engineering, Farmagudi – Goa. 2 Assistant professor, Information Technology Department, Goa College of Engineering, Farmagudi – Goa. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - As globalism continues to advance, language barriers obstruct free communication. Speech translation targets at converting speech from one language into speech or text in another language. This technology helps overcome communication barriers between people communicating via different languages and can allow access to digital content in different languages. One of the central applications of automatic speech translation is to translate documents such as presentations, lectures, broadcast news, etc. However, direct speech-to-speech translation is a very complicated task that involves recognizing and automatically translating speech in real time. This paper outlines some of the work done in this field. Key Words: Speech-to-Speech, Translation, Transformer, Encoder, Decoder, Attention. 1. INTRODUCTION Speech translation refers to the task of transcribing a spoken utterance in a source language into the target language. These systems are typically categorized into the cascade and End-to-End systems. The traditional speech translation system follows a step-by-step process which can be broken down into three components: ● automatic speech recognition (ASR) ● text-to-text machine translation (MT) ● text-to-speech (TTS) synthesis. The ASR, MT, and TTS systems are trained and tuned independently. The ASR processes and transforms the speech into text in the source language, MT then transforms this text into the corresponding text in the target language. Finally, TTS converts the target language text into speech utterances. Researchers are now analyzing direct speech-to-speech translation (S2ST) models that translate speech without relying on text generation as an intermediate step. Direct S2ST comprises fewer decoding steps. Therefore such systems have lower computational costs and low inference latency. 2. ANALYSIS OF VARIOUS METHODS USED FOR SPEECH TO SPEECH TRANSLATION 2.1 Cascaded speech translation model This paper [1] implements a speech-to-speech translation robot in the domain of medical care that helps English speaking patients describe their symptoms to Korean doctors or nurses. The system consists of three main parts - speech recognition, English-Korean translation, and Korean speech generation. English-Korean translation in this system is based on the rule-based translation. This system consists of five main modules: tokenization, part- of-speech tagging, sentence components grouping, Korean grammar application, and word-by-word translation It utilizes CMU Sphinx-4 as a speech recognition tool which is an open source program of Java speech recognition library. Once the recognition is successful, it passes the transcribed text to the translation system. Then the translation algorithm divides a sentence into basic sentence components, such as subject, verb, object, and prepositional phrase. It rearranges the parsed components by applying syntactic rules of Korean. As a last step, the DARwIn-OP speaks the result of a translated sentence in Korean. As there was no appropriate Korean TTS program that can be applied to this program, pre-recorded MP3 files were used. Each word is matched to each Korean voice recording by looking up the hash table. 2.2 Listen, Attend, and Spell (LAS) model Chiu et al. [2] presents the Listen, Attend, and Spell (LAS) model for direct speech to speech translation. The LAS model is a single neural network that includes an attention-based encoder-decoder. The LAS model consists of 3 modules. The encoder takes the input features, x, and maps them to a higher-level feature representation, h enc . The output of the encoder is passed to an attender, which determines which encoder features in h enc should be attended to in order to predict the next output symbol, yi. The output from the attention module is passed to the decoder, which takes the attention context ci, generated from the attender, and an embedding from the previous prediction, yi−1, to produce a probability distribution, P(yi|yi−1, . . . , y0, x), given the previous units, {yi−1, . . . , y0}, and input, x. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072