Computer Speech and Language 82 (2023) 101524
Available online 12 May 2023
0885-2308/© 2023 Elsevier Ltd. All rights reserved.
Contents lists available at ScienceDirect
Computer Speech & Language
journal homepage: www.elsevier.com/locate/csl
English–Assamese neural machine translation using prior alignment
and pre-trained language model
Sahinur Rahman Laskar
a
, Bishwaraj Paul
a
, Pankaj Dadure
b
, Riyanka Manna
c
,
Partha Pakray
a,∗
, Sivaji Bandyopadhyay
a
a
Department of Computer Science and Engineering, National Institute of Technology, Silchar, 788010, Assam, India
b
School of Computer Science, University of Petroleum and Energy Studies, Dehradun, 248007, Uttarakhand, India
c
Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, West Bengal, India
ARTICLE INFO
Keywords:
Low-resource
NMT
English–Assamese
Alignment
Language model
ABSTRACT
In a multilingual country like India, automatic natural language translation plays a key role
in building a community with different linguistic people. Many researchers have explored
and improved the translation process for high-resource languages such as English, German,
etc., and achieved state-of-the-art results. However, the unavailability of adequate data is the
prime obstacle to automatic natural language translation of low-resource north-eastern Indian
languages such as Mizo, Khasi, and Assamese. Though the recent past has witnessed a deluge
in several automatic natural language translation systems for low-resource languages, the low
values of their evaluation measures indicate the scope for improvement. In the recent past,
the neural machine translation approach has significantly improved translation quality, and the
credit goes to the availability of a huge amount of data. Subsequently, the neural machine
translation approach for low-resource language is underrepresented due to the unavailability of
adequate data. In this work, we have considered a low-resource English–Assamese pair using the
transformer-based neural machine translation, which leverages the use of prior alignment and a
pre-trained language model. To extract alignment information from the source–target sentences,
we have used the pre-trained multilingual contextual embeddings-based alignment technique.
Also, the transformer-based language model is built using monolingual target sentences. With
the use of both prior alignment and a pre-trained language model, the transformer-based neural
machine translation model shows improvement, and we have achieved state-of-the-art results
for the English-to-Assamese and Assamese-to-English translation, respectively.
1. Introduction
Machine translation (MT) is a popular task of natural language processing (NLP), and it has come into the limelight in the
last few decades. MT aims to perform the automatic translation from one natural language to another. The MT approaches are
categorized into two broad categories: rule-based and corpus-based. The rule-based approach uses the grammatical and linguistic
rules for particular language pairs to generate target translations (Saini and Sahula, 2015). However, the corpus-based approaches,
namely, example-based machine translation (Somers, 1999), statistical machine translation (SMT) (Koehn et al., 2003) and neural
machine translation (NMT) (Bahdanau et al., 2015; Luong et al., 2015) eliminate the need for construction of linguistic rules and
reliance on linguistic experts.
∗
Corresponding author.
E-mail address: partha@cse.nits.ac.in (P. Pakray).
https://doi.org/10.1016/j.csl.2023.101524
Received 5 July 2022; Received in revised form 6 March 2023; Accepted 3 May 2023