Transformation of Visual Information into Bangla
Textual Representation
Nafisa Nawer , Md. Shakiful Islam Khan , MD. Mustakin Alam , Md Humaion Kabir Mehedi , and
Annajiat Alim Rasel
Department of Computer Science and Engineering
Brac University
66 Mohakhali, Dhaka - 1212, Bangladesh
{nafisa.nawer, md.shakiful.islam.khan, md.mustakin.alam, humaion.kabir.mehedi}@g.bracu.ac.bd
annajiat@gmail.com
Abstract—In the past several years, the interest in research like
generating humanoid descriptions of scenarios by detecting and
analyzing their components has been increased tremendously.
Even though a significant amount of research has been put
into automating the process of converting visual information
into written representation, some languages, such as Bangla,
which have a limited amount of resources, continue to be quite
unfocused due to a lack of standard datasets. In order to resolve
this issue, we have introduced a new dataset named “Biboron”,
in which we manually gathered information in Bangla of im-
ages extracted from the widely available Flickr30k dataset that
were then post-processed and examined for quality assurance.
“Biboron” contains 1,58,915 distinct sentences describing 31,783
images which further specifies the versatile nature of the dataset.
Furthermore, we have presented two models in order to enhance
the automated extraction of visual information from images and
represent in Bangla. The first model includes Local Attention,
whilst the second model is based on Multi-Head Attention with
Transformers. The image feature extractor of the models utilized
VGG16, while bidirectional LSTM backed by CuDNN was used
in the decoder network. The BLEU scores suggest that the second
model appears to outperform the first one in terms of generating
more relevant textual representations from images by achieving
BLEU-1, BLEU-2, BLEU-3, BLEU-4 scores of 0.78, 0.53, 0.37,
0.21 respectively.
Index Terms—Bangla, Local Attention, Multi-Head Attention,
VGG16, CuDNNLSTM, BLEU score.
I. I NTRODUCTION
The ability to automatically describe the events displayed
in an image has arguably been one of the most consistently
problematic aspects of image comprehension throughout the
course of the past few years. One of the underlying goals
of computational visual tasks is to replicate the remarkable
capability of the human brain to recognize, process and
understand visual data with exceptional speed and precision.
The extraction of the visual information in the images to
describe what they portray is a fusion of Artificial Intelligence,
Computer Vision and Natural Language Processing since the
task depends upon object identification, the inference of their
relationship, and the creation of some relevant interpretation
utilizing a string of words. In addition, it must entail compre-
hending both the syntactic and semantic significance of the
Fig. 1: A glimpse of the proposed dataset containing Bangla
description of images. (For those who do not speak Bangla,
translations into English has been included.)
visuals.
Transforming visual information of images into textual form
is a necessity for applications such as sentence-based image
search since it is more difficult for machines to illustrate
visuals than for humans. Apart from that, it has a substantial
number of usage in the field of Human Computer Interface,
for instance, in search engines to retrieve information through
concept-based image indexing and show pertinent search re-
sults to the user. Additionally, automated description gener-
ation aids visually challenged individuals in comprehending
their environment by translating the descriptions into audio.
Due to the accessibility of large-scale image-sentence pair
datasets such as, Flickr8k [1], Flickr30k [2], and MS COCO
[3], all in English language, the automatic creation of image
descriptions has been a rather active topic of research. In addi-
tion, there are several multilingual databases available publicly
for different languages, for example, the IAPR TC-12 dataset
consists of 20,000 images each of which is associated with
descriptions in English, German and Spanish languages. Yet,
Bangla [4], being the fifth largest language in the world with
979-8-3503-3286-5/23/$31.00 ©2023 IEEE