Transformation of Visual Information into Bangla Textual Representation Naﬁsa Nawer , Md. Shakiful Islam Khan , MD. Mustakin Alam , Md Humaion Kabir Mehedi , and Annajiat Alim Rasel Department of Computer Science and Engineering Brac University 66 Mohakhali, Dhaka - 1212, Bangladesh {naﬁsa.nawer, md.shakiful.islam.khan, md.mustakin.alam, humaion.kabir.mehedi}@g.bracu.ac.bd annajiat@gmail.com Abstract—In the past several years, the interest in research like generating humanoid descriptions of scenarios by detecting and analyzing their components has been increased tremendously. Even though a signiﬁcant amount of research has been put into automating the process of converting visual information into written representation, some languages, such as Bangla, which have a limited amount of resources, continue to be quite unfocused due to a lack of standard datasets. In order to resolve this issue, we have introduced a new dataset named “Biboron”, in which we manually gathered information in Bangla of im- ages extracted from the widely available Flickr30k dataset that were then post-processed and examined for quality assurance. “Biboron” contains 1,58,915 distinct sentences describing 31,783 images which further speciﬁes the versatile nature of the dataset. Furthermore, we have presented two models in order to enhance the automated extraction of visual information from images and represent in Bangla. The ﬁrst model includes Local Attention, whilst the second model is based on Multi-Head Attention with Transformers. The image feature extractor of the models utilized VGG16, while bidirectional LSTM backed by CuDNN was used in the decoder network. The BLEU scores suggest that the second model appears to outperform the ﬁrst one in terms of generating more relevant textual representations from images by achieving BLEU-1, BLEU-2, BLEU-3, BLEU-4 scores of 0.78, 0.53, 0.37, 0.21 respectively. Index Terms—Bangla, Local Attention, Multi-Head Attention, VGG16, CuDNNLSTM, BLEU score. I. I NTRODUCTION The ability to automatically describe the events displayed in an image has arguably been one of the most consistently problematic aspects of image comprehension throughout the course of the past few years. One of the underlying goals of computational visual tasks is to replicate the remarkable capability of the human brain to recognize, process and understand visual data with exceptional speed and precision. The extraction of the visual information in the images to describe what they portray is a fusion of Artiﬁcial Intelligence, Computer Vision and Natural Language Processing since the task depends upon object identiﬁcation, the inference of their relationship, and the creation of some relevant interpretation utilizing a string of words. In addition, it must entail compre- hending both the syntactic and semantic signiﬁcance of the Fig. 1: A glimpse of the proposed dataset containing Bangla description of images. (For those who do not speak Bangla, translations into English has been included.) visuals. Transforming visual information of images into textual form is a necessity for applications such as sentence-based image search since it is more difﬁcult for machines to illustrate visuals than for humans. Apart from that, it has a substantial number of usage in the ﬁeld of Human Computer Interface, for instance, in search engines to retrieve information through concept-based image indexing and show pertinent search re- sults to the user. Additionally, automated description gener- ation aids visually challenged individuals in comprehending their environment by translating the descriptions into audio. Due to the accessibility of large-scale image-sentence pair datasets such as, Flickr8k [1], Flickr30k [2], and MS COCO [3], all in English language, the automatic creation of image descriptions has been a rather active topic of research. In addi- tion, there are several multilingual databases available publicly for different languages, for example, the IAPR TC-12 dataset consists of 20,000 images each of which is associated with descriptions in English, German and Spanish languages. Yet, Bangla [4], being the ﬁfth largest language in the world with 979-8-3503-3286-5/23/$31.00 ©2023 IEEE