Bulletin of Electrical Engineering and Informatics Vol. 14, No. 4, August 2025, pp. 2861∼2870 ISSN: 2302-9285, DOI: 10.11591/eei.v14i4.8885 ❒ 2861 Exploring deep learning approaches for image captioning to mimic human understanding Maheen Islam 1 , Mahedi Hassan Ratul 1 , Rezaul Haque 1 , Sazzad Hossain Rony 1 , Azharul Huq Asif 1 , Tanni Mittra 1 , Md Miskat Hossain 1 , Mahamudul Hasan 2 1 Department of Computer Science and Engineering, Faculty of Science and Engineering, East West University, Dhaka, Bangladesh 2 Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, United States Article Info Article history: Received Jun 21, 2024 Revised Feb 9, 2025 Accepted Mar 9, 2025 Keywords: Caption generation Context dataset Deep learning Image captioning Image encoding Microsoft common objects in context ABSTRACT Image captioning has emerged as a vital research area in computer vision, aim- ing to enhance how humans interact with visual content. While progress has been made, challenges like improving caption diversity and accuracy remain. This study proposes transfer learning models and RNN algorithms trained on the microsoft common objects in context (MS COCO) dataset to improve im- age captioning quality. The models combine image and text features, utilizing ResNet50, VGG16, and InceptionV3 with LSTM, and BiLSTM. Performance is measured using metrics such as BLEU, ROUGE, and METEOR for greedy and beam search. The InceptionV3+BiLSTM model outperformed others, achieving a BLEU score of over 60%, a METEOR score of 28.6%, and a ROUGE score of 57.2%. This research contributes to building a simple yet effective image cap- tioning model, providing accurate descriptions with human-like understanding. The error was analyzed to improve results while discussing ongoing research aimed at enhancing the diversity, fluency, and accuracy of generated captions, with significant implications for improving the accessibility and searchability of visual media and informing future research in this area. This is an open access article under the CC BY-SA license. Corresponding Author: Mahamudul Hasan Department of Computer Science and Engineering, University of Minnesota Twin Cities 200 Union St SE, Minneapolis, 55455, Minnesota, United States Email: munna09bd@gmail.com 1. INTRODUCTION Deep learning has revolutionized object detection with methods like convolutional neural networks (CNNs) and region-based CNNs, including Fast R-CNN, Faster R-CNN, and YOLO, becoming primary tools [1]. Transfer learning has further reduced the training data requirement by leveraging pre-trained models on large datasets [2]. Image captioning generates textual descriptions of images and complements object detec- tion, offering a comprehensive understanding of image content. Applications include making visual content accessible to visually impaired people, improving image searchability, enhancing content understanding, and enabling automated image tagging [3], [4]. Recent studies in image captioning have focused on improving caption accuracy, fluency, and diversity by incorporating attention mechanisms and additional information. Attention-based models, such as those in [5], focus on critical image regions, resulting in improved performance on datasets like COCO. For example, [5] achieved a BLEU-4 score of 50.4 using a Hard-Attention model. Similarly, [6] introduced SCA-CNN with Journal homepage: http://beei.org