Caption Generator using CNN and LSTM Siddhant Patil, Atharva Gentyal, Kashif Khan, Umar Khan Department of Artiﬁcial Intelligence and Data Science, Narsee Monjee Institute of Management Studies, Mumbai, India {siddhant.patil154, atharva.gentyal155, umar.khan40, kashif.khan009}@nmims.in Dr. Sakshi Indolia, Prof. Aditya Kasar Assistant Professor, Narsee Monjee Institute of Management Studies, Mumbai, India {sakshi.indolia, aditya.kasar}@nmims.in Abstract—Our paper introduces an innovative deep learning- based Image Caption Generator that makes use of the strength of convolutional neural networks (CNN) and long-short-term memory (LSTM) networks to produce contextually relevant and descriptive captions of images. The model utilizes a pre-trained CNN, like VGG16 , to obtain high-level visual features that are taken as input by an LSTM-based language model to produce natural language descriptions. Using these two models, our system can bridge the gap between natural language processing and computer vision. The model is trained and tested on common benchmark datasets, Flickr8k, to make it robust and generalize well. Experimental results indicate promising caption generation accuracy, well-describing object relationships, scene information, and semantic knowledge of images. We also solve critical issues of automatic image captioning, e.g., withstanding image content variation, generating correct sentences, and improving caption variety. To boost performance, we investigate techniques that include transfer learning, attention-based methods, and hyper- parameter tuning. We compare our model based on standard metrics including BLEU, METEOR, and CIDEr to measure the relevance and ﬂuency of the caption. The results improve vision language modeling with applications in assistive technology for the blind, content-based image retrieval, and human-computer interaction. I. I NTRODUCTION Automatic image captioning is a complex task that sits at the intersection of natural language processing and computer vi- sion. It involves creating meaningful, word-based descriptions for images, allowing computers to interpret and communicate what is in an image using natural language. This capability has signiﬁcant implications for various applications, ranging from assistive technologies for visually impaired individuals to en- hancing human-computer interaction and improving content- based image retrieval [8] [16] [2]. The challenge of image captioning lies in its need for multimodal understanding—machines must not only identify objects, scenes, and actions within an image but also grasp their relationships, context, and connections. In addition, craft- ing text descriptions requires a good command of linguistic structure to ensure that the resulting sentences are grammat- ically correct, semantically rich, and contextually appropriate [14] [10] [4]. In the last couple of years, deep learning techniques have really taken off when it comes to automatic image captioning. One of the best-known methods is to combine Convolutional Neural Networks (CNN) with Long Short-Term Memory (LSTM) networks, and it has been quite effective [17] [9] [3]. CNNs are great at pulling out features from images, making them perfect for processing visual information. However, LSTMs, which are a speciﬁc type of recurrent neural network (RNN), excel at handling sequences, allowing them to create captions that are relevant and coherent based on the features extracted from the images [1] [20] [5]. This research explores the use of the VGG16 model as a feature extractor combined with an LSTM-based language model to generate descriptive captions for images. The study focuses on the Flickr8k dataset, aiming to enhance the ac- curacy, contextual relevance, and ﬂuency of the generated captions through deep learning techniques [7] [6] [15]. We’re tackling some fundamental challenges, such as un- derstanding complex image semantics, cutting down on repet- itive captions, and enhancing sentence diversity. This work contributes to the ongoing advancements in vision-language modeling [13] [18] [11] [21] [19] [12]. The major contributions of our work are: 1) Improved Caption Accuracy: Traditional image cap- tioning struggles with accuracy due to reliance on ba- sic object detection. The CNN-LSTM model improves captions by extracting deep visual features with CNNs, learning sequence dependencies with LSTMs, and re- ducing errors through large-scale training. This results in more contextually rich descriptions [8] [16] [2]. 2) Enhanced Contextual Understanding: Unlike tradi- tional captioning methods that identify individual ob- jects, the CNN-LSTM model captures spatial relation- ships and global context. Instead of simply listing ob- jects like ”Dog, ball,” it generates descriptive, human- like captions such as ”A dog is playing with a ball in the park,” making captions more context-sensitive and meaningful [14] [10] [4]. 3) Scalability and Adaptability: The CNN-LSTM model is highly scalable and adaptable across various domains. It can be trained on diverse datasets like Flickr8k and MS COCO for general captioning and customized for speciﬁc applications, including medical imaging for di-