DI725 Project Phase III: Parameter-Efﬁcient Fine-Tuning of PaliGemma for Image Captioning ˙ Ibrahim Ethem Deveci Cognitive Sciences METU Informatics Institute Ankara, Turkey ethem.deveci@metu.edu.tr Abstract—This document presents the ﬁnal report for the Transformers and Attention-Based Deep Networks course, centered on improving the image captioning capabilities of the PaliGemma vision-language model (VLM) through QLoRA-based ﬁne-tuning with the RISC dataset. We report on the outcomes of Phase III, which builds upon prior results from the baseline and Phase II. In this phase, we applied two distinct hyperparameter conﬁgurations, each trained on the full training and validation sets. To systematically evaluate the effectiveness of the ﬁne-tuned models, we conducted a series of 12 experiments on the complete test set: two models tested across three prompt types and two inference conﬁgurations. This phase speciﬁcally investigates the efﬁcacy of low-rank adaptation techniques in enhancing model performance under constrained ﬁne-tuning conditions. GitHub: GitHub repository of the project phase III. Index Terms—parameter-efﬁcient ﬁne-tuning, quantized LoRA, image captioning, PaliGemma I. I NTRODUCTION The main question this project aimed to answer was whether parameter-efﬁcient ﬁne-tuning methods, when paired with evaluation criteria that reﬂect the functional demands of image captioning, can enhance the performance of VLMs such as PaliGemma beyond baseline levels, while maintaining minimal computational overhead. Given the large scale of PaliGemma, full ﬁne-tuning is computationally expensive in practical environments. Therefore, the project focused on exploring parameter-efﬁcient ﬁne-tuning strategies [1], [2] as a scalable alternative for adapting large VLMs to domain- speciﬁc objectives. In Phase II, Quantized Low-Rank Adaptation (QLoRA) [2] was introduced as a parameter-efﬁcient ﬁne-tuning method applied exclusively to the language model component, while freezing the vision tower and multi-modal projector parame- ters. This approach yielded signiﬁcant improvements, demon- strating QLoRA’s potential in adapting large-scale vision- language models under hardware constraints. Building upon these results, Phase III explores the efﬁcacy of QLoRA ﬁne-tuning under two distinct hyperparameter conﬁgurations. Each conﬁguration was trained on the full training and validation sets to maximize data utilization. To comprehensively evaluate the ﬁne-tuned models, we conducted twelve experiments by testing two models across three prompt types and two inference conﬁgurations using the complete test set. II. DATASET A. RISC Dataset Overview The RISC dataset consists of 44,521 remote sensing images (satellite imagery), each with a ﬁxed resolution of 224 × 224 pixels. Every image is annotated with ﬁve distinct captions, resulting in a total of 222,605 captions that describe the visual content of the images. The dataset is split into training (35,614 images), validation (4,453 images), and test (4,454 images) sets. B. Exploratory Data Analysis Exploratory data analysis revealed that there are no missing image ﬁles. Caption lengths vary signiﬁcantly, ranging from 4 to 50 tokens, with a mean length of approximately 11.10 tokens per caption. In terms of character length, captions range from 20 to 268 characters, with an average of 61.74. However, the dataset presents several quality issues that must be addressed. Notably, 14,632 captions are exact du- plicates, which can introduce redundancy. More critically, inconsistencies and contradictions exist among captions cor- responding to the same image. These include mismatches in object counts and syntactically ill phrases. Such discrepancies can hinder model training and obscure evaluation validity. A similarity analysis of the captions assigned to each image reveals considerable inconsistency, with a mean BLEU-4 score of 0.21, METEOR of 0.40, and cosine similarity of 0.33 across caption sets. These low scores indicate some divergence among the ﬁve captions per image, reinforcing the need for ﬁltering strategies during preprocessing to enhance dataset coherence and training efﬁcacy. C. Caption Selection and Alignment To optimize the training process and improve the alignment between visual and textual representations, the same approach as in Phase II was employed for Phase III. From the ﬁve available captions per image, a single caption was selected based on its semantic similarity to the corresponding visual content. This selection utilized CLIPScore [3], which measures the cosine similarity between image embeddings and text em- beddings using ViT-B/32 model. For each image, the caption with the highest CLIPScore was retained. In instances where