DI725 Project Phase III: Parameter-Efficient Fine-Tuning of PaliGemma for Image Captioning ˙ Ibrahim Ethem Deveci Cognitive Sciences METU Informatics Institute Ankara, Turkey ethem.deveci@metu.edu.tr Abstract—This document presents the final report for the Transformers and Attention-Based Deep Networks course, centered on improving the image captioning capabilities of the PaliGemma vision-language model (VLM) through QLoRA-based fine-tuning with the RISC dataset. We report on the outcomes of Phase III, which builds upon prior results from the baseline and Phase II. In this phase, we applied two distinct hyperparameter configurations, each trained on the full training and validation sets. To systematically evaluate the effectiveness of the fine-tuned models, we conducted a series of 12 experiments on the complete test set: two models tested across three prompt types and two inference configurations. This phase specifically investigates the efficacy of low-rank adaptation techniques in enhancing model performance under constrained fine-tuning conditions. GitHub: GitHub repository of the project phase III. Index Terms—parameter-efficient fine-tuning, quantized LoRA, image captioning, PaliGemma I. I NTRODUCTION The main question this project aimed to answer was whether parameter-efficient fine-tuning methods, when paired with evaluation criteria that reflect the functional demands of image captioning, can enhance the performance of VLMs such as PaliGemma beyond baseline levels, while maintaining minimal computational overhead. Given the large scale of PaliGemma, full fine-tuning is computationally expensive in practical environments. Therefore, the project focused on exploring parameter-efficient fine-tuning strategies [1], [2] as a scalable alternative for adapting large VLMs to domain- specific objectives. In Phase II, Quantized Low-Rank Adaptation (QLoRA) [2] was introduced as a parameter-efficient fine-tuning method applied exclusively to the language model component, while freezing the vision tower and multi-modal projector parame- ters. This approach yielded significant improvements, demon- strating QLoRA’s potential in adapting large-scale vision- language models under hardware constraints. Building upon these results, Phase III explores the efficacy of QLoRA fine-tuning under two distinct hyperparameter configurations. Each configuration was trained on the full training and validation sets to maximize data utilization. To comprehensively evaluate the fine-tuned models, we conducted twelve experiments by testing two models across three prompt types and two inference configurations using the complete test set. II. DATASET A. RISC Dataset Overview The RISC dataset consists of 44,521 remote sensing images (satellite imagery), each with a fixed resolution of 224 × 224 pixels. Every image is annotated with five distinct captions, resulting in a total of 222,605 captions that describe the visual content of the images. The dataset is split into training (35,614 images), validation (4,453 images), and test (4,454 images) sets. B. Exploratory Data Analysis Exploratory data analysis revealed that there are no missing image files. Caption lengths vary significantly, ranging from 4 to 50 tokens, with a mean length of approximately 11.10 tokens per caption. In terms of character length, captions range from 20 to 268 characters, with an average of 61.74. However, the dataset presents several quality issues that must be addressed. Notably, 14,632 captions are exact du- plicates, which can introduce redundancy. More critically, inconsistencies and contradictions exist among captions cor- responding to the same image. These include mismatches in object counts and syntactically ill phrases. Such discrepancies can hinder model training and obscure evaluation validity. A similarity analysis of the captions assigned to each image reveals considerable inconsistency, with a mean BLEU-4 score of 0.21, METEOR of 0.40, and cosine similarity of 0.33 across caption sets. These low scores indicate some divergence among the five captions per image, reinforcing the need for filtering strategies during preprocessing to enhance dataset coherence and training efficacy. C. Caption Selection and Alignment To optimize the training process and improve the alignment between visual and textual representations, the same approach as in Phase II was employed for Phase III. From the five available captions per image, a single caption was selected based on its semantic similarity to the corresponding visual content. This selection utilized CLIPScore [3], which measures the cosine similarity between image embeddings and text em- beddings using ViT-B/32 model. For each image, the caption with the highest CLIPScore was retained. In instances where