PUC Chile team at Caption Prediction: ResNet visual
encoding and caption classification with Parametric
ReLU
Vicente Castro, Pablo Pino, Denis Parra and Hans Lobel
Pontifcia Universidad Católica de Chile, Av. Vicuña Mackena 4860, Macul, 7820244, Chile
Abstract
This article describes PUC Chile team’s participation in the Caption Prediction task of ImageCLEFmed-
ical challenge 2021, which resulted in the team winning this task. We frst show how a very simple
approach based on statistical analysis of captions, without relying on images, results in a competitive
baseline score. Then, we describe how to improve the performance of this preliminary submission by
encoding the medical images with a ResNet CNN, pre-trained on ImageNet and later fne-tuned with
the challenge dataset. Afterwards, we use this visual encoding as the input for a multi-label classif-
cation approach for caption prediction. We describe in detail our fnal approach, and we conclude by
discussing some ideas for future work.
Keywords
Image Captioning, Medical Artifcial Intelligence, Deep Learning, Perceptual Similarity, Convolutional
Neural Networks
1. Introduction
ImageCLEF [1] is an initiative with the aim of advancing the feld of image retrieval (IR) as
well as enhancing the evaluation of technologies for annotation, indexing and retrieval of
visual data. The initiative takes the form of several challenges, and it is especially aware of the
changes in the IR feld in recent years, which have brought about tasks requiring the use of
diferent types of data such as text, images and other features moving towards multi-modality.
ImageCLEF has been running annually since 2003, and since the second version (2004) there are
medical images involved in some tasks, such as medical image retrieval. Since those versions,
the ImageCLEFmedical challenge group of tasks [2] has integrated new ones involving medical
images, with the medical image captioning task taking place since 2017. It consists of two
subtasks: concept prediction and caption detection. Although there have been changes in
the data used for the newest versions of the challenge, the goal of this task is the same: help
physicians reduce the burden of manually translating visual medical information (such as
radiology images) into textual descriptions. In particular, the caption prediction task within
the ImageCLEFmedical challenge 2021 aims at supporting clinicians in their responsibility to
provide clinical diagnoses by composing coherent captions for the entirety of a medical image.
CLEF 2021 ś Conference and Labs of the Evaluation Forum, September 21ś24, 2021, Bucharest, Romania
vvcastro@uc.cl (V. Castro)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)