1 Cats, not CAT scans: a study of dataset similarity in transfer learning for 2D medical image classiﬁcation Irma van den Brandt * , Floris Fok * , Bas Mulders * , Joaquin Vanschoren * Veronika Cheplygina † * Eindhoven University of Technology, The Netherlands † IT University of Copenhagen, Denmark Abstract—Transfer learning is a commonly used strategy for medical image classiﬁcation, especially via pretraining on source data and ﬁne-tuning on target data. There is currently no consensus on how to choose appropriate source data, and in the literature we can ﬁnd both evidence of favoring large natural image datasets such as ImageNet, and evidence of favoring more specialized medical datasets. In this paper we perform a systematic study with nine source datasets with natural or medical images, and three target medical datasets, all with 2D images. We ﬁnd that ImageNet is the source leading to the highest performances, but also that larger datasets are not necessarily better. We also study different deﬁnitions of data similarity. We show that common intuitions about similarity may be inaccurate, and therefore not sufﬁcient to predict an appropriate source a pri- ori. Finally, we discuss several steps needed for further research in this ﬁeld, especially with regard to other types (for example 3D) medical images. Our experiments and pretrained models are available via https://www.github.com/vcheplygina/cats-scans I. I NTRODUCTION In medical image classiﬁcation, labeled data is often scarce, inviting the use of techniques such as transfer learning [26, 9, 52, 36, 28], where the goal is to reuse information across datasets. When training a neural network, transfer can be achieved by ﬁrst training on a larger, source dataset (for example, natural images such as cats) and then further ﬁne- tuning on a smaller, target dataset (such as a dataset of computed tomography (CT or CAT) scans. This allows the network to reuse features it learned on the source data, thus lowering the amount of target data needed. A popular source dataset is ImageNet [15], although it has been argued whether it is the best strategy for medical target data. For example, Raghu et al. [32] argue that the dataset properties (i.e. number of classes, level of granularity in classes, size) of ImageNet and a medical target dataset may be too different to allow effective feature reuse. More suitable features may be learned from datasets more similar to the medical target dataset [43, 45]. Some studies have indeed shown that the effectiveness of feature reuse decreased with increasing differences between the source and target task [50, 42]. In a previous study we reviewed papers which compared medical to non-medical source datasets for medical target data [8]. Out of 12 papers, three concluded a medical source was best, three concluded that a non-medical source was best, and two did not ﬁnd differences. Others did not provide a deﬁnite conclusion, but did provide other valuable insights. In general, the papers seemed to agree that source datasets need to be “large enough” and “similar enough”, but exact deﬁnitions of these properties were not given. Since each paper used a different set of datasets, it was not possible to extract further conclusions from the study. Given a new target problem, we would like to identify what the best course of action is without trying all possible source datasets. The goal of the study is therefore two-fold: • Investigate the relationship between transfer learning per- formance and properties such as dataset size or origin of images, and • Investigate whether a dataset similarity measure, based on such a meta-representation, can be used to predict which source dataset is the most appropriate for a particular target dataset. We perform transfer learning experiments with nine datasets. We show that the natural image dataset ImageNet, which is the largest, leads to the best performances. However, even small datasets can be valuable for transfer learning. We also examine two deﬁnitions of dataset similarity, and how it relates to transfer learning performance. Both deﬁnitions show very weak to weak correlations with the performance. However, the most similar datasets appear to have the least effect on performance, which contrasts our results from earlier ﬁndings. II. RELATED WORK A. How can we do transfer learning? Transfer learning [30] relies on the idea of transferring information from related, but not the same, learning problems. In supervised classiﬁcation scenarios we normally assume that the training and test data are from the same domain D =(X ,p(X)), and task T =(Y ,f (·)), where X and Y are the feature and label spaces, p(X) is the distribution of the feature vectors, and f the mapping of the features to labels. In transfer learning scenarios, we assume that we are dealing either with different domains D S = D T and/or different tasks T S = T T . Transfer learning can be achieved via different strategies, of which a popular one is to pretrain on the source data, and then use the pretrained network either for extracting off-the-shelf features from the target data, or for further ﬁne-tuning on the target data. In the pretraining/ﬁne-tuning scenario, both the domain and the task can be different. Still, transfer learning can be beneﬁcial, and various studies have looked at how to do this successfully [42, 19, 21, 37]. Some general ﬁndings arXiv:2107.05940v1 [cs.CV] 13 Jul 2021