1 FedDropoutAvg: Generalizable federated learning for histopathology image classification Gozde N. Gunesli*, Mohsin Bilal, Shan E Ahmed Raza, and Nasir M. Rajpoot, Member, IEEE Abstract—Federated learning (FL) enables collaborative learn- ing of a deep learning model without sharing the data of participating sites. FL in medical image analysis tasks is relatively new and open for enhancements. In this study, we propose FedDropoutAvg, a new federated learning approach for training a generalizable model. The proposed method takes advantage of randomness, both in client selection and also in federated averaging process. We compare FedDropoutAvg to several algo- rithms in an FL scenario for real-world multi-site histopathology image classification task. We show that with FedDropoutAvg, the final model can achieve performance better than other FL approaches and closer to a classical deep learning model that requires all data to be shared for centralized training. We test the trained models on a large dataset consisting of 1.2 million image tiles from 21 different centers. To evaluate the generalization ability of the proposed approach, we use held-out test sets from centers whose data was used in the FL and for unseen data from other independent centers whose data was not used in the federated training. We show that the proposed approach is more generalizable than other state-of-the-art federated training approaches. To the best of our knowledge, ours is the first study to use a randomized client and local model parameter selection procedure in a federated setting for a medical image analysis task. Index Terms—Federated Learning, Model Aggregation, Con- volutional Neural Networks, Computational Pathology I. I NTRODUCTION In recent years, deep learning methods have shown success in many different tasks including those in computational pathology [1]. A major drawback to these approaches is the need for large amounts of data to train the networks. This drawback is even more obstructive in the medical field, as medical data are difficult to access and their sharing may be subject to legal and ethical limitations. Federated learning (FL) [2], [3] allows to overcome these challenges. While in traditional deep learning approaches, all data is required to be co-located in a central server where model training takes place, in an FL paradigm, each of the multiple decentralized centers holding local data can train the model on their own servers. By enabling training of the This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Asterisk indicates corresponding author. *G. N. Gunesli is with the Department of Computer Science, University of Warwick, UK (e-mail: Gozde.Gunesli-Noyan@warwick.ac.uk). M. Bilal is with the Department of Computer Science, University of Warwick, UK (e-mail: Mohsin.Bilal@warwick.ac.uk). Shan E Ahmed Raza is with the Department of Computer Science, University of Warwick, UK (e-mail: Shan.Raza@warwick.ac.uk). N. M. Rajpoot is with the Department of Computer Science, University of Warwick, UK (e-mail: N.M.Rajpoot@warwick.ac.uk) deep learning models collaboratively without exchanging the datasets, FL offers a solution to data ownership and gover- nance issues [4]. Existing FL methods comprise of several rounds of local training and federal aggregation steps. In each round of the federated training process, each data-holder trains a model for some number of epochs on their local dataset. The local data-holders then send their trained models to a central server for model aggregation. The aggregated model is sent back to the data-holders for further training rounds. Model aggregation is an important step for the overall performance. The most common method of model aggregation in existing FL studies is Federated Averaging (FedAvg) [3]. FedAvg is weighted averaging of local model parameters (the gradients) to obtain a global model at each round. The weights in this case are determined based on the number of training samples of each local data-holder. Li et al. [5] argued that local models are often substantially different from global models because of the heterogeneous and imbalanced nature of the datasets. Therefore, they proposed FedProx, which contained a proximal term in the loss function to restrict the effects of local training and prevent divergence from the global model parameters. Study of FL in the area of medical image analysis is relatively new. Existing FL approaches in the medical imaging domain have focused on specific tasks including: analysis of brain imaging data [6], [7], [8], [9], [10], [11], CT hemorrhage segmentation [12], breast density classification in mammog- raphy data [13], pancreas segmentation in abdominal CT images [14] and classification of histopathology images [15], [16]. Most of the FL studies in medical imaging have em- ployed FedAvg method for model aggregation [6], [7], [8], [9], [10], [11], [13], [14], [15]. Andreux et al. [16] proposed an enhancement for aggregation of batch normalization (BN) layers. Remedios et al. [12] incorporated momentum in the gradient averaging method. Medical image datasets can be very heterogeneous and unbalanced. Besides being unbalanced in terms of number of samples, the data quality and the diversity of different samples could differ by a large margin. In this case, the approach of FedAvg and FedProx, weighting the contributions of each local model by their data size, may have limitations. Since we cannot know beforehand which private local dataset(s) will generalize better for the test set of another center, mea- suring the contributions of individual centers and accurately weighting them is not feasible. While local validation set performances could be used to increase the weight of under- performing centers during training, these under-performing centers could have low data-quality and may not be worth arXiv:2111.13230v1 [cs.CV] 25 Nov 2021