David C. Wyld et al. (Eds): AIMLA, DBDM, CCNET - 2021 pp. 21-33, 2021. CS & IT - CSCP 2021 DOI: 10.5121/csit.2021.111302 DIVIDE-AND-CONQUER FEDERATED LEARNING UNDER DATA HETEROGENEITY Pravin Chandran, Raghavendra Bhat, Avinash Chakravarthy and Srikanth Chandar Intel Technology India Pvt. Ltd, Bengaluru, India ABSTRACT Federated Learning allows training of data stored in distributed devices without the need for centralizing training-data, thereby maintaining data-privacy. Addressing the ability to handle data heterogeneity (non-identical and independent distribution or non-IID) is a key enabler for the wider deployment of Federated Learning. In this paper, we propose a novel Divide-and- Conquer training methodology that enables the use of the popular FedAvg aggregation algorithm by over-coming the acknowledged FedAvg limitations in non-IID environments. We propose a novel use of Cosine-distance based Weight Divergence metric to determine the exact point where a Deep Learning network can be divided into class-agnostic initial layers and class-specific deep layers for performing a Divide and Conquer training. We show that the methodology achieves trained-model accuracy at-par with (and in certain cases exceeding) the numbers achieved by state-of-the-art algorithms like FedProx, FedMA, etc. Also, we show that this methodology leads to compute and/or bandwidth optimizations under certain documented conditions. KEYWORDS Federated Learning, Divide and Conquer, Weight divergence. 1. INTRODUCTION Federated Learning has been proposed as a new learning paradigm to overcome the privacy regulations and communication overheads associated with central training [1,2]. In Federated Learning, a central server shares a global model with participating client devices and the model is trained on the local datasets available at the client device. The local dataset is never shared with the server, instead, local updates to the global model are shared with the server. The server combines the local updates from the participating clients using an Optimization (or Aggregation) Algorithm and creates a new version of the global model. This process is repeated for the required number of communication rounds until the desired convergence criteria are achieved. Federated Learning differs significantly from traditional learning approaches in terms of optimization in a distributed setting, privacy preserving learning, and communication latency during the learning process [3]. Optimization in Distributed setting differs from the traditional learning approach due to statistical and systems heterogeneity [1]. The statistical heterogeneity manifests itself in the form of non-independent and identical distribution (non-IID) of training data across participating clients. The non-IID condition arises due to a host of reasons that is specific to the local environment and usage patterns at the client. Causes for the skewed data distribution have been surveyed extensively and it has been proven that any real-world scale deployment of Federated Learning should address the challenges around non-IID data. A good example specific to the medical domain can be found in [4]. Several approaches have been