Communication Paterns in Distributed Deep Learning Amir Farhat Manya Ghobadi Massachusetts Institute of Technology amirf@mit.edu,ghobadi@csail.mit.edu ABSTRACT Machine learning has been increasingly deployed in the cloud to take advantage of massive scaling capability as a means of reducing the time-to-accuracy of training. To this end, diferent machine learning training distribution frame- works are put to use, with Horovod from Uber emerging as a popular choice. To squeeze as much performance as possible from the distribution framework, it is important to maximally overlap computation and communication while maintaining high GPU utilization as a way of reducing the duration of each iteration of training. As a frst step in this direction, this project sets out to study the communication component of training. We train Deep Neural Network (DNN) models of various sizes on sixteen GPUs in Google Cloud Compute Engine platform and record information about the data the workers exchange as well as the timing of each iteration of training. Our two main observations are: (i) the amount of data exchanged between workers at each training iteration is proportional to the model size; and (ii) the duration of training is not fully determined by the model size, it depends also on the compute hardware, communication bandwidth, and batch sizes in addition. The signifcance of these fnd- ings does not ofer a complete enough picture for improving the TTA for models, but can do that in combination with information about computation. 1 INTRODUCTION Machine learning models are instrumental in solving com- plex non-traditional problems such as image processing, con- trolling autonomous vehicles, natural language processing, and more. The power of such techniques, specifcally deep learning, has inspired the development of increasingly com- plex and large models. To ensure their efectiveness, machine learning engineers put these models through several rounds of training, fne-tuning, and testing before deployment. As models increase in size, engineers have begun distributing the training on multiple servers in order to leverage par- allelism of training tasks. Specifcally, engineers, through the training environment, coordinate each server to train on a subset of the data that is disjoint from other subsets of the data that other servers handle. In this fashion, an N- fold increase in the number of servers participating in the training process will ideally lead to an N-fold speedup in training performance. In particular, a crucial metric used for evaluating training is the time-to-accuracy (TTA), a measure of the time required to train a given model until it achieves a specifc accuracy [1]. Thus, the machine learning community has adopted distributed machine learning frameworks for use in their training in an efort to improve the speed of their training [2]. Tensorfow, a widely used machine learn- ing training framework, comes equipped with a method to distribute training across multiple machines, but it is hard to use in a distributed fashion and can be slow [2, 3]. Users were consequently motivated to develop diferent machine learning distribution frameworks like Horovod, a fexible open source platform for distributing training from Uber. Horovod includes a profler named Horovod Timeline, but its information is limited in that it only displays computa- tion time [2, 3]. A full analysis of training must measure not only computation (on each server and on its GPUs), it must also measure communication (between servers and between GPUs). In the case that there is little or no overlap between the computation and communication stages during training, this method of profling would identify opportu- nities to increase the overlap between communication and computation as a means of reducing the overall TTA of a model. In addition to overlap, this method, when combined with knowledge of the inner-workings of the distribution framework at hand, exposes potential bottlenecks in the com- putation and communication stages. These can be used to optimize the training process for even better TTA. As a frst step to achieving this large goal, this project focuses on the communication patterns during distributed training. Specifcally, we make use of fve diferent models of varying size (highlighted in Table 1) for training with the Horovod distributed training framework. We design a measurement tool to measure inter-server communication. In particular, we measure all-to-all server communication over the network using tcpdump while training is running [4]. Later, we post-process the raw communication data to extract information about the communication. Our obser- vations are as follows. First, we confrm empirically that Horovod establishes a ring topology between its workers according to the order they appear in the execution com- mand. Second, we observe that inter-server TCP fows come