Challenges and Pitfalls of Reproducing Machine Learning Artifacts Cheng Li 1 , Abdul Dakkak 1 , Jinjun Xiong 2 , and Wen-mei Hwu 3 cli99@illinois.edu,dakkak@illinois.edu,jinjun@us.ibm.com, w-hwu@illinois.edu 1 Department of Computer Science , University of Illinois, Urbana-Champaign 2 IBM Thomas J. Watson Research Center , Yorktown Heights, NY 3 Department of Electrical and Computer Engineering , University of Illinois, Urbana-Champaign Abstract—An increasingly complex and diverse collection of Machine Learning(ML) models as well as hardware/software stacks, collectively referred to as ”ML artifacts”, are being proposed leading to a diverse landscape of ML. These ML in- novations proposed have outpaced researchers ability to analyze, study and adapt them. This is exacerbated by the complicated and sometimes non reproducible procedures for ML evaluation. The current practice of sharing ML artifacts is through repositories where artifact authors post ad-hoc code and some documentation. The authors often fail to reveal critical information for others to reproduce their results. One often fails to reproduce artifact authors claims, not to mention adapt the model to his/her own use. This article discusses common challenges and pitfalls of reproducing ML artifacts, which can be used as a guideline for ML researchers when sharing or reproducing artifacts. I. I NTRODUCTION An increasingly complex and diverse collection of ML models as well as hardware/software stacks are being proposed leading to a diverse landscape of ML. [2] shows that the num- ber of ML arXiv papers published have outpaced Moore’s law. These ML innovations proposed have outpaced researchers ability to analyze, study and adapt them. This is exacerbated by the complicated and sometimes non reproducible procedures for ML evaluation. The current practice of sharing ML artifacts is through repositories such as GitHub where model authors post ad- hoc code and some documentation. The authors often fail to reveal critical information for others to reproduce their results. Some authors also release a Dockerﬁle. However, Docker only guarantee the software stack but does not help model users examine or modify the artifact to adapt to other environments nor does it provide a consistent methodology or API to perform the evaluation. In short, one often fails to reproduce artifact authors claims, not to mention adapt the model to his/her own use. This paper shows how reproducibility is an issue for ML evaluation – motivated by outlining some common pitfalls model users often encounter when attempting to replicate model authors’ claims. These pitfalls also inform the model authors into the minimal types of information they must reveal for others to reproduce their claims. To facilitate the adoption of ML innovations, ML evaluation must be reproducible and a better way of sharing ML artifacts is needed. We propose Fig. 1: ResNet v1 50 using TensorFlow 1.13 on GPU and GPU systems with varying batch sizes a speciﬁcation of model evaluation and a efﬁcient system that consumes the speciﬁcation to provide ML evaluation while maintaining reproducibility. Please refer to MLMod- elScope [1] for more details. II. FACTORS THAT AFFECT MODEL EVALUATION Many SW/HW conﬁgurations must work in unison within a ML workﬂow to replicate a model authors’ claims. In the process of developing MLModelScope we identiﬁed a few less common pitfalls, show how they arise, and provide a suggested solution. Within MLModelScope all these pitfalls are handled by the platform’s design and the model manifest speciﬁcation. A. Hardware Different hardware architectures can result in varying per- formance and accuracy, since the ML libraries across archi- tectures could either be different or have different implemen- tations. Pitfall 1: Only look at partial hardware, not the system. E.g. Inference on a Volta GPU must be faster than that on a Pascal GPU. Figure 1 compares inference performance across systems. Volta (V100) is faster than Pascal (P100) in this case. One often assumes this to be always true. However, looking at only GPU or CPU compute sections when comparing performance is a common pitfall. Figure 2 shows a Pascal system can perform better than a Volta system because of a faster CPU- GPU interconnect. One therefore should consider the entire system and its end-to-end latency under different workload scenarios when reporting system performance results. arXiv:1904.12437v1 [cs.LG] 29 Apr 2019