ML4Chem: A Machine Learning Package for Chemistry and Materials Science Muammar El Khatib ∗ and Wibe A de Jong Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA (Dated: March 6, 2020) ML4Chem is an open-source machine learning library for chemistry and materials science. It provides an extendable platform to develop and deploy machine learning models and pipelines and is targeted to the non-expert and expert users. ML4Chem follows user-experience design and oﬀers the needed tools to go from data preparation to inference. Here we introduce its atomistic module for the implementation, deployment, and reproducibility of atom-centered models. This module is composed of six core building blocks: data, featurization, models, model optimization, inference, and visualization. We present their functionality and ease of use with demonstrations utilizing neural networks and kernel ridge regression algorithms. I. INTRODUCTION In the last decade, machine learning (ML) has undergone fast development due to large amounts of available data and advancements in computational hardware e.g. faster and cheaper central processing units (CPU), graphics process units (GPU), and more recently the introduction of tensor processing units (TPU). Algorithmic improvements on how to compute the gradient in weight space of feedforward neural networks with respect to a loss function[1] reduced the computational time of training deep neural networks signiﬁcantly. As a result companies like Google, and Facebook, introduced the most useful deep learning platforms available right now: TensorFlow[2], and Pytorch[3]. These frame- works positively impacted and advanced ML research because they helped with democratizing and simpliﬁed access to ML technologies to a larger audience. In the ﬁeld of physical chemistry and materials sciences, ML models are being standardized and applied to solve tasks such as the acceleration of atomistic simulations[4–8], prediction of the electronic Hamiltonian with generative models[9, 10], extraction of continuous latent representations for the generation of molecules[11], and even the predic- tion of the scent of small organic molecules[12]. It also is becoming the norm to release software solutions as support to validate results of publications that apply ML models, and alleviate the “reproducibility crisis in artiﬁcial intelligence and machine learning”[13, 14]. Nevertheless, this obliquely fragments the software ecosystem because each software implementation a) requires speciﬁc data structures and b) would likely experience a lack of continuous support. There already are packages that democratize ML in chemistry. For example, DeepChem[15] has played a critical role in providing users a helpful platform of ML algorithms and featurizers for drug discovery, quantum chemistry, material sciences, and biology. More recently ChemML has been introduced as a machine learning and informatics program suite for the analysis, mining, and modeling of chemical and materials data[16]. What diﬀerentiates ML4Chem is that it focuses on easing the implementation of new functionality, extraction of intermediate quantities, interfacing with * melkhatibr@lbl.gov