Unified real-time environmental-epidemiological data for multiscale modeling of the COVID-19 pandemic Hamada S. Badr 1* , Benjamin F. Zaitchik 2 , Gaige H. Kerr 3 , Nhat-Lan H. Nguyen 4 , Yen-Ting Chen 4 , Patrick Hinson 4 , Josh M. Colston 4 , Margaret N. Kosek 4 , Ensheng Dong 1 , Hongru Du 1 , Maximilian Marshall 1 , Kristen Nixon 1 , Arash Mohegh 3 , Daniel L. Goldberg 3 , Susan C. Anenberg 3 , and Lauren M. Gardner 1 1 Department of Civil and Systems Engineering, Johns Hopkins University, Baltimore, MD 21218 2 Department of Earth and Planetary Sciences, Johns Hopkins University, Baltimore, MD 21218 3 Department of Environmental and Occupational Health, Milken Institute School of Public Health, George Washington University, Washington, DC 20052 4 Division of Infectious Diseases and International Health, University of Virginia School of Medicine, Charlottesville, VA 22903 * Corresponding author at: JHU, 3400 N. Charles Street, Latrobe 5C, Baltimore, MD, 21218, USA. E-mail address: badr@jhu.edu (Hamada S. Badr). Key Words: COVID-19; SARS-CoV-2; Coronavirus; Pandemic, Infectious Diseases; Epidemiology; Hydrometeorology; Air Quality; Machine Learning. Abstract An impressive number of COVID-19 data catalogs exist. None, however, are optimized for data science applications, e.g., inconsistent naming and data conventions, uneven quality control, and lack of alignment between disease data and potential predictors pose barriers to robust modeling and analysis. To address this gap, we generated a unified dataset that integrates and implements quality checks of the data from numerous leading sources of COVID-19 epidemiological and environmental data. We use a globally consistent hierarchy of administrative units to facilitate analysis within and across countries. The dataset applies this unified hierarchy to align COVID- 19 case data with a number of other data types relevant to understanding and predicting COVID- 19 risk, including hydrometeorological data, air quality, information on COVID-19 control policies, and key demographic characteristics. Background & Summary The ongoing COVID-19 pandemic has caused widespread illness, loss of life, and societal upheaval across the globe. As the public health crisis continues, there is both an urgent need and a unique opportunity to track and characterize the spread of the virus and sensitivity of disease transmission to demographic, geographic, socio-political, seasonal and environmental factors, including influence of climate and air quality conditions. The global research and data science communities have responded to this challenge with a wide array of efforts to collect, catalog, and disseminate data on COVID case numbers, hospitalizations, mortality, and other indicators of COVID incidence and burden. 1-12 Some of these efforts have attempted to integrate data at regional to global scale, including inventories at the finest geographic scale available. While these databases All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 7, 2021. ; https://doi.org/10.1101/2021.05.05.21256712 doi: medRxiv preprint NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.