COVID-19 Likelihood Meter: a machine learning approach to COVID-19 screening for Indonesian health workers Running Head: A data-driven COVID-19 screening tool for the Indonesian health workers Author List: Shreyash Sonthalia 1 , Muhammad Aji Muharrom 1 , Levana L. Sani 1 , Olivia Herlinda 2 , Adrianna Bella 2 , Dimitri Swasthika 2 , Panji Hadisoemarto 3 , Diah Saminarsih 4 , Nurul Luntungan 2 , Astrid Irwanto 1,5 , Akmal Taher 6 , Joseph L. Greenstein 7 Affiliations: 1 Nalagenetics Pte Ltd, Singapore, Singapore 2 Center for Indonesia’s Strategic Development Initiatives (CISDI), Jakarta, Indonesia 3 Department of Public Health., Faculty of Medicine, Padjajaran University, Bandung, Indonesia 4 World Health Organization, Geneva, Switzerland 5 Department of Pharmacy, Faculty of Science, National University of Singapore, Singapore 6 Department of Urology, Cipto Mangunkusumo Hospital, Universitas Indonesia, Jakarta, Indonesia 7 Institute for Computational Medicine, The Johns Hopkins University, Baltimore, United States *Dr. Greenstein’s participation in this study was as an unpaid consultant for Nalagenetics. All opinions expressed and implied in this manuscript do not represent or reflect the views of the Johns Hopkins University or the Johns Hopkins Health System Abstract Word Count: 384 Text Word Count: 3475 ABSTRACT The COVID-19 pandemic poses a heightened risk to health workers, especially in low- and middle-income countries such as Indonesia. Due to the limitations to implementing mass RT-PCR testing for health workers, high-performing and cost-effective methodologies must be developed to help identify COVID-19 positive health workers and protect the spearhead of the battle against the pandemic. This study aimed to investigate the application of machine learning classifiers to predict the risk of COVID-19 positivity (by RT-PCR) using data obtained from a survey specific to health workers. Machine learning tools can enhance COVID-19 screening capacity in high-risk populations such as health workers in environments where cost is a barrier to accessibility of adequate testing and screening supplies. We built two sets of COVID-19 Likelihood Meter (CLM) models: one trained on data from a broad population of health workers in Jakarta and Semarang (full model) and tested on the same, and one trained on health workers from Jakarta only (Jakarta model) and tested on an independent population of Semarang health workers. The area under the receiver-operating-characteristic curve (AUC), average precision (AP), and the Brier score (BS) were used to assess model performance. Shapley additive explanations (SHAP) were used to analyze feature importance. The final dataset for the study included 3979 health workers. For the full model, the random forest was selected as the algorithm of choice. It achieved cross-validation mean AUC of 0.818 ± 0.022 and AP of 0.449 ± 0.028 and was high performing during testing with AUC and AP of 0.831 and 0.428 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 20, 2021. ; https://doi.org/10.1101/2021.10.15.21265021 doi: medRxiv preprint NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.