Elastic Net to Forecast COVID-19 Cases Tim K Johnsen Applied Data Science San Jose State Univeristy San Jose, CA, USA tim.johnsen@sjsu.edu Jerry Z Gao Computer Engineering San Jose State University San Jose, CA, USA jerry.gao@sjsu.edu Abstract— Forecasting novel daily cases of COVID-19 is crucial for medical, political, and other officials who handle day to day, COVID-19 related logistics. Current machine learning approaches, though robust in accuracy, can be either black boxes, specific to one region, and/or hard to apply if the user has nominal knowledge in machine learning and programing. This weakens the integrity of otherwise robust machine learning methods, causing them to not be utilized to their full potential. Thus, the presented Elastic Net COVID-19 Forecaster, or EN-CoF for short, is designed to provide an intuitive, generic, and easy to apply forecaster. EN-CoF is a multi-linear regressor trained on time series data to forecast number of novel daily COVID-19 cases. EN- CoF maintains a high accuracy on par with more complex models such as ARIMA and Bi-LSTM, while gaining the advantages of transparency, generalization, and accessibility. Keywords— COVID-19, Elastic Net, Machine Learning, Artificial Intelligence, Time Series, Forecast I. INTRODUCTION The 2019 novel coronavirus (COVID-19) was first observed and studied in China [1], and has since turned into a global pandemic. Daily cases are hard to forecast because there is a large uncertainty in confirmed cases, thus “predictions using more complex models may not be more reliable compared to using a simpler model” [2]. Susceptible-Exposed-Infectious- Removed (SEIR) models have been used in [2] and [3] to predict how policies will affect infection rates. Artificial Intelligence and other models can forecast far into the future, but “with sizable associated uncertainty” [4]. A more realistic approach is to forecast into the near future, using a region’s recent record of novel daily COVID-19 cases (i.e. time series data). Reference [5] used time series data to forecast daily cases with the use of Long Short-Term Memory Network (LSTM) [6] and Autoregressive Integrated Moving Average (ARIMA) [7] models. The LSTM and ARIMA approaches were used to make 5-day forecasts for four countries: US, Italy, Spain, and Germany. Other ARIMA approaches have been used to forecast cases for specific regions [8-13]. ARIMA has shown to be a useful tool for forecasting into the near future. However, ARIMA must be refit for each region. ARIMA based models are typically used for their ability to learn seasonality trends, which COVID-19 has not been in circulation long enough to develop. Most recently, Recurrent Neural Networks (RNN) were studied, and it was shown that Bi-LSTM [14] can achieve slightly greater accuracy than LSTM, Gated Recurrent Units (GRU) [15], support vector regression [16], and ARIMA models when applied to 10 countries [17]. Though robust in accuracy, Recurrent Neural Network (RNN) models lack in explaining how predictions are made, otherwise commonly referred to as “explainability”. Though neural network methods such as Grad- CAM [18] help, much work is still needed to improve explainability. Neural networks also typically require domain knowledge in machine learning and programming to apply in the field, thus making them harder to access. Other models have been developed that use more novel approaches. Reference [19] trained an ensemble of multiple machine learning algorithms on time series data to forecast 1, 3, and 6 days into the future, for ten Brazilian regions. Another ensemble was used to forecast daily cases in Hungary [20]. Reference [21] used internet searches, news alerts, and mechanistic models to create forecasts of 32 Chinese provinces. Reference [22] used mobile phone-based surveys to focus on specific towns under quarantine. A review of some recent AI applications can be found in [23]. These more novel approaches are intriguing and helpful; however, they are hard to deploy due difficulty in understanding and accessing data – especially to those untrained in artificial intelligence. Current models have been applied to specific region(s) and even though they may give robust results, they are not easily explainable, nor are they easy to deploy – thus they lack in generality, explainability, and accessibility. The presented Elastic Net COVID-19 Forecaster (EN-CoF) aims to fill these gaps. EN-CoF is intuitive – it simply makes forecasts by taking a linear combination of a region’s time series data, and the learned weights follow an intuitive trend. EN-CoF is generic – it can be applied to any region, because it is trained on aggregations of time-series data from multiple regions. EN-CoF is easy to deploy – it requires no programming or AI knowledge, as the only thing needed to deploy EN-CoF are the learned static weights and the region’s time series data. EN-CoF is robust – performing with similar accuracy to more sophisticated models, such as ARIMA and LSTM. EN-CoF was evaluated against 151 countries, the largest number of countries evaluated to date. II. METHODS All models were trained and evaluated using python and the Scikit-learn [24], statsmodels [25], and Keras [26] libraries. All code, data, results, and figures can be found on my GitHub: [27]. Data was collected from the European Centre for Disease Control (ECDC). Day 1 is the first day recorded in that country  © IEEE 2021. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.