Enrollment Rate Prediction in Clinical Trials based on CDF Sketching and Tensor Factorization tools Magda Amiridi * Department of ECE University of Virginia ma7bx@virginia.edu Cheng Qian ACOE, IQVIA alextoqc@gmail.com Nicholas D. Sidiropoulos * Department of ECE University of Virginia nikos@virginia.edu Lucas M. Glass ACOE, IQVIA lmglass@us.imshealth.com Abstract—Patient enrollment is critical to the success of a clinical trial. In practice, before launching a trial, one of the top priorities is to predict the enrollment rate for different countries, so that one can select clinical sites from the countries with the highest enrollment rates to accelerate patient recruitment. However, based on the limited trial information, estimating the enrollment rate is still a challenge. To deal with this problem, we adopt a very recent tensor factorization approach that aims to approximate the joint Cumulative Distribution Function (CDF) of trial enrollment data. We can always sketch a multivariate CDF in terms of multidimensional empirical cumulative probability array, i.e., a ﬁnite grid-sampled CDF tensor, and introduce a low- rank parametrization by a Canonical Polyadic Decomposition (CPD) model. The proposed model is unassuming of the structure of the data and identiﬁable under mild conditions by virtue of the uniqueness of CPD. At the same time, it affords both efﬁcient sample likelihood estimation and closed-form inference. Such model can be leveraged for reliable enrollment estima- tion, delivering probability estimates of a speciﬁc trial meeting an expected enrollment rate or probability estimates of being within a certain interval, as well as country recommendation. Experimental results demonstrate an improved performance of the proposed method in the enrollment rate prediction task over the best baselines by up to 12.2% in mean squared error on a real country-level trial dataset, while also offering direct means of quantifying uncertainty in the predictions based on the ﬁtted model. The improved performance and versatility highlight the societal and ﬁnancial beneﬁts of the proposed approach, which could be transformational in modern healthcare. I. I NTRODUCTION Clinical trials are of vital importance to global healthcare. They are a necessary step in drug development to ensure the safety and efﬁcacy of new therapies and vaccines. One of the main reasons clinical trials are terminated in the ﬁrst place is insufﬁcient patient enrollment. Accurate enrollment rate prediction, given relevant information (trial predictors) such as phase, country, eligibility criteria, patient segment, and indica- tion, is a key determinant of optimized clinical trial planning as it serves to minimize the overall cost and possible delays for an effective treatment from being approved. Minimizing the overall cost incurred by trials with low enrollment rates can be achieved by avoiding initiating trials that are most unlikely to meet pre-speciﬁed recruitment targets. Thus, the ability to predict the probability of proposed trials meeting enrollment goals prior to initiating the trial is highly beneﬁcial * Supported in part by NSF IIS-1704074. for pharmaceutical, biotech, and medical device companies. Additionally, maximizing the expected enrollment rate of a given trial may be accomplished by careful location selec- tion (i.e., by primarily focusing on countries likely to meet enrollment targets). Both of these aspects can be addressed if an efﬁcient data-generating distribution of clinical trials is available. The need for probabilistic analysis in predicting the enroll- ment rate of clinical trials using limited number of samples motivates the proposed method. We explore a data-driven, machine learning approach for effectively modeling the mul- tivariate distribution of trial predictors/features (trial phase, location, primary indication, therapeutic area) and enrollment rate based on multivariate cumulative distribution functions (CDFs) and the Canonical Polyadic (tensor-rank) decompo- sition or CANDECOMP/PARAFAC decomposition [1], [2], [3]. This paper builds upon the framework and the methods developed in [4] to solve an important practical problem of high societal impact, using a unique clinical trial dataset that has a long historical record of country level enrollment rates. We leverage the results presented in [4], which show that one can always sketch a multivariate CDF in terms of a multidimensional empirical cumulative probability array, i.e., a ﬁnite grid-sampled CDF tensor, and every multivariate grid- sampled CDF can be thought to be generated according to a latent variable Naive Bayes model with a bounded number of hidden states through CPD corresponding to the tensor rank. To jointly model both discrete (e.g., phase, country) and continuous variables (e.g., enrollment-rate), we will sketch the trial-related multivariate distribution in terms of a grid- sampled CDF tensor  F , where each mode represents the (ﬁnite) levels/cut-offs of the CDF for every individual trial feature (e.g., phase I, II, III, IV for the trial phase feature and USA, Germany, etc. for the location feature) and each element of that tensor can be easily estimated via sample averaging. By introducing the reconstructed approximation of  F , using the rank-R parameterization F , one can get a “universal” approach, which is unassuming of the structure of the data, striking an excellent trade-off between model generality and identiﬁability under mild conditions by virtue of the unique- ness of CPD. The resulting model also yields a probabilistic method which allows easy likelihood estimation, computation of conditional distributions, and closed-form inference. The ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10096026