Enrollment Rate Prediction in Clinical Trials based
on CDF Sketching and Tensor Factorization tools
Magda Amiridi
*
Department of ECE
University of Virginia
ma7bx@virginia.edu
Cheng Qian
ACOE, IQVIA
alextoqc@gmail.com
Nicholas D. Sidiropoulos
*
Department of ECE
University of Virginia
nikos@virginia.edu
Lucas M. Glass
ACOE, IQVIA
lmglass@us.imshealth.com
Abstract—Patient enrollment is critical to the success of a
clinical trial. In practice, before launching a trial, one of the top
priorities is to predict the enrollment rate for different countries,
so that one can select clinical sites from the countries with
the highest enrollment rates to accelerate patient recruitment.
However, based on the limited trial information, estimating the
enrollment rate is still a challenge. To deal with this problem, we
adopt a very recent tensor factorization approach that aims to
approximate the joint Cumulative Distribution Function (CDF) of
trial enrollment data. We can always sketch a multivariate CDF
in terms of multidimensional empirical cumulative probability
array, i.e., a finite grid-sampled CDF tensor, and introduce a low-
rank parametrization by a Canonical Polyadic Decomposition
(CPD) model. The proposed model is unassuming of the structure
of the data and identifiable under mild conditions by virtue
of the uniqueness of CPD. At the same time, it affords both
efficient sample likelihood estimation and closed-form inference.
Such model can be leveraged for reliable enrollment estima-
tion, delivering probability estimates of a specific trial meeting
an expected enrollment rate or probability estimates of being
within a certain interval, as well as country recommendation.
Experimental results demonstrate an improved performance of
the proposed method in the enrollment rate prediction task over
the best baselines by up to 12.2% in mean squared error on a
real country-level trial dataset, while also offering direct means
of quantifying uncertainty in the predictions based on the fitted
model. The improved performance and versatility highlight the
societal and financial benefits of the proposed approach, which
could be transformational in modern healthcare.
I. I NTRODUCTION
Clinical trials are of vital importance to global healthcare.
They are a necessary step in drug development to ensure the
safety and efficacy of new therapies and vaccines. One of the
main reasons clinical trials are terminated in the first place
is insufficient patient enrollment. Accurate enrollment rate
prediction, given relevant information (trial predictors) such as
phase, country, eligibility criteria, patient segment, and indica-
tion, is a key determinant of optimized clinical trial planning
as it serves to minimize the overall cost and possible delays
for an effective treatment from being approved. Minimizing
the overall cost incurred by trials with low enrollment rates
can be achieved by avoiding initiating trials that are most
unlikely to meet pre-specified recruitment targets. Thus, the
ability to predict the probability of proposed trials meeting
enrollment goals prior to initiating the trial is highly beneficial
*
Supported in part by NSF IIS-1704074.
for pharmaceutical, biotech, and medical device companies.
Additionally, maximizing the expected enrollment rate of a
given trial may be accomplished by careful location selec-
tion (i.e., by primarily focusing on countries likely to meet
enrollment targets). Both of these aspects can be addressed
if an efficient data-generating distribution of clinical trials is
available.
The need for probabilistic analysis in predicting the enroll-
ment rate of clinical trials using limited number of samples
motivates the proposed method. We explore a data-driven,
machine learning approach for effectively modeling the mul-
tivariate distribution of trial predictors/features (trial phase,
location, primary indication, therapeutic area) and enrollment
rate based on multivariate cumulative distribution functions
(CDFs) and the Canonical Polyadic (tensor-rank) decompo-
sition or CANDECOMP/PARAFAC decomposition [1], [2],
[3]. This paper builds upon the framework and the methods
developed in [4] to solve an important practical problem of
high societal impact, using a unique clinical trial dataset that
has a long historical record of country level enrollment rates.
We leverage the results presented in [4], which show that
one can always sketch a multivariate CDF in terms of a
multidimensional empirical cumulative probability array, i.e.,
a finite grid-sampled CDF tensor, and every multivariate grid-
sampled CDF can be thought to be generated according to a
latent variable Naive Bayes model with a bounded number of
hidden states through CPD corresponding to the tensor rank.
To jointly model both discrete (e.g., phase, country) and
continuous variables (e.g., enrollment-rate), we will sketch
the trial-related multivariate distribution in terms of a grid-
sampled CDF tensor
F , where each mode represents the
(finite) levels/cut-offs of the CDF for every individual trial
feature (e.g., phase I, II, III, IV for the trial phase feature and
USA, Germany, etc. for the location feature) and each element
of that tensor can be easily estimated via sample averaging.
By introducing the reconstructed approximation of
F , using
the rank-R parameterization F , one can get a “universal”
approach, which is unassuming of the structure of the data,
striking an excellent trade-off between model generality and
identifiability under mild conditions by virtue of the unique-
ness of CPD. The resulting model also yields a probabilistic
method which allows easy likelihood estimation, computation
of conditional distributions, and closed-form inference. The
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10096026