Statistical Challenges Facing Early Outbreak
Detection in Biosurveillance
Galit SHMUELI
Department of Decision,
Operations & Information Technologies and
The Center for Health Information
and Decision Systems
Robert H. Smith School of Business
University of Maryland
College Park, MD 20742
(gshmueli@rhsmith.umd.edu)
Howard BURKOM
The Johns Hopkins University
Applied Physics Laboratory
Laurel, MD 20723
(Howard.Burkom@jhuapl.edu)
Modern biosurveillance is the monitoring of a wide range of prediagnostic and diagnostic data for the
purpose of enhancing the ability of the public health infrastructure to detect, investigate, and respond to
disease outbreaks. Statistical control charts have been a central tool in classic disease surveillance and
also have migrated into modern biosurveillance; however, the new types of data monitored, the processes
underlying the time series derived from these data, and the application context all deviate from the in-
dustrial setting for which these tools were originally designed. Assumptions of normality, independence,
and stationarity are typically violated in syndromic time series. Target values of process parameters are
time-dependent and hard to define, and data labeling is ambiguous in the sense that outbreak periods are
not clearly defined or known. Additional challenges include multiplicity in several dimensions, perfor-
mance evaluation, and practical system usage and requirements. Our focus is mainly on the monitoring
of time series to provide early alerts of anomalies to stimulate investigation of potential outbreaks, with
a brief summary of methods to detect significant spatial and spatiotemporal case clusters. We discuss the
statistical challenges in monitoring modern biosurveillance data, describe the current state of monitoring
in the field, and survey the most recent biosurveillance literature.
KEY WORDS: Anomaly detection; Control chart; Disease outbreak; Statistical process control; Syn-
dromic data.
1. INTRODUCTION
Biosurveillance is the practice of monitoring data to detect,
investigate, and respond to disease outbreaks. Traditional bio-
surveillance has focused on the collection and monitoring of
diagnostic medical and public health data retrospectively to de-
termine the existence of disease outbreaks. Examples of tra-
ditional data are cause-specific mortality rates and daily or
weekly counts of selected laboratory results. Although such
data are the most direct indicators of the current burden of a dis-
ease of interest, in most situations they are collected, delivered,
and analyzed days, weeks, or even months after the outbreak.
By the time this information reaches decision makers, it may
be too late for public health interventions that might avoid or
ameliorate early cases or to react in other ways, such as stock-
piling and dispensing vaccine and medication.
Disease surveillance research in the late 1990s shifted to-
ward biosurveillance systems that would provide early detec-
tion of diseases resulting either from bioterrorist attacks or
from “natural” causes, such as the avian flu. This shift meant
monitoring information sources not previously used at time
scales shortened from weeks or months to days or hours. Mod-
ern biosurveillance uses less specific aggregated healthcare-
seeking behavior data (also called syndromic data) from op-
portunistic sources in search of earlier outbreak signals. Syn-
dromic data are derived from prediagnostic information, such as
over-the-counter (OTC) and pharmacy medication sales, calls
to nurse hotlines, school absence records, searches on med-
ical Web sites, and complaints of individuals entering hospi-
tal emergency departments. None of these data directly mea-
sure the number of cases of any specific disease, but it is as-
sumed that they contain an outbreak signal earlier than that
of traditional sources, because they contain measurable effects
of care-seeking behavior before patients experience acute or
disease-specific symptoms. The underlying assumption is that
data collected from this early care-seeking behavior, such as
purchasing OTC remedies, will contain a sufficiently strong and
early signal of the outbreak when aggregated across the moni-
tored population. The various data sources fall along a contin-
uum according to both diagnostic specificity and likely detec-
tion timeliness. Under the assumption that people tend to self-
treat and self-medicate before rushing to the hospital, we would
expect Web searching and the purchasing of OTC remedies to
precede calls to nurse hotlines and ambulance dispatches, and
followed by emergency department visits. Still, this entire con-
tinuum is assumed to occur before actual clinical diagnoses can
be made (after hospitalization and/or laboratory tests). In addi-
tion to monitoring syndromic data, there have been efforts to
monitor other types of data associated with disease risk factors,
such as air and water quality measurements. All of these evi-
© 2010 American Statistical Association and
the American Society for Quality
TECHNOMETRICS, FEBRUARY 2010, VOL. 52, NO. 1
DOI 10.1198/TECH.2010.06134
39