Statistical Challenges Facing Early Outbreak Detection in Biosurveillance Galit SHMUELI Department of Decision, Operations & Information Technologies and The Center for Health Information and Decision Systems Robert H. Smith School of Business University of Maryland College Park, MD 20742 (gshmueli@rhsmith.umd.edu) Howard BURKOM The Johns Hopkins University Applied Physics Laboratory Laurel, MD 20723 (Howard.Burkom@jhuapl.edu) Modern biosurveillance is the monitoring of a wide range of prediagnostic and diagnostic data for the purpose of enhancing the ability of the public health infrastructure to detect, investigate, and respond to disease outbreaks. Statistical control charts have been a central tool in classic disease surveillance and also have migrated into modern biosurveillance; however, the new types of data monitored, the processes underlying the time series derived from these data, and the application context all deviate from the in- dustrial setting for which these tools were originally designed. Assumptions of normality, independence, and stationarity are typically violated in syndromic time series. Target values of process parameters are time-dependent and hard to define, and data labeling is ambiguous in the sense that outbreak periods are not clearly defined or known. Additional challenges include multiplicity in several dimensions, perfor- mance evaluation, and practical system usage and requirements. Our focus is mainly on the monitoring of time series to provide early alerts of anomalies to stimulate investigation of potential outbreaks, with a brief summary of methods to detect significant spatial and spatiotemporal case clusters. We discuss the statistical challenges in monitoring modern biosurveillance data, describe the current state of monitoring in the field, and survey the most recent biosurveillance literature. KEY WORDS: Anomaly detection; Control chart; Disease outbreak; Statistical process control; Syn- dromic data. 1. INTRODUCTION Biosurveillance is the practice of monitoring data to detect, investigate, and respond to disease outbreaks. Traditional bio- surveillance has focused on the collection and monitoring of diagnostic medical and public health data retrospectively to de- termine the existence of disease outbreaks. Examples of tra- ditional data are cause-specific mortality rates and daily or weekly counts of selected laboratory results. Although such data are the most direct indicators of the current burden of a dis- ease of interest, in most situations they are collected, delivered, and analyzed days, weeks, or even months after the outbreak. By the time this information reaches decision makers, it may be too late for public health interventions that might avoid or ameliorate early cases or to react in other ways, such as stock- piling and dispensing vaccine and medication. Disease surveillance research in the late 1990s shifted to- ward biosurveillance systems that would provide early detec- tion of diseases resulting either from bioterrorist attacks or from “natural” causes, such as the avian flu. This shift meant monitoring information sources not previously used at time scales shortened from weeks or months to days or hours. Mod- ern biosurveillance uses less specific aggregated healthcare- seeking behavior data (also called syndromic data) from op- portunistic sources in search of earlier outbreak signals. Syn- dromic data are derived from prediagnostic information, such as over-the-counter (OTC) and pharmacy medication sales, calls to nurse hotlines, school absence records, searches on med- ical Web sites, and complaints of individuals entering hospi- tal emergency departments. None of these data directly mea- sure the number of cases of any specific disease, but it is as- sumed that they contain an outbreak signal earlier than that of traditional sources, because they contain measurable effects of care-seeking behavior before patients experience acute or disease-specific symptoms. The underlying assumption is that data collected from this early care-seeking behavior, such as purchasing OTC remedies, will contain a sufficiently strong and early signal of the outbreak when aggregated across the moni- tored population. The various data sources fall along a contin- uum according to both diagnostic specificity and likely detec- tion timeliness. Under the assumption that people tend to self- treat and self-medicate before rushing to the hospital, we would expect Web searching and the purchasing of OTC remedies to precede calls to nurse hotlines and ambulance dispatches, and followed by emergency department visits. Still, this entire con- tinuum is assumed to occur before actual clinical diagnoses can be made (after hospitalization and/or laboratory tests). In addi- tion to monitoring syndromic data, there have been efforts to monitor other types of data associated with disease risk factors, such as air and water quality measurements. All of these evi- © 2010 American Statistical Association and the American Society for Quality TECHNOMETRICS, FEBRUARY 2010, VOL. 52, NO. 1 DOI 10.1198/TECH.2010.06134 39