Web search activity data accurately predict population chronic disease risk in the USA Thin Nguyen, 1 Truyen Tran, 1 Wei Luo, 1 Sunil Gupta, 1 Santu Rana, 1 Dinh Phung, 1 Melanie Nichols, 2 Lynne Millar, 2 Svetha Venkatesh, 1 Steve Allender 2 ▸ Additional material is published online only. To view please visit the journal online (http://dx.doi.org/10.1136/jech- 2014-204523). 1 Centre for Pattern Recognition and Data Analytics, School of Information Technology, Deakin University, Geelong, Victoria, Australia 2 World Health Organization Collaborating Centre for Obesity Prevention, Deakin University, Geelong, Victoria, Australia Correspondence to Dr Steven Allender, World Health Organization Collaborating Centre D1.101, Deakin University, Waterfront Campus, 1 Gheringhap Street, Geelong, VIC 3220, Australia; steven.allender@deakin.edu.au Received 29 August 2014 Revised 12 December 2014 Accepted 26 January 2015 To cite: Nguyen T, Tran T, Luo W, et al. J Epidemiol Community Health Published Online First: [ please include Day Month Year] doi:10.1136/jech-2014- 204523 ABSTRACT Background The WHO framework for non- communicable disease (NCD) describes risks and outcomes comprising the majority of the global burden of disease. These factors are complex and interact at biological, behavioural, environmental and policy levels presenting challenges for population monitoring and intervention evaluation. This paper explores the utility of machine learning methods applied to population-level web search activity behaviour as a proxy for chronic disease risk factors. Methods Web activity output for each element of the WHO’s Causes of NCD framework was used as a basis for identifying relevant web search activity from 2004 to 2013 for the USA. Multiple linear regression models with regularisation were used to generate predictive algorithms, mapping web search activity to Centers for Disease Control and Prevention (CDC) measured risk factor/disease prevalence. Predictions for subsequent target years not included in the model derivation were tested against CDC data from population surveys using Pearson correlation and Spearman’s r. Results For 2011 and 2012, predicted prevalence was very strongly correlated with measured risk data ranging from fruits and vegetables consumed (r=0.81; 95% CI 0.68 to 0.89) to alcohol consumption (r=0.96; 95% CI 0.93 to 0.98). Mean difference between predicted and measured differences by State ranged from 0.03 to 2.16. Spearman’s r for state-wise predicted versus measured prevalence varied from 0.82 to 0.93. Conclusions The high predictive validity of web search activity for NCD risk has potential to provide real-time information on population risk during policy implementation and other population-level NCD prevention efforts. INTRODUCTION AND BACKGROUND Non-communicable diseases (NCD) such as heart disease, stroke and cancer account for the major burden of disease globally. 1 The WHO provides a framework for understanding risks for these leading NCDs, 2 which includes intermediate risk (raised blood sugar, raised blood pressure, dyslipi- daemia, overweight and obesity, and abnormal lung function), behavioural and environmental factors (unhealthy diet, physical inactivity, tobacco and alcohol use, air pollution, age and heredity) and underlying determinants (globalisation, urbanisa- tion, population ageing and social determinants). This short list begins to hint at the complexity of the interaction of these risk factors, and, more recently, large scale mapping exercises such as the Foresight map of the obesity system 3 have begun to unpick the underlying complexity driving these diseases. The complexity of NCD risk provides challenges for the design and intervention of large-scale popu- lation prevention efforts, not least of which is the provision of data to understand the effectiveness of interventions in real time. Most interventions propose a logic model targeting one or several of the elements of the WHO NCD Framework. Many have difficulty in evaluating these effects due to the cost and difficulty of collecting these data and the limited utility of the time lags inherent in current data collection methods for measuring population risk. 4 There may be lessons from other fields that can provide insight into the problems of inadequate population data to understand trends in NCD risk and outcomes. Notable among these is the popula- tion monitoring of behavioural trends in other fields using online web access data as a proxy for actual population behaviour. 5 Such an approach involves collecting trend data from online searches related to behaviours and assessing their utility in predicting population behaviour trends tested against measured data. The use of web access data using ‘Google Trends’ has proved to have some utility in predicting short-term values of a range of economic indicators, 5 notably automobile sales, unemployment claims and consumer confidence. 6 The approach has also been used to predict con- sumer behaviours around purchasing of movies, video games and songs. 6 Recent applications in public health include com- parisons of measured epidemic data related to an outbreak of Salmonella enterica in the USA in 2009 with a search volume of associated conditions (eg, ‘diarrhea’, ‘peanut butter’, ‘food poisoning’, etc). 7 The authors demonstrated strong associations between the epidemic curve and case incidence and point to the potential for using search terms in sur- veillance of other such outbreaks. 8 Others have demonstrated strong correlations between Google search trends and measured data for outbreaks of influenza (r=0.9), gastroenteritis and chickenpox. 9 For influenza, one study was able to accurately esti- mate influenza activity for each state of the USA with a lag of 1day. 10 11 Work in Australia and the European Union 12 13 has had mixed success in reporting correlations of between 0.39 and 0.94 between sentinel site physician-based case ascertain- ment and Google Flu Trends. During a recent pan- demic, Valdivia et al 13 found a strong correlation between the Google Flu Trends and sentinel net- works systems in several European countries. Additional applications include the use of Web search logs to detect adverse events of drugs. 14 Nguyen T, et al. J Epidemiol Community Health 2015;0:1–7. doi:10.1136/jech-2014-204523 1 Research report JECH Online First, published on March 24, 2015 as 10.1136/jech-2014-204523 Copyright Article author (or their employer) 2015. Produced by BMJ Publishing Group Ltd under licence. group.bmj.com on September 7, 2017 - Published by http://jech.bmj.com/ Downloaded from