Modeling the Impact of Lifestyle on Health at Scale Adam Sadilek Dept. of Computer Science University of Rochester Rochester, NY, USA sadilek@cs.rochester.edu Henry Kautz Dept. of Computer Science University of Rochester Rochester, NY, USA kautz@cs.rochester.edu ABSTRACT Research in computational epidemiology to date has concen- trated on estimating summary statistics of populations and simulated scenarios of disease outbreaks. Detailed studies have been limited to small domains, as scaling the meth- ods involved poses considerable challenges. By contrast, we model the associations of a large collection of social and envi- ronmental factors with the health of particular individuals. Instead of relying on surveys, we apply scalable machine learning techniques to noisy data mined from online social media and infer the health state of any given person in an automated way. We show that the learned patterns can be subsequently leveraged in descriptive as well as predictive fine-grained models of human health. Using a unified statis- tical model, we quantify the impact of social status, exposure to pollution, interpersonal interactions, and other important lifestyle factors on one’s health. Our model explains more than 54% of the variance in people’s health (as estimated from their online communication), and predicts the future health status of individuals with 91% accuracy. Our meth- ods complement traditional studies in life sciences, as they enable us to perform large-scale and timely measurement, inference, and prediction of previously elusive factors that affect our everyday lives. Categories and Subject Descriptors H.1.m [Information Systems]: Miscellaneous General Terms Algorithms, Experimentation, Human Factors Keywords Online social networks, machine learning, computational epi- demiology, ubiquitous computing, geo-temporal modeling Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WSDM’13, February 4–8, 2013, Rome, Italy. Copyright 2013 ACM 978-1-4503-1869-3/13/02 ...$15.00. Figure 1: Visualization of the health and location of a sample of Twitter users in New York City. Sick people are colored red, whereas healthy individuals are green. Major pollution sources are highlighted in purple, and ZIP code boundaries are shown with white outlines. This paper explores to what extent online social media can be used to quantify and pre- dict the impact of a large collection of environmental and lifestyle factors on our health. Our web appli- cation is available at http://fount.in. 1. INTRODUCTION How does a new factory affect the health of residents in the city? How does your social status impact your health? Do visits to gyms decrease your susceptibility to communi- cable diseases? How about visits to bars, or riding the sub- way? Such questions are traditionally difficult and costly to answer at a population scale. Existing methods resort to surveys of individuals and medical providers, which re- quire extensive amount of human effort to complete, cost large amounts of money, and sample only a small fraction of people in a population. By contrast, we apply machine learning techniques to Twitter data and automatically esti- mate the health state of any individual on the basis of his or her online communication. Throughout the text, we refer to