Aggregation of Location Attributes for Prediction of Infection Risk Slobodan Vucetic and Hao Sun Center for Information Science and Technology, 303 Wachman Hall, Temple University, 1805 N. Broad St., Philadelphia, PA 19122 Abstract. In this study we proposed an algorithm for prediction of infection risk that is based on aggregation of locations and their use as prediction attributes. The algorithm is tested on a specific instance of EpiSims simulated data for Portland, OR. The results indicate that location aggregation is very promising approach that can result in high prediction accuracy. Introduction Despite numerous advances in medicine, the risks associated with occurrences of well-known, modified, or novel pandemic diseases, such as H5N1 avian influenza in Southeast Asia [1], are among the largest threats facing the human race. Traditionally, key pandemic response elements include: (i) surveillance, investigation, and protective health measures, (ii) viral and anti-viral drugs, (iii) health care and emergency response [1]. Several of the response actions directly motivate research and developments in data mining. An open question is how modern data collection techniques and data mining could be integrated to better understand spread of a new infection. The recently developed EpiSims [2] simulation tool provides an excellent environment for development and testing of various data mining techniques for pandemic response. Recently, the EpiSims [3] team has published a thorough simulated data corresponding to a particular instance of infection outbreak in Portland, OR. The data consists of 5 data tables with detailed information for about 1.6 million people and 240 thousand locations in Portland, inclusive of people movement, activities, and social contacts, and infection spread for a particular simulation instance. While highly detailed simulated data are useful for understanding the properties of various infections, an important question is what type of information we can expect in a real life situations and how can such information be used in response to infection outbreak. This question motivated our study. Our assumption was that, using the current technology while considering privacy issues, it could be possible to collect highly valuable information for disease response. For example, by using cell phone records, it could be possible to track people movements quite accurately. In addition, by performing detailed surveys, information about the type of activities occurring at every location could be obtained. Using this information, our hypothesis is that data mining can be very helpful in predicting people most at risk, shortly upon the outbreak of an infection. To constrain the scope of our study, we concentrate on diseases that are transmitted only by human contact. In this case, collocation is necessary condition for infection to spread. The goal of our study was to explore if infection risk could be predicted by using people movement and location type data. Our approach is based on an assumption that that the type of location is important determinant of infection risk. For example, being at the same time in a big shopping mall with an infected person bears smaller risk than if the location is a coffee shop or an apartment. However, properties of each location, with the respect to the specific disease, are not known in advance. Additionally, in initial stages of infection, only a small fraction of locations is visited by infected people. Successful prediction algorithm should be able to generalize from this limited number of locations. The proposed approach is based on aggregation of locations into specific types and their use as prediction attributes. In this paper, we propose an automatic procedure for location aggregation that is based on the nature of activities occurring at any given location. 2. Data Sets and Data Preprocessing 2.1. Data Sets The original data [3] consists of 5 data tables that are the result of one simulation by EpiSims model: People (PortlandProtoPopulation). This table consists of basic information about 1.6 million inhabitants of Portland. Locations (PortlandProtoLocations). Contains spatial coordinates of about 240 thousand locations in Portland. Activities (PortlandActivities). Provides information 1