Malaria surveillance with multiple data sources using Gaussian process models Martin Mubangizi * , Ricardo Andrade-Pacheco † , Michael Smith * , John A. Quinn *‡ and Neil Lawrence † * Makerere University, Kampala, Uganda {mmubangizi,msmith,jquinn}@cit.ac.ug † University of Sheffield, UK {acq11ra,N.Lawrence}@sheffield.ac.uk ‡ UN Global Pulse, Kampala, Uganda Abstract—A statistical framework for monitoring the health of a population should ideally be able to combine data from a wide variety of sources, such as remote sensing, telecoms, and official health records, in a principled manner. Gaussian process regression is commonly used to visualise disease incidence by interpolating values across a map; in this article, we show how it can be extended to deal with many different types of information by introducing a flexible covariance structure across data sources. Combining many data sources in a single model provides a number of practical advantages, such as the ability to to automatically determine the importance of each data source through likelihood optimisation, and to deal with missing values. We show the basic idea with an application of malaria density modeling across Uganda using administrative records and remote sensing vegetation index data, and then go on to describe further extensions such as the incorporation of human mobility data extracted from mobile phone call detail records (CDRs). I. I NTRODUCTION Malaria remains endemic across much of the world, in spite of mitigation measures by both governments and international agencies. Health department intervention is now principally response-driven; at those times and locations with the greatest malaria infection rates the provision of treatment needs to be able to match the number of cases without stock-outs or staff- shortages. Hence planning stock and staff deployment depends on accurate and timely information regarding the distribution of malaria cases. In Uganda, the Ministry of Health receives weekly counts of reported malaria cases from all districts. However, this data is compromised by cases of non-reporting at both the district and health center levels [1], the cases reported are often based on unverified diagnoses, and there are various other sources of measurement error. In order to resolve ambiguity about how the disease burden is distributed, models can be constructed which relate infection levels across time and space, or incorporate covariates which provide extra information. These covariates may be envi- ronmental (rainfall levels, temperature, vegetation strength) or social (population density, migration/movement patterns, demographics), for example. In this regard, NDVI index, which is widely used to estimate vegetation density [2], turns out to be good proxy for rainfall [3] and has proved useful in identifying suitable habitats for mosquito breeding [4]. Any attempt to use remote sensing data, such as NDVI, for carrying out inference on administrative records, will face the problem of trying to mix two data sources with differing space and time resolutions. For example, while HMIS data is reported weekly and aggregated at a district level f , NDVI is provided a much higher resolution in a grid and is reported every 5 days. Gaussian process regression is commonly used in epi- demiology to interpolate disease counts across space. In this paper, we explain how it can be extended to a coregionalised form in order to incorporate information from covariates. By specifying a covariance structure relating a number of inputs and outputs, it is possible to combine several different types of data in a single, principled framework. We illustrate this model using weekly counts of malaria incidence by district in Uganda, and show that for certain regions, the incorporation of environmental remote sensing data can significantly improve the estimates of the infection rate compared to baseline models. We then describe how social data can be incorporated, in particular information about movements of the population derived from mobile phone call detail records. This paper is organised as follows. Section II discusses some of the related work; Section III presents the data used and introduces the model framework. Application of the model to environmental covariates is discussed in IV, and Section V discusses use of mobility data. We conclude, with suggestions for future work, in Section VI. II. RELATED WORK The need to use data from multiple sources to enhance disease modeling has been an active research area [5], [6]. [7] cites challenges that this research has been faced with. This also led to search for new data sources that may provide signals of changes in disease rates, including absenteeism [8], sales of over-the-counter health products [9], emergency call centers [10], and automatic malaria diagnosis results [11]. Examples of research that has focused on using multiple data sources, such as [6], acknowledge the need for data from multiple sources in biosurveillance. BioPHusion [5], for instance, is a framework that can use real time data from several sources for awareness and timely response. It is widely understood that determining the geographical distribution of a disease is vital in its control [12] and in esti- mating the cost of that control [13]. To this end, considerable effort has gone into producing risk maps of diseases at different f HMIS data might be available at smaller aggregation levels, however the information available to the authors had a district aggregation. 1 st