Enabling Dynamic Linkage of Linguistic Census Data at Statistics Canada Arnaud Casteigts, Marie-Hélène Chomienne, Louise Bouchard, Guy-Vincent Jourdan University of Ottawa, Canada {acasteig,mh.chomienne,louise.bouchard,gjourdan}@uottawa.ca Context Research in population health consists in studying the impact of various factors (determinants) on health, with the long- term objective of yielding better policies, programs, and services. Researchers of Ofﬁcial Language Minority Communi- ties (OLMCs) focus speciﬁcally on de- terminants related to speaking a minor- ity language, such as English in Quebec, or French in the rest of Canada. Inves- tigations of this type require the possi- bility of associating health data to lin- guistic information. Unfortunately, the largest health databases in Ontario, held at the Institute for Clinical Evaluative Sciences (ICES), do not contain linguistic variables. High-quality language vari- ables however exist at Statistics Canada (SC) through the 2006 Census. Purpose We are interested in enabling a form of linkage between ICES health data and SC linguistic data that could be auto- mated, and yet, proven safe. To this end, we conjecture that most OLMC- related questions could be reformulated as a counting problem in given samples of patients (e.g. counting how many are Francophones), and therefore reduce the complexity of a query to its essential minimum. Two solutions are proposed based on this principle. The ﬁrst one as- sumes a particular dataﬂow which pre- serves privacy by means of a tripartite interaction; the second discard the need for such an assumption by adding a ran- dom pertubation to the answer, which makes the collection of residual infor- mation almost impossible (we charac- terize the worst-case leakage precisely). Based on these results we argue that a safe exposition of linguistic data is possible, and beyond, that similar tech- niques could be used to enrich provin- cial health databases with many other census variables. Solution 1: Tripartite interaction This solution consists of a circular workﬂow between the three involved entities: OLMC researchers, ICES, and SC. The workﬂow is initiated by the researcher through the submis- sion of health criteria to ICES. A representative sample of in- dividuals matching these criteria is then generated and sent to SC, which performs the count query. The result of the query is ﬁnally returned to the researcher. ICES Researchers Stats Can. Health criteria (1) Sample (2) Answer (3) E.g. angioplasty Representative sample of individuals % Francophones + Normalization over total ratio of Francophones in Ontario = ⇒ Final answer Example: what is the aver- age angioplasty rate among Francophones? (vs. Anglo- phones). Researchers sub- mit the criteria to ICES, and gets an answer from SC. By normalizing this answer over the global ratio of Francophones in Ontario, one can answer the initial question. Privacy in this mechanism stems from the fact that i) re- searchers do not know the sample details, ii) SC does not know the health criteria that were used to generate that sam- ple, and iii) ICES does not know the ﬁnal answer. Therefore, none of the entity can learn something it is not supposed to know. Drawback: This scheme assumes that no additional ex- change occurs between the entities. In particular, the distinc- tion between ICES and the researchers may be debatable from the point of view of SC, which generally considers anything from outside as one and a single entity. Stats Can. World sample #F rancophones Here, a malicious use of count queries, if unsupervised, could make it possible to identify the language of a given individual (say, Madame x). Consider the following attack: making a query with a sample s 1 that does not contain x; then making a second query with the same sample, plus Madame x. s 1 = s 2 = x Obviously, if answer (s 2 ) > answer (s 1 ), then Madame x is Francophone. A malicious adversary could actually build more complex constructs with similar effects. Solution 2: Noised count queries One way to prevent (or strongly limit) the collection of residual information over several queries is to pertub the answers, i.e., drawing a random number – positive or negative – to be added to the exact value. Stats Can. World sample ↑ Random perturbation #F rancophones probability of answer exact value It is intuitively clear that no residual information can be collected with certainty, due to the fact that a same answer could be induced by different counts. The most an adversary can learn is probabilities (one would say belief in this case) that given individuals are Francophones or Anglophones. The amount of this belief leakage depends on how different the probabilities of answer among several samples are (neighbor samples such as s 1 and s 2 represent a worst-case scenario). This difference depends in turn on the shape and magnitude of the noise (which we assume to be a Laplacian one). s 1 s 2 given answer Equation risk parameter type of query maximal number of queries noise magnitude Recent works in the ﬁeld of private data analysis (e.g. [1]) made it possible to understand the exact trade- off between leakage and utility in a worst-case scenario (right picture above). Following these results and others from the same co-authors, we characterized the particular tradeoff at play in our case, i.e., queries counting the number of Francophones in the samples. Given a desired maximal belief about one’s lan- guage (typically chosen by the database holder), and a maximal perturbation OLMC researchers can stand for each answer, the tradeoff tells us how many queries can possibly be done. -100 -50 0 50 100 Lap(5) Lap(10) Lap(20) Lap(30) Conclusion Both mechanisms enable the linkage of provincial health data with federal census data. In the frame- work of our concern, they enable to supplement On- tario health data with linguistic variables. However, we believe this approach is more general and could apply to other variables than language and other provinces than Ontario. The road to realization may be long, but this initial work demonstrates that tech- nical solutions do exist and deserve to be explored. Funding RRASFO (Réseau de Recherche Appliquée sur la Santé des Francophones de l’Ontario). IRHM (Institut de Recherche de l’Hôpital Montfort). Reference [1] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Cali- brating noise to sensitivity in private data analysis. The- ory of Cryptography (TCC’06), pages 265–284, 2006.