Secure outsourcing of DNA sequences comparisons in a Grid environment RACHEL AKIMANA, OLIVIER MARKOWITCH and YVES ROGGEMAN D´ epartement d’Informatique Universit´ e Libre de Bruxelles Bd. du Triomphe – CP212, 1050 Bruxelles BELGIUM Abstract: Computing and data Grids are widely distributed computing systems usually used to resolve scientific or technical problems that require a large amount of computing power and/or storage resources. To be really attractive, Grids must provide secured environments (in terms of confidentiality, data integrity, entity identification, etc). In this paper, we consider the confidentiality aspects of Grid’s applications related to string matching. We take as an example the area of genetic biology and, more precisely, the search of DNA similarities. Since DNA sequences comparisons need greedy and sensitive computations, we propose a model allowing to search DNA similarities in a public DNA database on the Grid. The model is related to private approximate string matching problem where neither the inputs nor the outputs of the comparisons are revealed. We analyze the performance of our proposed DNA disguising method by taking into account how the edit distances between the client’s queries and their corresponding disguises are distributed along the DNA sequences. In order to outweigh the client’s load of the initial proposed model, we propose also an extension of our model where the client’s load is executed by a third untrusted server. Key-Words: Grid systems, Secure outsourcing, Secure approximate matching 1 Introduction Computing and data Grids are widely distributed com- puting systems usually used to resolve scientific or tech- nical problems that require a large amount of computing power and/or storage resources. Since a lot of differ- ent users are using Grid’s resources, the risks of eaves- dropping of data and information that are stored or pro- cessed on Grid resources, or even that are traveling on the Grid’s network, cannot be disregarded. Large amount of data are stored on Grid’s resources, and some of them may be related to individual private information (e.g. medical data, biological data, genetic data, etc.). In this case, confidentiality issues and pro- tection of the users’ privacy must be studied carefully. Moreover, confidentiality issues for sensitive data have to be adapted to Grid specificities. For example we have to take into account the fact that data may be stored or processed on a remote and possibly untrusted Grid node. In this work, the word data will be taken in a broad sense including data resulting from simulations and experiments that are organized in databases on the Grid as well as executables codes of jobs to be pro- cessed on the Grid. We will show that existing solutions for confidentiality issues in a Grid system as SSL for example ensure the confidentiality of data during their transport phase but do not guarantee the confidential- ity to sensitive computation during their execution. We will focus our interest on the confidentiality aspects of genetic applications on the Grid; more precisely, in the search of DNA similarities on DNA sequences stored in Grid’s databases. Such databases may be used in the elucidation of crimes, the establishment of DNA sim- ilarities for paternity test, the determination of genetic diseases . . . The DNA sequence comparisons are expensive computations since one DNA sequence may contain thousands to millions of nucleotides. Therefore such comparisons need powerful computing resources. Grids are of course an appropriate environment for such com- putations. A remote DNA sequence comparison mech- anisms may be a sensitive computation in the sense that we may have to ensure that the DNA sequences are not subject to unauthorized tests whose outcome could have such serious consequences [3] (as jeopardizing an indi- vidual’s insurability or employability, etc). On the basis of these security and computing power requirements, we propose a disguise model allowing to search DNA similarities in a public DNA database on the Grid in such a way that neither the inputs nor the outputs of the comparisons are revealed to the comput- ing node. This work is related to problems of Private Information Matching with a public database [7] where a client searches similarities to a given item in a public