JOURNAL OF COMPUTATIONAL BIOLOGY Volume 6, Numbers 3/4, 1999 Mary Ann Liebert, Inc. Pp. 361–368 Optimal Reconstruction of a Sequence from its Probes ALAN M. FRIEZE, 1 FRANCO P. PREPARATA, 2 and ELI UPFAL 3 ABSTRACT An important combinatorial problem, motivated by DNA sequencing in molecular biology, is the reconstruction of a sequence over a small nite alphabet from the collection of its probes (the sequence spectrum ), obtained by sliding a xed sampling pattern over the sequence. Such construction is required for Sequencing-by-Hybridization (SBH), a novel DNA sequencing technique based on an array (SBH chip) of short nucleotide sequences ( probes ). Once the sequence spectrum is biochemically obtained, a combinatorial method is used to reconstruct the DNA sequence from its spectrum. Since technology limits the number of probes on the SBH chip, a challenging combinatorial question is the design of a smallest set of probes that can sequence an arbitrary DNA string of a given length. We present in this work a novel probe design, crucially based on the use of universal bases [bases that bind to any nucleotide (Loakes and Brown, 1994)] that drastically improves the performance of the SBH process and asymptotically approaches the information- theoretic bound up to a constant factor. Furthermore, the sequencing algorithm we propose is substantially simpler than the Eulerian path method used in previous solutions of this problem. Key words: DNA sequencing, sequencing by hybridization, gapped probes, probabilistic analysis. 1. INTRODUCTION T he reconstruction of a sequence over a nite alphabet from the set of its subsequences, sampled according to a xed pattern, is a challenging combinatorial problem, which has received considerable attention in recent years. A pattern can be dened as a binary sequence beginning and ending with a 1, which can be used as a “template” to sample a given sequence, called the target sequence. Specically, the samples ( probes ) are obtained by sliding the pattern in all positions of complete overlap with the target sequence, and generating from each position the subsequence corresponding to the 1-symbols of the pattern. The resulting collection of probes is called the spectrum of the sequence, and the reconstruction task consists of deciding if there is a unique sequence consistent with a spectrum and, if so, to construct it. Although interesting on a purely information-theoretic level, the motivation for this problem comes from molecular biology, specically from the sequencing of DNA. In recent times a radically new technique, 1 Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh PA 15213, USA af1p@andrew.cmu. edu. Supported in part by NSF grant CCR-9530974. 2 Computer Science Department, Brown University, Box 1910, Providence, RI 02912-1910, USA. franco@cs.brown. edu. 3 Computer Science Department, Brown University, Box 1910, Providence, RI 02912-1910, USA. eli@cs.brown. edu. 361