Hamming Distance as a Concept in DNA Molecular Recognition Mina Mohammadi-Kambs,* , Kathrin Hö lz, Mark M. Somoza, and Albrecht Ott Biological Experimental Physics, Saarland University, Campus B2.1, 66123 Saarbrü cken, Germany Institute of Inorganic Chemistry, Faculty of Chemistry, University of Vienna, Althanstraße 14 (UZA II), 1090 Vienna, Austria * S Supporting Information ABSTRACT: DNA microarrays constitute an in vitro example system of a highly crowded molecular recognition environment. Although they are widely applied in many biological applications, some of the basic mechanisms of the hybridization processes of DNA remain poorly understood. On a microarray, cross-hybridization arises from similarities of sequences that may introduce errors during the transmission of information. Experimentally, we determine an appropriate distance, called minimum Hamming distance, in which the sequences of a set dier. By applying an algorithm based on a graph-theoretical method, we nd large orthogonal sets of sequences that are suciently dierent not to exhibit any cross-hybridization. To create such a set, we rst derive an analytical solution for the number of sequences that include at least four guanines in a row for a given sequence length and eliminate them from the list of candidate sequences. We experimentally conrm the orthogonality of the largest possible set with a size of 23 for the length of 7. We anticipate our work to be a starting point toward the study of signal propagation in highly competitive environments, besides its obvious application in DNA high throughput experiments. INTRODUCTION Molecular recognition in the crowded environment of DNA microarrays plays an important role in processing information. Recognition often requires the discrimination of one specic molecule among many similar, competing molecules. In 1894, Emil Fischer proposed the lock and key model to describe the recognition of an enzyme and a substrate. 1 According to this model, the substrate possesses the perfect size and shape to accommodate the active site of its complement. However, in crowded environments, binding between noncomplementary molecules may occur and result in introduction of errors. For DNA, specic-binding of two single strands, that is the formation of a stable double helix, occurs only if the bases A and T as well as C and G pair along the sequence. DNA microarrays are a widely used platform that, besides many applications in medicine and biology, enables the study of the fundamentals of DNA hybridization. 2-10 These microarrays consist of single-stranded DNA oligonucleotides immobilized on a surface (probes). If these probes are exposed to a bulk mixture of uorescently labeled target sequences, only complementary targets are expected to hybridize. However, hybridization of a probe to a noncomplementary target still occurs, albeit with a lower binding anity than the corresponding perfectly matching sequence. Therefore, sim- ilarities among probes can lead to a signicant amount of nonspecic cross-hybridization. On a DNA microarray with complex target mixtures, imperfect recognition introduces noise and makes results dicult to interpret. The kinetics of hybridization in the presence of competitors and the importance of cross-hybridization for quantitative interpretation of microarray data have been intensely studied, 11-13 especially for the purpose of single nucleotide polymorphism detection and the accurate assessment of gene expression levels. 14-17 One strategy to avoid cross-hybrid- ization is to construct sets of probes with minimized pairwise competition so that they do not cross-hybridize. Such probes are often referred to as orthogonal. Previous theoretical research 18-24 developed dierent strategies to nd sets of orthogonal sequences. The most intuitive approach to decide, which sequences cross-hybridize, is based on the free energy dierence between the perfectly matched and mismatched hybridization. 25 However, estimating free energies led to poor predictions of hybridization intensities on microarrays. 26 In this work, we apply a well-known local search algorithm and implement graph-theoretical methods to nd such sets. Following the concept of Hamming distance from coding theory, we consider that two sequences do not cross-hybridize if they dier by at least a certain number of bases. This threshold is called minimum Hamming distance d. 27 We determine a suitable d experimentally. One of the fundamental problems in coding theory is nding the maximum size of a code, where a code is a set of codewords with the length L and minimum Hamming distance d. 28 In analogy, here, we Received: January 14, 2017 Accepted: March 17, 2017 Published: April 5, 2017 Article http://pubs.acs.org/journal/acsodf © 2017 American Chemical Society 1302 DOI: 10.1021/acsomega.7b00053 ACS Omega 2017, 2, 1302-1308 This is an open access article published under a Creative Commons Attribution (CC-BY) License, which permits unrestricted use, distribution and reproduction in any medium, provided the author and source are cited.