Evaluation of an algorithm of tagging SNPs selection by linkage disequilibrium Nelson L.S. Tang a, , Paul D.P. Pharoah b , Suk Ling Ma a , Douglas F. Easton c a Department of Chemical Pathology, Faculty of Medicine, The Chinese University of Hong Kong, Shatin, Hong Kong b Department of Oncology, Strangeways Research Laboratories, Cambridge, UK c Cancer Research U.K. Genetic Epidemiology Unit, Strangeways Research Laboratories, Cambridge, UK Received 27 July 2005; received in revised form 30 October 2005; accepted 25 November 2005 Available online 19 January 2006 Abstract Background: Single nucleotide polymorphisms (SNPs) are the most abundant kind of genetic polymorphism in the human genome. They are important in both genetic research and genetic testing in a clinical setting, such as in the area of pharmacogenetics. In order to improve efficiency, tagging SNPs (tagSNPs) are selected in genes of interest to represent other co-related SNPs in linkage disequilibrium (LD) with the tagSNPs. Various algorithms have been proposed to identify a subset of single nucleotide polymorphisms as tagSNPs. Most algorithms of tagSNPs selection are haplotype-based, in which the spatial relationship between SNPs is considered. Currently, a more efficient cluster-based algorithm is proposed which clusters SNPs solely by a LD parameter, such as r 2 . Here, we evaluated the sample distribution of r 2 and its effect on the cluster-based tagSNPs selection. Design and methods: The genotype data of 198 individual within a 500-kb region on 5q31 was used to evaluate the sample distribution of r 2 and its effect on the cluster-based tagSNPs selection. Results: It was found that the degree of variation of LD depends on the LD structure of genes. Conclusion: As a cluster-based tagSNPs selection algorithm does not take into account the spatial position of SNPs, a more stringent r 2 threshold is required to achieve more reliable tagSNPs selection. © 2005 The Canadian Society of Clinical Chemists. All rights reserved. Keywords: Tagging; SNPs; Genetic; LD metric; Haplotype; Linkage disequilibrium Introduction Genetic factors play a strong role in susceptibility to common diseases. Under the common variantscommon diseases hypothesis, genetic predisposition to common diseases like diabetes and schizophrenia is due to genetic variants that are prevalent in the general population [1,2]. Disease pheno- types are the end-results of the action of multiple disease- predisposing genetic variants, while each of them confers a moderate increase in risk. In the past, familial linkage study was the only feasible genetic mapping approach due to the availability of sparse genetic markers in our genome. However, it is very difficult, if not possible, to dissect the genetics of common diseases by linkage analysis. After the completion of the Human Genome Project in 2003 [3,4] and recently Phase 1 of the HapMap project [5], abundant genetic polymorphisms, mostly single nucleotide polymorphisms (SNPs), are now readily available for analysis. It also enables their clinical application in diagnosis and risk prediction of disease. In addition, genetic study of inter-individual difference in drugs is also made possible using a similar approach in the field of pharmacogenetics [6,7]. A genetic association study is used to identify the genetic loci responsible for susceptibility to common diseases. Determination of a representative subset of SNPs will enhance the efficiency of genotyping in a genetic association study. Furthermore, after the disease causative genes are identified, such a set of representative SNPs is also needed for clinical application. The set of SNPs that could be used to represent the Clinical Biochemistry 39 (2006) 240 243 Corresponding author. Fax: +852 26322320/+852 26365090. E-mail address: nelsontang@cuhk.edu.hk (N.L.S. Tang). 0009-9120/$ - see front matter © 2005 The Canadian Society of Clinical Chemists. All rights reserved. doi:10.1016/j.clinbiochem.2005.11.014