Research paper Analysis of the genetic structure of the Malay population: Ancestry-informative marker SNPs in the Malay of Peninsular Malaysia Padillah Yahya a , Sarina Sulong b , Azian Harun c , Hatin Wan Isa b , Nur-Shafawati Ab Rajab b , Pongsakorn Wangkumhang d , Alisa Wilantho d , Chumpol Ngamphiw d , Sissades Tongsima d , Zilfalil Alwi a, * a Department of Paediatric, School of Medical Sciences, Universiti Sains Malaysia, Kubang Kerian, 16150 Kelantan, Malaysia b Human Genome Centre, School of Medical Sciences, Universiti Sains Malaysia, Kubang Kerian, 16150 Kelantan, Malaysia c Department of Medical Microbiology and Parasitology, School of Medical Sciences, Universiti Sains Malaysia, Kubang Kerian, 16150 Kelantan, Malaysia d National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand Science Park, Pathum Thani 12120, Thailand A R T I C L E I N F O Article history: Received 3 January 2017 Received in revised form 23 June 2017 Accepted 10 July 2017 Available online 14 July 2017 Keywords: SNP Ancestry-informative marker Genetics structure Malay Population Malaysia A B S T R A C T Malay, the main ethnic group in Peninsular Malaysia, is represented by various sub-ethnic groups such as Melayu Banjar, Melayu Bugis, Melayu Champa, Melayu Java, Melayu Kedah Melayu Kelantan, Melayu Minang and Melayu Patani. Using data retrieved from the MyHVP (Malaysian Human Variome Project) database, a total of 135 individuals from these sub-ethnic groups were proled using the Affymetrix GeneChip Mapping Xba 50-K single nucleotide polymorphism (SNP) array to identify SNPs that were ancestry- informative markers (AIMs) for Malays of Peninsular Malaysia. Prior to selecting the AIMs, the genetic structure of Malays was explored with reference to 11 other populations obtained from the Pan-Asian SNP Consortium database using principal component analysis (PCA) and ADMIXTURE. Iterative pruning principal component analysis (ipPCA) was further used to identify sub-groups of Malays. Subsequently, we constructed an AIMs panel for Malays using the informativeness for assignment (I n ) of genetic markers, and the K-nearest neighbor classier (KNN) was used to teach the classication models. A model of 250 SNPs ranked by I n , correctly classied Malay individuals with an accuracy of up to 90%. The identied panel of SNPs could be utilized as a panel of AIMs to ascertain the specic ancestry of Malays, which may be useful in disease association studies, biomedical research or forensic investigation purposes. © 2017 Elsevier B.V. All rights reserved. 1. Introduction Ancestry-informative markers (AIMs) are a set of genetic biomarkers whose frequencies clearly differ among populations. AIMs can be used to infer ancestry for an individual, which can be advantageous for several applications, including forensics, genetic ancestry testing and stratication correction in genome-wide association studies (GWAS) [18]. In GWAS, AIMs can be used to identify subpopulations prior to association mapping to minimize spurious results [9]. Markers that have different marked allele frequencies among groups of individuals are generally good candidates for an AIM panel [10]. The majority of AIMs research concentrated on haploid markers, mainly mitochondrial DNA and Y-chromosomal DNA polymorphisms, particularly at the continental level [1114]. However, these maternal and paternal informative markers are restricted in genetic information, especially in detecting recent admixtures [15]. With the advent of parallel genotyping platforms, considerable AIMs research has been performed using autosomal DNA single nucleotide polymor- phisms (SNPs) [4,8,1620]. Several techniques have been popularly used to identify AIMs from SNP genotyping data. One popular technique is the use of the xation index (F st ), which reects a differencebetween any two populations estimated by a SNP allele frequency [8,16,19,21]. PCAIM is a PCA-based AIM selection algorithm that chooses SNPs that contribute more (i.e., are more highly correlated) to the resulting principal components [22]. Another technique to lter AIMs uses a measurement called informativeness for assignment (I n ) [23] to infer ancestry. This approach utilizes information theory principles to determine the amount of information that a particular variant, for example, microsatellites or SNPs, contributes to an individual ancestry. The I n can help determine the amount of * Corresponding author. E-mail address: zilfalil@gmail.com (Z. Alwi). http://dx.doi.org/10.1016/j.fsigen.2017.07.005 1872-4973/© 2017 Elsevier B.V. All rights reserved. Forensic Science International: Genetics 30 (2017) 152159 Contents lists available at ScienceDirect Forensic Science International: Genetics journal homepage: www.else vie r.com/locate /fsig