Research paper
Analysis of the genetic structure of the Malay population:
Ancestry-informative marker SNPs in the Malay of Peninsular Malaysia
Padillah Yahya
a
, Sarina Sulong
b
, Azian Harun
c
, Hatin Wan Isa
b
, Nur-Shafawati Ab Rajab
b
,
Pongsakorn Wangkumhang
d
, Alisa Wilantho
d
, Chumpol Ngamphiw
d
,
Sissades Tongsima
d
, Zilfalil Alwi
a,
*
a
Department of Paediatric, School of Medical Sciences, Universiti Sains Malaysia, Kubang Kerian, 16150 Kelantan, Malaysia
b
Human Genome Centre, School of Medical Sciences, Universiti Sains Malaysia, Kubang Kerian, 16150 Kelantan, Malaysia
c
Department of Medical Microbiology and Parasitology, School of Medical Sciences, Universiti Sains Malaysia, Kubang Kerian, 16150 Kelantan, Malaysia
d
National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand Science Park, Pathum Thani 12120, Thailand
A R T I C L E I N F O
Article history:
Received 3 January 2017
Received in revised form 23 June 2017
Accepted 10 July 2017
Available online 14 July 2017
Keywords:
SNP
Ancestry-informative marker
Genetics structure
Malay
Population
Malaysia
A B S T R A C T
Malay, the main ethnic group in Peninsular Malaysia, is represented by various sub-ethnic groups such as
Melayu Banjar, Melayu Bugis, Melayu Champa, Melayu Java, Melayu Kedah Melayu Kelantan, Melayu Minang
and Melayu Patani. Using data retrieved from the MyHVP (Malaysian Human Variome Project) database, a
total of 135 individuals from these sub-ethnic groups were profiled using the Affymetrix GeneChip
Mapping Xba 50-K single nucleotide polymorphism (SNP) array to identify SNPs that were ancestry-
informative markers (AIMs) for Malays of Peninsular Malaysia. Prior to selecting the AIMs, the genetic
structure of Malays was explored with reference to 11 other populations obtained from the Pan-Asian
SNP Consortium database using principal component analysis (PCA) and ADMIXTURE. Iterative pruning
principal component analysis (ipPCA) was further used to identify sub-groups of Malays. Subsequently,
we constructed an AIMs panel for Malays using the informativeness for assignment (I
n
) of genetic
markers, and the K-nearest neighbor classifier (KNN) was used to teach the classification models. A model
of 250 SNPs ranked by I
n
, correctly classified Malay individuals with an accuracy of up to 90%. The
identified panel of SNPs could be utilized as a panel of AIMs to ascertain the specific ancestry of Malays,
which may be useful in disease association studies, biomedical research or forensic investigation
purposes.
© 2017 Elsevier B.V. All rights reserved.
1. Introduction
Ancestry-informative markers (AIMs) are a set of genetic
biomarkers whose frequencies clearly differ among populations.
AIMs can be used to infer ancestry for an individual, which can be
advantageous for several applications, including forensics, genetic
ancestry testing and stratification correction in genome-wide
association studies (GWAS) [1–8]. In GWAS, AIMs can be used to
identify subpopulations prior to association mapping to minimize
spurious results [9]. Markers that have different marked allele
frequencies among groups of individuals are generally good
candidates for an AIM panel [10]. The majority of AIMs research
concentrated on haploid markers, mainly mitochondrial DNA and
Y-chromosomal DNA polymorphisms, particularly at the
continental level [11–14]. However, these maternal and paternal
informative markers are restricted in genetic information,
especially in detecting recent admixtures [15]. With the advent
of parallel genotyping platforms, considerable AIMs research has
been performed using autosomal DNA single nucleotide polymor-
phisms (SNPs) [4,8,16–20].
Several techniques have been popularly used to identify AIMs
from SNP genotyping data. One popular technique is the use of the
fixation index (F
st
), which reflects a “difference” between any two
populations estimated by a SNP allele frequency [8,16,19,21].
PCAIM is a PCA-based AIM selection algorithm that chooses SNPs
that contribute more (i.e., are more highly correlated) to the
resulting principal components [22]. Another technique to filter
AIMs uses a measurement called informativeness for assignment
(I
n
) [23] to infer ancestry. This approach utilizes information
theory principles to determine the amount of information that a
particular variant, for example, microsatellites or SNPs, contributes
to an individual ancestry. The I
n
can help determine the amount of
* Corresponding author.
E-mail address: zilfalil@gmail.com (Z. Alwi).
http://dx.doi.org/10.1016/j.fsigen.2017.07.005
1872-4973/© 2017 Elsevier B.V. All rights reserved.
Forensic Science International: Genetics 30 (2017) 152–159
Contents lists available at ScienceDirect
Forensic Science International: Genetics
journal homepage: www.else vie r.com/locate /fsig