The 4 th Joint Symposium on Computational Intelligence (JSCI 4), 2 February 2018, Bangkok, Thailand SNP selection for Porcine breed classification by a hybrid information gain and genetic algorithm Wanthanee Rathasamuth * , Kitsuchart Pasupa † , and Sissades Tongsima ‡ * Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand † National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand Email: * rathasamuth.wan@gmail.com, † kitsuchart@it.kmitl.ac.th, ‡ sissades@biotec.or.th Abstract—Single Nucleotide Polymorphism (SNP) is a vari- ability of DNA sequence that connects to a unique trait of an organism. A good SNP selection can provide a good porcine breed that grows fast with high yield. SNP selection can be done by a computerized feature selection method and classification technique. At present, an effective classification model can only handle a small number of features efficiently. Too large a number may cause an over-fitting problem in the classification. Therefore, SNPs or features need to be reduced to an optimum number for an effective porcine SNP analysis. This paper proposes an approach to reducing the number of features in porcine SNP analysis to an optimum number by a hybrid of Information Gain (IG) and genetic algorithm (GA) techniques. A performance test demonstrated that this approach was able to select a minimum number of features (at 1.51% of the total number of features) that provided an average classification accuracy of 94.02%, as compared to 95.28% provided by the total number of features. Index Terms—Feature selection, Bioinformatics, Machine learning, Single Nucleotide Polymorphisms. I. I NTRODUCTION China might be the first country that has started to se- lectively bred wild pigs 5,000 years ago. Pig breeding in Thailand was heavily influenced by the Chinese immigrants to this country. Today, pigs are an important economic animal in Thailand, hence selecting the right breed for the geo- graphical location is a very important issue. Diverse physical traits of pigs are the results of the differences in DNA base sequences which are called single nucleotide polymorphism (SNP). A thorough porcine SNP analysis can determine the SNPs that provide good growth and reproduction. The issue is that there are millions of SNPs for a single organism, and so a manual SNP analysis by an expert is out of the question, not to mention the huge amount of other kinds of resources needed. Today, a good way to address this issue is to use bioinformatics, an integration of computer science, biology, mathematics, and engineering. Machine learning [1] has been applied to genomics, proteomics, microarray, and system biology for classification of genes. In [1], several classification techniques for bioinformatics are presented such as support vector machine (SVM), decision tree, neural net- works, Bayesian classifiers, and nearest neighbors. Since these classification techniques cannot effectively support too large a number of features that may cause a commonly encountered over-fitting problem–high accuracy when used with training dataset but low accuracy with testing dataset, reducing the number of features into a subset of optimum features can make a classification attempt successful. Papers that deal with this issue are such as [2] which is a review of feature selection applying to bioinformatics. The paper reports three types of feature selection: filter methods such as i-test, and information gain (IG); wrapper methods such as genetic algorithms (GA) and other nature inspired algorithms; and embedded methods such as random forest, and decision tree. In [3], a review of several nature-inspired algorithms that were used to perform feature selection was presented. The main point was how to increase selection efficiency and reduce prediction error. The most common problems found were too large a number of features and too small a training dataset. These are some of the challenges in doing a classification analysis. The conceptual frameworks of feature selection by filter, wrapper, and embedded methods are quite different. Simple and efficient, filter methods select features that have high index values, independent of the classification method. Wrapper methods rely on the classification method to select an optimal subset of features that provide high classification accuracy– each round of feature evaluation includes a classification step; therefore, a large number of features results in high computation time. Embedded methods are not very different from wrapper methods but include a feature reduction step that reduces their computation time. Wrapper and Embedded methods rely on a classification step to select an optimal subset of features hence the selected features can facilitate better learning of training dataset but their predictions may suffer from an over-fitting problem. On the other hand, filter methods tend to have less over-fitting problem. This paper proposes using a hybrid feature selection tech- nique that combines IG with GA (IG+GA) for the purpose of selecting the best porcine SNPs for classification of pig breeds. The next section presents a description of datasets, followed by experimental framework and its results. II. DATASETS The dataset used in this study consists of SNP data from 677 pig samples of 22 breeds, 356 samples from the dataset of porcine colonization of the Americas [4] and 321 samples from the dataset of the Project of Porcine Breed Improvement by Selection according to Whole Genome SNP Data supported 978-616-455-375-0 c ⃝2018 JSCI