International Journal of Business and Society, Vol. 18 S4, 2017, 845-853 IMPROVED ENSPART FOR DNA MOTIF PREDICTION Allen Chieng-Hoon Choong Universiti Malaysia Sarawak Nung-Kion Lee Universiti Malaysia Sarawak Chih-How Bong Universiti Malaysia Sarawak Norshafrina Omar Universiti Malaysia Sarawak ABSTRACT In our previous work we proposed ENSPART-an ensemble method for DNA motif discovery which partitions input dataset into several equal size subsets runs by several distinct tools for candidate motif prediction. The candidate motifs obtained from different data subsets are merged to obtain the final motifs. Nevertheless, the original ENSPART has several limitations: (1) the same background sequences are used for the calculation of Receiver Operating Cost (ROC) of motifs obtained from different datasets. This causes bias because different datasets might have different background distribution; (2) it does not consider the duplication of a motif and its reverse complement. This causes many redundant motifs in the result set which requires filtering. In this work, we extended the original ENSPART to solve those two issues. For the first issue, we employed background sequences that is based on the distribution of bases in the input sequences. As for the second issue, we employ a "triple" merging strategy to reduce redundant motifs. The evaluation results indicate that the two improvements obtain better AUC values in comparison to the original implementation. Keywords: DNA Motif Discovery; Machine Learning; Ensemble. 1. INTRODUCTION ENSPART (Lee, Choong, & Omar, 2016) is an ensemble approach which utilizes an ensemble of 7 motif discovery tools for motif prediction. It is designed for tackling large-scale ChIP dataset for the discovery of primary motifs in a DNA dataset enriched with motifs. The idea of ENSPART is to partition a large-scale ChIP dataset into small subsets and use an ensemble of motif discovery tools for motif prediction in each subset. The assumption is the binding sites of a primary transcription factor protein is abundance in each of the partitioned subset and thus can be predicted by motif discovery tools independently. Furthermore, utilizing many tools for prediction would increase the chances of obtaining true motifs. The tools run on each partitioned dataset for motif discovery and predicted motifs from individual tool are merged to produce the final motifs. An Corresponding author: Nung-Kion Lee, Faculty of Cognitive Sciences and Human Development, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia. Email: nklee@unimas.my