Fuzzy Filtering Systems for Performing Environment Improvement of Computational DNA Motif Discovery Dianhui Wang and Sarwar Tapan Abstract—DNA datasets demonstrate considerably low signal-to-noise ratio that constrains the computational motif discovery tools to achieve satisfactory performances. Thus, reducing the search space and increasing the signal-to-noise ratio (by the means of filtering) can be useful to facilitate computational motif discovery tools with better performing environments. This paper proposes unsupervised fuzzy filtering systems, that aim to remove a large portion of k-mers that are less relevant to potential motif instances in terms of location overlaps in given sequences. Relative Model Mismatch Score (RMMS), which is a new quantitative metric for measuring the quality of motif models, is employed in this work to facilitate the proposed filtering. A modified version of fuzzy c-means clustering algorithm with an initialization strategy is then adopted to group k-mers, while a complement of fuzzified RMMS is used to rank k-mers for data filtering. Experimental results on eight real DNA datasets showed that, the proposed filtering systems could remove approximately (85 ± 5)% of data samples while maintaining a high retention rate of relevant k-mers. Thus, this filtering as a data pre-processing component, will improve the performing environments of the motif discovery tools, since the filtered datasets will contain much smaller cardinality and higher signal-to-noise ratio than the original datasets. I. I NTRODUCTION Transcription Factor Binding Sites (TFBSs) are small DNA segments (usually 30 bp) that are believed to interact with Transcription Factors (TFs) to regulate gene expressions [1]. A collection of TFBSs from a set of co-regulated genes is termed as a DNA motif. Discovering novel motifs of unknown TFs or motif instances associated with known TFs in DNA sequences is crucial to understand gene regulatory networks [2]. Due to the huge volume of data, wet-laboratory experiments for finding DNA motifs are costly and time consuming. Hence, computational approaches have been adopted by the community as supportive tools to resolve this problem. Some tools addressing computational discovery of DNA motifs can be highlighted as: [3], [4], [5], [6], [7], [8] and [16]. Motif patterns keep some specific biological information due to evolutionary constraints and/or other biological phenomena. Therefore, motifs show some conservation property that refers to an expression of sequence pattern similarity to some extent. Computational motif discovery tools usually facilitate themselves with this conservation property to distinguish a potential motif model. Dianhui Wang and Sarwar Tapan are with the Department of Computer Science and Engineering, La Trobe University, Melbourne, Victoria 3086, Australia. E-mail: dh.wang@latrobe.edu.au. However, DNA datasets usually demonstrate considerably large amount of data samples and biological data mining tasks become challenging because of the very low signal-to-noise ratio. This very imbalanced characteristic constrains the computational motif discovery tools to achieve satisfactory performances. In regard to this, reference [9] compared the performances of 5 prominent tools upon a number of prokaryotic promoter datasets, and it was reported that, the performances of those tools were found deteriorated rapidly as they were applied on large-scale datasets [10]. Thus, it is believed that, reducing the search space and increasing the signal-to-noise ratio by the means of filtering can facilitate the computational motif discovery tools with better performing environments. The objective of this work is to develop unsupervised fuzzy filtering systems that aim to reduce the size of datasets by removing irrelevant sequence portions as much as possible while minimizing the loss of the possible true binding sites locations as much as possible. The framework developed in this paper is free to the use of priori-knowledge associated with datasets during the filtering process. Thus, the proposed filtering systems have good potential to be applied in computational DNA motif discovery on large scale datasets, where unavailability issue of priori-knowledge arises. In this paper, we have employed an extension of previously reported Model Mismatch Score (MMS) [11] to reflect the associative rareness measure of a motif model in addition to the conservation property. This extension is termed as Relative Model Mismatch Score (RMMS), and its fuzzified version is employed to rank k-mers in filtering process. Ranking of k-mers depends on motif models that can be generated by data clustering algorithms. For this purpose, a modified version of fuzzy c-means clustering technique [14] is employed to cluster the k-mers. This work also employed a heuristic driven seed concept to facilitate favourable initialization of fuzzy cluster centers on k-mers datasets. The remainder of this paper is organized as follows: Section II describes some preliminaries including a binary matrix encoding method for DNA data and a modified version of fuzzy c-means clustering algorithm [14]. Section III describes the Relative Model Mismatch Score (RMMS) measure and its fuzzified expression. Section IV then details the proposed fuzzy filtering systems. Section V reports some promising results with comparisons and discussions, and Section VI concludes this work. WCCI 2010 IEEE World Congress on Computational Intelligence July, 18-23, 2010 - CCIB, Barcelona, Spain FUZZ-IEEE 978-1-4244-8126-2/10/$26.00 c 2010 IEEE 1627