Fuzzy Filtering Systems for Performing Environment Improvement of Computational DNA Motif Discovery Dianhui Wang and Sarwar Tapan Abstract—DNA datasets demonstrate considerably low signal-to-noise ratio that constrains the computational motif discovery tools to achieve satisfactory performances. Thus, reducing the search space and increasing the signal-to-noise ratio (by the means of ﬁltering) can be useful to facilitate computational motif discovery tools with better performing environments. This paper proposes unsupervised fuzzy ﬁltering systems, that aim to remove a large portion of k-mers that are less relevant to potential motif instances in terms of location overlaps in given sequences. Relative Model Mismatch Score (RMMS), which is a new quantitative metric for measuring the quality of motif models, is employed in this work to facilitate the proposed ﬁltering. A modiﬁed version of fuzzy c-means clustering algorithm with an initialization strategy is then adopted to group k-mers, while a complement of fuzziﬁed RMMS is used to rank k-mers for data ﬁltering. Experimental results on eight real DNA datasets showed that, the proposed ﬁltering systems could remove approximately (85 ± 5)% of data samples while maintaining a high retention rate of relevant k-mers. Thus, this ﬁltering as a data pre-processing component, will improve the performing environments of the motif discovery tools, since the ﬁltered datasets will contain much smaller cardinality and higher signal-to-noise ratio than the original datasets. I. I NTRODUCTION Transcription Factor Binding Sites (TFBSs) are small DNA segments (usually ≤ 30 bp) that are believed to interact with Transcription Factors (TFs) to regulate gene expressions [1]. A collection of TFBSs from a set of co-regulated genes is termed as a DNA motif. Discovering novel motifs of unknown TFs or motif instances associated with known TFs in DNA sequences is crucial to understand gene regulatory networks [2]. Due to the huge volume of data, wet-laboratory experiments for ﬁnding DNA motifs are costly and time consuming. Hence, computational approaches have been adopted by the community as supportive tools to resolve this problem. Some tools addressing computational discovery of DNA motifs can be highlighted as: [3], [4], [5], [6], [7], [8] and [16]. Motif patterns keep some speciﬁc biological information due to evolutionary constraints and/or other biological phenomena. Therefore, motifs show some conservation property that refers to an expression of sequence pattern similarity to some extent. Computational motif discovery tools usually facilitate themselves with this conservation property to distinguish a potential motif model. Dianhui Wang and Sarwar Tapan are with the Department of Computer Science and Engineering, La Trobe University, Melbourne, Victoria 3086, Australia. E-mail: dh.wang@latrobe.edu.au. However, DNA datasets usually demonstrate considerably large amount of data samples and biological data mining tasks become challenging because of the very low signal-to-noise ratio. This very imbalanced characteristic constrains the computational motif discovery tools to achieve satisfactory performances. In regard to this, reference [9] compared the performances of 5 prominent tools upon a number of prokaryotic promoter datasets, and it was reported that, the performances of those tools were found deteriorated rapidly as they were applied on large-scale datasets [10]. Thus, it is believed that, reducing the search space and increasing the signal-to-noise ratio by the means of ﬁltering can facilitate the computational motif discovery tools with better performing environments. The objective of this work is to develop unsupervised fuzzy ﬁltering systems that aim to reduce the size of datasets by removing irrelevant sequence portions as much as possible while minimizing the loss of the possible true binding sites locations as much as possible. The framework developed in this paper is free to the use of priori-knowledge associated with datasets during the ﬁltering process. Thus, the proposed ﬁltering systems have good potential to be applied in computational DNA motif discovery on large scale datasets, where unavailability issue of priori-knowledge arises. In this paper, we have employed an extension of previously reported Model Mismatch Score (MMS) [11] to reﬂect the associative rareness measure of a motif model in addition to the conservation property. This extension is termed as Relative Model Mismatch Score (RMMS), and its fuzziﬁed version is employed to rank k-mers in ﬁltering process. Ranking of k-mers depends on motif models that can be generated by data clustering algorithms. For this purpose, a modiﬁed version of fuzzy c-means clustering technique [14] is employed to cluster the k-mers. This work also employed a heuristic driven seed concept to facilitate favourable initialization of fuzzy cluster centers on k-mers datasets. The remainder of this paper is organized as follows: Section II describes some preliminaries including a binary matrix encoding method for DNA data and a modiﬁed version of fuzzy c-means clustering algorithm [14]. Section III describes the Relative Model Mismatch Score (RMMS) measure and its fuzziﬁed expression. Section IV then details the proposed fuzzy ﬁltering systems. Section V reports some promising results with comparisons and discussions, and Section VI concludes this work. WCCI 2010 IEEE World Congress on Computational Intelligence July, 18-23, 2010 - CCIB, Barcelona, Spain FUZZ-IEEE 978-1-4244-8126-2/10/$26.00 c 2010 IEEE 1627