Abstract—In biological sequence research, the positional weight matrix (PWM) is often used to search for putative transcription factor binding sites. A set of experimentally verified oligonucleotides known to be functional motifs are collected and aligned. The frequency of each nucleotide A, C, G, or T at each column of the alignment is calculated in the matrix. Once a PWM is constructed, it can be used to search from a nucleotide sequence for subsequences that can possibly perform the same function. The match between a subsequence and a PWM is usually described by a score function, which measures the closeness of the subsequence to the PWM as compared with the given background. Nevertheless, the score function is usually motif-length-dependent and thus there is no universally applicable threshold. In this paper, we propose an alternative scoring index (G) varying from zero, where the subsequence is not much different from the background, to one, where the subsequence fits best to the PWM. We also propose a measure evaluating the statistical expectation at each G index. We investigated the PWMs from the TRANSFAC and found that the statistical expectation is significantly (p<0.0001) correlated with both the length of the PWMs and the threshold G value. We applied this method to two PWMs (GCN4_C and ROX1_Q6) of yeast transcription factor binding sites and two PWMs (HIC1-02, HIC1_03) of the human tumor suppressor (HIC-1) binding sites from the TRANSFAC database. Finally, our method compares favorably with the broadly used Match method. The results indicate that our method is more flexible and can provide better confidence. Index Terms — Positional Weight Matrix, Threshold, Statistical expectation, Goodness-of-fit, Sequence motif. I. INTRODUCTION Sequence motifs are short, functional patterns in biological sequences and are often used to characterize the interaction between a DNA and a protein, such as a binding site of a transcription factor (TF). Many TFs are able to bind to a DNA subsequence with alternative nucleotides at one or more positions in a motif. A set of experimentally verified oligonucleotide sequences known to be bound by a TF are collected and aligned. The frequency of each nucleotide A, C, G, or T at each column of the alignment is calculated in the matrix, called positional weight matrix (PWM, see e.g. [1]). Once a PWM is constructed, it can be used to search for putative sites that are possibly bound by the corresponding TF. The match between a subsequence and a PWM is usually described by a score function. A subsequence is considered This work was supported in part by Genomics and Health Initiative at National Research Council Canada. This is National Research Council publication NRC XXXXX. Both authors are with the Institute for Information Technology, National Research Council Canada, 1200 Montreal Road, Ottawa, Ontario, Canada K1A 0R6 (YP: corresponding author: 613-993-8556; fax: 613-952-0215; e-mail: youlian.pan@nrc.ca. SP: e-mail: sieu.phan@nrc.ca). as a putative TFBS when its score passes a given threshold. The PWM has been a popular means in modeling the transcription factor binding sites (TFBSs) in a promoter sequence. Over the past two decades, many computational approaches are developed to discover conserved motifs with certain degree of success. Computational motif discovery process can be considered in two categories, the supervised known motif prediction and the unsupervised de novo motif discovery [2]. The supervised known motif prediction methods include Match [3], P-Match [4], MatInspector [5] and GAPWM [6]. In unsupervised de novo motif discovery, novel motifs are found through identification of over represented oligonucleotides in the input sequence dataset. The conserved motifs are iteratively evolved through various optimization algorithms as those discussed in [2]. The popular methods include expectation maximization methods, which were implemented in MEME [7]-[8], a combination of expectation maximization with stochastic sampling, which was implemented in Gibbs Sampling family, such as CONSENSUS [9], AlignACE [10], motifSampler [11], and BioProspector [12]. As a research result from various laboratories around the world over the past few decades, many PWMs became available in public databases, such as TRANSFAC [13] and JASPAR [14]. These PWMs are extensively used to search for putative motif instances and the PWM-based methods are reviewed in [2], [15], [16]. The PWM-based methods commonly assume that the positions in a motif are mutually independent. A score function is usually used to compare with the PWM and to calculate the similarity of each base in a motif instance regardless of the content of the neighboring bases. The main challenge in PWM-based motif prediction methods is the objective score function and the determination of a threshold score. The score functions usually depend on PWM parameters such as its length and information content. Therefore, a threshold scores that legitimately qualify a functional motif is very hard to select without subjectivity. The score of a motif instance is usually the summation of the score on each base. Thus it is dependent on the length of the motif and the PWM models. Up till today, there is no universally applicable threshold that can be used in PWM-based methods and this has been a major drawback of PWM-based methods. Several research groups have attempted solving the problem. For example, Match [3] takes the minimum and maximum scores and scales them between 0.00 and 1.00 for the entire PWM space as well as the five consecutive nucleotides whose maximum score is the best in any region of the PWM space. Hertzberg et al. [17] introduced a probability measure to scan the input sequence for a position with maximum score and then calculate the Threshold for Positional Weight Matrix Youlian Pan and Sieu Phan Engineering Letter, 16:4, EL_16_4_06 ____________________________________________________________________________________ (Advance online publication: 20 November 2008)