Estimating the number of coding mutations in genotypic and phenotypic driven N-ethyl-N-nitrosourea (ENU) screens: revisited David A. Keays, Taane G. Clark, Thomas G. Campbell, John Broxholme, William Valdar Psychiatric Genetics Laboratory, Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN, United Kingdom Received: 18 May 2006 / Accepted: 11 December 2006 Abstract We recently described methods for estimating the number of N-ethyl-N-nitrosourea (ENU)-induced coding mutations in phenotypic and genotypic screens. In this article we revisit these methods, clarifying their application. In particular, we focus on the difference between unconditional and condi- tional probabilities. We also introduce a website to assist investigators in the application of these equations (http://www.well.ox.ac.uk/enuMutRat). Introduction We recently described methods for estimating the numbers of N-ethyl-N-nitrosourea (ENU)-induced coding mutations in phenotypic and genotypic screens (Keays et al. 2006). We have been alerted to a potential ambiguity in the application of these methods, so in this article we seek to clarify them. This ambiguity is seen in the following example. Suppose an investigator has just identified a mouse mutant with a heritable phenotype of interest and has chosen at random a 5-Mb region of genomic DNA containing 115 kb of coding DNA in which to look for the mutant allele. As before, if we assume the number of coding mutations (K) in a specific length of DNA (n) follows a Poisson distribution with a known fixed mutation rate (k), then the probability of k coding mutations is PðK ¼ k; nÞ¼ ðnkÞ k! e nk ; ð1Þ and the probability of more than k coding mutations is PðK > k; nÞ¼ 1 X k i¼0 PðK ¼ k; nÞ; ð2Þ where k = 0, 1, ... . Suppose then the known mutation rate is 1 in 1.82 Mb (Quwailid et al. 2004). The probability that there are no coding mutations in the n c = 115-kb region is PðK ¼ 0; n c Þ 0:939, the probability of a single mutation is Pðk ¼ 1; n c Þ’ 0:059, and the probability there are two or more is PðK > 0; n c Þ¼ 1 PðK ¼ 0; n c Þ PðK ¼ 1; n c Þ’ 0:002. Now suppose the investigator sequences half of the candidate coding region and finds a mutation. The probability of there being one or more further coding mutations in the unsequenced portion, and thus two or more in the coding region overall, is PK > 0; n c 2 À Á 0:031 . More generally, if the investi- gator sequences a segment of length n 1 from a coding region of length n and finds m coding mutations therein, the probability of finding j or more muta- tions in the remaining region of length n 2 =n)n 1 is given by the conditional probability PðK > m þ j; njmÞ¼ PðK > j; n 2 Þ; ð3Þ which is independent of the value of m. At first this might seem inconsistent. In the example, before partial sequencing the probability of two or more coding mutations was @0.002, whereas afterward the probability has risen to @0.031. How- ever, this rise in probability is familiar in other contexts. For example, the probability of two coin tosses producing two heads is 0.25, but if one coin has already been tossed to produce one head then the (conditional) probability of getting a second head Correspondence to: D. A. Keays; E-mail: david.keays@physiol. ox.ac.uk DOI: 10.1007/s00335-006-0065-z Volume 18, 123124 (2007) Ó Springer Science+Business Media, Inc. 2007 123