Estimating trace-suspect match probabilities for singleton Y-STR haplotypes using coalescent theory Mikkel Meyer Andersen a, *, Amke Caliebe b,1 , Arne Jochens b,2 , Sascha Willuweit c,3 , Michael Krawczak b,4 a Department of Mathematical Sciences, Aalborg University, Fredrik Bajers Vej 7G, 9220 Aalborg East, Denmark b Institute of Medical Informatics and Statistics, Christian-Albrechts University, UK SH Campus Kiel, Arnold-Heller-Strasse 3, 24105 Kiel, Germany c Institute of Legal Medicine, Charite ´– Universita ¨tsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany 1. Introduction In forensic genetics, it is often necessary to compare the plausibility of two case-relevant hypotheses on the basis of some genetic data, and the most consistent (and therefore generally recommended) way of doing so is by means of the likelihood ratio [1]. Calculating the likelihood ratio in forensic case work is usually tantamount to quantifying the match probability between two genetic proﬁles under different assumptions about their degree of relatedness. One particularly important match probability in this context is the probability that a certain individual (e.g. the donor of a trace found at a crime scene) has the same DNA proﬁle as another individual (usually a suspect) drawn randomly from the same population. Methods to estimate this so-called ‘trace-suspect’ match probability are well established for autosomal STRs [2], with most of them assuming statistical independence between the markers included in the proﬁle. Lineage markers, such as Y-chromosomal short tandem repeats (Y-STRs) or mtDNA polymorphisms, have several advantages over autosomal markers [3,4], for example, when solving cases of sexual assault [5]. However, due to the lack of recombination and, therefore, lack of statistical independence, the calculation of match probabilities is more challenging for lineage than for autosomal markers [6]. In particular, when considering Y-STR haplotypes comprising up to 17 loci [7], the proportion of cases involving singletons, deﬁned as haplotypes observed only once in a reference database augmented by the suspect proﬁle, may become so large that use of traditional count estimates of the corresponding match probabilities becomes unsatisfactory. To detail the inference problem arising with singleton haplotypes, let us assume that a reference database of size n is given, and that a trace and suspect carry a new haplotype not yet observed in the database. Initially, the count estimator 1/(n + 1) was used to derive match probabilities in such cases. However, this estimator is rather conservative because it is limited from below by Forensic Science International: Genetics 7 (2013) 264–271 A R T I C L E I N F O Article history: Received 20 June 2012 Received in revised form 7 November 2012 Accepted 24 November 2012 Keywords: Y-STR Singleton haplotype Lineage marker Match probability Likelihood ratio Coalescent theory A B S T R A C T Estimation of match probabilities for singleton haplotypes of lineage markers, i.e. for haplotypes observed only once in a reference database augmented by a suspect proﬁle, is an important problem in forensic genetics. We compared the performance of four estimators of singleton match probabilities for Y-STRs, namely the count estimate, both with and without Brenner’s so-called ‘kappa correction’, the surveying estimate, and a previously proposed, but rarely used, coalescent-based approach implemented in the BATWING software. Extensive simulation with BATWING of the underlying population history, haplotype evolution and subsequent database sampling revealed that the coalescent-based approach is characterized by lower bias and lower mean squared error than the uncorrected count estimator and the surveying estimator. Moreover, in contrast to the two count estimators, both the surveying and the coalescent-based approach exhibited a good correlation between the estimated and true match probabilities. However, although its overall performance is thus better than that of any other recognized method, the coalescent-based estimator is still computation-intense on the verge of general impracticability. Its application in forensic practice therefore will have to be limited to small reference databases, or to isolated cases of particular interest, until more powerful algorithms for coalescent simulation have become available. ß 2012 Elsevier Ireland Ltd. All rights reserved. * Corresponding author. Tel.: þ45 99408866. E-mail addresses: mikl@math.aau.dk (M.M. Andersen), caliebe@medinfo.uni-kiel.de (A. Caliebe), jochens@medinfo.uni-kiel.de (A. Jochens), sascha.willuweit@charite.de (S. Willuweit), krawczak@medinfo.uni-kiel.de (M. Krawczak). 1 Tel.: þ49 431 597 3199. 2 Tel.: þ49 431 597 3192. 3 Tel.: þ49 30 450 525074. 4 Tel.: þ49 431 597 3200. Contents lists available at SciVerse ScienceDirect Forensic Science International: Genetics jou r nal h o mep ag e: w ww .elsevier .co m /loc ate/fs ig 1872-4973/$ – see front matter ß 2012 Elsevier Ireland Ltd. All rights reserved. http://dx.doi.org/10.1016/j.fsigen.2012.11.004