The Application of Statistical Methods to Cognate Docking: A Path Forward? Paul C. D. Hawkins,* Brian P. Kelley, and Gregory L. Warren OpenEye Scientiﬁc Software, 9 Bisbee Court, Suite D, Santa Fe, New Mexico 87508, United States * S Supporting Information ABSTRACT: Cognate docking has been used as a test for pose prediction quality in docking engines for decades. In this paper, we report a statistically rigorous analysis of cognate docking performance using tools in the OpenEye docking suite. We address a number of critically important aspects of the cognate docking problem that are often handled poorly: data set quality, methods of comparison of the predicted pose to the experimental pose, and data analysis. The focus of the paper lies in the third problem, extracting maximally predictive knowledge from comparison data. To this end, we present a multistage protocol for data analysis that by combining classical null-hypothesis signiﬁcance testing with eﬀect size estimation provides crucial information about quantitative diﬀerences in performance between methods as well as the probability of ﬁnding such diﬀerences in future experiments. We suggest that developers of software and users of software have diﬀerent levels of interest in diﬀerent parts of this protocol, with users being primarily interested in eﬀect size estimation while developers may be most interested in statistical signiﬁcance. This protocol is completely general and therefore will provide the basis for method comparisons of many diﬀerent kinds. ■ INTRODUCTION Cognate or self-docking is a widely reported method for assessing the quality of docking engines. 1-16 In these experiments, a ligand is extracted from a protein complex and, using a particular tool, is posed and scored back into its own binding site. The quality of the tool is then estimated by some measure of deviation between the predicted pose and the experimental pose. The underlying assumption for the relevance of such studies to the quotidian use of docking engineswhere many diﬀerent ligands are posed and scored in a single binding site (commonly known as cross-docking)is that successful performance in self-docking is a necessary but not suﬃcient condition for successful performance in cross- docking. This is a prediction that is rarely, if ever, tested; however, testing this prediction in a rigorous way, while valuable, is beyond the scope of this paper. We ourselves have recently reported on self-docking experiments with our docking tools. 8,17 It may then be germane to ask “why yet another self-docking study?” Our goals for this study were twofold. First, we wished to answer the question “have the changes made to our docking tools improved their performance?” with some conﬁdence in the magnitude of the diﬀerences between older and newer versions, which required us to develop a new, more statistically rigorous method for the analysis of pose prediction data. In comparisons diﬀerent types of docking methodology, the utility of self-docking experiments is concrete and immediate: if performed carefully and analyzed correctly, these experiments can provide clear-cut answers concerning the diﬀerences between methods. Second, we wished to use the pose prediction data to exemplify our general approach to method comparison, which is much more informative and rigorous than approaches currently in wide use in the community. We should note at this juncture that while we focus here exclusively on cognate docking perform- ance, virtual screening experiments are another entirely legitimate method of evaluating docking engines. However, the levels of performance of the same tool in the two diﬀerent areas can often be quite diﬀerent. 6 Therefore, optimization of a scoring function for cognate docking performance will not necessarily produce corresponding improvements in virtual screening performance. The process we outline and use in this paper can be applied as-is to virtual screening data; we have used pose prediction data simply to exemplify the process and to provide concrete outputs. Since there are diﬀerent types of information to be gathered from a comparison that are of use to diﬀerent interest groups, our process consists of multiple stages, each examining diﬀerent aspects of the comparison. In our view, developers of tools and users of tools have diﬀerentbut equally legitimategoals when analyzing comparison data. We suggest that the minimum size of an improvement of interest to users is substantially greater than the minimum size of an improvement of interest to developers. In software development, large improvements in the function of a tool usually result from the cumulation of a number of small changes, so determining with conﬁdence that Received: February 20, 2014 Article pubs.acs.org/jcim © XXXX American Chemical Society A dx.doi.org/10.1021/ci5001086 | J. Chem. Inf. Model. XXXX, XXX, XXX-XXX