The Application of Statistical Methods to Cognate Docking: A Path Forward? Paul C. D. Hawkins,* Brian P. Kelley, and Gregory L. Warren OpenEye Scientic Software, 9 Bisbee Court, Suite D, Santa Fe, New Mexico 87508, United States * S Supporting Information ABSTRACT: Cognate docking has been used as a test for pose prediction quality in docking engines for decades. In this paper, we report a statistically rigorous analysis of cognate docking performance using tools in the OpenEye docking suite. We address a number of critically important aspects of the cognate docking problem that are often handled poorly: data set quality, methods of comparison of the predicted pose to the experimental pose, and data analysis. The focus of the paper lies in the third problem, extracting maximally predictive knowledge from comparison data. To this end, we present a multistage protocol for data analysis that by combining classical null-hypothesis signicance testing with eect size estimation provides crucial information about quantitative dierences in performance between methods as well as the probability of nding such dierences in future experiments. We suggest that developers of software and users of software have dierent levels of interest in dierent parts of this protocol, with users being primarily interested in eect size estimation while developers may be most interested in statistical signicance. This protocol is completely general and therefore will provide the basis for method comparisons of many dierent kinds. INTRODUCTION Cognate or self-docking is a widely reported method for assessing the quality of docking engines. 1-16 In these experiments, a ligand is extracted from a protein complex and, using a particular tool, is posed and scored back into its own binding site. The quality of the tool is then estimated by some measure of deviation between the predicted pose and the experimental pose. The underlying assumption for the relevance of such studies to the quotidian use of docking engineswhere many dierent ligands are posed and scored in a single binding site (commonly known as cross-docking)is that successful performance in self-docking is a necessary but not sucient condition for successful performance in cross- docking. This is a prediction that is rarely, if ever, tested; however, testing this prediction in a rigorous way, while valuable, is beyond the scope of this paper. We ourselves have recently reported on self-docking experiments with our docking tools. 8,17 It may then be germane to ask why yet another self-docking study?Our goals for this study were twofold. First, we wished to answer the question have the changes made to our docking tools improved their performance?with some condence in the magnitude of the dierences between older and newer versions, which required us to develop a new, more statistically rigorous method for the analysis of pose prediction data. In comparisons dierent types of docking methodology, the utility of self-docking experiments is concrete and immediate: if performed carefully and analyzed correctly, these experiments can provide clear-cut answers concerning the dierences between methods. Second, we wished to use the pose prediction data to exemplify our general approach to method comparison, which is much more informative and rigorous than approaches currently in wide use in the community. We should note at this juncture that while we focus here exclusively on cognate docking perform- ance, virtual screening experiments are another entirely legitimate method of evaluating docking engines. However, the levels of performance of the same tool in the two dierent areas can often be quite dierent. 6 Therefore, optimization of a scoring function for cognate docking performance will not necessarily produce corresponding improvements in virtual screening performance. The process we outline and use in this paper can be applied as-is to virtual screening data; we have used pose prediction data simply to exemplify the process and to provide concrete outputs. Since there are dierent types of information to be gathered from a comparison that are of use to dierent interest groups, our process consists of multiple stages, each examining dierent aspects of the comparison. In our view, developers of tools and users of tools have dierentbut equally legitimategoals when analyzing comparison data. We suggest that the minimum size of an improvement of interest to users is substantially greater than the minimum size of an improvement of interest to developers. In software development, large improvements in the function of a tool usually result from the cumulation of a number of small changes, so determining with condence that Received: February 20, 2014 Article pubs.acs.org/jcim © XXXX American Chemical Society A dx.doi.org/10.1021/ci5001086 | J. Chem. Inf. Model. XXXX, XXX, XXX-XXX