The Application of Statistical Methods to Cognate Docking: A Path
Forward?
Paul C. D. Hawkins,* Brian P. Kelley, and Gregory L. Warren
OpenEye Scientific Software, 9 Bisbee Court, Suite D, Santa Fe, New Mexico 87508, United States
* S Supporting Information
ABSTRACT: Cognate docking has been used as a test for
pose prediction quality in docking engines for decades. In this
paper, we report a statistically rigorous analysis of cognate
docking performance using tools in the OpenEye docking
suite. We address a number of critically important aspects of
the cognate docking problem that are often handled poorly:
data set quality, methods of comparison of the predicted pose
to the experimental pose, and data analysis. The focus of the
paper lies in the third problem, extracting maximally predictive
knowledge from comparison data. To this end, we present a
multistage protocol for data analysis that by combining
classical null-hypothesis significance testing with effect size estimation provides crucial information about quantitative
differences in performance between methods as well as the probability of finding such differences in future experiments. We
suggest that developers of software and users of software have different levels of interest in different parts of this protocol, with
users being primarily interested in effect size estimation while developers may be most interested in statistical significance. This
protocol is completely general and therefore will provide the basis for method comparisons of many different kinds.
■
INTRODUCTION
Cognate or self-docking is a widely reported method for
assessing the quality of docking engines.
1-16
In these
experiments, a ligand is extracted from a protein complex
and, using a particular tool, is posed and scored back into its
own binding site. The quality of the tool is then estimated by
some measure of deviation between the predicted pose and the
experimental pose. The underlying assumption for the
relevance of such studies to the quotidian use of docking
engineswhere many different ligands are posed and scored in
a single binding site (commonly known as cross-docking)is
that successful performance in self-docking is a necessary but
not sufficient condition for successful performance in cross-
docking. This is a prediction that is rarely, if ever, tested;
however, testing this prediction in a rigorous way, while
valuable, is beyond the scope of this paper.
We ourselves have recently reported on self-docking
experiments with our docking tools.
8,17
It may then be germane
to ask “why yet another self-docking study?” Our goals for this
study were twofold. First, we wished to answer the question
“have the changes made to our docking tools improved their
performance?” with some confidence in the magnitude of the
differences between older and newer versions, which required
us to develop a new, more statistically rigorous method for the
analysis of pose prediction data. In comparisons different types
of docking methodology, the utility of self-docking experiments
is concrete and immediate: if performed carefully and analyzed
correctly, these experiments can provide clear-cut answers
concerning the differences between methods. Second, we
wished to use the pose prediction data to exemplify our general
approach to method comparison, which is much more
informative and rigorous than approaches currently in wide
use in the community. We should note at this juncture that
while we focus here exclusively on cognate docking perform-
ance, virtual screening experiments are another entirely
legitimate method of evaluating docking engines. However,
the levels of performance of the same tool in the two different
areas can often be quite different.
6
Therefore, optimization of a
scoring function for cognate docking performance will not
necessarily produce corresponding improvements in virtual
screening performance. The process we outline and use in this
paper can be applied as-is to virtual screening data; we have
used pose prediction data simply to exemplify the process and
to provide concrete outputs.
Since there are different types of information to be gathered
from a comparison that are of use to different interest groups,
our process consists of multiple stages, each examining different
aspects of the comparison. In our view, developers of tools and
users of tools have differentbut equally legitimategoals
when analyzing comparison data. We suggest that the minimum
size of an improvement of interest to users is substantially
greater than the minimum size of an improvement of interest to
developers. In software development, large improvements in
the function of a tool usually result from the cumulation of a
number of small changes, so determining with confidence that
Received: February 20, 2014
Article
pubs.acs.org/jcim
© XXXX American Chemical Society A dx.doi.org/10.1021/ci5001086 | J. Chem. Inf. Model. XXXX, XXX, XXX-XXX