RESEARCH PAPER
International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009
254
Novel Method for Improving the Exact Matching
of the Molecular Graphs
C S Chowdary and Pinaki Mitra
Department of Computer Science and Engineering
IIT Guwahati, Guwahati, India-781039
Email: {c.chowdary, pinaki}@iitg.ernet.in
Abstract— one way of determining chemical and biological
reactivity of a newly found compound is by searching
database for structurally similar molecules. Graph theory
concepts are being used for molecular matching. Molecular
matching is of two kinds, like, complete matching and
partial matching (like searching for functional groups). In
this paper we propose an efficient way of pruning the large
molecular databases in various stages in order to do coarse
filtering which uses bit-string manipulation, histogram
filtering and dimensionality reduction to prune some or
most of the database molecules. Then exact matching with
the query molecule is performed through fine filtering on
the remaining molecules in order to find the exact match for
the query molecule from the pruned database using more
expensive graph isomorphism algorithm. In this way search
time can be reduced significantly.
I. INTRODUCTION
In chemical graph theory and in mathematical
chemistry, a molecular graph or chemical graph is a
representation of the structural formula of a chemical
compound in terms of graph theory. A chemical graph is
a labeled graph whose vertices correspond to the atoms of
the compound and edges correspond to chemical bonds.
Its vertices are labeled with the kinds of the
corresponding atoms and edges are labeled with the types
of bonds. Searching for similar compounds is a
fundamental task in many applications of biology such as
drug discovery. As similar molecules tend to have similar
biological properties, the search for molecular similarity
plays an important role in chemistry and in biology, e.g.,
the protein-legend docking, the prediction of biological
activity, reaction site modeling, the interpretation of
molecular spectra, etc. The assumption that the molecules
which are similar in structure and shape should exhibit
similar biological activity in same environments is
generally valid. This property is often phrased as
molecules exhibit the neighborhood behavior.
Currently, many new chemical compounds are being
invented, it is important to find their properties. When a
new compound is found, it is easy to find its geometric
structure. But finding chemical and biological reactivity
of a new compound by doing various experimental
studies (like testing the X-ray crystallographic structure
of the compound) is both costly and time consuming.
One way to overcome this problem is to estimate the
behavior of the new compound by using the
neighborhood behavior between the compounds. That is,
for the new compound one has to find the compounds
which are structurally similar for which the properties are
known. Currently, a large database of compounds for
which the properties are known is available. So, the
problem now boils down to retrieving structurally similar
compounds for the given compound from the database.
Similarity finding can be done in many ways such as
descriptor-based similarity methods, applying graph
theory concepts on molecular structures, etc. For
applying graph theory concepts, molecular structure is
seen as a graph where atoms are the vertices and bonds
among them are the edges. A complete review about
descriptor-based similarity searching and their
disadvantages can be found in the paper by [1]. One of
the famous techniques for measuring the similarity
between molecules based on the structure description is
by Finger-print based comparison. In this approach, a
molecule is considered as a bit-string, each bit indicating
the presence or absence of an atom or a predefined
molecular substructure known as key descriptor or finger-
print in [2]. The similarity between two molecules is then
determined by comparing their corresponding bit-strings.
Also, the combination of numerical vector methods and
fingerprint methods has been used as a mathematical
extension of bit-comparison methods [3], [4]. Although
these methods are simple and easy to implement but it
depends on the selected key descriptor and it does not
guarantee accurate results.
Many methods were proposed for applying graph
theory concepts in molecular matching. This type of
molecular matching is primarily of two types. One is the
complete graph matching and the other is the partial
graph matching. The complete matching of molecules
using graph theory concepts can be considered as a
complete graph isomorphism problem. Where as the
partial matching can be considered as a sub-graph
isomorphism problem. The sub-graph isomorphism is a
NP complete problem [5]. But the complete graph
isomorphism problem is still not shown whether it is in
the NP complete or not. But for special graphs and
restricted graphs such as planar, bounded valence, and
bounded color graphs complete isomorphism is proved to
have a polynomial time solution. Many methods were
proposed for partial molecular matching using sub-graph
isomorphism technique. One of the best articles on
survey about maximum common sub-graph searching
algorithms which are applicable for chemical structure
© 2009 ACADEMY PUBLISHER