RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009 254 Novel Method for Improving the Exact Matching of the Molecular Graphs C S Chowdary and Pinaki Mitra Department of Computer Science and Engineering IIT Guwahati, Guwahati, India-781039 Email: {c.chowdary, pinaki}@iitg.ernet.in Abstract— one way of determining chemical and biological reactivity of a newly found compound is by searching database for structurally similar molecules. Graph theory concepts are being used for molecular matching. Molecular matching is of two kinds, like, complete matching and partial matching (like searching for functional groups). In this paper we propose an efficient way of pruning the large molecular databases in various stages in order to do coarse filtering which uses bit-string manipulation, histogram filtering and dimensionality reduction to prune some or most of the database molecules. Then exact matching with the query molecule is performed through fine filtering on the remaining molecules in order to find the exact match for the query molecule from the pruned database using more expensive graph isomorphism algorithm. In this way search time can be reduced significantly. I. INTRODUCTION In chemical graph theory and in mathematical chemistry, a molecular graph or chemical graph is a representation of the structural formula of a chemical compound in terms of graph theory. A chemical graph is a labeled graph whose vertices correspond to the atoms of the compound and edges correspond to chemical bonds. Its vertices are labeled with the kinds of the corresponding atoms and edges are labeled with the types of bonds. Searching for similar compounds is a fundamental task in many applications of biology such as drug discovery. As similar molecules tend to have similar biological properties, the search for molecular similarity plays an important role in chemistry and in biology, e.g., the protein-legend docking, the prediction of biological activity, reaction site modeling, the interpretation of molecular spectra, etc. The assumption that the molecules which are similar in structure and shape should exhibit similar biological activity in same environments is generally valid. This property is often phrased as molecules exhibit the neighborhood behavior. Currently, many new chemical compounds are being invented, it is important to find their properties. When a new compound is found, it is easy to find its geometric structure. But finding chemical and biological reactivity of a new compound by doing various experimental studies (like testing the X-ray crystallographic structure of the compound) is both costly and time consuming. One way to overcome this problem is to estimate the behavior of the new compound by using the neighborhood behavior between the compounds. That is, for the new compound one has to find the compounds which are structurally similar for which the properties are known. Currently, a large database of compounds for which the properties are known is available. So, the problem now boils down to retrieving structurally similar compounds for the given compound from the database. Similarity finding can be done in many ways such as descriptor-based similarity methods, applying graph theory concepts on molecular structures, etc. For applying graph theory concepts, molecular structure is seen as a graph where atoms are the vertices and bonds among them are the edges. A complete review about descriptor-based similarity searching and their disadvantages can be found in the paper by [1]. One of the famous techniques for measuring the similarity between molecules based on the structure description is by Finger-print based comparison. In this approach, a molecule is considered as a bit-string, each bit indicating the presence or absence of an atom or a predefined molecular substructure known as key descriptor or finger- print in [2]. The similarity between two molecules is then determined by comparing their corresponding bit-strings. Also, the combination of numerical vector methods and fingerprint methods has been used as a mathematical extension of bit-comparison methods [3], [4]. Although these methods are simple and easy to implement but it depends on the selected key descriptor and it does not guarantee accurate results. Many methods were proposed for applying graph theory concepts in molecular matching. This type of molecular matching is primarily of two types. One is the complete graph matching and the other is the partial graph matching. The complete matching of molecules using graph theory concepts can be considered as a complete graph isomorphism problem. Where as the partial matching can be considered as a sub-graph isomorphism problem. The sub-graph isomorphism is a NP complete problem [5]. But the complete graph isomorphism problem is still not shown whether it is in the NP complete or not. But for special graphs and restricted graphs such as planar, bounded valence, and bounded color graphs complete isomorphism is proved to have a polynomial time solution. Many methods were proposed for partial molecular matching using sub-graph isomorphism technique. One of the best articles on survey about maximum common sub-graph searching algorithms which are applicable for chemical structure © 2009 ACADEMY PUBLISHER