A Content Spotting System For Line Drawing Graphic Document Images Muhammad Muzzamil Luqman *† , Thierry Brouard * , Jean-Yves Ramel * and Josep Llad´ os * Laboratoire d’Informatique, Universit´ e Franc ¸ois Rabelais de Tours, 37200 France Computer Vision Center, Universitat Aut´ onoma de Barcelona, 08193 Spain Email: {brouard, ramel}@univ-tours.fr, {mluqman, josep}@cvc.uab.es Abstract—We present a content spotting system for line drawing graphic document images. The proposed system is sufficiently domain independent and takes the keyword based information retrieval for graphic documents, one step forward, to Query By Example (QBE) and focused retrieval. During offline learning mode: we vectorize the documents in the repository, represent them by attributed relational graphs, extract regions of interest (ROIs) from them, convert each ROI to a fuzzy structural signature, cluster similar signatures to form ROI classes and build an index for the repository. During online querying mode: a Bayesian network classifier recognizes the ROIs in the query image and the corresponding documents are fetched by looking up in the repository index. Experimental results are presented for synthetic images of architectural and electronic documents. Keywords-content spotting; graphic document retrieval; query by example; fuzzy structural signature I. I NTRODUCTION AND RELATED WORKS The graphic document research community has seen a gradual shift of attention over the last few years, from the hard problems of symbol recognition, segmentation and localization to the relatively softer problem of symbol spotting. An important reason behind this is the growing size of document repositories and the increasing demand from users to have an efficient browsing mechanism for graphic content. The format of these documents mainly restricts to use keyword based searching and indexing mechanisms. Thus a very interesting topic of research is to investigate into mechanisms of indexing the graphic content of these docu- ments; in order to offer to the users, the advantages of Query By Example (QBE) and focused retrieval. The research surveys by Chhabra [1], Llados et al. [2], Cordella & Vento [3] and Tombre et al. [4] provide a detailed and state of the art historical review of work done in the field of symbol recognition over last two decades. The graphic documents are generally represented by symbolic representations based structural methods of pattern recognition. Graph in one form or another has remained a popular choice for most of the methods of symbol recognition and segmentation, because of its natural adaptation to the content of these documents, but has an associated drawback of computational inefficiency. On the other hand, the new developments in statistical pattern recognition offer highly efficient mathematical tools for learning and classification. Fonseca et al. [5] have presented a detailed review of content based retrieval of technical drawings. Some of the notable recent works for symbol spotting include : a region string based method in [6], a method based on graph representations and vectorial signature [7], a chain point dendrogram based approach by [8] and a shape context descriptor based approach in [9]. The PhD dissertations of Rusinol [10] and Nguyen [11], in recent past, are good contributions to the literature on symbol spotting. We are more interested in investigating into graph based representations for symbol spotting and have selected a method from Qureshi et al. [12] for our work. This system is based on a graph based structural approach. First, it vectorizes the image into a set of quadrilateral primitives, extracts topological and geometric features and represents the image content by an attributed relational graph (ARG). The nodes of the graph are the quadrilateral primitives and arcs are the relationships between these primitives. Nodes of graph have relative lengths and arcs have relative angle and relation type as their attributes. In the second step, the system looks for potential ROIs corresponding to symbols. It detects parts of the ARG that may correspond to symbols i.e. symbol seeds. Scores corresponding to probabilities of being part of a symbol are computed for all edges and nodes of the ARG. They are based on features such as lengths of segments, perpendicular and parallel angular relations, degrees of nodes etc. The symbol seeds are detected during a score propagation process. This process seeks and analyzes the different shortest paths and loops between nodes in the ARG. To obtain the symbols seeds, the scores of all the nodes belonging to a detected path are homogenized i.e. propagation of the maximum score to all the nodes in the path until convergence. And finally they employ a greedy algorithm for sub-graph matching. The system achieves good localization and spotting results. The results of this system have also been evaluated by Delalandre et al. [13], where the authors have concluded that this method offers high confidence detection results without any multiple detections but lacks in precision of localization results. We argue that a content spotting and document retrieval system should offer a high recall rate and low precision automatically becomes tolerable. The underlying sub-graph matching algorithm restricts this method to scale to huge document repositories. 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.835 3408 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.835 3424 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.835 3420 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.835 3420 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.835 3420