COMPUTER-AIDED DIAGNOSIS OF MAMMOGRAPHIC MASSES USING VOCABULARY TREE-BASED IMAGE RETRIEVAL Menglin Jiang 1 , Shaoting Zhang 2 , Jingjing Liu 1 , Tian Shen 3 , Dimitris N. Metaxas 1 1 Department of Computer Science, Rutgers University, Piscataway, NJ, USA 2 Department of Computer Science, UNC Charlotte, Charlotte, NC, USA 3 Hwatech Medical Info-Tech Co., Xi’An, China ABSTRACT Computer-aided diagnosis of masses in mammograms is im- portant to the prevention of breast cancer. Many approach- es tackle this problem through content-based image retrieval (CBIR) techniques. However, most of them fall short of s- calability in the retrieval stage, and their diagnostic accuracy is therefore restricted. To overcome this drawback, we pro- pose a scalable method for retrieval and diagnosis of mam- mographic masses. Specifically, for a query mammographic region of interest (ROI), SIFT descriptors are extracted and searched in a vocabulary tree, which stores all the quantized descriptors of previously diagnosed mammographic ROIs. In addition, to fully exert the discriminative power of SIFT de- scriptors, contextual information in the vocabulary tree is em- ployed to refine the weights of tree nodes. The retrieved ROIs are then used to determine whether the query ROI contain- s a mass. This method has excellent scalability due to the low spatial-temporal cost of vocabulary tree. Retrieval preci- sion and diagnostic accuracy are evaluated on 5005 ROIs ex- tracted from the digital database for screening mammography (DDSM), which demonstrate the efficacy of our approach. Index Terms— Mammographic masses, computer-aided diagnosis (CAD), content-based image retrieval (CBIR) 1. INTRODUCTION For years, breast cancer remains the leading cause of cancer- related death among women. Nevertheless, early diagnosis could improve the chances of recovery dramatically. Current- ly, among all the imaging techniques for breast examination, mammography is the most effective and the only widely accepted method. Many computer-aided diagnosis (CAD) methods have been proposed to facilitate the detection of masses in mammograms, which is an important indicator of breast cancer. Most of these approaches consist of two steps, namely detection of suspicious regions and classification of these regions as mass or normal tissue [3, 4, 8, 11]. As an alternative solution, some CAD methods utilize content-based image retrieval (CBIR) techniques. Specifical- ly, they compare the current case with previously diagnosed cases stored in a reference database, and return the most rel- evant cases along with the likelihood of a mass in the current case. Compared with classification-based approaches, these methods could provide more clinical evidence to assist the diagnosis, and therefore attract more and more attention. For example, template matching based on mutual information was utilized to retrieve mammographic regions of interest (ROIs), and similarity scores between the query ROI and its best matches were used to determine whether it contained a mass [13]. This approach was further studied using more similarity measures (such as normalized mutual information) [12]. Features related to shape, edge sharpness and texture were adopted to search for mammographic ROIs with similar masses [1]. For the same purpose, 14 image features and a k- nearest neighbor (k-NN) algorithm were applied in [18]. This method was improved by removing poorly effective ROIs from the reference database [9]. These methods have shown great value of CBIR techniques in retrieval and analysis of mammographic masses. However, they did not consider s- calability and were tested on at most 3200 mammographic ROIs. This drawback limited the diagnostic accuracy, since the larger a reference database is, the more likely that relevant cases are found and a correct decision is made [9]. In this paper, we propose to solve the above problem through a scalable image retrieval framework. Specifically, SIFT descriptors extracted from database ROIs are quantized and indexed in a vocabulary tree. To enhance the discrimina- tive power of SIFT descriptors, statistical information about neighbor nodes in the tree is utilized to refine the weights of tree nodes following [15]. Given a query ROI, i.e. a mass region asserted by another CAD system, SIFT descriptors are extracted and searched in the tree to find similar database ROIs. These ROIs along with the similarities to the query ROI are used to determine whether the query contains a mass or not. Such method could retrieve from millions of images efficiently due to its low cost of memory and computational time. 2. PROPOSED APPROACH In this section, we first introduce the mammographic ROI re- trieval framework based on vocabulary tree, then present our