594 J SCI IND RES VOL 71 SEPTEMBER 2012 Journal of Scientific & Industrial Research Vol. 71, September 2012, pp. 594-600 *Author for correspondence E-mail: amirahmad01@gmail.com A study of digital mammograms by using clustering algorithms S M Halawani 1 , M Alhaddad 2 and A Ahmad 1 * Department of Information Technology, Faculty of Computing and Information Technology, King Abdul Aziz University, 1 Rabigh, 2 Jeddah, Saudi Arabia Received 12 April 2012; revised 05 August 2012; accepted 06 August 2012 This study presents clustering algorithms to study digital mammograms. Probabilistic clustering algorithms performed better than hierarchical clustering algorithm. Clustering results are competitive with classification results, indicating that clustering algorithms can be used as an important tool to study digital mammograms. Probabilistic clustering algorithms can also be used by radiologists to improve their prediction accuracy. Keywords : Classification, Clustering, K-mean clustering, Mammograms, Mixed datasets, Probabilistic Introduction Breast cancer is the most common form of cancer amongst women 1 . Early and accurate detection of breast cancer results in long survival of patients 2 . Machine learning techniques are being used to improve diagnostic capability for breast cancer 3-7 . Various classification techniques (decision trees 8 , support vector machines 9 , fuzzy- genetic algorithm 10-12 etc.) have been used to study breast cancer dataset. Mammography is a screening technique to detect breast cancer at an early stage 13,14 . However, radiologists not always give correct results. Generally, high false positive [benign (B) tumor is predicted as malignant (M) tumor] is related with these prediction. Hence, various computer added diagnostic tools have been suggested to help radiologists. Various data mining techniques (feature selection techniques, classification techniques and clustering techniques) have been used to study digital mammograms 15–17 . A breast cancer dataset 18 , which has 4 attributes [3 features related with breast imaging reporting data systems (BI- RADS) and 1 age feature] has been used extensively to study various data mining techniques. A low number gives indication of B whereas, a high number is indicative of M. Various classification algorithms have been used to study this datasets 19,20 . However, only Polat 21 used few clustering techniques to improve classification results. This study presents digital mammograms collected with various clustering algorithms to understand applicability of these algorithms for breast cancer detection. Proposed Methodology Digital mammograms were collected at University of Erlangen-Nuremberg between 2003 and 2006 with various clustering algorithms to apply for breast cancer detection. Experiments were carried out with various clustering algorithms and decision tree ensembles. Clustering Clustering involves partitioning a set of data points into non overlapping groups, or clusters of points where points in a cluster are more similar to one another than to points in other clusters 22 . In general, clustering algorithms are classified into two categories 22,23 (hard clustering algorithms and fuzzy clustering algorithms). In hard clustering, each data point belongs to one and only one cluster, whereas in fuzzy clustering, each data point is allowed to have membership functions to all clusters. EM (Expectation-Maximization) Algorithm This study used EM algorithm 22 , an iterative procedure, which estimates parameters of multivariate probability density function in a form of Gaussian mixture distribution with a specified number of mixtures. Every iteration of EM algorithm includes two steps: i) Expectation-step or E-step, where a probability of each sample belongs to each mixture by using currently available mixture parameter estimates; and ii)