International Journal of Computer Applications (0975 – 8887) Volume 10– No.6, November 2010 9 Data Clustering Method for Discovering Clusters in Spatial Cancer Databases Ritu Chauhan Jamia Hamdard Hamdard University New Delhi Harleen Kaur Jamia Hamdard Hamdard University New Delhi M.Afshar Alam Jamia Hamdard Hamdard University New Delhi ABSTRACT The vast amount of hidden data in huge databases has created tremendous interests in the field of data mining. This paper discusses the data analytical tools and data mining techniques to analyze the medical data as well as spatial data. Spatial data mining includes discovery of interesting and useful patterns from spatial databases by grouping the objects into clusters. This study focuses on discrete and continuous spatial medical databases on which clustering techniques are applied and the efficient clusters were formed. The clusters of arbitrary shapes are formed if the data is continuous in nature. Furthermore, this application investigated data mining techniques such as classical clustering and hierarchical clustering on the spatial data set to generate the efficient clusters. The experimental results showed that there are certain facts that are evolved and can not be superficially retrieved from raw data. General Terms Data mining, k-means, Clustering Algorithms Keywords Data Mining, Clustering, K-means, Hierarchical agglomerative clustering (HAC), SEER. 1. INTRODUCTION Recently many commercial data mining clustering techniques have been developed and their usage is increasing tremendously to achieve desired goal. Researchers are putting their best efforts to achieve the fast and efficient algorithm for the abstraction of spatial medical data sets. Cancer has become one of the leading causes of deaths in India. An analysis of most recent data has shown that over 7 lakh new cases of cancer and 3 lakh deaths occur annually due to cancer in India [1]. Furthermore, cancer is a preventable disease if it is analyzed at an early stage. There are various sites of cancer such as oral, stomach, liver, lungs, kidney, cervix, prostate testis, bladder and many others. There has been enormous growth in the clinical data from past decades, so we require proper data analysis techniques for more sophisticated methods of data exploration. In this study, we are using different data mining technique for effective implementation of clinical data. The objective of this paper is to explore several data mining techniques on clinical and spatial data sets. Data mining is also known as knowledge discovery from large data base; it is the process to extract hidden relevant patterns, information and regularities from large databases. It is an emerging field which is currently used in marketing, Surveillance fraud detection, human factor related issue, medical pattern detection and scientific discovery. Several data mining techniques are pattern recognition, clustering, association, classification and clustering. The proposed work will focus on challenges related to clustering on medical spatial datasets. Clustering is the unsupervised classification of patterns into clusters [2]. There are recently developed fast algorithms for clustering large data sets such as DBSCAN, CLARANS, BIRCH, STING[3], [4], [5] [6] . They are several series of facts have been gathered during the series of experiments. This chapter is organized as follows: Section 2, we discuss the related works of clustering algorithms. Section 3 the Experimental analysis on spatial medical datasets has been discussed. Conclusions are presented in the last section. 2. Clustering Algorithms The community of users has played lot emphasis on developing fast algorithms for clustering large data sets [13]. Clustering is a technique by which similar objects are grouped together. Clustering algorithms can be classified into several categories such partitioning-based clustering, hierarchical algorithms, density based clustering and grid based clustering. Now a day’s huge amount of data is gathered from remote sensing, medical data, geographic information system, environment etc. So everyday we are left with enormous amount of data that requires proper analysis. Data mining is one of the emerging fields that are used for proper decision-making and utilizing these resources for analysis of data. They are several researches focused on medical decision making [14] [15]. Data clustering techniques have been extensively used are: 2.1 Partitioning Based Clustering The K-means algorithm is a classical clustering method which is used to group large datasets into clusters [8][16]. It is the unsupervised classification to find optimal clusters. The algorithm is often considered to be a partitioning clustering method, and it works as follows. It arbitrarily chooses the cluster center then the objects are assigned to the similar cluster, which are more similar. The cluster means are updated for each cluster until there is no change. The disadvantage of using K-means method is the number of cluster should be specified in the beginning and it is not able to generate the cluster with different shapes. Given the above disadvantages, there is the silhouette value also known as silhouette width, gives a sort of compactness