International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 2 Issue: 10 3161 – 3166 _______________________________________________________________________________________________ 3161 IJRITCC | October 2014, Available @ http://www.ijritcc.org ______________________________________________________________________________________ Efficient K-Mean Clustering Algorithm for Large Datasets using Data Mining Standard Score Normalization Sudesh Kumar Computer science and engineering BRCM CET Bahal Bhiwani (Haryana) Bhiwani, India ksudesh@brcm.edu Nancy Computer science and engineering BRCM CET Bahal Bhiwani (Haryana) Bhiwani, India Nancypubreja09@gmail.com Abstract—In this paper, the clustering and data mining techniques has been introduced. The data mining is useful for extract the useful information from the large database/dataset. For extract the information with efficient factor, the data mining Normalization techniques can be used. These techniques are Min-Max, Z-Scaling and decimal Scaling normalization. Mining of data becomes essential thing for easy searching of data with normalization. This paper has been proposed the efficient K-Mean Clustering algorithm which generates the cluster in less time. Cluster Analysis seeks to identify homogeneous groups of objects based on the values of their attribute. The Z-Score normalization technique has been used with Clustering concept. The number of large records dataset has been generated and has been considered for analyze the results. The existing algorithm has been analyzed by WEKA Tool and proposed algorithm has been implemented in C#.net. The results have been analyzed by generating the timing comparison graphs and proposed works shows the efficiency in terms of time and calculation Keywords- Normalization, Data Mining, Clustering, Modified K-Mean, Centroids __________________________________________________*****_________________________________________________ I. INTRODUCTION Data mining technology is used to give the user an ability to extract meaningful patterns from large database.After data mining implementation, Data Analysis need to be used for analyze the results for identify mining efficiency. Data analysis (DA) is an efficient method of analyzing large sets of data in a variety of fields, for internal, external and forensic audits. Most DA engagements involve working on existing data extracted by the IT departments of the audit client. Preparing the data for analysis can be a time-intensive task. Data mining (DM) is defined as the process of automatically searching large volumes of data for patterns such as association rules. It is a generic term used to explain a variety of tasks involving the analysis of data.Data Analysis is an analytical and problem-solving process that identifies and interprets relationships among variables. It is used primarily to analyze data based on predefined relationships, while DM, as it pertains to computer science, is used to identify new relationships in an otherwise bland dataset. More often than not, DA is considered as the knowledge to operate one of the DA tools e.g., Microsoft Excel, etc. Like auditing, DA needs a specific mindset as opposed to having merely the capability to use a given tool. It requires an analytical and problem-solving mindset with the ability to identify and interpret the relationships among the variables. Successfully solving a DA problem requires a deep understanding of the definition and application of various elements of DA. For analyze the results properly, the data transformation is core concept of data mining. Preprocessing: The measurement unit used can affect the data analysis. For example, changing measurement units from meters to inches for height, or from kilograms to pounds for weight, may lead to very different results. In general, expressing an attribute in smaller units will lead to a larger range for that attribute, and thus tend to give such an attribute greater effect or “weight.” To help avoid dependence on the choice of measurement units, the data should be normalized or standardized. This involves transforming the data to fall within a smaller or common range such as [-1, 1] or [0.0,1.0] Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest- neighbor classification and clustering. If using the neural network back propagation algorithm for classification mining, normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase. For distance-based methods, normalization helps prevent attributes with initially large ranges (e.g., income) from outweighing attributes with initially smaller ranges (e.g., binary attributes). It is also useful when given no prior knowledge of the data [1]. There are many methods for data normalization. The number of Normalization techniques exist in data mining elaborated as: a. Min-max normalization This performs a linear transformation on the original data. Suppose that min A and max A are the minimum and maximum values of an attribute, A. Min-max normalization maps a value, v i , of A to v' i in the range [new min A ,new max A ] by computing the formula. Min-max normalization preserves the relationships among the original data values. It will encounter an out-of-bounds error if a future input case for normalization falls outside of the original data range for A. For Example, minimum and maximum values for the attribute income are $12,000 and $98,000, respectively. It would like to map income to the range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income is transformed to (73,600-12,000/98,000-12,000)*(1.0-0) +0=0.716. b. Standard Score Normalization(Z-Score) In z-score normalization (or zero-mean normalization), the values for an attribute, A, are normalized based on the mean (i.e., average) and standard deviation of A. A value, v i , of A is normalized to v' i by computing: