International Journal on Advances in Computing and Communication Technologies Volume 2, Issue 1, 2013 116 Application of CURE Data Clustering Algorithm to Batangas State University Student Database Nguyen Thi Linh Department of Information Technology ICT University – Thai Nguyen University Thai Nguyen, Vietnam Christopher Chua Department of Informatics and Computing Sciences Batangas State University Batangas City, Philippines Abstract —Clustering is said to be one of the most complex, well-known and most studied problems in data mining theory. Data clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters. The increasing enrolment of students at Batangas State University (BatStateU) equates to increase of students’ database which can be mined to discover patterns in large data sets. Patterns extracted can be converted to understandable information that can be useful to the organization. A popular data clustering algorithm known as Clustering Using Representative (CURE) was implemented using C# programming language to cluster the students’ database of Batangas State University. Keywords—CURE algorithm, data clustering, data mining I. INTRODUCTION Data mining is one of the main steps in the process of knowledge discovery. It is considered a complex process where intelligent methods are applied in order to extract data patterns [1]. It involves integration of techniques from multiple disciplines such as database and data warehouse technology, statistics, machine learning, high – performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial or temporal data analysis. Investigating on methods of data mining still has been a main and essential subject of researchers and scientists. With the vast and diversified information resource, discovering a general method for data mining is impossible. This is because each kind of information resource or database has some correlative methods which are appropriate for mining it. Researchers‘ main object ive is finding effective data mining methods for each case. One of the most complex, well-known and most studied problems in data mining theory is clustering. This term refers to the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters [2]. As mentioned by Ma and Wu [3] dissimilarities are assessed based on the attribute values describing the objects and are usually distance measures are used. CURE is an agglomerative algorithm in the hierarchical method which builds clusters gradually. It identifies clusters by using c representative points that are created by choosing well-scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction α [4]. The parameter α can also be used to control the shapes of clusters. A smaller value of α contracts the dispersed points very little and thus favors elongated clusters. On the other hand, with larger values of α, the scattered points get located closer to the mean, and clusters tend to be more compact [4]. During each iteration, the clusters merged are those having the closest pair of representative points, until the desired number of clusters is reached. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. In this paper, the following objectives are attained: 1. Characterize the student database of BatStateU. 2. Develop a database clustering application using CURE algorithm 3. Utilize the developed application to cluster the student database of BatStateU. II. METHODOLOGY This paper used the constructive research method to come up with a data clustering application. Constructive research method deals with building of an artifact (practical, theoretical or both) which solves a domain specific problem in order to create knowledge about how the problem can be solved (or understood, explained or modeled) in principle [5]. The C# object–oriented programming language was used to design the interface, implement CURE algorithm and functions for the application. SQL Server 2005 was used as a tool for pre-processing data, designing data tables and implementing connections, queries, and stored procedures to ensure the interaction between the user and the application, as well as the application and the database system.