International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March-2015 641 ISSN 2229-5518 IJSER © 2015 http://www.ijser.org General method for data indexing using clustering methods Karwan Jacksi, Sobhan Badiozamany Abstract— Indexing data plays a key role in data retrieval and search. New indexing techniques are proposed frequently to improve search performance. Some data clustering methods are previously used for data indexing in data warehouses. In this paper, we discuss general concepts of data indexing, and clustering methods that are based on representatives. Then we present a general theme for indexing using clustering methods. There are two main processing schemes in databases, Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). The proposed method is specific to stationary data like in OLAP. Having general indexing theme, different clustering methods are compared. Here we studied three representative based clustering methods; standard K-Means, Self Organizing Map (SOM) and Growing Neural Gas (GNG). Our study shows that in this context, GNG out performs K-Means and SOM. Index Terms— Clustering Algorithms, K-Means, Self Organizing Map (SOM), Growing Neural Gas (GNG), Database Indexing. —————————— —————————— 1 INTRODUCTION e review general database indexing concepts in sec- tion 1.1. Then we cover spatial indexing concepts and methods in section 1.2. Clustering methods that are based on representatives are discussed in section 1.3. 1.1 Database indexing general concepts Database Indexes are supplementary access structures which are used to make the search faster when looking up for rec- ords. Indexes provide secondary access path to data files, meaning that they do not alter the placement of records in the main data file. Index can be put on any data field (attribute), there could be more than one index per single data file. Index can also be defined on a combination of attributes. Index files are usually have two fields, <Key, pointer>, where key is the value of indexing attribute and the pointer is the physical address for records having a certain value in their index field. Having indexes as an extra access path, the search is done in two steps, first accessing the index structure looking up for the key, then following the pointer in the index entry to get to the actual record in data file. The most common indexing methods are B+-trees and hashing indexes. B+-trees are the most common structure for generating indexes in most relational Database Management Systems (DBMSs) (1). 1.2 Spatial indexing methods Many scientific applications produce multidimensional spatial datasets that are very huge, both dimensionally and vertically. The fact that conventional database indexing techniques are unable to index spatial datasets caused huge efforts in making specific indexing techniques for spatial datasets. Spatial indexing methods can be grouped into two sets; space partitioning methods and data partitioning methods. Space partitioning methods that are based on KD-trees (2) have been shown to perform well for point data. In space par- titioning, we start by inserting elements into tree. When over- flow happens, a single dimension and a single position in that dimension are used to split nodes. Data partitioning methods, based on R-trees split the space using rectangular bounding boxes. The positions of bounding boxes are stored in the index structure (3). 1.3 Clustering General Concepts Clustering divides data into groups (clusters) based on the similarity between data points. The aim of grouping is either to divide data into meaningful groups or as a preprocessing step, for instance to summarize data. In case the clustering intention is to find meaningfulness of data, so called natural clusters are generated by clustering algorithms (4). There are varieties of clustering algorithms in literature; here we focus on clustering methods that are based on having one or more representatives for each cluster. More specifically, we focus on K-means, Self Organizing Map (SOM) and Grow- ing Neural Gas (GNG). 1.3.1 K-Means K-means algorithm is one of the simplest clustering methods (4). K initial centroids (or codebooks) are chosen, where K is the parameter to the algorithm (the number of clusters). Each data point is then assigned to the closest centroid. Then the centroid of each cluster is updated based on the mean (aver- age) of all data points in the cluster. The assignment and up- dating steps are repeated until convergence. K-means is described in more details in the following algo- rithm (4). Select K initial centroids Repeat o Form K clusters by assigning each data point W ———————————————— Karwan Jacksi is currently pursuing PhD degree program in Computer Science in University of Zakho, Iraq and Eastern Mediterranean Univerisy, Cyprus. Tel: +90-533-852-8257. Email: Karwan.Jacksi@uoz.ac Sobhan Badiozamany is currently pursuing PhD degree program in Com- puter Science in Uppsala University, Sweden, Tel: +4670-4094664, Email: sobhan.badiozaman@gmail.com IJSER