International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March-2015 641
ISSN 2229-5518
IJSER © 2015
http://www.ijser.org
General method for data indexing using
clustering methods
Karwan Jacksi, Sobhan Badiozamany
Abstract— Indexing data plays a key role in data retrieval and search. New indexing techniques are proposed frequently to improve
search performance. Some data clustering methods are previously used for data indexing in data warehouses. In this paper, we discuss
general concepts of data indexing, and clustering methods that are based on representatives. Then we present a general theme for
indexing using clustering methods. There are two main processing schemes in databases, Online Transaction Processing (OLTP) and
Online Analytical Processing (OLAP). The proposed method is specific to stationary data like in OLAP. Having general indexing theme,
different clustering methods are compared. Here we studied three representative based clustering methods; standard K-Means, Self
Organizing Map (SOM) and Growing Neural Gas (GNG). Our study shows that in this context, GNG out performs K-Means and SOM.
Index Terms— Clustering Algorithms, K-Means, Self Organizing Map (SOM), Growing Neural Gas (GNG), Database Indexing.
—————————— ——————————
1 INTRODUCTION
e review general database indexing concepts in sec-
tion 1.1. Then we cover spatial indexing concepts and
methods in section 1.2. Clustering methods that are
based on representatives are discussed in section 1.3.
1.1 Database indexing general concepts
Database Indexes are supplementary access structures which
are used to make the search faster when looking up for rec-
ords. Indexes provide secondary access path to data files,
meaning that they do not alter the placement of records in the
main data file. Index can be put on any data field (attribute),
there could be more than one index per single data file. Index
can also be defined on a combination of attributes. Index files
are usually have two fields, <Key, pointer>, where key is the
value of indexing attribute and the pointer is the physical
address for records having a certain value in their index field.
Having indexes as an extra access path, the search is done
in two steps, first accessing the index structure looking up for
the key, then following the pointer in the index entry to get to
the actual record in data file.
The most common indexing methods are B+-trees and
hashing indexes. B+-trees are the most common structure for
generating indexes in most relational Database Management
Systems (DBMSs) (1).
1.2 Spatial indexing methods
Many scientific applications produce multidimensional spatial
datasets that are very huge, both dimensionally and vertically.
The fact that conventional database indexing techniques are
unable to index spatial datasets caused huge efforts in making
specific indexing techniques for spatial datasets.
Spatial indexing methods can be grouped into two sets;
space partitioning methods and data partitioning methods.
Space partitioning methods that are based on KD-trees (2)
have been shown to perform well for point data. In space par-
titioning, we start by inserting elements into tree. When over-
flow happens, a single dimension and a single position in that
dimension are used to split nodes. Data partitioning methods,
based on R-trees split the space using rectangular bounding
boxes. The positions of bounding boxes are stored in the index
structure (3).
1.3 Clustering General Concepts
Clustering divides data into groups (clusters) based on the
similarity between data points. The aim of grouping is either
to divide data into meaningful groups or as a preprocessing
step, for instance to summarize data. In case the clustering
intention is to find meaningfulness of data, so called natural
clusters are generated by clustering algorithms (4).
There are varieties of clustering algorithms in literature;
here we focus on clustering methods that are based on having
one or more representatives for each cluster. More specifically,
we focus on K-means, Self Organizing Map (SOM) and Grow-
ing Neural Gas (GNG).
1.3.1 K-Means
K-means algorithm is one of the simplest clustering methods
(4). K initial centroids (or codebooks) are chosen, where K is
the parameter to the algorithm (the number of clusters). Each
data point is then assigned to the closest centroid. Then the
centroid of each cluster is updated based on the mean (aver-
age) of all data points in the cluster. The assignment and up-
dating steps are repeated until convergence.
K-means is described in more details in the following algo-
rithm (4).
• Select K initial centroids
• Repeat
o Form K clusters by assigning each data point
W
————————————————
• Karwan Jacksi is currently pursuing PhD degree program in Computer
Science in University of Zakho, Iraq and Eastern Mediterranean Univerisy,
Cyprus. Tel: +90-533-852-8257. Email: Karwan.Jacksi@uoz.ac
• Sobhan Badiozamany is currently pursuing PhD degree program in Com-
puter Science in Uppsala University, Sweden, Tel: +4670-4094664,
Email: sobhan.badiozaman@gmail.com
IJSER