Clustering in Data Mining: A Review
Amandeep Kaur
1
and Tarun Dhiman
2
1
Geeta Institute of Management and Technology, Kurukshetra, India
Email: meamansaini99@gmail.com
2
Geeta Institute of Management and Technology, Kurukshetra, India
Email: tarun.dhiman@gmail.com
Abstract—Document categorization is used for sorting the useful document and classifies the
document by content. Document categorization is document classification. It is an approach
of machine learning in the form of Natural Language Processing (NLP). The goal is to
assign one or more classes or categories to a document, which makes it easier to sort and
manage. This paper provides a review on document mining concept, their architecture, their
fields, clustering and types of clustering.
Index Terms— Document categorization, Document classification, Text clustering, Data
mining, Clustering.
I. INTRODUCTION OF DATA MINING
Data mining refers to extracting the knowledge from large amount of data. It is just like a mining of coil and
getting the diamond from mining. Sometimes, data mining named as knowledge mining from data. It is the
process of computation to discover patterns in large data sets involving methods at the intersection of
artificial intelligence, database systems, statistics, and machine learning. The main aim of the data mining
process is to extract information from a data set and transform it into an understandable structure.
A. Archutecture of Data Mining
1. Data Sources
World Wide Web (WWW), data warehouse (DH), database (DB), text files etc. are the main sources in the
process of data mining. The World Wide Web is the big source of data. Historical data is used for successful
data mining. Data warehouses or databases are usually used by the organizations and data warehouse
contains one or more databases.
2. Data Cleaning, Integration and Selection
Cleaning, integration and selection processes are carried out before passing it to the DB or DW server. The
data is incomplete and not reliable so that it cannot be used directly for data mining processes. Firstly,
cleaning and integration process is carried out and then only useful data is selected and sends to the server.
3. Database or Data Warehouse Server
Fully prepared data is processed by DB or DW server. Hence, the server is responsible for receiving the
appropriate data based on the data mining request by the user.
4. Data Mining Engine
The data mining engine is connected to the knowledge base (KB). The knowledge base sends information to
the data mining engine and it performs the task like association, classification, characterization, clustering,
Grenze ID: 02.IETET.2016.5. 34
© Grenze Scientific Society, 2016
Proc. of Int. Conf. on Emerging Trends in Engineering & Technology, IETET