Clustering in Data Mining: A Review Amandeep Kaur 1 and Tarun Dhiman 2 1 Geeta Institute of Management and Technology, Kurukshetra, India Email: meamansaini99@gmail.com 2 Geeta Institute of Management and Technology, Kurukshetra, India Email: tarun.dhiman@gmail.com Abstract—Document categorization is used for sorting the useful document and classifies the document by content. Document categorization is document classification. It is an approach of machine learning in the form of Natural Language Processing (NLP). The goal is to assign one or more classes or categories to a document, which makes it easier to sort and manage. This paper provides a review on document mining concept, their architecture, their fields, clustering and types of clustering. Index Terms— Document categorization, Document classification, Text clustering, Data mining, Clustering. I. INTRODUCTION OF DATA MINING Data mining refers to extracting the knowledge from large amount of data. It is just like a mining of coil and getting the diamond from mining. Sometimes, data mining named as knowledge mining from data. It is the process of computation to discover patterns in large data sets involving methods at the intersection of artificial intelligence, database systems, statistics, and machine learning. The main aim of the data mining process is to extract information from a data set and transform it into an understandable structure. A. Archutecture of Data Mining 1. Data Sources World Wide Web (WWW), data warehouse (DH), database (DB), text files etc. are the main sources in the process of data mining. The World Wide Web is the big source of data. Historical data is used for successful data mining. Data warehouses or databases are usually used by the organizations and data warehouse contains one or more databases. 2. Data Cleaning, Integration and Selection Cleaning, integration and selection processes are carried out before passing it to the DB or DW server. The data is incomplete and not reliable so that it cannot be used directly for data mining processes. Firstly, cleaning and integration process is carried out and then only useful data is selected and sends to the server. 3. Database or Data Warehouse Server Fully prepared data is processed by DB or DW server. Hence, the server is responsible for receiving the appropriate data based on the data mining request by the user. 4. Data Mining Engine The data mining engine is connected to the knowledge base (KB). The knowledge base sends information to the data mining engine and it performs the task like association, classification, characterization, clustering, Grenze ID: 02.IETET.2016.5. 34 © Grenze Scientific Society, 2016 Proc. of Int. Conf. on Emerging Trends in Engineering & Technology, IETET