IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 10 | Oct-2015, Available @ http://www.ijret.org 131 CONCEPT BASED CATEGORIZATION OF DOCUMENTS FOR SEARCH ENGINES Soumen Swarnakar 1 , Sangita Karmakar 2 1 Assistant Professor, Department of Information Technology , Netaji Subhash Engineering College, W.B., India 2 B.Tech Student, Department of Information Technology , Netaji Subhash Engineering College, W.B., India Abstract Now days, information retrieval is a challenging work for search engines. In this paper we will discuss about text categorization. Text documents categorization is the process to classify documents according to some predefined knowledge. Documents with same concept are grouped together, and documents with different concept are formed other group according to their similarity of context of the documents. This grouping technique is called document categorization. So the related documents will be in same category and non related documents in other category. In this paper we have concentrated on the document context, according to their context, categorization process is done. So we are trying to propose Link base document categorization according to the document context of a particular concept. In this way we can retrieve the proper information about the document and also find about the document’s main concept and about what sub concept according to the percentage of weights of domains of a document. According to percentages of different concepts of different domain and indexing of documents, the categorization can be improved for information retrieval process of a search engine. Keywords: Context address, Mixture category, pure category, concept dictionary, Domain. ----------------------------------------------------------------------------- ***--------------------------------------------------------------------------- 1. INTRODUCTION Text document concept analysis is a part of information extraction. Document categorization is happened according to the concept dictionary or on which way documents are coming in sequence. Automatic text categorization has many practical applications, including indexing for document retrieval, automatically extracting metadata, word sense disambiguation by detecting the topics a document covers, and organizing and maintaining large catalogues of Web resources and It is also used in automatic document organization topic extraction and information retrieval or filtering information. A fuzzy based approach for multilabel text categorization and similar document retrieval has been suggested by Rubiya P U et al. (2015). Ontology based document clustering has been proposed by Soumen Swarnakar (2012) whereas a new approach to concept base document clustering has also been proposed by Soumen Swarnakar et al. (2015). Correlated concept based dynamic document clustering algorithms for newsgroups and scientific literature suggested by Jayaraj Jayabharathy and Selvadurai Kanmani (2014). A new term weighting Scheme for clustering dynamic data streams is also proposed by Joel W. Reed et al. (2006). Traditional search engines output a list of results that are ranked according to their relevance to the query. Our proposed approach will help to categorize the documents automatically according to concept which will improve search engine performance. 2. PROPOSED WORK In this paper focus has been given on concept of documents. At first all the text documents are converted in to lower case. Next, preprocessing is done by removing articles, prepositions, conjunctions and finally stemming operations on different words are applied, i.e., if word in document is injured, then after stemming the word would be injure. Next, according to concept dictionary and synonyms dictionary, occurrence matrix has been computed. From occurrence matrix, percentages of different concepts of different domains for documents have been calculated and according to percentage of concepts, the concept of the document will be decided. After that, indexing of documents is done according to concept based category. 2.1 TERMINOLOGIES USED 2.1.1 Concept address Concept address is specific location into the domain according to the specific keywords coming from the documents. Concept address for any particular document gives the specific address for a particular domain in which it reside. Here concept address has been described below for any document numbers i = 0, 1, 2, 3..... n, co i signifies concept of document i where W1, W2, W3 are the occurrence of concept 1, occurrence of concept 2 and occurrence of concept3 respectively. co i W1 co i W2 co i W3 In this way we can implement the concept address of the any new document.