International Journal of Multimedia and Ubiquitous Engineering Vol.9, No.3 (2014), pp.219-234 http://dx.doi.org/10.14257/ijmue.2014.9.3.21 ISSN: 1975-0080 IJMUE Copyright ⓒ 2014 SERSC A Centroid and Relationship based Clustering for Organizing Research Papers Damien Hanyurwimfura 1,3 , Liao Bo 1 , Dennis Njagi 2 and Jean Paul Dukuzumuremyi 2 1 College of Information Science and Engineering, Hunan University, China 2 College of Information Science and Engineering, Central South University, China 3 College of Science and Technology, University of Rwanda, Kigali, Rwanda hadamfr@yahoo.fr Abstract Finding research papers about particular topic of study is the most time consuming activity for many people including students, professors and researchers. People doing research have to search, read and analyze multiple research papers, e-books and other documents and then determine what they contain and discover knowledge from them. Huge available resources are in the form of unstructured texts format of long text pages which require a long time to process, search, read and analyze. Organizing research papers in their respective subjects or topics can facilitate the search process. We propose a new method to research paper organization and retrieval that is amenable to closely research papers and intertwined research topics. With our centroid and relationship based clustering approach, research papers are arranged and grouped within the most probable research topics or subjects. To determine topic membership, the proposed approach considers relationships such as common terms in paper title, in keywords, in referenced titles and common terms in the top frequent sentences. To solve the high dimensional problem associated with text document, only most important information of the paper is considered and we leverage on multi-word and frequent occurring phrases as the features in clustering process. Conducted experiments show that our approach is effective. Keywords: centroids selection, research papers, paper clustering, paper relationship, multi words features, important information 1. Introduction A considerable amount of research is being conducted by many people (researchers, students, professors etc) everyday. Finding information about a specific topic consumes appreciable amount of time due to the high volume of available resources. Given the high number of researches and the increasing amount of information on the web, it becomes very important to organize this large amount of information into meaningful clusters referenced by distinct categories. Faced with such large data sets, it is very difficult to find the desired information quickly and accurately. Therefore, there is a need to avail automated processing methods for such large amount of information so that it is accessed fast and accurately. Text mining is defined as an intelligent text analysis, text data mining, or knowledge discovery in the text [2]. Typically text mining tasks include information extraction and text clustering [7], text classification, information retrieval and text summarization. Researchers have put a lot of interest on mining data that are in the form of structured format where they assume that the