Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization Xiaodi Huang a,d , Xiaodong Zheng b,c , Wei Yuan b,c , Fei Wang b,c , Shanfeng Zhu b,c,d,⇑ a School of Computing and Mathematics, Charles Sturt University, Albury, NSW 2640, Australia b The School of Computer Science, Fudan University, Shanghai 200433, China c Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China d State Key Lab of Software Engineering, Wuhan University, Wuhan 430072, China article info Article history: Received 18 July 2009 Received in revised form 5 December 2010 Accepted 16 January 2011 Available online 31 January 2011 Keywords: Biomedical document clustering Non-negative matrix factorization Ensemble clustering abstract Searching and mining biomedical literature databases are common ways of generating sci- entiﬁc hypotheses by biomedical researchers. Clustering can assist researchers to form hypotheses by seeking valuable information from grouped documents effectively. Although a large number of clustering algorithms are available, this paper attempts to answer the question as to which algorithm is best suited to accurately cluster biomedical documents. Non-negative matrix factorization (NMF) has been widely applied to clustering general text documents. However, the clustering results are sensitive to the initial values of the parameters of NMF. In order to overcome this drawback, we present the ensemble NMF for clustering biomedical documents in this paper. The performance of ensemble NMF was evaluated on numerous datasets generated from the TREC Genomics track dataset. With respect to most datasets, the experimental results have demonstrated that the ensemble NMF signiﬁcantly outperforms classical clustering algorithms of bisecting K-means, and hierarchical clustering. We compared four different methods for constructing an ensemble NMF. For clustering biomedical documents, this research is the ﬁrst to compare ensemble NMF with typical classical clustering algorithms, and validates ensemble NMF constructed from different graph-based ensemble algorithms. This is also the ﬁrst work on ensemble NMF with Hybrid Bipartite Graph Formulation for clustering biomedical documents. Ó 2011 Elsevier Inc. All rights reserved. 1. Introduction MEDLINE is the US National Library of Medicine’s premier biomedical literature database [30]. Indexing 18 million bio- medical documents, MEDLINE has accumulated scientiﬁc ﬁndings in the biomedical ﬁeld for more than 40 years. Biomedical researchers regard MEDLINE as the main source for generating scientiﬁc hypothesis and discovering new knowledge [15]. With thousands of new citations being added into MEDLINE each day, it is obvious that researchers cannot browse all rel- evant literature in the database. In order to alleviate this problem, similar biomedical documents are grouped using docu- ment clustering techniques [5,33]. In this way, the major ﬁndings reported in the literature can be easily digested. In general, a clustering algorithm needs to address two underlying issues: in which way elements are grouped and what criteria are used for governing such groupings. According to the ways of grouping elements, clustering algorithms are cat- egorized into two types: partitional (ﬂat) clustering and hierarchical clustering [16]. Elements in a partitional clustering are grouped into a number of ﬂat clusters without examining their explicit relationships. Hierarchical clustering, however, produces a hierarchy of clusters in which the different numbers of clusters can be obtained by examining groups at different 0020-0255/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2011.01.029 ⇑ Corresponding author at: The School of Computer Science, Fudan University, Shanghai 200433, China. Tel.: +86 21 65643786. E-mail address: zhushanfeng@gmail.com (S. Zhu). Information Sciences 181 (2011) 2293–2302 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins