DHC: A Distributed Hierarchical Clustering Algorithm for Large Datasets ¤ Wei Zhang , , Gongxuan Zhang ,§ , Xiaohui Chen , Yueqi Liu , Xiumin Zhou and Junlong Zhou Computer Science and Engineering, Nanjing University of Science and Technology, NO.200 Xiaolingwei Road, Nanjing 210094, P. R. China Computer Science and Technology, Huaiyin Normal University, NO.111 Changjiangxi Road, Huai'an 223300, P. R. China § gongxuan@njust.edu.cn Received 30 January 2018 Accepted 31 May 2018 Published 4 July 2018 Hierarchical clustering is a classical method to provide a hierarchical representation for the purpose of data analysis. However, in practical applications, it is di±cult to deal with massive datasets due to their high computation complexity. To overcome this challenge, this paper presents a novel distributed storage and computation hierarchical clustering algorithm, which has a lower time complexity than the standard hierarchical clustering algorithms. Our proposed approach is suitable for hierarchical clustering on massive datasets, which has the following advantages. First, the algorithm is able to store massive dataset exceeding the main memory space by using distributed storage nodes. Second, the algorithm is able to e±ciently process nearest neighbor searching along parallel lines by using distributed computation at each node. Extensive experiments are carried out to validate the e®ectiveness of the DHC algorithm. Experimental results demonstrate that the algorithm is 10 times faster than the standard hierarchical clustering algorithm, which is an e®ective and °exible distributed algorithm of hierarchical clustering for massive datasets. Keywords: Hierarchical clustering; nearest neighbor search; distributed storage and computa- tion; data mining. 1. Introduction Hierarchical clustering has been used in many ¯elds, especially in machine learning and data mining. The basic approaches to hierarchical clustering, agglomerative and *This paper was recommended by Regional Editor Tongquan Wei. § Corresponding author. Journal of Circuits, Systems, and Computers Vol. 28, No. 4 (2019) 1950065 (26 pages) # . c World Scienti¯c Publishing Company DOI: 10.1142/S0218126619500658 1950065-1 J CIRCUIT SYST COMP Downloaded from www.worldscientific.com by CHONGQING NORMAL UNIVERSITY on 10/12/18. Re-use and distribution is strictly not permitted, except for Open Access articles.