Intelligent Data Analysis 24 (2020) 1207–1233 1207 DOI 10.3233/IDA-194843 IOS Press W-Com2Vec: A topic-driven meta-path- based intra-community embedding for content-based heterogeneous information network Phu Pham and Phuc Do ∗ University of Information Technology, VNU-HCM, Vietnam Abstract. Heterogeneous information network (HIN) are becoming popular across multiple applications in forms of complex large-scaled networked data such as social networks, bibliographic networks, biological networks, etc. Recently, information network embedding (INE) has aroused tremendously interests from researchers due to its effectiveness in information network analysis and mining tasks. From recent views of INE, community is considered as the mesoscopic preserving network’s structure which can be combined with traditional approach of network’s node proximities (microscopic structure preserving) to leverage the quality of network’s representation. Most of contemporary INE models, like as: HIN2Vec, Metapath2Vec, HINE, etc. mainly concentrate on microscopic network structure preserving and ignore the mesoscopic (intra-community) structure of HIN. In this paper, we introduce a novel approach of topic-driven meta-path-based embedding, namely W-Com2Vec (Weighted intra-community to vector). Our proposed W-Com2Vec model enables to capture richer semantic of node representation by applying the meta-path-based community-aware, node proximity preserving and topic similarity evaluation at the same time during the process of network embedding. We demonstrate comprehensive empirical studies on our proposed W-Com2Vec model with several real-world HINs. Experimental results show W-Com2Vec outperforms recent state-of-the-art INE models in solving primitive network analysis and mining tasks. Keywords: Meta-path, community detection, content-based HIN, network embedding 1. Introduction Recently, INE [1–3] has become the most popular researching area from in information network analysis and mining field. Multiple INE models have been proposed and achieved considerable attentions from many researchers and organizations due to its effectiveness in preserving the complex structure of real-world HINs as well as high performance in a wide range of applications. Up to present, recent INE models have been shifted to the development of effective and scalable representation learning mechanism which are applied for highly complex and large-scaled network like as social networks (Facebook, Twitter, etc.). Most of previous proposed INE models [1,4–8] aim to embed the network’s nodes into a latent low-dimensional vector space by preserving the nodes’ attributes and their order proximity. Contemporary network embedding models like as DeepWalk [9], LINE [10] and Node2Vec [11] learn ∗ Corresponding author: Phuc Do, University of Information Technology, VNU-HCM, Vietnam. E-mail: phucdo@uit.edu.vn. 1088-467X/20/$35.00 c 2020 – IOS Press and the authors. All rights reserved