Scalable Graph Embedding Learning On A Single GPU Azita Nouri * , Philip E. Davis , Pradeep Subedi , Manish Parashar * School of Computer Science, Rutgers University, NJ, USA azita.nouri@rutgers.edu Scientific Computing Imaging Institute, University of Utah, Salt Lake City, UT, USA philip.davis@sci.utah.edu, pradeep.subedi@utah.edu, parashar@sci.utah.edu Abstract—Graph embedding techniques have attracted grow- ing interest since they convert the graph data into continuous and low-dimensional space. Effective graph analytic provides users a deeper understanding of what is behind the data and thus can benefit a variety of machine learning tasks. With the current scale of real-world applications, most graph analytic methods suffer high computation and space costs. These methods and systems can process a network with thousands to a few million nodes. However, scaling to networks with billions of nodes remains a challenge. The complexity of training graph embedding system requires the use of existing accelerators. In this paper, we introduce a hybrid CPU-GPU framework that addresses the challenges of learning embedding of large-scale graphs. The performance of our method is compared qualitatively and quantitatively with existing embedding systems on common benchmarks. We also show that our system can scale training to datasets with an order of magnitude greater than a single machine’s total memory capacity. The effectiveness of the learned embedding is evaluated within multiple downstream applications. The experimental results indicate the effectiveness of the learned embedding in terms of performance and accuracy. I. I NTRODUCTION In many real-world applications, graphs (a.k.a networks) have been widely used to demonstrate interactions between entities. Graph representation allows researchers to efficiently understand the structure of data in a systematic manner while they comprise many high-dimensional data to be pro- cessed (e.g., social networks[1], biology networks[2]). Due to the complexity of the data collected by various platforms and services, learning continuous low-dimensional vectors of graphs has attracted significant research interest. Moreover, with dynamically growing graphs, the high-dimensional data is not suitable for many machine learning approaches as they require vectors with low-dimensional representation for their computation. Among various approaches, graph embedding has attracted more attention in unsupervised learning of node representations in smaller space. Figure 1 illustrates the pro- cess of graph embedding in which low-dimensional vectors are learned through the training of samples that obtained from the graph. Later on, these d dimensional embeddings can be used as an input for many graph analytic methods (some of them are listed in Fig. 1). Representing nodes (originally in n dimensions) in low-dimensional space (d << n) enables us to apply common machine learning techniques to find the hidden properties of the graph more efficiently, such as link prediction [3], community detection [4], node classification [5], clustering, and graph visualization [6], [7]. Therefore, it is critical to have high-quality representations of nodes for these downstream machine learning tasks to accurately perform graph analytics. Learning a graph embedding model can be a resource inten- sive process. First, training such a model requires a massive amount of computation [8] especially when the representation vectors have higher dimensions and graphs include many nodes and edges to be trained. The training of graph embed- dings simply consists of many dot-product operations between various vectors, while higher dimensions and more number of edges in larger graphs pose a vast amount of computations in the training phase. Various graph embedding methods are sug- gested in the literature. However, these approaches rarely scale to large graphs. For example, DeepWalk[9], node2vec[10], and LINE[11] require hours of CPU training, even for small- and medium-scale graphs. Although these approaches can be parallelized, a parallel CPU implementation of DeepWalk takes over two hours on 26 CPU cores for a graph with 2 million vertices and 5 million edges. Traditional CPU-based methods are resource constraints. For example, training a large graph with billions of edges in vectors of size 100 using these methods is not feasible or in the best case takes hours of training, if not days. Therefore, with the rapidly growing size of graphs, it is essential for embedding methods to support the training of such large graphs in a reasonable time [8]. There are many studies that try to train large graphs using limited resources [12], [8]. In addition to that, with more availability of GPUs and other accelerators, new techniques have been done to utilize these resources to speed up the training phase of learning algorithms [13]. Thus, using the power of GPUs in a hybrid system is a popular method to achieve speedup for the computational part of embedding training. However, another challenge of these computations is related to memory requirements, mainly because we need to store a large graph in memory along with its representation vectors. For large graphs, the model parameters cannot be stored in the main memory, e.g., a graph with 100 million nodes represented in 256 dimensions requires around 400GB RAM, which is beyond the capacity available to oridinary users. Considering the increasing size of large graphs, we cannot fit the embedding vectors into CPU memory, while in the case of utilizing GPU, arXiv:2110.06991v1 [cs.LG] 13 Oct 2021