GraNNDis / before you worried the large graph training , this architecture can help your worry practically

Jeong ii tae

6 min readFeb 18, 2024

GraNNDis: Efficient Unified Distributed Training Framework for Deep GNNs on Large Clusters

[Content]

Paper Link

Training large graph data inevitably faces GPU engineering challenges. Even with the best Multi-GPU setups, achieving ‘good’ learning outcomes is not guaranteed. When training Graph Neural Networks (GNN) on Multi-GPU servers, the specific processes of message passing and how resources are optimally utilized across servers can be understood through the paper GraNNDis we’ll introduce today.
Many have experienced the dilemma of how many gigabytes per parameter are necessary for VRAM while setting up computing environments for deep learning models. This includes using cloud-based solutions like Colab. The moment you confidently load data into VRAM, you might encounter the dreaded Out of Memory (OOM) error. Attempts to mitigate this through reduced batch sizes, among other strategies, are common, leading to the adage that sometimes the setup for the experiment costs more effort than the experiment itself.
With data sizes growing, engineering capabilities are becoming increasingly critical. Real-world graph data mostly follows a power-law distribution, making it more susceptible to OOM. For instance, while loading weights into VRAM, some nodes with fewer edges compute less weight, whereas nodes with more edges might exceed VRAM capacity, leading to OOM.
To prevent OOM in GNNs, GraNNDis proposes an idea utilizing shared preloading, expansion-aware sampling, and cooperative batching. These strategies improve OOM by enhancing GPU server communication efficiency, inter-server sampling, and a batching technique for learning the entire graph, demonstrating superior performance through experiments.
Before experimenting, it’s essential to establish criteria for ‘good’ learning. The paper sets benchmarks based on how much each server’s GPU computes nodes and edges and the communication time between GPUs. Simply put, it’s about distributing and computing excessive data across server GPUs in parallel and measuring the time it takes to integrate these computations.

Main module 1 — shared preloading

Communication times between GPU networks on the same server and those across different servers are bound to differ. Obviously, if all graphs are loaded on the GPUs of a single server, it would be faster since loading them across GPUs of multiple servers would require additional server access to gather that information, leading to longer processing times.
To solve this, we utilize a method called Shared preloading. This approach differs from the traditional method of equally distributing graphs across servers by pre-assigning and preloading nodes to be used by each layer at the server stage. It operates by utilizing the features of preloaded nodes when external-internal operations occur.

Main module 2 — expansion-aware sampling

Before discussing our method, it’s essential to briefly touch on how sampling has been conducted in Graph Neural Networks (GNNs), as this will help deliver a better overview of the paper to you all. In GNNs, the traditional sampling methods can be divided into two major types: layer-wise and subgraph-wise. The layer-wise method involves sampling up to n neighbors for each layer, while the subgraph-wise method involves sampling a subgraph connected to a specific node.
The advantage of the layer-wise sampling method is that it allows arbitrary control of the neighbors for message passing operations at each layer. On the other hand, the subgraph-wise sampling method efficiently captures the original graph’s structural characteristics by sampling well-reflected subgraphs. To give an example familiar to many, within the context of the torch_geometric module, layer-wise sampling can be equated to NeighborLoader, and subgraph-wise sampling to GraphSAINTRandomWalkSampler.
Returning to our main topic, even with the utilization of shared preloading, as the layers deepen and the number of nodes to be sampled increases, there will inevitably be a limit to preloading due to the majority of real-world graph data exhibiting a power-law distribution. To mitigate this limitation, we employ expansion-aware sampling.
The concept is straightforward. Unlike traditional sampling methods, expansion-aware sampling pays attention to the “expansion” aspect by introducing additional parameters like max-hop and fan-out for sampling. It allows for tunable specifications of how many hops away from a specific node and the maximum number of connections per neighbor to load. This method helps set criteria for deciding which nodes to include or exclude during shared preloading.

Main module 3 — cooperative batching

So far, we have discussed how shared preloading and expansion-aware sampling can be approached for full-graph training, that is, for training the entire graph. Cooperative batching, on the other hand, addresses how nodes are fetched for each GPU.
Traditional methods first designate target nodes and divide them, then fetch based on those divided nodes — a method known as split-then-fetch. Cooperative batching employs a fetch-then-split approach, where the nodes to be utilized by the server are pre-fetched for each batch before deciding which GPU will train them.
The conventional method assigns each target node to different GPUs for training, requiring the fetching of all related neighbors for each node. However, cooperative batching, by distributing common nodes across batches, allows GPUs to share the computational load. This makes it more efficient in terms of memory utilization, as it reduces the need for each GPU to fetch and store all neighbors independently.

Experiment

We conduct experiments using frameworks like PyG and DGL as baselines. It’s worth noting that while DGL’s pipeGCN and DistDGL have limitations, such as not supporting mini-batches and full-graph training, respectively, GraNNDis is applicable to both scenarios.
Among all the intriguing experimental results, I found Figure 11 to be the most interesting. It focuses on measuring memory efficiency and the point at which Out of Memory (OOM) occurs. Considering the enormity of having 56 layers and features of 256 dimensions, the sheer number of parameters would typically be overwhelming. Normally, the thought of training under these conditions wouldn’t even cross our minds, but the paper allows us to vicariously experience and observe these results, which I found fascinating.
Furthermore, one might think that performance, such as accuracy, could degrade due to changes in the way parameters are updated with different sampling methods. However, the experiments show that without sacrificing performance, the training speed can be improved from 1.23 to 1.95 times faster.

Conclusion

In today’s era where Large Language Models (LLMs) are a hot topic, there is a lot of discussion about how to effectively train on large datasets given the available resources. Graph data, unlike other types of data, possesses a unique characteristic due to its strong dependency on topology. The amount of information a node learns varies depending on the specified topology, which is directly linked to performance and is considered a critical element.
However, there are situations where this topology is not efficiently utilized. Due to the rich-get-richer nature of topology, some GPU memories may be underutilized with a small amount of information loaded, while others may be overloaded with a large amount of information, leading to Out of Memory (OOM) issues during training.
To prevent this, strategies often involve reducing the batch size or the sample size to minimize the amount of information per batch. However, this may result in missing out on meaningful information.
The paper addresses this issue by introducing the GraNNDis architecture, which features shared preloading, expansion-aware sampling, and cooperative batching. This is a valuable paper for those who have been focused solely on message-passing, as it broadens the perspective by studying the process of how parameters are computed on GPUs.

[Contact Info]

Linkedin — https://www.linkedin.com/in/ii-tae-jeong/

Email — jeongiitae6@gmail.com

GraNNDis / before you worried the large graph training , this architecture can help your worry practically

GraNNDis: Efficient Unified Distributed Training Framework for Deep GNNs on Large Clusters

[Content]

Main module 1 — shared preloading

Main module 2 — expansion-aware sampling

Main module 3 — cooperative batching

Experiment

Conclusion

[Contact Info]

Written by Jeong ii tae