Graphs exist everywhere. Be it financial transactions, social connections, academic performance, molecular interactions, or advertising marketing statistics – just about everything tangible can be expressed in graphical form. And these graphs can be used to delineate relationships and interdependencies between different objects.
But not everybody loves graphs!
Pie charts, Columns, Bubbles, Scatter graphs, and Venn diagrams – these continue to be the bane of many businesses. And this is exceedingly perplexing since these are simple, straightforward 2D representations – Euclidean data, in technical terms.
However, there are some other smoldering questions related to data?
What about non-Euclidean data? Data with complex relationships and interdependencies mapped between different objects (nodes)? Data exceeding far beyond the constraints of 3D representation? Can this data also be represented in graphical form? This is something that perplexes not only humans (including mathematics researchers) but even advanced computing systems.
Graphical Analysis and Shortcomings of Existing ML/DL Models
At the simplest level, a graph has two fundamental components – a set of two objects (nodes) and the relationship mapped between them (graph edge). The scenario is manifolds complicated when multiple subjects are introduced in the picture, each affecting the other, introducing interdependencies and weaving a massively convoluted spiderweb of relationships.
For example, the relationship between different members (objects) of a family can be easily depicted in graphical form, including mapping how the actions and emotions of one member affects the others’ actions and emotions. But when the actions, emotions and interdependencies of extended family, distant cousins, friends and colleagues are also introduced in the matrix along with their own families and so on, the lattice formed begins to become indecipherable.
The humongous data sets that emerge from such graphical representations are immensely complex, so much so that even advanced ML algorithms begin to dither and fail at the prospect of making sense of this data. This is primarily because of two reasons –
- Machine Learning/ Deep Learning models (ML/DL) are trained to identify patterns and detect/ predict anomalies. This has multifarious applications in the fields of personalized product recommendation, cybersecurity and unauthorized access prevention, satellite mapping, etc. These models fail where no discernible patterns exist.
- Furthermore, ML/DL models can map one-to-one relationships which renders them excellent for image recognition and classification, traffic prediction, etc. But they are generally not suitably well-trained to map one-to-many and many-to-one relationships, especially where millions of objects might be concerned.
Introduction to Graph Neural Networks (GNNs)
GNNs are a special class of Artificial Neural Networks (ANNs). To ameliorate the shortcomings of conventional ML/DL models to coherently decipher monumental graph structures, GNNs have emerged as an indispensable tool for processing seemingly arbitrary data that can be represented as graphs. They have been deployed for unsupervised, semi-supervised, self-supervised and reinforcement learning (RL), and have found applications across the spectrum in the following fields –
- Node Classification – Semi-supervised ML model which is trained to determine the labels of unlabeled individual nodes by inspecting the labels of their neighbors, e.g., detecting fraudulent financial transactions by examining extensive transaction history.
- Graph Classification – Similar to (1) but applied to an entire graph, e.g., categorizing text documents depending on their content and context for Natural Language Processing (NLP) AI models.
- Graph Visualization – Depiction of extensive information in visual form to highlight structures and detect anomalies/ outliers.
- Link Prediction – Numerous uses ranging from friend recommendations on social media, spam mail identification, and epidemic prediction. An overarching use case involves crime enforcement predicting an individual’s criminal associations.
- Graph Clustering – Similarity-based clustering of dense data in the form of graphs, based either on edge weight or edge distances. Currently being used in many image identification and visual cognition applications.
Developing GNNs for inference entails –
- the conversion of humongous amounts of input data into mini-batches for training, and
- undertaking neighborhood sampling in these mini-batches in a distributed multi-GPU environment.
This incurs intense computational load since the position of each data point (graph node) also depends on the position of its neighbors (neighborhood aggregation) and this can quickly lead to a prohibitively large neighborhood even in mini-batches. This increases dramatically when considering staggering datasets with millions of nodes and billions of relationships between them!
Apart from the computational cost, the representation of so many individual data points in this massive neighborhood also consumes substantial memory, often even exceeding the GPU’s memory capacity and requiring the transfer of the entirety of the corresponding subgraph and feature vectors to the GPUs themselves.
There are, thus, two major bottlenecks inhibiting GNN performance – mini-batch preparation and massive-scale CPU-to-GPU data transfer. It was demonstrated that throughout the course of GNN training, startlingly only ~28% of the time is actually spent on GPU training, the balance being exhausted in mini-batch preparation and data transfer to the GPU!
Both bottlenecks, therefore, need to be efficiently surmounted to achieve substantial GPU utilization across a multi-GPU environment. In fact, sampling thoroughput needs to be improved by at least three times in order to keep a single GPU busy and hide sampling latency. The speedup needs to be even higher when multiple GPUs are simultaneously deployed on a workload.
The SALIENT Approach
To ameliorate the abovementioned hardware constraints, researchers from MIT-IBM Watson AI Lab have developed a specialized system christened “SAmpling, sLIcing, and data movemENT (SALIENT)” for distributed data parallel GNN training and inference using neighborhood sampling.
This advanced system incorporates the following features to achieve incredibly high performance –
- Optimized neighborhood sampling
- Efficient parallel batch preparation
- CPU-to-GPU data transfer optimizations that hide latency and saturate data bus bandwidth, and
- Seamless compatibility with PyTorch’s Distributed Data Parallel training module (DDP) to scale across multiple GPUs and machines.
GPU saturation using SALIENT. X-axis indicates elapsed time. Note that CPU cores, being more in number, are being employed to undertake data sampling and prepare mini-batches in parallel (green cells) in order to saturate the GPUs. Schematic courtesy Tim Kaler et al.
Across benchmarks, SALIENT uses targeted optimizations to maximize GPU utilization and workload efficiency. It is demonstrated to undertake neighborhood sampling 2.5x faster vis-a-vis Pytorch Geometric (PyG) implementation.
It further increases GPU utilization by deploying separate GPU streams for data transfer and GPU training computations and synchronizing those overlapping streams to ensure a training iteration begins after the necessary data is transferred. Since batch preparation and transfer have already been optimized to consume less time than GPU training computations, SALIENT naturally succeeds in eliminating latency outside GPU training computations.
The trick to achieving these extraordinary speed-ups is keeping all of the hardware (CPUs, GPUs and NCCL data links) busy at all times. The CPU is continuously involved in graph sampling and mini-batch preparation, extracting necessary and relevant information from nodes and their neighbors, while the more critical GPU resources is kept occupied with ML model training or conducting inference. Notably, SALIENT enhances efficiency manifolds without requiring disruptive changes to user-facing APIs.
Multi-GPU scaling –
Employing a single GPU, SALIENT achieves 3-3.4x times speedup across datasets by exploiting its improved pipelined design to optimize data sampling, mini-batch preparation, and data transfer stages. The overall per-epoch runtime nearly coincides with the GPU compute time for training.
Scaling to a multi-GPU environment enables the system to achieve even higher parallel speedup by leveraging the increased computational density. Processing larger batch sizes and undertaking larger neighborhood sampling effectively camouflages synchronization and data transfer overheads. With 16 GPUs, the speedup ranges from 4.45-8.05x times.
This is appropriately illustrated by comparing GNN training performance on the same workload (ogbn-papers100M) with/ out SALIENT optimizations in mono/ multi-GPU environments –
As demonstrated above, SALIENT marks a significant milestone in improving GNN architecture, especially after taking into consideration its potential to incorporate other established methodologies resolving unrelated bottlenecks.
Scenarios where billions of transactions must be efficiently and quickly processed by GPU-enabled ML/DL algorithms will only increase over time as datasets keep multiplying. Technologies utilizing these datasets, whether in the realm of crypto/ finance or medicine or quantum computing, continue to grow by leaps and bounds, and achieving superior GPU utilization must remain a technological and environmental priority.
The complete paper is available here.