Micron 9400 NVMe SSDs explore big accelerator memory using NVIDIA technology

Dataset training sizes continue to grow beyond billions of parameters. While some models can fit in system memory completely, larger models cannot. In this situation, data loaders need to access models located on flash storage through various methods. One such method is a memory mapped file stored on SSDs. This allows the data loader to access the file as if it were in memory, but the overhead of the CPU and software stack drastically reduces the performance of the training system. This is where Big accelerator Memory (BaM)* and the GPU-Initiated Direct Storage (GIDS)* data loader come in.

What are BaM and GIDS?

BaM is a system architecture that utilizes the low latency, extremely high throughput, large density, and endurance of SSDs. BaM’s goal is to provide efficient abstractions that enable GPU threads to make fine-grained accesses to datasets on SSDs and achieve much higher performance than solutions requiring CPU(s) to provide storage requests to serve GPUs. BaM acceleration uses a custom storage driver that is designed specifically to enable the inherent parallelism of GPUs to access storage devices directly. BaM is different from NVIDIA Magnum IO™ GPUDirect^® Storage (GDS), as BaM doesn’t rely on the CPU to prepare the communication from GPU to SSD.

Micron had done previous work with NVIDIA GDS as noted below:

The GIDS dataloader is built on the BaM subsystem to address memory capacity requirements for GPU-accelerated Graph Neural Network (GNN) training while also masking storage latency. GIDS does this by storing the feature data of the graph on the SSD, since this data is typically the largest part of the total graph dataset for large-scale graphs. The graph structure data, which is typically much smaller compared to the feature data, is pinned into system memory to enable rapid GPU graph sampling. Lastly, the GIDS dataloader allocates a software-defined cache on the GPU memory for recently accessed nodes in order to reduce storage accesses.

Graph neural network training using GIDS

To show the benefits of BaM and GIDS, we performed GNN training using the Illinois Graph Benchmark (IGB) heterogeneous full dataset. This dataset is 2.28TB large and would not fit into most platforms’ system memory. We timed the training for 100 iterations using a single NVIDIA A100 80GB Tensor Core GPU and varied the number of SSDs to provide a broad range of results, as seen in Figure 1 and Table 1.

Graph showing GNN training times, with graph sampling, feature aggregation, and training, across different SSD counts

Figure 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations

	GIDS (4 SSDs)	GIDS (2 SSDs)	GIDS (1 SSD)	DGL Memory Map Abstraction
Sampling	4.75	4.93	4.08	4.65
Feature Aggregation	8.57	15.9	31.6	1,130
Training	1.98	1.97	1.87	2.13
End-to-End	15.3	22.8	37.6	1,143

Table 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations

The first part of the training is graph sampling done by the GPU and by accessing the graph structure data within system memory (seen in blue). This value varies little across the different test configurations because the structure stored in system memory does not change between these tests.

Another part is the actual training time (seen at the far right in green). This part is highly dependent on the GPU, and we can see that this does not change much between the multiple test configurations as expected.

The most important section, where we see the largest difference, is feature aggregation (shown in gold). As the feature data is stored on the Micron 9400 SSDs for this system, we see that scaling from 1 to 4 Micron 9400 SSDs drastically improves (reduces) the feature aggregation processing time. Feature aggregation improves by 3.68x as we scale from 1 SSD to 4 SSDs.

We also included a baseline calculation, which uses a memory map abstraction and the Deep Graph Library (DGL) data loader to access the feature data. Because this method of accessing the feature data requires the use of the CPU software stack instead of direct access by the GPU, we can see how inefficient the CPU software stack is at keeping the GPU saturated during training. The feature abstraction improvement versus baseline is 35.76x for 1 Micron 9400 NVMe SSD using GIDS and 131.87x on 4 Micron 9400 NVMe SSDs. Another view of this data can be seen in Figure 2 and Table 2, which shows the effective bandwidth and IOPs during these tests.

Figure 2: Effective Bandwidth and IOPS of GIDS Training vs Baseline

	DGL Memory Map	GIDS (1 SSD)	GIDS (2 SSDs)	GIDS (4 SSDs)
Effective Bandwidth (GB/s)	0.194	6.9	13.8	25.6
Achieved IOPs (M/s)	0.049	1.7	3.4	6.3

Table 2: Effective Bandwidth and IOPS of GIDS Training vs Baseline

As datasets continue to grow, we can see the need for a shift in paradigm in order to train these models in a reasonable amount of time and to take advantage of the improvements provided by leading GPUs. BaM and GIDS are a great starting point, and we look forward to working with more of these types of systems in the future.

Test System

Component	Details
Server	Supermicro® AS 4124GS-TNR
CPU	2x AMD EPYC™ 7702 (64 Core)
Memory	1 TB Micron DDR4-3200
GPU	NVIDIA A100 80GB Memory Clock: 1512 MHz SM Clock: 1410 MHz
SSDs	4x Micron 9400 MAX 6.4TB
OS	Ubuntu 20.04, Kernel 5.4.0
NVIDIA Driver	535.113.01

Reference Links

Big Accelerator Memory Paper & GitHub
GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture (arxiv.org)
GitHub - ZaidQureshi/bam

GPU Initiated Direct Storage Paper and GitHub
Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses (arxiv.org)
GitHub - jeongminpark417/GIDS
GitHub - IllinoisGraphBenchmark/IGB-Datasets: Largest real-world open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.

*Note: NVIDIA Big Accelerator Memory (BaM) and NVIDIA GPU Initiated Direct Storage (GIDS) dataloader are prototype projects from NVIDIA Research and are not intended for general release.

MTS, Systems Performance Engineer

John Mazzie

John is a Member of the Technical Staff in the Data Center Workload Engineering group in Austin, TX. He graduated in 2008 from West Virginia University with his MSEE with an emphasis in wireless communications. John has worked for Dell on their storage MD3 Series of storage arrays on both the development and sustaining side. John joined Micron in 2016 where he has worked on Cassandra, MongoDB, and Ceph, and other advanced storage workloads.

Products overview

Search for, filter and download Micron data sheets

Market & Industries overview

AI data center

Partners overview

Learn about and enroll in Micron's Technology Enablement Program (TEP)

Sales & Support overview

Contact Micron's sales support

About overview

Investor Relations overview

Visit Micron's Investor Relations site

Recent Search

Micron 9400 NVMe SSDs explore big accelerator memory

What are BaM and GIDS?

Graph neural network training using GIDS

Test System

Reference Links

John Mazzie