On Wednesday, Feb. 28, micron.com will be upgraded between 6 p.m. - 12 a.m. PT. During this upgrade, the site may not behave as expected and pages may not load correctly. Thank you in advance for your patience.


NUMA Configuration on AMD Rome Processors and NVMe Performance on Windows Servers

By Dilim Nwobu - 2020-02-18

Now that “Rome” — the AMD EPYC™ 7002 series processors, which are the second generation of server processors based on the Zen 2 architecture — are available, data centers are investigating the hybrid multi-die architecture for on-premise workloads. Micron leaders were present at the EPYC launch last fall, and our storage and memory helped set some of the new marketing records.

With Rome processors offering up to 64 cores and 128 threads in a single socket and support for PCIe Gen 4, our Micron storage team has started testing on platforms based on this innovative CPU architecture. We recently procured servers outfitted with AMD 7702P processors, and I was able to do some preliminary testing with Microsoft® Windows® Server 2019. My focus was to test Micron’s new 7300 PRO NVMe™ SSD performance at a system level, scaling the drive count from one to eight NVMe SSDs. Launched in 2019, our budget-friendly Micron 7300 SSD extends the benefits of the NVMe flash storage protocol — speed, capacity and low latency — to more workloads and more applications.

Test System

  • Dell PowerEdge R7515
  • 1x AMD 7702P (64 cores)
  • 8x 1.92TB Micron 7300 NVMe SSD storage
  • 512GB RAM
  • Windows Server 2019 Datacenter (ver. 1809)
  • Baseline BIOS Settings:
    • Hyperthreading: Disabled
    • System Profile: Performance

Note that I disabled hyperthreading to simplify my testing and analysis since DiskSpd needs execution threads to be affinitized. With a 64-core processor, this guaranteed that only a single processor group would be created, making it easier to know which processors from the group were associated with a non-uniform memory access (NUMA) node. Hyperthreading would have created 128 logical processors and therefore two processor groups. As a result, association of NUMA nodes to processor groups would have become more complex. More details on Windows processor groups can be found here.

Initial Drive Scaling

My area of focus is on Microsoft technologies, namely SQL Server and Azure Stack HCI performance. For this reason, I decided to use DiskSpd, an open source, Microsoft-developed storage performance tool similar to FIO on Linux.

Once the OS was deployed, appropriate drivers installed and updates applied, I ran a few DiskSpd tests based on Microsoft recommendations outlined in this SNIA presentation. Below is an example of the DiskSpd command executed:

diskspd.exe `-ag0,0,1,2,3 -t4 -b4k -d240 -o128 -r4k -Rtext -Su  -W30 -D -L "#1"

The command above runs a 4k (-b4k) aligned random read workload (-r4k) with four threads (-t4) affinitized to the cores listed in the processor group (-ag0,0,1,2,3), 128 outstanding I/Os per thread (-o128), software caching disabled (-Su), a 30-second warmup (-W30), and a duration of four minutes (-d240) against a single drive. A unique instance of DiskSpd was created per drive. The output is text (-Rtext) and includes latency (-L) and IOPS statistics (-D). The complete DiskSpd command-line documentation can be found here.

Bar chart
Figure 1: Scaling the Number of Drives on a Single NUMA Node Configuration

When executing the workload against a single drive, I was able to meet the 7300 SSD spec sheet performance (~520K IOPS). However, as I scaled the number of drives, performance per drive decreased. This result was particularly odd considering that overall CPU usage was relatively low at less than 27% and no other common system bottlenecks were identifiable at four and eight drives. My analysis of the breakout of CPU usage did show unusually high kernel activity on a per core basis.

Image of excel table

I also tried modifying the DiskSpd command by adding more execution cores and threads per drive.

diskspd.exe `-ag0,0,1,2,3,4,5,6 -t6 -b4k -d240 -o128 -r4k -Rtext -Su  -W30 -D -L "#1"

This command lowered per core utilization, but performance did not improve.

Increasing the Number of NUMA Nodes

From previous AMD materials I had read and our team’s previous experience with first-generation EPYC (Naples) processors, I wanted to see if changing NUMA settings would resolve this scaling issue. Both Naples and Rome families of AMD processors allow the BIOS to partition the CPU into different NUMA domains through a setting called NUMA per socket (NPS). The NPS can be set to 1 (default), two or four. In addition, there is another setting called CCX as NUMA domain, where each core cache complex (CCX) is treated as individual NUMA domains and overrides the NPS setting. The 7702P has 16 CCXs consisting of four cores each, so enabling this setting configured 16 NUMA nodes. I don’t want to get too deep in the weeds on the Rome architecture here, but these two blogs (1, 2) are a good place to start in understanding how these BIOS settings correspond to logical NUMA domains.

That said, I ran more DiskSpd tests while scaling the number of NUMA nodes from one to 16 with four and eight drives. Performance did improve, but it still did not scale as I had expected.

Bar chart
Figure 2: Scaling the Number of NUMA Nodes From One to 16 With Four Drives

When executing against four drives, the best per drive performance is NUMA set to 4. Performance didn’t change much beyond four NUMA nodes. Just as with the single NUMA node testing, overall CPU utilization was relatively low at 23%. The affinitized cores in use were around 92%. Within each core,  about 7% was user and about 85% was kernel. It was interesting to see a 44% performance gain from setting NUMA to 4 from 2.

Bar chart
Figure 3: Scaling the NUMA Node with Eight Drives

When executing against eight drives, per drive performance was lower than for the previous four-drive tests. Again, overall CPU usage is relatively low at 40%, and individual core utilization is about 80%, with 7% of each core’s utilization coming from user space and approximately 74% coming from kernel space.

Image of excel table
Bar chart
Figure 4: Performance Across a Number of Drives and NUMA Configurations

Overall system performance increased by changing NUMA settings; however, something happening in the Windows storage stack is preventing DiskSpd from fully utilizing the drives at scale.

What Do These Results Mean?

My next major project will be an all-NVMe Azure Stack HCI Reference Architecture using the Micron 7300 SSD with AMD servers. Given my observations from this initial testing, I’ll be sure to keep an eye on how NUMA affects a hyperconverged virtualization solution. If you intend to deploy Windows on the new AMD Rome platform, be sure to see which NUMA configuration works best for your workload. I certainly will.

Contact Micron

Have questions about Micron storage testing or methodology? Leave a comment below or email us at ssd@micron.com.


Dilm Nwobu portrait

Dilim Nwobu

Dilim Nwobu started his career at Dell Technologies where he worked on storage, custom solutions and server platform development. At Micron, Dilim is a storage solution engineer where he focuses on testing Micron products with Microsoft technologies, namely SQL Server and Azure Stack HCI.