Optimizing Your All-NVMe vSAN Cluster (Part 1)

By Collin Murphy - 2020-04-24

NVMe is the latest and greatest of storage protocols for the cloud/data center as well as for laptops. High-performance enterprise environments are adopting NVMe solid-state drives (SSDs) to accelerate their applications, taking advantage of the low latency and high throughput of the NVMe architecture, which eliminates much of the waste in the SAS/SATA protocols. NVMe allows for lower latency by reducing the CPU clock cycles required for an operation, giving that processing power back to your applications. In recent years, the price of NVMe devices has dropped drastically, bringing it more into the mainstream. For this reason, NVMe is becoming not only an acceleration technology but also a general-purpose storage technology. VMware® vSAN is one of the many solutions where NVMe is becoming a top choice in storage protocols.

NVMe is not new to vSAN’s cache tier, but it has only recently become popular to use for the capacity tier. Each revision of vSAN continues to increase its performance. At the same time, storage devices are increasing in capacity. This means you can deploy a high-performance solution using fewer drives than previously possible. The question then becomes, “How do you maximize performance while minimizing drive count?” Our study was designed to answer that question.

In vSAN, the main way to increase performance is by increasing the number of disk groups. Each disk group adds an additional cache device, reducing the stress on each cache drive in heavy write operations. Adding more disk groups also adds more capacity drives, boosting read performance since reads come from the capacity tier in all-flash configurations. But do you need to scale up to the maximum supported five disk groups and 35 capacity drives in an all-NVMe environment? There are certainly cost considerations to factor in. How many disk groups do you actually need? How many capacity drives? Let’s take a look …

How was the test configured?

I conducted this study on a four‑node vSAN cluster, using Dell® R740xd servers, each with dual Intel® Xeon® Gold 6142 CPUs and 384GB of memory. Each server had three 25GbE network connections, with two dedicated to vSAN and one reserved for management and VMware vMotion. I gave vSAN two dedicated ports because of the high throughput potential of this configuration — in large block sequential read tests, some of the nodes were able to see more than 25Gbps, necessitating the second link for vSAN traffic to achieve maximum performance. I wanted to ensure that the networking was not a bottleneck for these tests.

Fab10 Media 1
The four-node vSAN cluster used in testing

The solid-state drives used for vSAN were all 3.84TB Micron® 7300 PRO in the U.2 form factor. While the Micron 7300 PRO is not ideal for the cache tier due to its large capacity, it suited the purpose of this study to see how performance changed with various drive configurations.

What methodology was used for testing?

For the first part of this study, I wanted to evaluate how storage performance scaled within each disk group based on the number of capacity drives. I created a single disk group on each host and swept from one to seven capacity drives in that disk group. I used HCIBench (hyperconverged infrastructure benchmark) for testing and ran the same tests for each configuration. Each test suite consisted of 4K random reads, 4K random writes, 128K sequential reads and 128K sequential writes. Each host got two VMs, and each VM had eight 100GiB virtual disks (VMDKs). Before each test, I preconditioned the VMDKs with a 128K 100% sequential write operation, run long enough to completely fill the disks. For each test case, I set the total outstanding I/O used (figured as the number of threads per disk x the number of data disks x the number of VMs) based on prior test results. The point of this study was not to find maximum performance but rather to see how performance changed by increasing the number of drives.

What were the results?

The first test consisted of 4K random writes. Writes are typically slower than reads because the drives characteristically write slower than they read and because of vSAN’s two-tier storage, where writes must first go to the cache tier and then get de-staged to the capacity tier. For these reasons, write tests typically use fewer threads than read tests do. For this test, I gave each VMDK one thread, for a total of 64 outstanding I/O across the cluster.

Fab10 Media 1
Graph 1: Comparing write latencies and IOPS across random writes

We can see from Graph 1 that increasing the number of capacity drives slightly improved 4K random write IOPS and lowered latency. Going from one to six capacity drives increased IOPS by 44% and lowered latency by 28%. Going to seven drives appeared to reduce performance slightly, though it was within the margin of error, so we can assume that it is effectively the same as for six drives.

Note that we are not at the maximum write performance limitation, but I wanted to ensure that the scaling was not bottlenecked by the single cache drive, so I used a lower thread count. Maximum write performance was well over 100K IOPS.

While 4K transactions are best for showing off IOPS, large block sequential operations will show the highest throughput. For that, I ran a 128K sequential write workload. Because of the large block size, the thread count was again only one thread per VMDK, resulting in 64 outstanding I/O.

Fab10 Media 1
Graph 2: Comparing write latencies and throughput on sequential writes

In Graph 2, we see that the number of capacity drives has a large effect on write performance. Going from one to seven capacity drives more than tripled the throughput, with 73% reduced latency. Note that this is not because any individual drive reached its maximum performance but because of increased parallelization. A single 3.84TB Micron 7300 PRO can write 128K sequential at 1.9GB/s, but an individual drive’s performance becomes less significant than the number of drives over which you can distribute the load. This is because vSAN adds additional layers (cache, checksum, replication, deduplication, etc.).

Reads are typically much faster than writes for a few reasons. One of the reasons is that it is much easier for the drive to read data than to write it. Another is that reads are served directly from the capacity tier, so they do not have to go through the cache tier. Because reads are able to be served much more quickly than writes, we can use more threads per VMDK. In this case, we used four threads per VMDK for 4K random reads, for a total of 256 I/O.

Fab10 Media 1
Graph 3: Comparing read latencies and IOPS across random reads

In Graph 3, we see that increasing the number of capacity drives had essentially no effect on 4K random reads. The Micron 7300 SSD is performant enough that we can get full disk-group read throughput from a single device. Each Micron 7300 is capable of 520,000 random read IOPS, which is double what four drives (268K x 2, in the single capacity drive per node configuration) did in this test. Remember, all reads come from the capacity tier, so the cache drive has no effect on the 100% read tests.

When switching to large block sequential operations, we would expect to benefit from adding more capacity drives since we should be able to come very close to the throughput limitation of a single drive. Each drive is capable of 3GB/s of 128K sequential read operations. With four nodes, that means a theoretical maximum of 12GB/s of throughput potential (again, for the single capacity drive configuration). Here, two threads were used per VMDK for a total of 128 I/O.

Fab10 Media 1
Graph 4: Comparing read latencies and throughput across sequential reads

As shown in Graph 4, with a single capacity drive per node, I hit over 8GB/s, which was around 70% of our theoretical throughput. Going up to three drives showed no additional performance, but going up to five or more capacity drives gave us a slight increase in throughput and decrease in latency, both around 12%.

Maximize performance, minimize drive count

Based on this study, we can conclude that adding more capacity drives (at least in this configuration) shows the greatest effect in write-intensive workloads, particularly large block writes. Reads show minimal difference, likely because the individual drive performance is so great. If you had an all-SATA configuration, reads would typically show more improvement with an increased number of drives. Because NVMe drives are generally more performant than SATA for writes as well, it is also likely that an all-SATA configuration would show an even more dramatic increase in write performance.

NVMe is becoming the new norm, and VMware seems onboard with optimizing for it. There’s also the potential to create multiple vSAN devices by splitting the NVMe namespaces in one SSD. In the near future, NVMe drives will become mainstream, especially as they become more affordable. Micron is committed to providing high-performance NVMe solutions while still maintaining SATA options for those not quite ready for the move. At the very least, NVMe should be strongly considered for the cache tier of your vSAN deployments.

Want more info?

Check in with @MicronTech and connect with us on LinkedIn to see when part 2 of this study goes live. We’ll discuss how vSAN on NVMe SSD performance increases with the number of disk groups and the best configuration for a fixed drive count.

Amit Gattani

Collin Murphy

Collin Murphy is a senior storage solutions engineer at Micron focused on characterizing Micron’s SSDs on VMware vSAN to develop all flash solutions and reference architectures.