NVMe™ is becoming mainstream, and much of the industry is pushing to adopt this technology as a higher performance standard, including VMware for vSAN™. The NVMe protocol, which was built specifically for solid state drives (SSDs), has numerous performance advantages over traditional SAS/SATA storage, with improved latency being perhaps the most important. The NVMe protocol on the PCIe interface bypasses much of the stack overhead of SAS/SATA protocols, and thus is able to deliver lower latency, more IOPS, and higher throughput while also reducing CPU clock cycles, power consumption, heat output, etc.
Introducing the little-known NVMe namespaces
A feature of the NVMe protocol that not many people are familiar with is namespaces. Namespaces allow you to carve an SSD into logically separate smaller drives, like partitions. But, unlike partitions, namespaces each have their own submission and completion queues allowing for highly parallelized operations. All NVMe drives utilize namespaces, but most drives support only a single namespace and come pre-configured with the primary namespace taking up the entire drive capacity. With the Micron® 9300 MAX NVMe SSD, you can configure up to 32 separate namespaces in any size you want.
Because the smallest size Micron 9300 drive offered is 3.84TB, I realized it doesn’t make sense to use an entire 9300 NVMe SSD for a vSAN cache device, considering only 600GB would be utilized. I then had the crazy idea to separate the SSD into smaller logical drives using namespaces and to present each of them to the VMware ESXi hypervisor as separate storage devices.
“Why would you do this?” Scaling!
Imagine a small ROBO configuration utilizing minimal servers with very few drive slots. With traditional vSAN configurations, you need several drives to scale performance, since vSAN adds additional threads with each additional disk group. If you are limited to only a few slots, you are basically stuck with a single disk group, which limits your performance (as well as your capacity) since one slot must be reserved for the cache drive. The Micron 9300 SSD can go up to 15.36TB in a single U.2 drive, allowing for very high density, and—with namespaces—very high performance as well. Another use case could be in edge computing with minimal servers. This could even be used outside of vSAN, such as for scratch disks passed as RDMs to multiple VMs.
In preparing for this demo, I ran some tests on a 2-node configuration utilizing namespaces with the Micron 9300. For the demo configuration, we used two Dell R730xd servers, each with a single Micron 9300 SSD and dual 25GbE NICs. The purpose was to see how vSAN performance scaled with the number of disk groups, capacity drives per disk group, storage profiles, etc. The biggest takeaway from that study was that increasing the number of disk groups dramatically improves performance, with almost three times the read performance at three disk groups as with a single disk group. The read and write scaling are shown below, using 128 outstanding IO per host with HCIBench.
As shown by the graphs above, read and write performance both scale with disk groups. The returns for more than three disk groups were negligible, so we stuck with three disk groups. We notice that the performance scales better with reads than writes. Note that the write tests were run long enough to get into heavy de-staging, hence the lower performance. If this were a test where the dataset fit 100% in the cache tier, the numbers would be substantially higher.
Creating 4 vSAN Nodes on 1 demo server
Building on this study, I wanted to show off this functionality at VMworld’19 but didn’t want to ship two big bulky servers just for my demo. Instead, I used a 2U 4-node Supermicro Big Twin and decided to do a full 4-node vSAN cluster as a demo. Each node had six drive slots with four of them supporting NVMe, though I only needed one. I put a 15.36TB Micron 9300 MAX into each node, installed the latest ESXi image with vSAN 6.7U3, and split each drive into 24 separate namespaces—three 600GB namespaces for cache drives and 21 594GB namespaces for capacity drives. The configuration is shown below.
Luckily, because this is a standard NVMe feature, we didn’t need any special tools to configure the namespaces—I just used the built-in `esxcli nvme` commands to create them. The commands below show how a namespace can be created and attached to the controller.
`esxcli nvme device namespace create -A vmhba3 -c 1258291200 -p 0 -f 0 -m 0 -s 1258291200`
`esxcli nvme device namespace attach -A vmhba3 -c 1 -n 1`
Once the namespaces were created, they showed up in the vSphere hosts as separate storage devices (not partitions). As far as the hosts knew, the namespaces were each separate individual physical storage devices. We were, in effect, tricking vSAN into thinking that we gave it multiple physical drives, when we knew that we are in fact only using a single physical drive.
When your drive is the bottleneck, this is not at all useful, but when you’re using a drive that is able to deliver 850K IOPS and 3.5+GB/s throughput, splitting it into smaller logical drives helps vSAN (or whatever application you’re using) scale performance with increased parallelization.
And the winner was… Performance!
Enough technical talk—let’s talk about performance. If I told you we could get amazing performance with a single drive prior to your knowledge of NVMe, namespaces, and the Micron 9300 SSD, you’d probably call me crazy. However, with this solution, we were able to drive more performance from a single drive than most solutions do with 20+ physical drives… In a much smaller form factor… With higher density… And using less power… Creating less heat… Reducing your TCO…
With this configuration, I was able to push 750K IOPS (4k random reads) and over 11.5GB/s (128K sequential reads)! The throughput was so high, in fact, that I had to add a third NIC to the solution so that vSAN could have two dedicated NICs. I was hitting over 21Gbps average per node, seeing over 25Gbps at times with some nodes. That means I’m driving more performance than is even possible with dual 10GbE NICs, even if you were to give them both to vSAN with no reserve for vMotion, management, or other network functions.
Because the demo was running live in VMworld’s Solutions Exchange (not in a temperature-controlled lab environment), noise was a concern. I reduced the power settings to “Energy Efficient” which slightly reduced performance, though it still maintained 550K+ IOPS for four days. This was a straight 4K random-read test, which is why the throughput was only 2GB/s. Had I switched the block size to 128K, you would have seen the throughput jump to about 8GB/s and IOPS reduce to about 62K IOPS.
Note that each node ran two VMs, hence why there were eight separate numbers being reported for each metric. Along with the eight load VMs, I was running a three-node Elasticsearch cluster with Kibana (using Metricbeat running on each of the VMs to collect host metrics), as well as the vCenter Server Appliance, and a Windows server for management. Even with 14 total VMs running in this cluster, these drives delivered this performance with little effort.
The demo gathered quite a bit of interest at VMworld. I gave a few presentations a day on this configuration and was even invited to join a podcast with the guys at Virtually Speaking (@virtspeaking) to talk about the future of vSAN, all-flash, namespaces, networking, endurance, cost, TCO, and more.
We have plans to test this further and come up with more use cases for it. Micron and VMware are in talks and we hope to see it as a supported configuration in the near future. Don’t forget, people wondered at Micron storage experts who showed up to VMworld 2014 with an all-flash vSAN cluster before that was a thing. People thought we were crazy then, too! 😉