Disclaimer: The technology tested here is not currently supported by VMware and may never be. This is not recommended for use in any production environment, especially where VMware support is required.
NVMe™ is becoming mainstream, and much of the industry is pushing to adopt this technology as a higher performance standard, including VMware for vSAN™. The NVMe protocol, which was built specifically for solid state drives (SSDs), has numerous performance advantages over traditional SAS/SATA storage, with improved latency being perhaps the most important. The NVMe protocol on the PCIe interface bypasses much of the stack overhead of SAS/SATA protocols, and thus is able to deliver lower latency, more IOPS, and higher throughput while also reducing CPU clock cycles, power consumption, heat output, etc.
Introducing the little-known NVMe namespaces
A feature of the NVMe protocol that not many people are familiar with is namespaces. Namespaces allow you to carve an SSD into logically separate smaller drives, like partitions. But, unlike partitions, namespaces each have their own submission and completion queues allowing for highly parallelized operations. All NVMe drives utilize namespaces, but most drives support only a single namespace and come pre-configured with the primary namespace taking up the entire drive capacity. With the Micron® 9300 MAX NVMe SSD, you can configure up to 32 separate namespaces in any size you want.
Because the smallest size Micron 9300 drive offered is 3.84TB, I realized it doesn’t make sense to use an entire 9300 NVMe SSD for a vSAN cache device, considering only 600GB would be utilized. I then had the crazy idea to separate the SSD into smaller logical drives using namespaces and to present each of them to the VMware ESXi hypervisor as separate storage devices.
“Why would you do this?” Scaling!
Imagine a small ROBO configuration utilizing minimal servers with very few drive slots. With traditional vSAN configurations, you need several drives to scale performance, since vSAN adds additional threads with each additional disk group. If you are limited to only a few slots, you are basically stuck with a single disk group, which limits your performance (as well as your capacity) since one slot must be reserved for the cache drive. The Micron 9300 SSD can go up to 15.36TB in a single U.2 drive, allowing for very high density, and—with namespaces—very high performance as well. Another use case could be in edge computing with minimal servers. This could even be used outside of vSAN, such as for scratch disks passed as RDMs to multiple VMs.
In preparing for this demo, I ran some tests using HCIBench on a 2-node configuration utilizing namespaces with the Micron 9300. I used two Dell R730xd servers, each with dual Intel Xeon 2690v4 processors, 256GB RAM, a single Micron 9300 SSD and dual 25GbE NICs. The purpose was to see how vSAN performance scaled with the number of disk groups, capacity drives per disk group, storage profiles, etc. To make sure the comparisons were fair, I used the same HCIBench configuration for each test, which consisted of four VMs per node, 8 VMDKs per VM (100GB each), and four threads per VMDK, working out to 128 outstanding IO. The biggest takeaway from that study was that increasing the number of disk groups dramatically improves performance, with almost three times the read performance at three disk groups as with a single disk group. The read and write scaling are shown below. These numbers were reported with the vSAN Default Storage Policy, and deduplication and compression disabled.
As shown by the graphs above, read and write performance both scale with disk groups. The returns for more than three disk groups were negligible, so I stuck with three disk groups. I noticed that the performance scales better with reads than writes. Note that the write tests were run long enough to get into heavy de-staging, hence the lower performance. If this were a test where the dataset fit 100% in the cache tier, the numbers would be substantially higher.
Creating 4 vSAN Nodes on 1 demo server
Building on this study, I wanted to show off this functionality at VMworld’19 but didn’t want to ship two big bulky servers just for my demo. Instead, I used a 2U 4-node Supermicro Big Twin (SYS-2029BT-HNC0R) and and decided to do a full 4-node vSAN cluster as a demo. Each of the four nodes has dual Intel Xeon Gold 6142 Processors and 384GB of RAM. They each have six drive slots with four of them supporting NVMe, though I only needed one. I put a 15.36TB Micron 9300 MAX into each node, installed the latest ESXi 6.7U3 image (build 14320388), connected them to a vSphere Center Server Appliance (6.7U3 build 184.108.40.206000), and split each drive into 24 separate namespaces—three 600GB namespaces for cache drives and 21 594GB namespaces for capacity drives. The configuration is shown below.
Luckily, because this is a standard NVMe feature, I didn’t need any special tools to configure the namespaces—I just used the built-in `esxcli nvme` commands to create them. The commands below show how a namespace can be created and attached to the controller.
`esxcli nvme device namespace create -A vmhba3 -c 1258291200 -p 0 -f 0 -m 0 -s 1258291200`
`esxcli nvme device namespace attach -A vmhba3 -c 1 -n 1`
Once the namespaces were created, they showed up in the vSphere hosts as separate storage devices (not partitions). As far as the hosts knew, the namespaces were each separate individual physical storage devices. I was, in effect, tricking vSAN into thinking that I gave it multiple physical drives, when I knew that I was in fact only using a single physical drive.
When your drive is the bottleneck, this is not at all useful, but when you’re using a drive that is able to deliver 850K IOPS and 3.5+GB/s throughput, splitting it into smaller logical drives helps vSAN (or whatever application you’re using) scale performance with increased parallelization. For more information on the power of the Micron 9300 high-performance SSD, visit the product page.
For the purposes of creating a demo, I decided to use a more flexible and configurable solution than HCIBench to generate the load. Instead, I created my own VMs so that I could have full control over them. Each host got two VMs with 8 VMDKs per VM. This turned out to be the optimal number of VMs to spread enough threads over all of the namespaces while making management easier by minimizing the number. I used CentOS 7.6 as the OS and gave each 8 vCPUs and 8GB of vRAM with a single network interface. Each VM had the 8 VMDKs mounted as regular hard disks (not NVMe controllers), as testing showed that there was no significant performance difference with either choice. I again used the vSAN Default Storage Policy with deduplication and compression disabled.
I created a separate VM to run Windows Server 2019 from which I used the Iometer tool to generate the load against the hosts. An Iometer worker was assigned to each of the VMDKs on each host with 16 and 1 threads (outstanding IO) for 4K random and 128K sequential workloads respectively. The optimal number of threads was determined from simple trial and error. The goal is to maximize IOPS for 4K random reads and maximize throughput (GB/s) in 128K sequential reads, while also maintaining a reasonable latency for each. There is a certain point where your performance will no longer increase, but your latency will, and that is where I stop adding threads.
And the winner was… Performance!
Enough technical talk—let’s talk about performance. If I told you that you could get amazing performance with a single drive prior to your knowledge of NVMe, namespaces, and the Micron 9300 SSD, you’d probably call me crazy. However, with this solution, I was able to drive more performance from a single drive than most solutions do with 20+ physical drives… In a much smaller form factor… With higher density… And using less power… Creating less heat… Reducing your TCO…
With this configuration, I was able to push 750K IOPS (4k random reads) and over 11.5GB/s (128K sequential reads)! The throughput was so high, in fact, that I had to add a third NIC to the solution so that vSAN could have two dedicated NICs. I was hitting over 21Gbps average per node, seeing over 25Gbps at times with some nodes. That means I’m driving more performance than is even possible with dual 10GbE NICs, even if you were to give them both to vSAN with no reserve for vMotion, management, or other network functions.
Because the demo was running live in VMworld’s Solutions Exchange (not in a temperature-controlled lab environment), noise was a concern. I reduced the power settings to “Energy Efficient” which slightly reduced performance, though it still maintained 550K+ IOPS for four days. This was a straight 4K random-read test, which is why the throughput was only 2GB/s. Had I switched the block size to 128K, you would have seen the throughput jump to about 8GB/s and IOPS reduce to about 62K IOPS.
Note that each node ran two VMs, hence why there were eight separate numbers being reported for each metric. Along with the eight load VMs, I was running a three-node Elasticsearch cluster with Kibana (using Metricbeat running on each of the VMs to collect host metrics), as well as the vCenter Server Appliance, and a Windows server for management. Even with 14 total VMs running in this cluster, these drives delivered this performance with little effort.
The demo gathered quite a bit of interest at VMworld. I gave a few presentations a day on this configuration and was even invited to join a podcast with the guys at Virtually Speaking (@virtspeaking) to talk about the future of vSAN, all-flash, namespaces, networking, endurance, cost, TCO, and more.
We have plans to test this further and come up with more use cases for it. Micron and VMware are in talks and we hope to see it as a supported configuration in the near future. Don’t forget, people wondered at Micron storage experts who showed up to VMworld 2014 with an all-flash vSAN cluster before that was a thing. People thought we were crazy then, too! 😉