Now that the team is back from VMWorld, we wanted to share the configuration for our VSAN demo that drew so much attention. This blog is intended to give you the high-level view of how we created the demo and provide some ideas for your own VSAN explorations. We documented other information such as node configurations, VSAN observer settings, and fio test workloads but didn’t provide it here for brevity’s sake. We’d love to hear comments and questions from anyone who wants to know more; post your comments below.
Our primary goal was to demonstrate best-in-class VSAN performance and show how that compared to a standard VSAN configured with SAS HDDs. One of the most interesting aspects of our configuration was that our M500 client SSDs were actually less expensive than the SAS 10K HDDs.
Our cluster consisted of 6 units of the following:
|1||Dell R620 config (see table below)||10 drives, dual 10 Gb/E, dual E5-2697v2 (12-core, 2.7 GHz)|
|2||Micron 1.4TB P420m PCIe SSD||Part# MTFDGAR1T4MAX-1AG1Z|
|10||Micron 960GB M500||Part# MTFDDAK960MAV-1AE12ABYY|
|24||Micron 32GB PC3-14900 LRDIMM||Part# MT72JSZS4G72LZ-1G9E2A7|
|1||Lexar 8GB USB drive||For ESXi boot device|
To compare our all-SSD configuration to a standard HDD configuration, we used: Seagate Enterprise Performance 1.2TB 10K HDD v7 (Part# ST1200MM0017).
The table below outlines the purchases we made from Dell for the PowerEdge R620 server configuration. Irrelevant or user-optional items (like bezel, power cords, and warranty options) have been omitted:
|1||PowerEdge R620: PowerEdge R620, Intel® Xeon® E-26XX Processors||R620IB|
|1||Chassis Configuration: Chassis with up to 10 Hard Drives and 3 PCIe Slots||10H3P|
|1||Processor: Intel® Xeon® E5-2697 v2 2.70 GHz, 30M Cache, 8.0GT/s QPI, Turbo, HT, 12C, 130W, Max Mem 1866 MHz||E52697V|
|1||Additional Processor: Intel® Xeon® E5-2697 v2 2.70 GHz, 30M Cache, 2E52697 8.0 GT/s QPI, Turbo, HT, 12C, 130W||2E52697|
|1||RAID Configuration: RAID 0 for H710P/H710/H310 (1-10 HDDs)||R0H7H3|
|1||RAID Controller: PERC H710 Integrated RAID Controller, 512MB NV Cache||R0H7H3|
|1||Select Network Adapter: Intel Ethernet X540 DP 10Gb BT + I350 1Gb BT DP Network Daughter Card||X540DC|
|1||Power Supply: Dual, Hot-Plug, Redundant Power Supply (1+1), 1100WSingle, Hot-Plug Power Supply (1+0), 750W||RPS1100NPS750|
|1||Power Management BIOS Settings: Power-Saving Dell Active Power Controller||DAPC|
The BIOS configuration we used for the VSAN hosts is no different than our standard BIOS settings for ESXi hosts. Because we were benchmarking and trying to measure optimal performance, we used the Performance profile along with the following settings:
|Memory Operating Mode||Optimizer|
|Logical Processor (HyperThreading)||Enabled|
|I/O AT DMA Engine||Enabled|
|SR-IOV Global Enable||Enabled|
|Memory Map I/O Above 4GB||Enabled|
To optimize storage, we installed ESXi to a USB drive, which freed up a drive bay on the server for a VSAN storage drive. Scratch data was put on an NFS server as recommended for hosts with greater than 512GB of RAM.
An array of 10 SAS HDDs is capable of less than 5K IOPS in a random workload. Ten of our M500 drives can sustain 100X that performance so, if anything, it is even more important to get the controller configuration right with an all-SSD configuration. The best controller option from Dell, in our opinion, is the H710. (The H310 is reported to have poor performance stemming from an incredibly low queue depth.) The H710 is a RAID controller and lacks a pass-through mode; therefore, we have to create individual RAID 0 volumes—one per physical M500 disk. The RAID 0 volumes should be set with the minimum stripe size (64KB in this case), no read ahead, write-through cache, and with the disk write buffer disabled.
An HBA or controller with pass-through requires a step to allow the VSAN to see the M500 as a HDD. Find more details on that procedure
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -MMicron_M500_MTFD-o disable_ssd
This configuration uses the P420m, a PCIe SSD, as the SSD cache. It should be noted that because it’s a PCIe SSD the P420m does not connect to the storage controller.
The P420m has an “inbox” driver for ESXi 5.5. We recommend updating the driver and firmware. You can find support releases for the P420m on our website. Toward the bottom of the screen, you can find a link to the Linux/VMware driver support pack. Download the current version. As of this writing, the most recent version is B144.04.00.
First from workstation:
scp "B144.04.00_Linux_VMware/VMware Driver/mtip32xx-native-3.8.2-esxi55-cert.zip" root@esx_hostname:/
scp "B144.04.00_Linux_VMware/RealSSD Manager/VMWare/ESX5.5/rssdm" root@esx_hostname:/scratch/rssdm
scp "B144.04.00_Linux_VMware/Unified Image/B144.02.00.ubi" root@esx_hostname:/
Second from ESX shell:
/scratch/rssdm -T /B144.02.00.ubi -n 0 -r
/scratch/rssdm -T /B144.02.00.ubi -n 1 -r
esxcli software vib install -d /mtip32xx-native-3.8.2-esxi55-cert.zip --no-sig-check
The following diagram outlines the host servers and network interconnects between them. It should be noted that while our VSAN operates on a network configuration with distributed switches, the infrastructure hosts do not. We chose to do this because our infrastructure needs to support more than any single network configuration; in particular, VSAN, login VSI, and VMmark. We have found that the VSAN setup works best with distributed switches, and we highly recommend them when using clusters of hosts. In this configuration we chose to give the infrastructure access to all network points, which is not required for normal use. Our assertion is that we may need to do packet sniffing or test operations over each network, and having access to it via the infrastructure gives us flexibility as to where we put the VMs to perform testing.
Disk Group Details
The diagram below outlines the VSAN storage configuration from a single-host perspective. A disk group is the element of storage configuration in VSAN and consists of one caching SSD and between one and seven "data" drives. The VSAN 2014 demo configuration has two 1.4TB P420m cache drives and ten 960GB M500 drives organized into two disk groups consisting of one P420m and five M500 drives. The end result is 9.4TB of storage space and a 1.9TB read cache, for a 1:5 cache to data ratio. The unrepresented cache space is used for a write buffer, for a total of 840GB on each host.
The illustration further shows how this disk group configuration is viewed by VSAN for a single host and illustrates the data flow into and out of a disk group. Notice that the block allocation unit on the data storage is 1MB. This means that the I/O seen by the data storage is a 1MB random read/write workload. For an SSD, this is extremely good news for endurance. Most of the endurance specifications for SSDs are measured against 4K random workloads, and VMware VSAN specifications are based around 8K random workloads. A 4K or 8K random workload is actually the hardest workload on an SSD in terms of endurance. Thus, a 1MB random read/write workload improves the drive's ability to make drive reclaim operations much more efficient during cleanup operations. This means lower overall write amplification, meaning fewer writes are going on in the background.
Note also that the write buffer is the primary storage entity that is significantly impacted with writes that vary in I/O size and rate. Our understanding of the read cache workload is not complete, but we are hypothesizing that the write workload to the read cache (upon cache fill) is relatively consistent at 1MB, which is the same I/O size range as the data storage. We also believe that the read cache will experience varying I/O read sizes because even though the allocation unit is 1MB, there is nothing stopping the storage system from issuing smaller block reads and writes to the overall storage subsystem.
VMware’s VSAN software, coupled with this configuration, shows it is possible to a build a supercharged virtualization solution that is scalable, easy to deploy, power efficient, dense, self-contained, and supercharged. The performance differences between this all-SSD solution and the usual hybrid solution are striking when measured in I/O latency and end-application responsiveness. We measured surprising differences in VDI performance using LoginVSI to benchmark a Horizon View configuration. We are also working through the process of vetting our results from synthetic benchmarks and sysbench MySQL testing. What we’ve seen looks pretty interesting, and we are excited to share what we’ve found. Let us know what workloads you’re interested in seeing and any questions you have about what we learned.