IT is moving fast. Very fast. So fast there isn’t enough time to keep up, forget getting ahead. If this sounds like your day, I’d like to help (with vSAN planning and testing anyway).
This post offers five tips that can help make your vSAN planning easier, keep your customers happier and get you to bed earlier.
Full disclosure: This blog references a HPE vSAN node we’re using for lab performance analysis and tuning. It does not describe a node being used in production.
1. Set goals
This may seem obvious, but it bears repeating: Before you start configuring your first node, select your first networking option, consider your test and deployment plans or buy your first SSD, you must set goals.
Decide what you want to do and why you want to do it first.
We’ve done a lot of work with vSAN as well as released several Reference Architectures (RA) from our Austin, Texas acceleration research lab, but when we designed our HPE DL380 Gen10 vSAN nodes lab rigs for a new, HPE-based vSAN RA, we had specific targets in mind:
- AF-6 like configuration with outstanding price/performance balance
- HPE dual-Xeon, rack-mount server
- Approachable price point
- Easy deployment (by using standard HPE storage components)
- Balanced components (right combination of cache and capacity SSDs, DRAM, processors and networking)
- Predictably high performance
OK, that’s more than a few (!) but we were interested in a very targeted design. Something we could use for lab research that would support a broad variety of workloads while still being within a reasonable budget.
2. Select a platform, storage, CPUs, memory and networking
Since we wanted to build an HPE configuration, we chose their popular, Intel-based dual-processor 2U rack mount general purpose platform — HPE DL380’s current Gen10 (we used 868703-B21). While HPE has several models that may also make great vSAN nodes, we felt that the flexibility, density and broad range of options offered with the DL380 Gen10 made it an exceptional choice.
With our goals and platform set, we moved on to selecting other elements of each node. In the interest of full disclosure, we were only thinking of an all-flash (AF) configuration (no surprise there).
As we started considering which SSDs to use and where to use them, we knew a balanced mix of SATA SSDs with different performance, endurance and capacity characteristics would give the broad deployment range and results we wanted. We also knew that since we were building on HPE platforms, it made sense to use HPE qualified SSDs.
We chose high write performance, mixed-use SSDs (a pair of HPE MK000960GWEZK, 960GB MU drives) for the cache tier and eight high capacity, read-intensive SSDs for the capacity tier (HPE VK000GWEZE, 1.92TB RI drives). And since we wanted to drive overall value at a more approachable price point, these are SATA SSDs. This 1:4 cache-to-capacity drive ratio works well for a broad range of applications and workloads (as we found in our prior research lab work).
CPUs, memory and networking
We completed the configuration with a pair of Intel 6148 CPUs per node along with 12x 32GB DDR4 and a dual-port 10GbE/25Gbe Ethernet adapter (631FLR-SFP28).
Optimal tuning may be the most challenging goal of all. Since we have done quite a bit of work with vSAN as well as released several Reference Architectures (RA) we may be a bit ahead of the game here. Browse our posted RAs for more details and suggestions.
vSAN’s default tunings are configured to be safe for all users. When doing heavy write tests, a disk group can quickly run out of memory and run into memory congestion, decreasing performance. To overcome this, we followed VMware’s performance document to alter three advanced configuration parameters.
The table below shows the default value and the value we used (these may be optimal for the lab work we are doing and yours may differ):
Note: Even with these performance tunings, vSAN occasionally experiences various forms of congestion. Congestion appears to occur more often during runs with high write percentage. This article on VMware’s website discusses congestion in detail.
4. Use Standard Benchmarks and Monitoring Tools, be Consistent
Benchmarking virtualization can be tough because of the many different system components that can affect results. Since we focused on vSAN’s storage components and their ability to deliver a large number of transactions at a low latency, we relied heavily on using synthetic benchmarking to gauge storage performance. If your focus is evaluating other components, another benchmark may apply.
To ensure that all storage is used, we distributed worker threads evenly amongst all nodes and all drives with four VMs on each node each with eight VMDKs (each either 6GB or 128GB, depending on whether it is a cache or capacity test).
Be sure to precondition with HCIBench using a 128K sequential write test that is run sufficiently long to ensure that the entire dataset is written over twice (do it twice to ensure the SSD’s over-provisioning area is written — ensuring the VMDKs have readable data for all LBAs instead of simply all zeros). This is particularly important when it comes to checksumming to ensure that the checksum is always calculated on non-zero data.
Additionally, we set OSR to 100% for all tests — except for the density profile — and we left stripe width at the default value of 1 to ensure that data is spread physically across the entire usable space of each SSD, instead of potentially lying in only a subset of them.
When benchmarking storage, it is important to ensure consistent and repeatable results, ensuring that every test is run the same way, under the same conditions. Each test must start in the same state, which is why we selected the clear read/write cache before testing option in HCIBench.
We also allow each test to reach steady state performance (see the SNIA’s Performance Test Specification v2.01 section 3.1 for the definition of steady state) before we measured performance. For all tests the time to reach steady state was approximately two hours.
Starting each test from an identical state, preconditioning and measuring steady state performance is essential for AF and hybrid (HY) vSAN configurations.
We tested with HCIBench.
For each configuration, we tested five different workload profiles, all generating 4K random read/write mixtures. Since read and write performance differ drastically, we ran a sweep across different read/write mixes — 0/100, 30/70, 50/50, 70/30, and 100/0. This allows inferring approximate performance based on the deployment’s specific read/write mixture goals.
We also used a separate vSAN cluster set up for all infrastructure services, such as for HCIBench, DNS, routing, etc. An additional virtual network was created on a separate VLAN (115), and the HCIBench VM’s virtual NIC was assigned to this network to ensure that it could not send unwarranted traffic.
We tested two dataset sizes to show the difference in performance when the working set fits 100% in the cache tier (“cache test”) and one when it is too large to fit fully in cache (“capacity test”).
5. Evaluating Results with an Unbiased View
We found some interesting results during testing.
When we had a relatively small working set size — meaning it fits entirely (or mostly) in the cache tier — there was very little downside to utilizing RAID-5/6 with deduplication and compression, especially when the workload is mostly writes.
If increased usable capacity is your primary goal, the density profile can potentially give you much more usable capacity, thanks mostly to deduplication and compression. If your workload is highly compressible, using deduplication and compression is strongly recommended. If your dataset is not very compressible, you may be better off not using deduplication and compression, as we saw a small performance penalty with little or no capacity benefit.
The density profile had some unexpected behavior on capacity tests with large amounts of reads. We saw as congestion occurred introduced, the IOPS decreased and latency increased (the process was also cyclical -- where congestion increases as the log space fills up, and returns to zero once the log space frees). This congestion occurs when the internal log in the cache tier of vSAN’s Local Log Structured Object Manager (LSOM) runs out of space. This log is fixed in size, and when metadata is written faster than it can be purged, the log runs out of space. (Log congestion is introduced to slow down write transactions).
We have several other Reference Architectures available for download including other vSAN configurations, Red Hat Ceph Storage and Excelero shared NVMe. I’m interested in how you deploy vSAN and Ceph Storage and if/how you are moving to shared NVMe. I’d also like to hear from those of you who have used our RAs.