In case you haven’t already heard, Micron recently released its heterogeneous-memory storage engine (HSE) to the open source community. Our design focuses on providing a solution that makes storage class memory (SCM) and SSDs more performant, with increased effective SSD lifespan through reduced write amplification, all while being deployed at massive scale. When compared to legacy storage engines, HSE often benefits workloads like Yahoo! Cloud Serving Benchmark (YCSB) many times over.
What is a “Heterogeneous Memory” Storage Engine (HSE)?
Why heterogeneous? Micron has an extensive portfolio of DRAM, SCM and SSDs that gives us the insight and expertise to build a storage engine that intelligently manages data placement across disparate memory and storage media types. Unlike traditional storage engines that were written for hard disk drives, HSE was designed from the ground up to exploit the high throughput and low latency of SCM and SSDs.
HSE uses the advantages of discrete media types to support two media classes for data storage: a “staging” media class and a “capacity” media class. A staging media class is typically configured to run on high-performance (IOPS and/or MB/s), low latency and high write endurance media (for example, SCM or data center class SSDs with NVMe™). Data intended for hot, short-term access is allocated to the staging media class while cold, long-lived data is typically configured to run on lower cost, lower write endurance media (like quad-level cell [QLC] SSDs) in the capacity media class tier. This enables HSE to achieve high throughput and low latency while also conserving write cycles on lower endurance media.
Configurable Durability Layer
The HSE durability layer is a user-configurable logical construct that resides on the staging media class. The durability layer provides user-definable data persistence in which the user specifies an upper bound on how many milliseconds of data may be lost in the event of a system failure, like power loss.
Data is initially ingested from DRAM into the durability layer. Storage is allocated from the faster staging media class to meet the low-latency, high-throughput requirements of the durability layer. Unlike a traditional write-ahead log (WAL), this durability layer avoids the “double write problem” common with classic journaling to significantly reduce write amplification.
As stored data ages, the data migrates through multiple layers of the system and is rewritten as part of garbage collection to optimize query performance (completion time). Here’s the high-level process:
- When new data needs to be stored, it is first written in the durability layer.
- As the data ages, it is rewritten to the capacity media class as a background maintenance operation.
- As new data arrives, that new data may render existing data obsolete (by updating or deleting records that were previously written). Maintenance operations periodically scan existing data to enable space reclamation. If a large part of the data is now invalid or obsolete, these operations reclaim space by rewriting just the data that is still valid —freeing up all the space the old data occupied (i.e., garbage collection). To service queries efficiently, valid data is also arranged so that it can be scanned easily.
- Valid data is reorganized into tiers for faster query processing. Key and value data are isolated into separate streams throughout this process — keys are written to the staging media class to facilitate faster lookups. Eventually, older data at the bottom tier is written to the designated capacity media class devices.
As queries are serviced and data is read from both media classes, indexes are page-cached into DRAM. An LRU (least recently used) algorithm dynamically ranks indexes to facilitate index tracking, pinning the hottest (i.e., the most frequently accessed indexes) in memory, assuming system DRAM is available.
Media Class Performance
Our test setup used one Micron 9300 SSD with NVMe™ as the staging media class device and four Micron 5210 SATA QLC SSDs as the capacity media class devices. We used the Yahoo!® Cloud Serving Benchmark (YCSB) to compare the operations per second and 99.9% tail latencies:
- First run: Four Micron 5210 QLC SSDs configured as capacity media class devices
- Second run: Four 5210 SSDs configured as capacity media class devices and one Micron 9300 SSD with NVMe as a staging media class device
We ran YCSB workloads A, B, C, D and F with the same thread counts for both configurations.1 Table 1 summarizes several YCSB workload mixes, with application examples taken from the YCSB documentation. Tables 2 through 4 share other test details regarding hardware, software and benchmark configurations.
Table 1: Workloads
|YCSB Workload||I/O Operations
|Session store recording user-session activity
||User profile cache
|User status updates
|User database or recording user activity
1 Workload E was not tested because it is not universally supported
Table 2: Hardware Details
|Server Platforms||Server Platform Intel® based (dual-socket)|
||2x Intel E5-2690 v4
||Staging class media: 1x Micron 9300 SSD with NVMe
Capacity class media: 4x Micron 5210 7.68TB SATA SSDs
|Capacity Class Media Configuration
||LVM striped logical volume
Table 3: Software Details
|Operating System||Red Hat Enterprise Linux 8.1
Table 4: Benchmark2
|YCSB Benchmark Configuration|
|Dataset||2TB (2 billion 1,000-byte records)|
||2 billion per workload
2 Different configurations may show different results.
YCSB starts by loading the database. This is a 100% insert workload. Adding a 9300 to the mix reduces the time taken to load the 2TB database by a factor of four.
Figure 1 shows throughput for the load phase and run phase of the five YCSB workloads. For write-intensive workloads like Workload A (50% update) and Workload F (50% inserts), adding a Micron 9300 as a staging media class increases the overall throughput 2.3 and 2.1 times, respectively. Workloads B and D (5% updates/inserts) show more modest improvements in throughput because 95% of these workloads are reads coming almost entirely from the 5210 SSDs comprising the capacity media class.
Figure 1: Throughput by YCSB Workload
Figure 2 shows the 99.9% read (tail) latencies. The read tail latencies for all workloads are considerably improved (2 to 3 times) after adding the 9300 (except for Workload C, which is 100% reads). Recall that newly arrived writes are first absorbed by the 9300 and gradually written in the background to the 5210s as the data ages. Key data (indexes) are written to the 9300, making lookups faster in the second configuration. A fraction of the reads are serviced by the 9300 instead of the 5210s (depending on the query distribution and age of the data being read).
Additionally, by reducing the number of writes to the 5210s, even the reads that are serviced by the 5210s suffer less contention from ongoing writes, so tail read latencies are lower. The insert/update latencies are not pictured as they are similar in the two configurations during the run phase.
Figure 2: Latencies by YCSB Workload
Finally, we measured the amount of data written to the 5210s in the course of executing each workload. Adding a 9300 as a staging media class reduces the number of bytes written to the 5210s, preserving write cycles and extending the 5210’s write lifespan. During the load (insert-only phase), the number of bytes written to the 5210s is reduced by a factor of 2.4 as seen in Table 5.
Table 5: Write Reduction
|Configuration||4x 5210||9300 + 4x 5210
|GB written to 5210s (capacity media)
|GB written to 9300 (staging media)
Figure 3 shows the total number of gigabytes written during the run phase of the YCSB workloads. Note that this includes both user and background writes. With the exception of Workload C (100% read), the other workloads show at least a twofold reduction in the total number of bytes written to the 5210s by adding one 9300 to the configuration.
Figure 3: Reductions in Data Written
As part of future work, we are looking to broaden specific aspects of the HSE API to enhance its use, like custom media class policies that give the application more control. For example, if the application creates a key-value store (KVS, the equivalent of a table in a relational database) that will be used only for indexing, it can specify that the particular KVS should use a staging media class to speed up lookups. If the size of the indexing KVS grows too large to be accommodated on the staging device, the application can specify a policy that uses staging media but falls back to capacity media. We may also introduce predefined media class policy templates and extend the HSE API to allow an application to use them based on its needs. Be sure to stay in touch for potential developments.