NVMe™ over TCP proof of concept

Ryan Meredith | March 2020

Proof of concept: Lightbits Labs™ Apache Cassandra® performance using Micron 9300

Today’s data centers are consolidating their high-performance solid-state drives (SSDs) on the fast, scalable, and power-efficient NVMe™ protocol. When capacity demands required a pool of SSDs, NVMe Over Fabrics (NVMe-oF™) was introduced to solve the problem of disaggregated storage by supporting a rack-scale remote pool of NVMe SSDs where capacities could be flexibly assigned to specific applications. Then the challenge became how to maximize the performance and investment in NVMe by spreading performance beyond a single server and across an entire data center.

As a contributor to the NVMe standard, Micron investigates and shares our experiences in the software-defined storage space, specifically in disaggregated storage solutions. One of the biggest challenges of deploying earlier NVMe-oF implementations has been the administration and configuration of the RDMA (remote direct memory access) fabrics.

I lead the engineering team at Micron’s Austin-based solutions lab, where my team gets to experiment with new technologies from leaders in the industry. We recently tested a new protocol for NVMe-oF called NVMe over TCP (NVMe/TCP) solution from Lightbits Labs™, which avoids the RDMA complexity

This was interesting for Micron because it is not trivial to enable RDMA on switches, a situation that makes network administration complex. Using standard TCP for the NVMe transport means that there are no special functions or settings required on the switch and that the solution is easier to deploy and use, as in Figure 1. The trade-off for this simplification is additional latency introduced by the TCP protocol stack versus the RDMA protocol (which uses native Ethernet).

The success of any networked solution depends on how it compares to a similar configuration using local storage residing in the application server. Our goal for this test was to determine what overhead (if any) the Lightbits LightOS NVMe/TCP solution introduces into cloud workloads versus local NVMe devices.

graphical illustration of TCP protocol stack used for NVMe transport

Looking at Lightbits for NVMe/TCP

The Lightbits storage solution is a combination of its LightOS® advanced storage software and the Lightfield™ Storage Acceleration PCIe add-in card, responsible for compression, offloading back-end operations and managing a global flash translation layer, or “Global FTL,” installed on standard x86 servers from a variety of OEMs.

A Lightbits storage solution consists of a set of one or more nodes hosting NVMe SSDs for data storage along with advanced memory and optional nonvolatile dual in-line memory modules (NVDIMMs) for caching. That’s a lot of Micron hardware! Our goal was to test how Lightbits took full advantage of all that Micron goodness.

Local direct-attached PCIe storage is typically faster than networked storage, but how much can protocols affect latency? For this test, we compared the behavior of a common application using local NVMe drives versus that same application using volumes presented over TCP from a Lightbits LightOS storage server. In this instance, we tested performance and latency on Apache® Cassandra® using the Yahoo!® Cloud Serving Benchmark (YCSB).

Specifics of Our Test Configuration

Our proof of concept test used two test configurations. Configuration 1 included four stand-alone Cassandra servers using remote storage from Lightbits LightOS.

Lightbits LightOS organizes each server’s CPU complexes as separate “storage nodes.” A two-socket server can host one storage node per CPU complex. We tested with one storage node assigned to CPU socket 0.

The storage node hosted eight high-performance Micron 9300 3.84TB NVMe SSDs. We created four 4.9TB logical volumes from the Lightbits storage node and assigned one volume to each Cassandra database server as shown in Figure 2.

graphical illustration of Cassandra test configuration using LightBits LightOS storage server

Four load generation servers, connected to the network using 50Gb Ethernet adapters, ran the YCSB workload A.

Configuration 2 included four stand-alone Cassandra servers with local NVMe. Each Cassandra server hosted two Micron 9300 3.84TB NVMe SSDs to store data locally and connect to a 100Gb Ethernet network. Four load generation servers, connected to the network using 50GbE, ran the YCSB workload A.

Figure 3 below illustrates the test configuration.

graphical illustration of Cassandra test configuration using local NVMe

For both test configurations, the operations per second (OPS) results shown are the aggregate of the four Cassandra servers. All latency measurements documented were an average across the four Cassandra servers.

All servers' configurations used in the two tests were configured as follows:

Cassandra database servers (four)

  • Two Intel Xeon Platinum 8168 (24 cores @ 2.7GHz)
  • 384GB memory
  • 100Gbps Mellanox ConnectX-4
  • Datastax Cassandra v3.0.9
  • Two 3.84TB Micron 9300 Max NVMe drives
    • Configured as LVM-striped volume for local storage test
      Used for direct-attached test comparison only

Lightbits storage server (one)

  • Two Intel Xeon Platinum 8168 (24 cores @ 2.7GHz)
  • 768GB memory (128GB NVDIMM, 640GB PC2666)
  • 100Gbps Mellanox ConnectX-5
  • Eight 3.84TB Micron 9300 Max NVMe drives
  • Lightbits version 1.2.3

Load generation servers (four)

  • Two Intel Xeon E5-2690v4 (14 cores @ 2.6GHz)
  • 256GB memory
  • 50Gbps Mellanox ConnectX-4
  • YCSB v0.16.0 with Cassandra compressible data support

We focused on the performance difference using the same aggregate number of SSDs for both local (four Cassandra servers with two SSDs each) and remote (one Lightbits LightOS node with eight SSDs) by measuring average and tail latencies at various fixed database transactional loads.

All tests were run using YCSB Workload A, which is a 50% read and 50% update workload. We also adjusted the data distribution to “uniform” from “Zipfian” to get a larger dataset in use during testing. This increased the utilization of storage over data cached in memory. We ran a series of tests where YCSB throttled performance to a fixed number of OPS, and we measured average and quality of service (QoS) (99.9%) latency metrics. Each Cassandra database under test was 1.39TB in size.

For Cassandra with local NVMe tests, we used software compression, which typically introduces overhead and has a direct effect on performance. We used local server LVM (logical volume manager) striping for the local tests.

For the remote NVMe test, we disabled software compression in Cassandra and enabled compression using the Lightfield storage acceleration add-in card in the Lightbits storage server. We used disk striping functionality provided by the Lightbits software for the remote storage tests. Lightbits also supports redundant array of independent drives (RAID) and erasure codes for logical volumes. These features are something we may address in future testing.

Test Results Show Interesting Latency Results

As we scaled up the number of database operations per second, we saw a strong correlation between average read latencies on both the Lightbits and local storage configurations. At a throttled limit of 60,000 OPS per server, we started seeing a divergence between Lightbits and local storage configurations where the local database cannot match the performance of the Lightbits configuration in terms of latency (lines) and OPS (bars). An unthrottled YCSB test (the last dataset on the right) showed the Lightbits storage having a measurable advantage over the local storage configuration for both average read latency and total OPS.

graph showing average read latency for YCSB Workload A

Cloud workloads demand good QoS latency as well. The chart below shows the 99.9% QoS latency values with a similar result to average latency. The Lightbits configuration again showed better performance, with the no throttle QoS latency delta of 28.9 ms for local storage versus 40.7 ms for remote storage.

graph showing QoS (99.9%) read latency for YCSB workload A

Average update latency was interesting in that the average latency decreased for both tests as the load increased. This was an artifact of the way YCSB’s throttling mechanism works. The results for no throttling (bars on extreme right of chart below), on both the Lightbits and local NVMe configurations, showed nearly identical average update latency (42 ms versus 43 ms), with higher total OPS on the Lightbits configuration.

graph showing average update latency for YCSB workload A

Updated tail latency (99.9%) showed a similar decrease in measured latency for both test configurations as the load increased. Again, this was an effect of the way YCSB measures update latencies when throttling performance. As the load increased to 60,000 OPS per server and higher, we once again saw a measurable advantage for the Lightbits solution in both 99.9% latency and total OPS.

graph showing QoS (99.9%) update latency for YCSB workload A

Summary of Lessons Learned

Lightbits Labs NVMe/TCP solution reduced tail latency, and the overall performance was close to or better than local NVMe for Cassandra. Disaggregated storage makes a lot of sense in applications like Cassandra by allowing administrators to assign an appropriate amount of capacity to the application while taking advantage of the massive performance of pooled NVMe.

Setup and configuration of the Lightbits storage server was straightforward, and we were running tests within a day. Our testing shows that Micron’s fastest NVMe SSDs can move to external storage and retain their performance advantage with a little help from Lightbits Labs.

Would You Like to Know More?

Micron found the Lightbits NVMe/TCP approach intriguing enough to become an investor. We’ve planned a Micron technical brief on these NVMe/TCP Cassandra results. Get a notification when it’s available and stay up to date by following us on Twitter @MicronTech and connecting with us on LinkedIn.

Director, Storage Solutions Architecture

Ryan Meredith

Ryan Meredith is director of Data Center Workload Engineering for Micron's Storage Business Unit, testing new technologies to help build Micron's thought leadership and awareness in fields like AI and NVMe-oF/TCP, along with all-flash software-defined storage technologies.