How fast can a Ceph implementation go? 1
As a Principal Storage Solutions Engineer with Micron, I am tasked with figuring that out. I was proud to announce Micron’s All-NVMe Ceph® reference architecture at the OpenStack Summit 2017 in Boston. It’s based on Red Hat® Ceph Storage 2.1 (Jewel 10.2) and it is crazy fast.
In case you missed it, here’s a diagram of our Reference Architecture:
Here are links to the reference architecture and the video of my session at the OpenStack Summit:
This reference architecture is built with the Micron 9100MAX 2.4TB NVMe SSD, Red Hat Enterprise Linux 27.3 + Red Hat Ceph Storage 2.1, and the Supermicro SYS-1028U-TN10RT+.
Faster!?! Kraken + BlueStore Performance
Many conference attendees asked me about the performance difference when using BlueStore in Ceph instead of FileStore. Using BlueStore should help alleviate the write penalty inherent in Ceph and provide greater performance. Check out this slide deck from Sage Weil (the founder and chief architect of Ceph) on the on the problems with FileStore and how BlueStore addresses them.
I did some testing with the latest GA version of Ceph, Kraken (Ceph 11.2) using BlueStore on the same hardware from the reference architecture and got some impressive performance improvements.
- 39% higher 4KB random read performance
- 21% higher 4KB random write performance
- 25% lower 4MB object average read latency with higher throughput
- 2.3X higher 4MB object write throughput + 37% lower average latency
Giant Squid-Sized Caveat: BlueStore in Kraken is listed as stable but still experimental and data corrupting, so use it at your own risk and not anywhere near production.
4KB Random Block Workload
I used FIO against the RBD driver on 10 load generation servers to push 4KB random read and write workloads. Tests were run on a 2x replicated pool with 5TB of data (10TB data accounting for replication).
Kraken + BlueStore reaches 1.6 Million 4KB random read IOPs at 6.3ms average latency, a 39% increase in read IOPs over Jewel. 4KB random writes hit 291k IOPs with 4.4ms of average latency, a 21% increase.
CPU utilization is lower with Kraken + BlueStore, topping out at 85%+ on reads and 70%+ on writes. There should be more performance to gain with further optimizations and tuning as BlueStore becomes GA.
4MB Object Workload
I used Rados Bench on 10 load generation servers to push 4MB object read and write. Tests were run on a 2x replicated pool with 5TB of data (10TB data accounting for replication).
4MB object read throughput with Rados Bench is close because both tests are network limited with 50GbE networking. Kraken achieves higher throughput (+900 MB/s) and 25% lower average latency.
4MB object writes are 2.3X higher than Jewel. This large difference is due to BlueStore writing 2 copies of each object + metadata versus FileStore writing 4 copies of every object due to journaling. This creates a massive throughput increase along with a 37% reduction in latency. With BlueStore, 4MB object writes are network limited.
Faster Ceph Implementations, Coming Soon!
A Ceph implementation is faster with BlueStore, full stop.
BlueStore addresses two of the major drawbacks in the Ceph stack, the penalty of write amplification from using journals, and the overhead of the XFS filesystem for storing data. The performance improvements of BlueStore allow Ceph to take better advantage of high performance drives like the Micron 9100 MAX NVMe SSD.
BlueStore in its current state is encouraging. While it is still experimental, my test cluster was stable with Kraken 11.2. There is no reason to doubt that BlueStore will provide even higher performance in its final form. BlueStore will become the default in the next GA version of Ceph, Luminous, which should be released in late 2017.
The ceph.conf tuning settings for BlueStore and RocksDB I used in these tests are available here:
1 © 2017 Micron Technology, Inc. All rights reserved. All information is provided on an “AS IS” basis without warranties of any kind.