Ceph Performance on NVMe; BlueStore, Checksum, and Tuning
The latest version of Ceph recently hit GA status with a ton of new features. Ceph 12.2, Luminous, is the long term stable release of Ceph and adds full GA support of BlueStore, checksum for data, improved tuning, and a lot more. Check the release notes here.
I loaded up Luminous on my reference architecture hardware and tested with checksum enabled and disabled to see their impact on IOPS performance and latency. I also ran scaling tests from 1 to 100 FIO instances accessing my four node, all-NVMe Ceph cluster deployment. This blog focuses on 4KB random read and write tests using the Rados Block Driver (RBD) with FIO.
Test Platform: IOPS Optimized Reference Architecture
I used the hardware from our Ceph Reference Architecture with a few software updates:
Ceph Storage Software Tuning
For those of you that have worked on Ceph storage, tuning the ceph.conf file is an art form. This makes one of the bullet points on the release notes stand out:
- Each OSD now adjusts its default configuration based on whether the backing device is an HDD or SSD. Manual tuning generally not required.
Based on this, I removed all the OSD-specific tuning from my config file and…
It works! The OSDs scaled to use most of the CPU on my storage nodes and performed well. There may be some tuning around NVMe to uncover, but good out of the box performance is great progress.
BlueStore is now the default storage engine and I ran my tests without any specific BlueStore tuning in the config file. NVMe SSDs were configured with 1 OSD per drive with data, WAL, and RocksDB on the same partition.
Checksum and Ceph
In previous iterations of Ceph, using checksums to validate objects was done by the deep scrub process on a weekly schedule. This enables fast writes, but creates a lag time between when data is written and when it is validated. Ceph Kraken introduced in-line checksum to Ceph using the crc32c algorithm. This creates increased data integrity and reduces the impact and necessity of the deep scrubbing process (which can be resource intensive). Luminous is the first long term stable release of Ceph to support in-line checksum.
Test Methodology: FIO + RBD 4KB Random Workloads
I used 10 Supermicro 2028U servers with 40GbE NICs as load generation servers. I scaled FIO tests from one instance to 100 instances running across these 10 servers. FIO uses the RBD driver to access RBD images.
I created 100x 50GB RBD images on a 2x replicated pool. Each FIO process has a unique RBD image.
I ran a sweep of tests for 4KB reads and writes with checksum disabled on the pool. I deleted and re-created the pool with crc32c checksum enabled, and re-ran the tests.
Each test was run for 5 minutes with a 100 second ramp-up time. RBD images were fully written before testing to eliminate the performance impact of thin-provisioning.
4KB Random Write Performance: BlueStore + CRC32C
4KB random write performance hits it’s high point at 70 FIO instances with 273k IOPS (no checksum) and 233k IOPS (crc32c checksum). There is a 15% drop in IOPS when enabling crc32c checksum.
The following chart shows the values for 4KB random write IOPS, average latency (LAT, in ms), and 99.99% latency (LAT, also in ms) as the tests reach maximum performance.
CPU utilization averaged 75%+ on 4KB random write tests. Putting 2 OSDs per NVMe device may increase CPU utilization and performance and will be tested in the future.
4KB Random Read Performance: BlueStore + CRC32C
The addition of checksums has a negligible effect on 4KB random read performance. Peak performance of 1.5M IOPS is hit at 50-60 FIO instances.
The following chart shows the values for 4KB random read IOPS, average latency (LAT, in ms), and 99.99% latency (LAT, also in ms) as the tests reach maximum IOPS.
CPU utilization averaged 90%+ on 4KB random read tests. 4KB random read performance is CPU limited.
Ceph Luminous presents a great step forward for Ceph. The auto-tuning OSDs work as advertised, BlueStore performance is great, and in-line checksums incur a 15% performance hit for 4KB writes and no impact on reads.
My next steps are to look at large object throughput and the impact of erasure-coded RBD pools. The Intel Purley and AMD EPYC™ platforms have also been released and Micron 9200 series of NVMe™ SSDs are on the way.