After finishing up some testing of the Yahoo! Cloud Serving Benchmark (YCSB) and Apache™ Cassandra® on our new high-performance solid-state drive, the Micron® 9300 SSD, I had the opportunity to investigate using the Micron 9300 PRO NVMe™ SSDs as YARN cache in our test Hadoop cluster. We previously did a study on how adding NVMe drives can improve the performance of a hard disk drive (HDD) Hadoop cluster and got excellent results. These computational clusters store and analyze huge amounts of unstructured data in a distributed computing environment.
Previously: Hadoop cluster of HDDs with added Micron 9200 NVMe SSDs in the cache
This time around I had the chance to see if the Micron 9300 SSD, which uses NVMe protocols, could improve the performance of a cluster that is already using SATA SSDs for its storage. The results showed that adding Micron 9300 PRO NVMe SSDs to a Hadoop cluster can further improve its performance and allow it to run more jobs more quickly.
The Hadoop cluster was made up of four HDFS DataNodes, each of which had 12 x Micron 5210 ION QLC SSDs and 1 x Micron 9300 PRO NVMe SSD. These DataNodes also hosted the YARN NodeManagers, which meant that the compute was very close to the storage. Another set of four nodes was used to host the other components required to have a running Apache Ambari-managed Hadoop cluster, including NameNodes, ResourceManagers, App Timeline Servers, NFS Gateways, etc.
Compute and Storage Nodes: (4 Nodes)
- Supermicro 2029U-TR25M
- 2 x Intel Xeon 6142 Gold Processors 16 Cores @ 2.60GHz
- 384 GB RAM
- 1 x 3.84TB Micron 9300 PRO NVMe SSD
- 12 x 8TB Micron 5210 ION QLC SATA SSDs
- 100GbE Networking
Infrastructure Nodes: (4 Nodes)
Supermicro 2028TP-HC0TR (4 Nodes in 2U)
2x Intel Xeon 2660v3 Processors 10 Cores @ 2.60GHz
1 x Micron 5100 PRO 960GB SATA SSD for storage
Apache Hadoop Data Platform 2.6.5 on Ambari 188.8.131.52:
HDFS – 2.7.3
YARN – 2.7.3
MapReduce2 – 2.7.3
Spark2 – 2.3.0
We ran the HiBench Micro Benchmark suite on the Spark2 framework which included Sort and TeraSort benchmarks.
Results and Analysis
When we compared the two configurations, we found that adding Micron 9300 PRO NVMe SSDs delivered up to 30% improvement in performance over the Micron 5210 SATA SSD configurations.
In the Sort test, the Micron 9300 NVMe SSD + Micron 5210 SATA SSD configuration sorted nearly 350GB of random text data in about 500 seconds when compared to 730 seconds for the SATA SSD configuration. This generated a throughput of 630MB/s from the Micron 9300 NVMe SSD + Micron 5210 SATA SSD configuration and 450MB/s from the Micron 5210 SSD configuration.
This was a nearly 30% reduction in completion time.
Another popular benchmark to execute against a Hadoop cluster is the TeraSort benchmark. The TeraSort benchmark measures the amount of time it takes to sort one terabyte of randomly distributed data. This benchmark took about 4,100 seconds to execute on the Micron 9300 + Micron 5210 SSD configuration while taking nearly 5,000 seconds to execute on the Micron 5210 SATA SSD configuration. This test generated a throughput of 155 MB/s and 130 MB/s respectively.
Move to Micron SSDs
One of the easiest ways to get more bang for the buck from your Hadoop cluster is to replace the HDDs with solid state drives. Take it one step further by using the Micron 9300 PRO NVMe SSDs to improve the performance of those SSD-based clusters.
Micron SSDs work well with Hadoop, and we enjoy an ecosystem relationship with Cloudera (previously Hortonworks). The Micron team continues to test even more exciting machine learning benchmarks on our Hadoop cluster. The results of this testing will be shared in a Micron presentation at DataWorks Summit 2019 and will be available as a downloadable technical brief.
Visit www.micron.com/9300 to learn more.
Micron at Data Works Summit 2019 – come chat with us!
Affordable All-Flash Big Data Clusters with QLC SSDs by Sujit Somandepalli
- Wed, 5/22 at 12:45 p.m. – preview on the exhibit floor
- Thu, 5/23 at 4:00 p.m. in full breakout session
Pipeline Builder: Micron’s Journey Automating the Global Data Warehouse by Paul Gibeault & Peter Wicks
- Wed, 5/22 at 2:00 p.m. in full breakout session