Accelerating the Apache Hadoop 3.1-based Distribution Ecosystem with Flash Storage

By Tony Ansley - 2018-07-31

As more and more businesses depend on big data analytics, we see increasing numbers of open-source Apache™ Hadoop® platform deployments. Of course, big data means lots of storage and, as a major solid-state storage (SSD) provider, Micron is very interested in the advantages that enterprise flash storage could bring to this environment. To date, most big data solutions are built around legacy hard-disk drives (HDDs) due to their traditional cost advantages over higher performance SSDs. The additional speed, efficiency, density and reduced latency of flash were not perceived as valuable to big-data analytics since a majority of the analysis was batch processed.

Recently, however, companies are seeing the value of real-time analytics that result in a faster time to insights. To realize the benefits of these real-time analytic use cases, a solution requires faster storage – both in terms of latency and number of transactions per second (IOPS) – such as achieved with Non-volatile Memory Express™ (NVMe™) SSDs and SATA SSDs.

Our alliance partner Hortonworks® is a leading Hadoop platform implementer for managing large data repositories and performing deep analytics that allow you to obtain actionable intelligence from that data. Hortonworks and Micron believe that SSDs can provide real value to data analytics infrastructure.

We have started a series of performance analysis using Apache Hadoop 3.1 based distribution, specifically on Hortonworks Data Platform (HDP® 3.0) with Micron SSDs in various roles. We co-presented our results with Hortonworks at our session at the DataWorks Summit 2018. In a pre-production environment, our Micron IT team found TCO benefits resulting from fewer nodes needing set up and fewer nodes incurring software licensing.

Apache Hadoop 1

Flash in Hadoop Cache

One of the biggest challenges that data scientists encounter in their quest for faster-time-to-answer has NOT been CPU or GPU related, but one of providing those CPUs/GPUs with data fast enough to allow these expensive resources to be fully utilized. We, in the IT world, call this CPU idle time, and it is the bane of real-time big-data analytics. This is where judicious, cost-effective additions of SSDs to existing big-data deployments can help.

One of our goals for the performance analysis of HDP 3.0 with Micron SSDs was to reduce CPU idle time and thus reduce time-to-answers. We are excited to announce the initial results from the tests we conducted.

For these tests, Hortonworks gave us early access to its new HDP 3.0-based Hive™ database on HDFS/YARN solution, which was deployed on two separate 4-node clusters. The clusters were configured as shown in the table below, but the primary configuration difference was the introduction of a single Micron 9200 MAX NVMe SSD to each node as a YARN cache to the HDDs. Our database size to memory ratio was targeted at two to one for the cluster with our total database size being 2TB and our aggregate cluster memory of 822GB available after OS overhead.

Apache Hadoop 2

The test we executed consisted of performing 94 of the 99 queries used within the Transaction Processing Performance Council’s TPC-DS benchmark and measuring the time to completion for each query on each cluster configuration using the configuration that only used HDDs as the baseline for comparison to the NVMe-cached configuration. We could complete 94 benchmark queries with enough confidence to publish the results. This may be due to our using an early beta version of Hortonworks HDP 3.0 software. Overall, the results of executing the 94 queries resulted in the SSD-cached configuration completing the queries 1.72X faster than the HDD-only configuration.

The chart below illustrates the benefits for the six queries with the biggest improvements (shorter bars are better results); in the interest of providing complete information, we had a total of three of the 94 queries that ran slower with the SSD-based YARN cache, with the worst being 3.6 percent slower.

Apache Hadoop 3

Six Queries from the TPC-DS Benchmark for 15K HDD vs NVMe SSD

Two final observations should be made. First, that our testing was performed using 15K RPM SAS HDDs. These drives are typically not a primary HDD type used in big data solutions which tends to use 7.2K RPM SATA HDDs. For this reason, we believe that in real-world environments, the acceleration impact of introducing NVMe SSDs as a YARN cache are even greater. Second, we are still early in the process of discovery. As we look at other components in the Apache Hadoop 3.1-based distribution ecosystem, there will be other candidates that may benefit from using SSDs in strategic roles within the storage system.

What is even more interesting is that the impact on CPU idle time was the main contributor to these performance gains from using an SSD as YARN cache. As the chart below shows, adding a single NVMe SSD resulted in a reduction of CPU wait to zero. No longer is the CPU the limiting factor.

Apache Hadoop 4

CPU Waiting Time for 15K Hard Disk Drive vs NVMe Solid State Drive

There is a clear advantage to introducing SSDs, even in a limited role such as YARN cache, into your big data solution. While HDDs will continue to be the primary storage technology for the foreseeable future, using SSDs in strategic roles within the solution can provide a cost-effective way to get quicker answers and allow you to make more timely decisions that can directly impact your business. We look forward to continuing to partner with Hortonworks to evaluate more roles that SSDs can play within the Hadoop ecosystem.

Amit Gattani

Tony Ansley

Tony is a 34-year technology leader in server architectures and storage technologies and their application in meeting customer’s business and technology requirements. He enjoys fast cars, travel, and spending time with family — not necessarily in that order.