Storage

Make Hadoop 36% Faster with a Little Flash Memory

By Greg Kincade - 2018-06-04

Micron® Technology is proud to be both a customer of Hortonworks® and an enabler of improved performance on Hadoop®. As a user, Micron’s 13 global fabs deploy Apache HIVE™ to help ensure our production processes improve the yield and quality on our memory and storage product lines, as well as help continually innovate on 3D NAND solid state drives and 3D X-Point™ technology. As an Ecosystem Enablement Program manager and developer of storage solutions, I’ve led several projects to test Micron fast storage on Big Data. We get constant questions about Hadoop and HDFS, and the interest is well deserved. Zion Market Research expects the global Hadoop market share to reach approximately USD 87.14 billion by 2022, growing at an estimated CAGR of 50% between 2017 and 2022.

Hadoop 3

Data managers often presume that certain storage devices fit best into certain workloads – but they have been known to do this without testing, without any real data to back up those assumptions. The typical assumption is, if a workload is highly sequential, HDDs will excel and SSDs will offer only little value. That thinking may have been valid when SSDs were optimized for only very small random accesses. Today’s high-performance SSDs and memory have created a new world. We want to thank Hortonworks for the opportunity to share the result of one of our tests with Hadoop.

Add One High-Performance SSD to HDD-Based Nodes, See 36% Average Faster Run

When IT needs better Hadoop performance from their HDD-based nodes, they typically have two options:

  1. They can add more nodes to the existing cluster. This may help meet performance goals, but the incremental cost of the added nodes may be prohibitive.
  2. Alternatively, they can replace the current cluster nodes with new ones. Rebuilding with higher performance nodes may meet performance goals, but the cost is higher still.

Now there is a third choice that enables better performance and economics: add a high-performance Micron NVMe™ SSD to each existing HDD-based cluster node. Adding a single SSD (tested with a Micron 9100) to each existing Hadoop node (10-node cluster) and making a slight change in YARN’s resource localization provides 36% average reduction in benchmark runtime (over 10 test runs) and is far more economical than adding more nodes to achieve a similar improvement.

Reduce Benchmark Runtime. Keep Your Cluster Investment.

Each Hadoop distribution comes with a set of standardized, built-in benchmarks. These benchmarks enable broad range performance measurements across technologies and deployments.

The first configuration deployed the SSD for the YARN cache, and we used the HDDs for HDFS. The second used SSDs for the YARN cache and HDFS. Both used the same hardware. When we used the NVMe SSD as the YARN cache, we saw a 36 percent reduction in benchmark completion time.

Hadoop 4

Add the SSD for Better Economics (Than Expanding the Cluster)

After we saw the above improvement, we also wanted to understand if adding one high-performance SSD to the existing cluster nodes was more economical than expanding the cluster (adding more all-HDD nodes) to achieve a similar runtime reduction. So, we expanded the all-HDD cluster to approximate the benchmark completion time for our SSD YARN cache cluster (3518 seconds average across 10 runs), then analyzed the cost.

We added two more all-HDD nodes (12 total) and repeated the tests. The 10-run average was faster (4,353 seconds mean), but fell short. We added one more node (13 total) and again ran the same tests. This run was very close.

Hadoop 1

When we ran these tests, the manufacturer’s suggested retail price (MSRP) for the all-HDD cluster nodes as configured in these tests was just over $12,000 each. The price for one high-performance Micron 9100 SSD was just under $2,500. Adding the SSD to each node of our existing cluster was more economical than adding cluster nodes to reach similar performance.

Summary

Adding a single NVMe SSD to each node to a 10-node, all-HDD cluster reduced a standard benchmark runtime by 36 percent and costs less than expanding the existing cluster (to reach a similar benchmark runtime reduction). When IT needs more from their Enterprise deployments, they typically weigh several options. Acquisition and recycling costs, performance benefit, and deployment time are a few of these.

Distributed systems like Hadoop offer two main options for improving performance: Add more of what you already have (cluster expansion) or replace what you have (decommission and build new). With Micron’s NVMe SSDs, there is another, more attractive option: add one SSD to each existing cluster node and make a small change to YARN resource allocation. If you’d like, read the full technical brief by my colleague Doug Rollins.

How We Measured These Results

We used the standard benchmarks built into most Hadoop distributions, running the benchmark set 10 times and recording the mean completion time. The table below shows the benchmarks we used and their parameters.

Hadoop 2

See you at DataWorks Summit 2018

We’re looking forward to the upcoming 3.x versions of Hortonworks HDFS and Hadoop Hive and seeing how they’ll perform with Micron IT infrastructure. At the DataWorks Summit in San Jose in June, Micron and Hortonworks will show real-world results demonstrating the tangible benefits of Hadoop 3.x when combined with the latest in non-volatile storage and an updated IT infrastructure in well-designed platforms.

Another highlight: Real-world Hive examples from our Micron fab group senior data architect.

Hope to see you there.

Amit Gattani

Greg Kincade

Greg Kincade is a Sr. Ecosystem Enablement Program Manager for the Micron Storage Business Unit. His primary focus is working cross functionally with external collaborators and internal stakeholders to define differentiated solutions that help our customers dramatically increase the performance and efficiency of their storage infrastructure.

+