Storage

Micron Brings All-Flash Performance to Big Data with Newest Spark Reference Architecture

By Tony Ansley - 2019-07-10

Hello again! It has been a busy time here at Micron. If you have read some of our earlier blogs, you know that we have been diligently working on several projects that illustrate the value of SSDs for real-world solutions that are important to you. Today, we offer up a brand-new category of solutions that our customers have been requesting, solutions that can greatly benefit from the introduction of flash…big-data analytics. Check out this Micron Accelerated Apache™ Hadoop® Analytics With Apache Spark™ Solution Brief to get straight into the technical details.

Over the past year, we have provided insights into the rationale behind flash for big-data through our Micron Blog and through whitepapers and technical briefs. As more businesses embrace big-data analytics to drive their products and services, the more it becomes a critical part of their success. Even more critical is the shift to using these solutions to gain insights about their business and customers in real-time. Any time I hear the word “real-time,” I think replace it with “super-fast.”

The challenge that businesses have – and probably you if you’re an IT professional – is that most production big-data deployments are starting to show their age and cannot provide needed answers expediently enough because these HDD-based clusters simply cannot keep up. When originally built, these clusters made perfect sense using HDDs – heck, even Micron built their initial big-data clusters using HDDs! They had large capacities and were relatively inexpensive. But, like you, we needed to be more responsive in our business and our HDD-only production clusters weren’t cutting it. As discussed in the past (here, here and here), we learned that simply adding a single NVMe SSD to each cluster node generated immediate benefits. This also illustrated a great way to start transitioning your existing Hadoop clusters to all-flash Hadoop clusters. You can review the results of our testing here.

As we talked with various Hadoop vendors and customers, we also discovered a shift in the analytics tools used, from Hadoop MapReduce to Apache Spark, due to its in-memory processing support. In some instances, this can improve performance dramatically. Some estimates show that the use of Spark is growing much faster than MapReduce so we decided to build this RA (reference architecture) using that tool.

Let’s get to Micron’s new Spark-based big-data solution.

Graphic 1
Data analytics with Spark, HDFS and YARN architecture overview

1.1 Solution Overview

The cluster consists of four compute/data nodes hosting the Spark executor, YARN cache, and HDFS data and four infrastructure nodes hosting the YARN resource manager and the HDFS namenodes. On the compute/data nodes, we placed the YARN cache on the Micron 9300 PRO NVMe SSD, and for the primary data (HDFS) storage we used the Micron 5210 ION SATA SSD. All tests were run from a client server (see image above) that incorporated the modified HiBench code that integrated the Spark context.

The tables below provide information on the software and hardware configuration. On the hardware side, we purposefully obscure the server vendor. We believe that, while there might be minor differences in each vendor’s server designs, most vendors offer a server that meets our documented choices and your performance using that vendor’s servers should not deviate considerably from our results. Our strategy allows you to use the server vendor of your choice. The RA documents the configuration further, so I encourage you to download the reference architecture.

Graphic 2
Hadoop Infrastructure and Data/Compute Nodes: Software
Graphic 3
Compute/Data Nodes Hardware Details
Graphic 4
Infrastructure Nodes Hardware Details

To test how well this RA functioned and what the benefits would be, we ran a subset of the HiBench suite that was modified to support Spark as an analytics engine replacement for MapReduce. We are in the process of modifying more of the HiBench suite for Spark and will use these to expand our result set in future RA releases. For the RA, we executed the Sort, TeraSort, and WordCount benchmarks. As part of our testing, we started with a baseline configuration using commonly available 7.2K RPM SATA HDDs. To test the SSD configuration, we performed two different SSD configurations:

  • Replacing the HDDs with the same number and capacity Micron 5210ION SSDs to the solution and not using any YARN cache
  • Add a single Micron 9300 SSD configured as YARN cache to the previous Micron 5210 SSD configuration.

1.2 Performance Results

The data proves that an all-flash Spark+Hadoop solution generates significant performance gains. The biggest “bang for your buck” comes simply by moving to flash. The chart below illustrates the completion times for each of the tested benchmarks, shorter bars are better, with the gray bar representing the baseline HDD configuration and the blue (Micron 5210 SSD) and green (Micron 5210 + Micron 9300) SSD bars in the chart below. The performance improvements are stated above each benchmark bar-set. Note that adding the YARN cache, increases the performance gains even more; in some instances, as much as an additional 26+ percent over the non-cached SSD configuration. While the performance advantage varies depending on the type of analytics, the results are clear, SSDs – even lower-cost, lower-performing SATA SSDs – drive faster analytics.

Graphic 5
Results comparison of three different Hadoop/Spark analytics configurations

1.3 Conclusion

If you have or are planning a big data solution based on legacy HDD storage, I would highly recommend that you consider an all-flash solution. High-performance SSDs such as the Micron 5210 ION are bringing lower cost to the discussion and should be considered for new deployments. Even when using the same type of interface – such as SATA – SSDs can dramatically reduce the time it takes to get critical insights from your data. By strategically adding higher-performance NVMe to the solution, you will realize even more performance. Micron’s reference architectures can show you the path to these high-performance big-data solutions. Realizing dramatic big-data performance gains like those shown above are now within your grasp.

To learn more about our latest reference architecture focused on all-flash Hadoop, visit our Micron Accelerated Hadoop solution page. Visit our Micron 5210 ION SSD and Micron 9300 NVMe SSD product pages to find out more about what high-performance flash can do for your solutions.

Amit Gattani

Tony Ansley

Tony is a 34-year technology leader in server architectures and storage technologies and their application in meeting customer’s business and technology requirements. He enjoys fast cars, travel, and spending time with family — not necessarily in that order.

+