Virtualized Hadoop Really Rocks

By Doug Rollins - 2016-03-07

“Big Data” – I see this mentioned more often than I can count.  Industry publications, online seminars, conferences, case studies, training, comparative analysis (“whose Big Data platform is best?”) – the list goes on.  And with good reason.  As more businesses have come to realize that there is (or at least can be) real value in Big Data much of the focus has changed.

We’ve seen a transition – Big Data discussions have moved from “…gee, that’s interesting – I wonder what it is?” and “…we need a Big Data project, any project, but we need one!” to a far tighter focus.  As Big Data value has been shown to be real, the questions being asked are very, very different.

Most of the discussions are now centered around how to get the most from a Big Data project, not whether or not to one.

That got a few of us thinking – would flash help with Big Data?  How?  What different approaches might flash enable?

Winding Back The Clock...

Looking at past (but relatively recent) Big Data deployments, we realized that many of them had two common elements: lots of physical nodes and rotating storage.  Unfortunately, this approach requires ‘bigger’ (more nodes to grow) but does not deliver ‘better’ (processing is still batch-mode speed).  That may not be good enough.

What If…

Our Engineering Team got to working on this.  How might a different approach to Big Data help deliver results faster with smaller deployments?  What if instead of deploying on bare metal, we deployed on virtual machines?  And what if the VM hosts were equipped with Micron M500DC Enterprise SSDs?  If we did, how might the results be different from deploying on HDDs?

Being Fair...

To make a fair comparison, we used a shell script to fire off several built-in Hadoop benchmarks – Wordcount, Word aggregate histogram, Sort, Kmeans, and Terasort.  These ran in sequence and one complete sequence was looked at as a single benchmark run.  After we recorded the results, the data from each run was then deleted and the systems restored to a fresh out of the box state and the next run launched.  We repeated this 10 times with a brief pause in between each. You can read about our complete testing here.

The Results Are In

So…what happened?  Virtualized Hadoop with Micron M500DC SSDs was the clear winner.  How clear?  The M500DC SSDs (6 drives, RAID 10) completed the 10 test runs with a mean time of 530 seconds, while the 15K HDDs (also 6 drive, RAID 10) completed the runs in just over 3372 seconds— approximately 6.4X longer than the SSD array.

So What?

Many of us tend to presume that certain storage devices fit best into certain workloads – but we do this without testing, without any real data to back up those assumptions.  We assume that since a workload is highly sequential that HDDs will excel and SSDs will offer no particular value.  We may completely ignore the possibilities that a different approach can bring.

If we stop doing that, we may just be (pleasantly) surprised!  Want to learn more?  Take a look at the Technical Marketing Brief or if you have questions about the testing we ran, leave a comment below. You can also connect with me on Twitter @GreyHairStorage and follow Micron Storage on Twitter @MicronStorage and on LinkedIn.

Doug Rollins

Doug Rollins

Doug Rollins is a principal technical marketing engineer for Micron's Storage Business Unit, with a focus on enterprise solid-state drives. He’s an inventor, author, public speaker and photographer. Follow Doug on Twitter: @GreyHairStorage.