This technical marketing brief shows how virtualizing Hadoop® (versus using a more conventional bare metal Hadoop deployment) improved the completion time for an aggregated set of standard benchmark tests by 26%. To help ensure that performance was not limited by storage, we equipped both the virtualized and bare metal deployments that we tested with Micron® M510DC SSDs.
Finding Relevant Answers in Big Data
Data is everywhere, and we generate more of it daily. The sources of this data and the decisions made as a result vary greatly — ranging from purchasing patterns that drive stock and inventory management, to traffic flow patterns that steer road maintenance priorities, to social media trends that influence advertising and product messaging. Big data and its analytics are all this and more.
The broader acceptance that big data holds value reflects both its maturity as well as the now mainstream reliance on the guidance and answers it contains.
A continuing challenge we face with big data lies in our need to analyze this ever-growing data pool quickly enough so that the results are relevant. Having to wait for legacy deployments to get the necessary answers from a data set provides little value — we need better answers faster.
Virtualizing to Find Answers Faster
We found that one of the best ways to get answers faster from the Hadoop ecosystem is to increase the number of data nodes in the Hadoop cluster. This increases the number of simultaneous jobs running and decreases the time to complete any given task set (up to the point where other processes like shuffle/sort across nodes suffer from external bottlenecks like network throughput, which limits cluster performance).
As shown in Figure 1, we can increase the number of data nodes in a cluster in at least two different ways:
- Bare Metal (Physical) Hadoop: Run Hadoop on bare metal with one instance per data node and then add more nodes to the cluster (one instance of Hadoop per data node).
- Virtualized Hadoop: Virtualize Hadoop using a hypervisor, like a kernel-based virtual machine (KVM), to run multiple instances per server (in this case, four KVMs per data node with one instance of Hadoop per VM).
Figure 1: Physical and Virtual Hadoop Clusters
Creating Your Unfair Advantage
Fast Storage Is a Must
In another Micron technical marketing brief, “Micron M500DC SSD: Bringing Real Value to Virtualizing Hadoop Solutions in a Flash,” we showed that enterprise SSDs are an essential element to performance-focused Hadoop deployments, enabling built-in Hadoop benchmark tests to run significantly faster than on an array of enterprise HDDs. Based on this experience, we used our M510DC SSDs in both our bare metal and virtual deployments; using SSDs also helped to ensure that measured differences were not due to slow HDD storage.
Virtualize for More Performance
Although we usually associate application virtualization with a performance penalty, that was not the case when we tested Hadoop. To optimize the number of containers per data node for our tested configuration, we used four virtualized Hadoop instances per node — which allowed more processes to work on both the map and reduce functions for most Hadoop-related jobs, helping to provide better performance.
When we ran Hadoop in a purely physical (same configuration, but non-virtualized) cluster, we saw instances where an entire data node was consumed with completing one map/reduce job of one container, which crippled overall performance.
Verify Your Advantage
We tested two clusters: one physical and one virtualized. Both clusters used identical hardware (same servers, CPUs, memory and M510DC SSDs for storage). We used a shell script to run multiple built-in Hadoop example benchmark tests, deleting the created data after each to ensure consistent starting points. Each of these test sequences was considered a test run. We added a three-minute pause between each of 10 test runs to ensure successful Hadoop distributed file system (HDFS) block deletion before the next test run began.
To gauge the overall performance difference between the physical cluster and the virtualized cluster, we averaged the performance of each of the 10 test runs. The results are shown in Figure 2. As you can see, the virtualized Hadoop cluster completed the same set of benchmark tests 26% faster than the physical cluster.
Figure 2: Virtual Versus Physical Hadoop Performance – Aggregated Completion Time for 10 Test Runs (Less Time Is Better)
The Bottom Line
Big data is now a mainstream building block on which a growing number of decisions are based. These decisions often influence our daily lives (from stock and inventory management, to maintenance priorities, to advertising and messaging). The value of these influences can be very time-sensitive — the faster we can learn about them, the faster we can take action and the more opportunity we have to capitalize on them.
This technical marketing brief shows how virtualizing Hadoop can help drive better performance compared to a bare metal deployment. To ensure we isolated the difference to virtualization, we used the same hardware and the same fast Micron M510DC SSD storage in each cluster; the only difference was virtualization. Our testing showed that the virtualized deployment enabled more simultaneous processing, which helped reduce the overall average run time of a set of example benchmarks by 26% (compared to from the bare metal cluster’s completion time).
When your big data holds the insight you need, create your own unfair advantage by virtualizing Hadoop with fast storage like Micron’s M510DC SSD.