In my last IT Tech Tip blog, I discussed how raw performance is no way to measure value—both in the car industry or the IT industry—and I began to discuss the much-buzzed-about Hadoop and its use with solid state drives (SSDs). Keep in mind—I’m talking about Hadoop and its distributed file system (HDFS) —as we know them today.
First, consider that the size of the “chunk” that an HDFS reads is huge, and the reads tend to be sequential. So how do you get value using an SSD that is optimized for very small random accesses? Well to be blunt, you won’t if using an HDFS today.
Some SSD Vendors Claim a Hadoop + SSDs Benefit
Some SSD vendors still seem to think Hadoop/HDFS is a great use for SSDs. One vendor, in particular, cites a benchmark that shows up to 27% improvement for specific Hadoop-related operations, 17% on other types, and no real differentiation on still others. According to their 27% improvement conclusion, we should be looking seriously at an all-SSD storage solution for our data nodes in today’s Hadoop/HDFS clusters. But should we? How should an SSD vendor respond? What’s the overall value that those SSDs are adding? And how should an SSD vendor respond to this claim?
Sure, 27% improvement in almost anything is worth noting, but after hearing this claim, I found myself scratching my bald head and thoughtfully stroking my graying beard as I asked a few questions (and I’ve yet to find the answers):
- How much more would an all-SSD data node solution cost (CAPEX)?
- Is the cost greater than 27% when compared to an all-HDD data node design? (Hint, it probably is.)
- Is a 27% reduction even meaningful in a platform where basic operations are not too time-critical?
- If an all-SSD data node solution saves me 27% turn time, but costs me more than 27% in additional expense to buy, haven’t I effectively lost money?
Don’t SSDs Make Everything Better?
…even my morning coffee and bagel? The answer is in the fundamentals of the current (with strong emphasis on ‘current’) Hadoop/HDFS design. It seems that HDFS as we know and love it today is deliberately designed to work around the inherent limitations (features?) of “rotating rust” (AKA, hard drives). It does this by accessing data in large sequential chunks. But SSDs excel at small random accesses. In fact, with small, random accesses, SSDs eclipse HDD performance. SSDs are as fast as ramen noodles, instant cocoa, and microwave popcorn—all rolled into one! When it comes to large, sequential accesses, SSDs usually match HDD performance, and do so at a much higher acquisition cost per gigabyte (a fair unit of comparison because performance is very similar).
The Hadoop + SSDs Opportunity
Designers have a real opportunity to re-architect applications to take advantage of SSDs and exploit their incredibly fast, random access; small transfer performance; and strikingly low latency. But as of now, SSD-optimized products occupy an incredibly small majority of the application space, so SSD-optimized performance remains an opportunity, not today’s reality (not yet, anyway).
Changing the application base and its fundamental operations will be no trivial task, and it may require efforts as small as rewriting or retuning specific software modules—all the way up to completely rethinking an application’s fundamentals. However, I’m confident that we’ll get there someday.
Unfortunately, that day is not today. For now, let’s focus our SSD spend on applications that can really take advantage and offer measurable business value today. Hadoop/HDFS combined with SSDs can work—but the current value is questionable.
Have your own opinion on Hadoop and/or SSDs? Please leave me a comment below, and let’s discuss.