Earlier this month I delivered the keynote address at the LexisNexis HPCC Engineering Summit. It was entitled, Custom is the New Commodity. It’s exciting to be asked to attend the internal conference of a world-class company like LexisNexis. As one of the first data aggregators on the planet, they’ve developed truly novel approaches to handling large-scale analytics, and I certainly learned a lot being a fly on the wall as they planned the future.
When confronted with big data—before big data was fashionable—the HPCC group developed the ECL programming language and the associated infrastructure to support it using what we in the HPC community would think of as commodity clusters. That critical technology has grown over the last 10+ years to support a rich set of data fusion and analytics functions across numerous market spaces. Interestingly, given the programming languages background of the team who developed it, ECL is a declarative language that supports automatic runtime optimization of the algorithms being expressed based on the data being analyzed.
The HPC community has long worried about programmer productivity for massively parallel processors (MPPs), and it’s interesting to see the evolution of a high-performance path in analytics that follows almost exactly the opposite philosophy. From the first supercomputer on, simulation of 3D physics evolved down a path of running nearly on bare metal. This has led the community to relentlessly give up functionality in exchange for performance, which has resulted in two problems: a dependence on programming paradigms that have to map to the underlying hardware architecture; and concern in the community about scientists having access to the HPC resources they need.
I remember a very close friend of mine who was studying the dynamics of glucose in the blood and needed to run a series of simulations. Because he was a scientist and not a computer scientist, he took a month off while the high-powered workstation in his office chugged through the calculations. I remember walking him through the parallel algorithm that would speed up his research and being told, “By the time I write that, my calculations will be complete.” In my mind, enabling the productive use of parallel computers remains one of the field’s major challenges.
The HPCC systems group took a completely different path, giving a lot of the work, including the placement and optimization of data, over to the computer. Given that they were working with datasets larger than most individual humans could curate, it makes sense that they’d embrace automation as a core part of their approach. This is in contrast to even imperative approaches to analytics, like MapReduce, which rely on the programmer for manual partitioning of data and selection of algorithms.
My presentation advanced the proposition that today’s commodity machines are architecturally unbalanced for large-scale analytics. When combined with the fundamental technology changes that the semiconductor industry is seeing, the problem only intensifies. I believe the solution lies in computer architecture—different platforms for different markets, enabled by the kinds of design methodologies applied to commodity mobile platforms. I was happy that this somewhat radical proposition was well received by application experts, many of whom brought real-world examples to me after my talk. It’s increasingly clear to me that the people who are performing analytics are losing opportunities to exploit their data because of unbalanced platforms. The solution, as always, lies in the memory.
I think we have a lot to learn from the insight that in the case of data-driven problems, the computer can know more about the right optimizations to apply than the programmer. I look forward to seeing the continued evolution of the HPCC platform that’s being planned now in Florida.
Interested in more on this topic? See Flavio Villanustre’s interview here. Have a question or a clarification for me? Fire away!