Rethinking Analytics Infrastructure
Last year I published a reasonably well-received research document on Hadoop infrastructure, “Building the Foundations for Customer Insight: Hadoop Infrastructure Architecture”. Now, less than a year later it’s looking obsolete, not so much because it was wrong for traditional (and yes, it does seem funny to use a word like “traditional” to describe a technology that itself is still rapidly evolving and only in mainstream use for a handful of years) Hadoop, but because the universe of analytics technology and tools has been evolving at light-speed.
If your analytics are anchored by Hadoop and its underlying map reduce processing, then the mainstream architecture described in the document, that of clusters of servers each with their own compute and storage, may still be appropriate. On the other hand, if, like many enterprises, you are adding additional analysis tools such as NoSQL databases, SQL on Hadoop (Impala, Stinger, Vertica) and particularly Spark, an in-memory-based analytics technology that is well suited for real-time and streaming data, it may be necessary to begin reassessing the supporting infrastructure in order to build something that can continue to support Hadoop as well as cater to the differing access patterns of other tools sets. This need to rethink the underlying analytics plumbing was brought home by a recent demonstration by HP of a reference architecture for analytics, publicly referred to as the HP Big Data Reference Architecture.
This architecture is based on a cluster of HP SL4500 series servers running only the Hadoop file system and HP Moonshot servers for compute. The demonstration configuration consisted of three SL4540 storage servers and two HP Moonshot systems with a total of 90 low-power Xeon processing nodes. Contrary to initial expectations, the file system servers and the accompanying Gb Ethernet plumbing were not only able to transfer data efficiently between the processing nodes and the storage tier, but were able to demonstrate efficient (ability to efficiently use all resources with no signs of bottlenecks or stranded resources) processing of mixed analytics workloads across the cluster. All in all an impressive demonstration and HP is now delivering this as a reference architecture to their customers.
While HP’s implementation will be based on their rather unique combination of products, especially the SL4540 storage servers and Moonshot, which up until now has not been widely deployed as an analytics platform, I fully expect that the lessons from this and other experiments in analytics architecture will rapidly diffuse into the marketplace. During 2015 I expect to see multiple product offerings from multiple vendors offering multiple spins on infrastructure for analytics.
Stay tuned – change is in the wind for analytics infrastructure. This is good news and bad news for practitioners, since an early investment in a “safe” architecture can leave you with a messier problem when it comes to supporting newer techniques.