First, consider that data informs nearly every decision an organization makes today. Customers across virtually every industry expect to interact with businesses wherever they go, in real-time, across a myriad pf devices and applications. This results in piles and mounds of information that need to be culled, sorted and organized to find actionable data to drive businesses forward.
This evolution mirrors much of what’s taking place in the Apache-Hadoop ecosystem as it continues to mature and find its place among a broader business audience.
The Origins & Evolution of Hadoop
Let’s look at the origins of Hadoop as a start. Hadoop originally started out as a framework for big batch processing, which is exactly what early adopters like Yahoo! needed – an algorithm that could crawl all of the content on the Internet to help build big search engines and then take the outputs and monetize them with targeted advertising. That type of a use case is entirely predicated on batch processing on a very large scale.
The next phase centered on how Hadoop would reach a broader customer base. The challenge there was to make Hadoop easier to use by a wider audience. Sure, it’s possible to do very rich processing with Hadoop, but it also has to be programmed very specifically, which can make it difficult to use by enterprise users for business intelligence or reporting. This drove the trend around SQL on Hadoop, which was the big thing about two years ago with companies like Cloudera, IBM, Pivotal and others entering the space.
The third major phase in Hadoop’s evolution, which emerged late last year, was around making it enterprise-grade – ensuring data security and governance as well as data management. This gave enterprises the confidence that a still-emerging framework like Hadoop offered at least as much security as existing enterprise analytics tools like data warehouses.
Indeed, the workloads that are best suited for Hadoop are changing. Hadoop is at its best when you’re in a Hadoop-centric environment and examining bulk processing of data. But another use case that is rapidly become a higher priority is the movement towards processing massive amounts of data in close to real-time. This will require near real-time stream processing with real-time decision making.
Why IoT + Hadoop is the Next Frontier
Much of this evolution, in my view, is being driven by movements such as the Internet of Things and the architecture necessary to support it. IoT is a huge development from an architectural framework perspective – there’s tremendous fundamental change that has to happen to support IoT from not only the technology side, but also the business factors for its adoption, utilization and deployment.
To get Hadoop to the next stage of its evolution and reduce many of the barriers to its widespread adoption, it’s also important to close the skills gap with users. There are two ways to approach this, in my opinion. The first is to make Hadoop much more like existing systems, minimizing the need for training (and retraining) for users that will be adopting the technology.
The next way is to leverage the tools and expertise from the open source development community and utilize the intelligent modules and solution recipes to automate several functions within Hadoop. This falls into the area of machine learning and making the software more intelligent, almost self-training.
Overall, Hadoop’s evolution follows the same progression of software development as a whole. Originally you had to write your low level algorithms, then it got automated, then more object-oriented programming modules came into play.
It will be interesting to see how Hadoop continue to refine its framework. Certainly as the role of data plays a more strategic role in today’s business, the tools to manage it must also grow in order to accommodate a wider range of users in both technical and business fields.