Wednesday, 28 November 2012

Storm, S4 or HFlame (real time streaming analysis built into Hadoop) – Which is the right choice?

Hadoop weaknesses and strengths are quite well understood. Some of its main strengths are offline data analysis at massive scale, high availability and fault tolerance. However businesses that need quick real time insights into Big Data cannot leverage Hadoop for data analysis and tend to use alternate technologies to meet this requirement of business. 

That means Hadoop can neither be used for calculating trending topic in twitter nor can be used by fund manager for stock trends in real time. The end result is hybrid data processing environments to meet different business needs. Hadoop powers offline data analysis and other specialized stream processing frameworks for real time insights and stream processing.

Lack of real time streaming analysis in Hadoop resulted into developments of framework like Storm, S4. These frameworks are good at what they do and were development based on Hadoop principal of massive parallel processing, high availability and fault tolerance. There is a subtle different in the functioning of these frameworks, however underlying principal is the same.

However if an enterprise is using or looking to use Hadoop for offline data analytics, wouldn’t it be better if same infrastructure can power real time streaming analysis as well and even better if can be done by leveraging Hadoop core pieces like HDFS and Map-Reduce processing style.

HFlame enhances Hadoop core with real time streaming analysis capability. In traditional Hadoop, a Map-Reduce job processes only the current snapshot of available data and ends right after it finished processing the snapshot. Processing of any new contents requires scheduling of another Map-Reduce job. With HFlame enhanced Hadoop, Map-Reduce jobs can optionally be configured to run in continuous mode. Which essentially means that Map-Reduce job doesn’t end even if there are no more new contents available. As soon as new data is pushed in HDFS, continuously running Map-Reduce jobs are notified, which immediately passes the new contents through Map-Reduce process and extract insights. Alongside HFlame supports following behavior

  1. HFlame runs on top of customer’s Hadoop installation. HFlame is an incremental add on to existing Hadoop clusters.    
  2. No new API. Completely driven by configuration.
  3. HFlame’s real time Map-Reduce jobs are completely fault tolerant. In the event of any failure, failed components are automatically scheduled on other available Hadoop nodes.
  4. HFlame guarantees no data loss. If any component of Map-Reduce job or Hadoop infrastructure fails in the middle, automatic job/component’s recovery procedure will take of care starting the data processing from exactly the same place where it failed.
  5. Allows building a complex mesh of real time Map-Reduce jobs to support data analysis requirements that cannot be described in single Map-Reduce process. 
  6. Support data analysis frameworks like PIG, HIVE.
  7. Real time Map-Reduce jobs can optionally be run in batch mode, i.e. Reduce tasks accumulate data for certain amount of time and then produce the aggregated results.
Following picture explains the flow of real time Map-Reduce job –

  
HFlame compelling argument is common data analysis framework for both offline and real time massively parallel data analysis which essentially means no new storage, no new data processing semantics and leveraging existing high level abstraction languages like Pig and Hive. For Hadoop users, real time streaming analysis with HFlame requires absolutely ‘0’ investment into new infrastructure and no new API/tools to learn.

While Storm and S4 are good alternative for real time streaming analysis, Hadoop users should find themselves more compelled to use HFlame.

Check out www.hflame.com or www.dataadvent.com for more details.
Twitter - @hadoopflame