Hadoop weaknesses and strengths are quite well
understood. Some of its main strengths are offline data analysis at massive
scale, high availability and fault tolerance. However businesses that need
quick real time insights into Big Data cannot leverage Hadoop for data analysis
and tend to use alternate technologies to meet this requirement of business.
That means Hadoop can neither be used for
calculating trending topic in twitter nor can be used by fund manager for stock
trends in real time. The end result is hybrid data processing environments to
meet different business needs. Hadoop powers offline data analysis and other
specialized stream processing frameworks for real time insights and stream
processing.
Lack of real time streaming analysis in Hadoop
resulted into developments of framework like Storm, S4. These frameworks are
good at what they do and were development based on Hadoop principal of massive
parallel processing, high availability and fault tolerance. There is a subtle
different in the functioning of these frameworks, however underlying principal
is the same.
However if an enterprise is using or looking to
use Hadoop for offline data analytics, wouldn’t it be better if same
infrastructure can power real time streaming analysis as well and even better
if can be done by leveraging Hadoop core pieces like HDFS and Map-Reduce
processing style.
HFlame enhances Hadoop core with real time
streaming analysis capability. In traditional Hadoop, a Map-Reduce job
processes only the current snapshot of available data and ends right after it
finished processing the snapshot. Processing of any new contents requires
scheduling of another Map-Reduce job. With HFlame enhanced Hadoop, Map-Reduce
jobs can optionally be configured to run in continuous mode. Which essentially
means that Map-Reduce job doesn’t end even if there are no more new contents
available. As soon as new data is pushed in HDFS, continuously running
Map-Reduce jobs are notified, which immediately passes the new contents through
Map-Reduce process and extract insights. Alongside HFlame supports following
behavior
- HFlame runs on top of customer’s Hadoop installation. HFlame is an incremental add on to existing Hadoop clusters.
- No new API. Completely driven by configuration.
- HFlame’s real time Map-Reduce jobs are completely fault tolerant. In the event of any failure, failed components are automatically scheduled on other available Hadoop nodes.
- HFlame guarantees no data loss. If any component of Map-Reduce job or Hadoop infrastructure fails in the middle, automatic job/component’s recovery procedure will take of care starting the data processing from exactly the same place where it failed.
- Allows building a complex mesh of real time Map-Reduce jobs to support data analysis requirements that cannot be described in single Map-Reduce process.
- Support data analysis frameworks like PIG, HIVE.
- Real time Map-Reduce jobs can optionally be run in batch mode, i.e. Reduce tasks accumulate data for certain amount of time and then produce the aggregated results.
Following picture explains the flow of real time
Map-Reduce job –
HFlame compelling argument is common data
analysis framework for both offline and real time massively parallel data
analysis which essentially means no new storage, no new data processing
semantics and leveraging existing high level abstraction languages like Pig and
Hive. For Hadoop users, real time streaming analysis with HFlame requires absolutely
‘0’ investment into new infrastructure and no new API/tools to learn.
While Storm and S4 are good alternative for real
time streaming analysis, Hadoop users should find themselves more compelled to
use HFlame.
Check out www.hflame.com
or www.dataadvent.com for more details.
Twitter - @hadoopflame