In my early research, batch-oriented MapReduce[1] was one of interesting technology. As all of you know well now, MapReduce programming is very simple and powerful, especially, useful for the aggregation and several basic relational algebraic operations on large data-sets.
However, MapReduce is NOT good for everything. For example, graph algorithms[2], machine learning, and matrix arithmetic. SQL-like Pig, Hive, and MR-based Mahout shows well the scope and limit of MapReduce. Iterative MapReduce also has some problems such as heavy cost for task assignment and I/O overhead. A lack of ability to perform as a real-time was also issue.
Today, many MapReduce alternatives are now available to solve efficiently such problems:
- Apache Hama[3] - BSP (Bulk Synchronous Parallel) computing engine on top of Hadoop
- Apache Giraph - BSP (Bulk Synchronous Parallel) based graph computing framework
- Apache S4 and Twitter Storm - Scalable real-time processing system
1. MapReduce: Simplified Data Processing on Large Clusters
2. Pregel: a system for large-scale graph processing
3. Apache Hama: Bulk Synchronous Parallel Computing Framework
4. Interview with Arun Murthy on Apache YARN
No comments:
Post a Comment