August 26, 2012

MapReduce and Beyond

Hi, in this post I'm going to tell you about past and near future of big data processing. In 2006, I worked as a Senior Software Engineer for web portal company, NHN, corporation. Since then, I had experienced a data explosion (the average pageview per day was one billion), and began to research distributed computing technologies.

In my early research, batch-oriented MapReduce[1] was one of interesting technology. As all of you know well now, MapReduce programming is very simple and powerful, especially, useful for the aggregation and several basic relational algebraic operations on large data-sets.

However, MapReduce is NOT good for everything. For example, graph algorithms[2], machine learning, and matrix arithmetic. SQL-like Pig, Hive, and MR-based Mahout shows well the scope and limit of MapReduce. Iterative MapReduce also has some problems such as heavy cost for task assignment and I/O overhead. A lack of ability to perform as a real-time was also issue.

Today, many MapReduce alternatives are now available to solve efficiently such problems:
  • Apache Hama[3] - BSP (Bulk Synchronous Parallel) computing engine on top of Hadoop
  • Apache Giraph - BSP (Bulk Synchronous Parallel) based graph computing framework
  • Apache S4 and Twitter Storm - Scalable real-time processing system
Wow! too many to learn, but please don't worry. Hadoop 2.0, YARN[4] will manages these new alternatives at once. 

1. MapReduce: Simplified Data Processing on Large Clusters
2. Pregel: a system for large-scale graph processing
3. Apache Hama: Bulk Synchronous Parallel Computing Framework
4. Interview with Arun Murthy on Apache YARN

No comments:

Post a Comment