August 26, 2012

Semi-clustering with Apache Hama

How do you analyze the social networks? Sentiment analysis or Text mining? I think the valuable insight can be found by analyzing the evolution of network structures or, information flows. Network science is needed.

In this post, I'm introducing the powerful graph computing package of Apache Hama and semi-clustering algorithm described Google's Pregel paper[1]. A semi-cluster in a social graph is a group of people who interact frequently with each other and less frequently with others. It is different from ordinary clustering in the sense that a vertex may belong to more than one semi-cluster.

The algorithm is a greedy algorithm. Since Apache Hama provides complete clone of Pregel, you can write a semi-clustering program and run it on large graphs in a few minutes like this:
    @Override
    public void compute(Iterator<SCMessage> messages) throws IOException {
      if (this.getSuperstepCount() == 0) {
        // In superstep 0, V enters itself in that list as a semi-cluster of
        // size 1 and score 1, and publishes itself to all of its neighbors.
      }

      // In subsequent supersteps, V circulates over the received semi-clusters
      SCMessage clusters = null;
      while ((clusters = messages.next()) != null) {
        for (SemiCluster c : clusters.getClusters()) {
          // If a semi-cluster c does not already contain V, V is added to c to
          // form c’.
        }
      }

      // The semi-clusters are sorted by their scores and the
      // best ones are sent to V’s neighbors.

      // Vertex V updates its list of semi-clusters with the
      // semi-clusters that contain V.

      boolean updated = updateLocalClusters(clusters);
      if (!updated) {
        // The algorithm terminates either when the semi-
        // clusters stop changing or the user may provide a
        // limit.

        // At that point, the list of best semi-cluster
        // candidates for each vertex may be aggregated
        // into a global list of best semi-clusters.
        
        voteToHalt();

      } else {
        // The semi-clusters are sorted by their scores and the best ones are
        // sent to V’s neighbors.
      }
    }

What do you think - easy enough to use? :-)

P.S., the implementation issue is still left open[2]. If you want to contribute your code or just want to share something with us, Please feel free to contact Hama team.

1. Pregel: a system for large-scale graph processing
2. Implementation of a Semi-Clustering algorithm, described in Pregel paper.

MapReduce and Beyond

Hi, in this post I'm going to tell you about past and near future of big data processing. In 2006, I worked as a Senior Software Engineer for web portal company, NHN, corporation. Since then, I had experienced a data explosion (the average pageview per day was one billion), and began to research distributed computing technologies.

In my early research, batch-oriented MapReduce[1] was one of interesting technology. As all of you know well now, MapReduce programming is very simple and powerful, especially, useful for the aggregation and several basic relational algebraic operations on large data-sets.

However, MapReduce is NOT good for everything. For example, graph algorithms[2], machine learning, and matrix arithmetic. SQL-like Pig, Hive, and MR-based Mahout shows well the scope and limit of MapReduce. Iterative MapReduce also has some problems such as heavy cost for task assignment and I/O overhead. A lack of ability to perform as a real-time was also issue.

Today, many MapReduce alternatives are now available to solve efficiently such problems:
  • Apache Hama[3] - BSP (Bulk Synchronous Parallel) computing engine on top of Hadoop
  • Apache Giraph - BSP (Bulk Synchronous Parallel) based graph computing framework
  • Apache S4 and Twitter Storm - Scalable real-time processing system
Wow! too many to learn, but please don't worry. Hadoop 2.0, YARN[4] will manages these new alternatives at once. 

1. MapReduce: Simplified Data Processing on Large Clusters
2. Pregel: a system for large-scale graph processing
3. Apache Hama: Bulk Synchronous Parallel Computing Framework
4. Interview with Arun Murthy on Apache YARN