August 26, 2012

Semi-clustering with Apache Hama

How do you analyze the social networks? Sentiment analysis or Text mining? I think the valuable insight can be found by analyzing the evolution of network structures or, information flows. Network science is needed.

In this post, I'm introducing the powerful graph computing package of Apache Hama and semi-clustering algorithm described Google's Pregel paper[1]. A semi-cluster in a social graph is a group of people who interact frequently with each other and less frequently with others. It is different from ordinary clustering in the sense that a vertex may belong to more than one semi-cluster.

The algorithm is a greedy algorithm. Since Apache Hama provides complete clone of Pregel, you can write a semi-clustering program and run it on large graphs in a few minutes like this:
    @Override
    public void compute(Iterator<SCMessage> messages) throws IOException {
      if (this.getSuperstepCount() == 0) {
        // In superstep 0, V enters itself in that list as a semi-cluster of
        // size 1 and score 1, and publishes itself to all of its neighbors.
      }

      // In subsequent supersteps, V circulates over the received semi-clusters
      SCMessage clusters = null;
      while ((clusters = messages.next()) != null) {
        for (SemiCluster c : clusters.getClusters()) {
          // If a semi-cluster c does not already contain V, V is added to c to
          // form c’.
        }
      }

      // The semi-clusters are sorted by their scores and the
      // best ones are sent to V’s neighbors.

      // Vertex V updates its list of semi-clusters with the
      // semi-clusters that contain V.

      boolean updated = updateLocalClusters(clusters);
      if (!updated) {
        // The algorithm terminates either when the semi-
        // clusters stop changing or the user may provide a
        // limit.

        // At that point, the list of best semi-cluster
        // candidates for each vertex may be aggregated
        // into a global list of best semi-clusters.
        
        voteToHalt();

      } else {
        // The semi-clusters are sorted by their scores and the best ones are
        // sent to V’s neighbors.
      }
    }

What do you think - easy enough to use? :-)

P.S., the implementation issue is still left open[2]. If you want to contribute your code or just want to share something with us, Please feel free to contact Hama team.

1. Pregel: a system for large-scale graph processing
2. Implementation of a Semi-Clustering algorithm, described in Pregel paper.

No comments:

Post a Comment