Showing posts from August, 2012

Semi-clustering with Apache Hama

How do you analyze the social networks? Sentiment analysis or Text mining? I think the valuable insight can be found by analyzing the evolution of network structures or, information flows. Network science is needed. In this post, I'm introducing the powerful graph computing package of Apache Hama and semi-clustering algorithm described Google's Pregel paper [1] . A semi-cluster in a social graph is a group of people who interact frequently with each other and less frequently with others. It is different from ordinary clustering in the sense that a vertex may belong to more than one semi-cluster. The algorithm is a greedy algorithm. Since Apache Hama provides complete clone of Pregel, you can write a semi-clustering program and run it on large graphs in a few minutes like this: @Override public void compute(Iterator<SCMessage> messages) throws IOException { if (this.getSuperstepCount() == 0) { // In superstep 0, V enters itself in that list as a

MapReduce and Beyond

Hi, in this post I'm going to tell you about past and near future of big data processing. In 2006, I worked as a Senior Software Engineer for web portal company, NHN, corporation. Since then, I had experienced a data explosion (the average pageview per day was one billion), and began to research distributed computing technologies. In my early research, batch-oriented MapReduce [1] was one of interesting technology. As all of you know well now, MapReduce programming is very simple and powerful, especially, useful for the aggregation and several basic relational algebraic operations on large data-sets. However, MapReduce is NOT good for everything . For example, graph algorithms [2] , machine learning, and matrix arithmetic. SQL-like Pig, Hive, and MR-based Mahout shows well the scope and limit of MapReduce. Iterative MapReduce also has some problems such as heavy cost for task assignment and I/O overhead. A lack of ability to perform as a real-time was also issue. Today, many M