Showing posts from August, 2009

Memo, level synchronous parallel BFS algorithm

Inference anatomy of the Google Pregel

The summary paper of the Google Pregel was distributed -- Pregel: a system for large-scale graph processing.

The high-level organization of Pregel programs is inspired by Valiant's Bulk Synchronous Parallel model. Pregel computations consist of a sequence of iterations, called super-steps.
It's same with Hamburg design as figured below:

During a superstep the framework invokes a user-defined Compute() function for each vertex, conceptually in parallel. The function specifies behavior at a single vertex v and a single superstep S. It can read messages sent to v in superstep S - 1, send messages to other vertices that will be received at superstep S + 1, and modify the state of v and its outgoing edges. Messages are typically sent along outgoing edges, but a message may be sent to any vertex whose identifier is known. A program terminates when all vertices declare that they are done.
According to our reasoning, user defined Compute() function which can be called during graph trave…

Same matte black different feel -- Lamborghini LP640 vs BMW Z4 vs Hyundai Avante

Lamborghini LP640

2009 BMW Z4

And, Hyundai Avante ... Looks like burned car.

Graph database on Hadoop

Below is the problem list of the recent trends of graph data in my Insight.

- Very large (e.g. Web linked data, Social network, ..., etc)
- Diversified attributes of node and edge
- Requires real-time processing (for exampe, finding the shortest path based on attributes in Google Map)

So, I'm thinking the graph database on hadoop as described below:

HDFS Hama, Map/Reduce Hamburg
graph data -> graph partitioning for locality -> real-time processing

The large graph data can be stored on Hadoop/Hbase and, communication cost can be reduced by partitioning step as bulk processing. Then, finally we can perform the real-time graph processing. What do you think? ;)

Doug Cutting leaves Yahoo, joins Cloudera

The core member of Hadoop, Doug Cutting, is leaving Yahoo to join a startup called Cloudera. Cool... I would like to learn from his footsteps... and eventually soon be a open source developer like him.

The low-power Hadoop cluster

We're understand that the Hadoop is a low-cost way to manage and process the massive data since it has been designed to run on a lot of cheap commodity computers. But, the electric power costs also should be considered when evaluating cost effectiveness. Have you thought them? Since It's a fault tolerant system with active replication, a few servers could go anytime into power saving mode without data loss.

I heard that some guys are trying to handle this problem. See also : On the Energy (In)efficiency of Hadoop Clusters