December 19, 2011

Install a Hama cluster using Whirr

Apache Whirr provides a Cloud-neutral way to run a properly-configured system quickly through libraries, common service API, smart defaults, and command line tool. Currently it supports various Cloud services e.g., Hadoop, HBase, Hama, Cassandra, and ZooKeeper. Let's see how it is simple to install Hama cluster using Whirr.

The following commands install Whirr and start a 5 node Hama cluster on Amazon EC2 in 5 minutes or less.

% curl -O http://apache.tt.co.kr//whirr/whirr-0.7.0/whirr-0.7.0.tar.gz
% tar zxf whirr-0.7.0.tar.gz; cd whirr-0.7.0

% export AWS_ACCESS_KEY_ID=YOUR_ID
% export AWS_SECRET_ACCESS_KEY=YOUR_SECKEY
% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr

% bin/whirr launch-cluster --config recipes/hama-ec2.properties --private -key-file ~/.ssh/id_rsa_whirr

Upon success you should see imok echoed to the console, indicating that Hama is running.


Oh... finished. :)
Now you can run an BSP examples as below:

edward@domU-12-31-39-0C-7D-41:/usr/local/hama-0.3.0-incubating$ bin/hama jar hama-examples-0.3.0-incubating.jar 
An example program must be given as the first argument.
Valid program names are:
  bench: Random Communication Benchmark
  pagerank: PageRank
  pi: Pi Estimator
  sssp: Single Source Shortest Path
  test: Serialize Printing Test
edward@domU-12-31-39-0C-7D-41:/usr/local/hama-0.3.0-incubating$ bin/hama jar hama-examples-0.3.0-incubating.jar pi
11/12/25 11:48:11 INFO bsp.BSPJobClient: Running job: job_201112251143_0001
11/12/25 11:48:14 INFO bsp.BSPJobClient: Current supersteps number: 0
11/12/25 11:48:17 INFO bsp.BSPJobClient: Current supersteps number: 1
11/12/25 11:48:20 INFO bsp.BSPJobClient: The total number of supersteps: 1
Estimated value of PI is 3.147866666666667
Job Finished in 9.635 seconds

December 13, 2011

SSSP (Single Source Shortest Path) problem with Apache Hama

From yesterday I'm testing Apache Hama SSSP (Single Source Shortest Path) example with random graph of ~ 100 million vertices and ~ 1 billion edges as a input on my small cluster. More specifically:
  • Experimental environments
    • One rack (16 nodes 256 cores) cluster 
    • Hadoop 0.20.2
    • Hama TRUNK r1213634.
    • 10G network
  • Task and data partitioning
    • Based on hashing of vertextID in graph and size of input data.
  • SSSP algorithm
    • Algorithm described in Pregel paper
And here's rough results for you:

Vertices (x10 edges)TasksSuperstepsJob Execution Time
10 million65423656.393 seconds
20 million122231449.542 seconds
30 million184398886.845 seconds
40 million2454321112.912 seconds
50 million30107472079.262 seconds
60 million3681581754.935 seconds
70 million42206344325.141 seconds
80 million48143563236.194 seconds
90 million54114802785.996 seconds
100 million6076792169.528 seconds

What do you think on this chart? I'm quite satisfied considering that the job execution time contains the data partitioning and loading time (100 ~ 500 seconds) and there is still much to be desired. This surely shows scalable performance, the SSSP processing time will not increase linearly with the number of vertices.