Install a Hama cluster using Whirr

Apache Whirr provides a Cloud-neutral way to run a properly-configured system quickly through libraries, common service API, smart defaults, and command line tool. Currently it supports various Cloud services e.g., Hadoop, HBase, Hama, Cassandra, and ZooKeeper. Let's see how it is simple to install Hama cluster using Whirr.

The following commands install Whirr and start a 5 node Hama cluster on Amazon EC2 in 5 minutes or less.

% curl -O
% tar zxf whirr-0.7.0.tar.gz; cd whirr-0.7.0

% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr

% bin/whirr launch-cluster --config recipes/ --private -key-file ~/.ssh/id_rsa_whirr

Upon success you should see imok echoed to the console, indicating that Hama is running.

Oh... finished. :)
Now you can run an BSP examples as below:

edward@domU-12-31-39-0C-7D-41:/usr/local/hama-0.3.0-incubating$ bin/hama jar hama-examples-0.3.0-incubating.jar 
An example program must be given as the first argument.
Valid program names are:
  bench: Random Communication Benchmark
  pagerank: PageRank
  pi: Pi Estimator
  sssp: Single Source Shortest Path
  test: Serialize Printing Test
edward@domU-12-31-39-0C-7D-41:/usr/local/hama-0.3.0-incubating$ bin/hama jar hama-examples-0.3.0-incubating.jar pi
11/12/25 11:48:11 INFO bsp.BSPJobClient: Running job: job_201112251143_0001
11/12/25 11:48:14 INFO bsp.BSPJobClient: Current supersteps number: 0
11/12/25 11:48:17 INFO bsp.BSPJobClient: Current supersteps number: 1
11/12/25 11:48:20 INFO bsp.BSPJobClient: The total number of supersteps: 1
Estimated value of PI is 3.147866666666667
Job Finished in 9.635 seconds

SSSP (Single Source Shortest Path) problem with Apache Hama

From yesterday I'm testing Apache Hama SSSP (Single Source Shortest Path) example with random graph of ~ 100 million vertices and ~ 1 billion edges as a input on my small cluster. More specifically:
  • Experimental environments
    • One rack (16 nodes 256 cores) cluster 
    • Hadoop 0.20.2
    • Hama TRUNK r1213634.
    • 10G network
  • Task and data partitioning
    • Based on hashing of vertextID in graph and size of input data.
  • SSSP algorithm
    • Algorithm described in Pregel paper
And here's rough results for you:

Vertices (x10 edges)TasksSuperstepsJob Execution Time
10 million65423656.393 seconds
20 million122231449.542 seconds
30 million184398886.845 seconds
40 million2454321112.912 seconds
50 million30107472079.262 seconds
60 million3681581754.935 seconds
70 million42206344325.141 seconds
80 million48143563236.194 seconds
90 million54114802785.996 seconds
100 million6076792169.528 seconds

What do you think on this chart? I'm quite satisfied considering that the job execution time contains the data partitioning and loading time (100 ~ 500 seconds) and there is still much to be desired. This surely shows scalable performance, the SSSP processing time will not increase linearly with the number of vertices.

Living life

Have you once thought about how many days you have left until you die? If so you already know that what a cruel thing is time.

Human life is too short.

Everyday is very precious, don't stress the small stuff!

Open-source M&A

There are two types of M&A from a buyer's perspective: social status/influence (valuable users/employees/databases/experiences) and positive business model. Typically, knowledge-based industry's case falls under the former. For examples, art, internet service, software, .., etc. In a similar vein, open source software also have a big opportunity.

If your open source software is able to make some buzz among the people, be ambitious!

Google Summer of Code 2011 T-shirt

Just received my Google Summer of Code 2011 T-shirt gift from Google today, as I was a mentor for GSoC @ Apache this year.

After all, Hama project have found a friend in Thomas.

MapReduce, Twitter Storm, 그리고 Hama BSP

과거 MapReduce는 확실히 batch-oriented 된 processing engine 이었고, 한 동안 서비스개발에서 멀어져있던 나는 그 굴레에서 쉽게 벗어나지를 못하고 있었던것 같다. large-scale과 우아한 알고리듬 처리 .. 만 생각하고 있었다.

그런데 오늘날 서비스들을 잘 보면 트위터 트렌드나 네이버 실시간 급상승 인기 검색어, 등등 .. 실시간으로 변화하고 진화하는 분야의 문제를 위해 이제는 단순히 거대한 big data/large-scale processing 에서 data stream mining[1], online processing, continuous computation 형태로 진화함을 보고 있다.

확실히 기술은 필요에 의해서 발전의 기틀을 마련한다. 국내에 대형 포탈들은 조용한데 이상한 회사들이 big data를 논하고 있는 현상은 ... 그냥 trend 타고있음을 강조하기 위해 사용되는 서술자. ㅋ

어쨌건 그래서 Storm이나 Stream processing엔진들이 나오고 있고, Google의 Pregel도 100% 이런 형태로 사용되고 있음을 짐작한다. Storm은 내가 안봐서 확실히는 모르겠고, M/R과 달리 Hama BSP는 이 분야에 대해 확실한 강점을 갖는다. traffic anomaly detection 을 위한 시스템을 실험해본 결과 너무 훌륭했다랄까. YARN과 통합된 이후 어떻게 발전할지 기대된다. :D


Real-time, continous, and stream processing with BSP?

Recently, I attended some seminar, met some people who want to use Hama or already made something. I expected only some large-scale/batch-oriented data processing applications but heard very interesting use cases. One was that a continuous processing using infinite loop in a bsp function of each task.

I realized that BSP is can also be used easily for real-time, continous, and stream processing unlike MapReduce.

나의 2011년을 뒤돌아 보며, 그리고 2012

3개월이나 남았으므로 아직 끝난것은 아니나 성격이 급해놔서. 

어디보자~ 블로그를 스캔하니 금방이다. 올해는 블로깅이 매우 뜸했다. 특별히 계획은 안보이고 년초에 4월까지의 목표라고 정리한게 보이는데 ...

1. 송도 국제도시로 이사 완료 
2. MongoDB 번역 완료 및 출판 
3. Apache Hama 0.2 릴리즈 
4. 후보 2명을 커미터로 충원 
5. 작년에 인디 오더 넣은 애마 인수

몽땅 올킬. 굿!

그럼 2012년 목표는:

1. Hama를 Hadoop nextGen에 통합
2. 좀 더 많은 커미터 꼬실레이션
3. 1 thousand nodes 대규모 Hama cluster 테스트
4. Hama 0.5까지 릴리즈
5. 아파치 인큐베이터 졸업
6. 실사례, 킬러앱 추가
7. Hama In Action 저술

더 있지만 일단 요정도만.

월요일 잡담

한 때는 순진하게도 어미에게 물려받은 사냥 스킬 하나면 광야를 홀로 누비는 자유로운 삶이 해결된다 믿었다.

한 사람 개인에게 주어지는 한정적인 시간과 능력.

순진한 방법으로는 생존 조차 어려운게 오늘날 우리네 인간사회 현실이다.

Cancers Generate Muscle-Like Contractions ...

There are only two cells that declares independence from our body. A germ cell and cancer cell. Few days ago, I saw the amazing news "Cancers Generate Muscle-Like Contractions to Spread Around the Body ... ".

Do they wants to revert back to a single cell state? Are they independent beings?

VIA-based (Bulk Synchronous Parallel) BSPLib vs. BSPLib

The Virtual Interface Architecture (VIA) is an abstract model of an user-level zero-copy network, and is the basis for InfiniBand, iWARP and RoCE. Created by Microsoft, Intel, and Compaq, the original VIA sought to standardize the interface for high-performance network technologies known as System Area Networks (SAN).

The VIA model of high-performance by simplifying protocol stacks and reducing kernel-copy overheads, can be also used to avoid bottlenecks caused by the communication overheads in Parallel clustering system.

Message (bytes)VIA-based BSPLibBSPLib

The above table presents a comparison of time delay(us) of VIA-based and UDP/IP based BSPLib. VIA-based BSPLib shows the obvious better performance than the performance of UDP/IP based.

Francis Galton’s Ox

(an excerpt from The Wisdom of Crowds by James Surowiecki)

One day in the fall of 1906, the British scientist Francis Galton left his home in the town of Plymouth and headed for a country fair. Galton was eighty-five years old and beginning to feel his age, but he was still brimming with the curiosity that had won him renown—and notoriety—for his work on statistics and the science of heredity. And on that particular day, what Galton was curious about was livestock.

Galton’s destination was the annual West of England Fat Stock and Poultry Exhibition, a regional fair where the local farmers and townspeople gathered to appraise the quality of each other’s cattle, sheep, chickens, horses, and pigs. Wandering through rows of stalls examining workhorses and prize hogs may seem to have been a strange way for a scientist (especially an elderly one) to spend an afternoon, but there was a certain logic to it. Galton was a man obsessed with two things: the measurement of physical and mental qualities, and breeding. And what, after all, is a livestock show but a big showcase for the effects of good and bad breeding?

Breeding mattered to Galton because he believed that only a very few people had the characteristics necessary to keep societies healthy. He had devoted much of his career to measuring those characteristics, in fact, in order to prove that the vast majority of people did not have them. At the International Exhibition of 1884 in London, for instance, he set up an “Anthropometric Laboratory,” where he used devices of his own making to test exhibition-goers on, among other things, their “Keenness of Sight and of Hearing, Colour Sense, Judgment of Eye, [and] Reaction Time.” His experiments left him with little faith in the intelligence of the average person, “the stupidity and wrong-headedness of many men and women being so great as to be scarcely credible.” Only if power and control stayed in the hands of the select, well-bred few, Galton believed, could a society remain healthy and strong.

As he walked through the exhibition that day, Galton came across a weight-judging competition. A fat ox had been selected and placed on display, and members of a gathering crowd were lining up to place wagers on the weight of the ox. (Or rather, they were placing wagers on what the weight of the ox would be after it had been “slaughtered and dressed.”) For sixpence, you could buy a stamped and numbered ticket, where you filled in your name, your address, and your estimate. The best guesses would receive prizes.

Eight hundred people tried their luck. They were a diverse lot. Many of them were butchers and farmers, who were presumably expert at judging the weight of livestock, but there were also quite a few people who had, as it were, no insider knowledge of cattle. “Many non-experts competed,” Galton wrote later in the scientific journal Nature, “like those clerks and others who have no expert knowledge of horses, but who bet on races, guided by newspapers, friends, and their own fancies.” The analogy to a democracy, in which people of radically different abilities and interests each get one vote, had suggested itself to Galton immediately. “The average competitor was probably as well fitted for making a just estimate of the dressed weight of the ox, as an average voter is of judging the merits of most political issues on which he votes,” he wrote.

Galton was interested in figuring out what the “average voter” was capable of because he wanted to prove that the average voter was capable of very little. So he turned the competition into an im-promptu experiment. When the contest was over and the prizes had been awarded, Galton borrowed the tickets from the organizers and ran a series of statistical tests on them. Galton arranged the guesses (which totaled 787 in all, after he had to discard thirteen because they were illegible) in order from highest to lowest and graphed them to see if they would form a bell curve. Then, among other things, he added all the contestants’ estimates, and calculated the mean of the group’s guesses. That number represented, you could say, the collective wisdom of the Plymouth crowd. If the crowd were a single person, that was how much it would have guessed the ox weighed.

Galton undoubtedly thought that the average guess of the group would be way off the mark. After all, mix a few very smart people with some mediocre people and a lot of dumb people, and it seems likely you’d end up with a dumb answer. But Galton was wrong. The crowd had guessed that the ox, after it had been slaughtered and dressed, would weigh 1,197 pounds. After it had been slaughtered and dressed, the ox weighed 1,198 pounds. In other words, the crowd’s judgment was essentially perfect. Perhaps breeding did not mean so much after all. Galton wrote later: “The result seems more creditable to the trustworthiness of a democratic judgment than might have been expected.” That was, to say the least, an understatement.

This indicates the reliability of probabilistic statistics.

Rough analysis of Google's Pregel-clone projects.

1. Apache Hama is currently focused on implementing pure BSP computing engine on top of Hadoop and its performance. They also plan to add a Pregel-like graph computing framework.

2. Y!'s giraph is different in the sense that it runs as a map-only job.
See the org.apache.giraph.graph.GiraphJob.BspMapper. They uses existing Hadoop Map/Reduce.

3. JPregel is a prototype level project I think. But, well written code.

4. And, GoldenOrb. Sorry, I didn't read their code.

I gotta autograph from Nancy Lang (modern pop artist)

I gotta autograph from Nancy Lang \o/
I didn't know her very well but she seems such a cool person.

The Survival

Two days ago, I found great illumination in the BBC Wildlife. Competition and collaboration. Both sides has the same purpose.

The Survival.

The social network evolution is influenced by strong instinct, "a desire for existence". Love, money, reputation, ... Now I can understand why many people likes Survival TV shows or social network services. :P The Survival is the perfect story of life.

The power of story

We don't just buy the product, we buy the story behind it. The man loves 'Porsche'. Because there are stories about freedom and romance of youth.

We don't just join a company, we join because of its story and vision.

Create your story, tell your story and leverage your story!!!!!

Apache Hama 0.2.0-incubating Released!

Apache Hama 0.2.0-incubating Released!

Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.

This first release includes:

  • BSP computing framework and its examples
  • CLI-based managing and monitoring tool of BSP job

You can be downloaded from the download page of Hama website[2].


MongoDB 완벽 가이드, 책으로 읽어보다

증정용이 도착했다~ \o/
두려운 마음으로 한장 한장 넘겼다.

데드라인이 넘어가면서 역자서문/표지 번역은 출/퇴근길 아이폰으로 버스에서 구글 닥스에 붙어 작업했고 아주 막판엔 지쳐서 2번 정도 대강 읽고 끝내버렸기에 은근히 신경쓰였는데, ... 이거이거 꽤 잘 나왔잖아!!

(아, 물론 내 생각 .. )

GSoC, Implementation of single source shortest path using Hama

Despite the Hama project[1] is still under heavy construction, one of my GSoC students has excellently (or aggressively) finished implementing[2] his plan, "Implementation of single source shortest path using Hama"[3] and also started to contribute improvements to the Hama project. He used an algorithm described in Google Pregel paper[4], and it works nicely on my 2-rack cluster (512 cores).

But, more surprisingly, he is just 19 years old! :o When I was 19 years old?

  • JDK 1.1 released.
  • I first met the monster called Diablo[5].
  • I ridden bike.



[Note] Dijkstra's vs. Bellman-Ford

  • Dijkstra's : relax each edge exactly once (can't handle negative weights)
  • Bellman-Ford : relax each edge |V| -1 times

Anything else?

What's the most popular NoSQL?

I'd say that the answer is a MongoDB.

MongoDB is growing quickly in popularity because it offers a relatively rich range of features, while maintaining impressive speed. The features include built-in Indices, range queries, support for replication, and auto-sharding. A Map/Reduce function allows you to add to the aggregate functions natively supported and do large-scale jobs like nightly reports.

According to some reports, Java-based NoSQL solutions seems are not free from the GC(Garbage Collection) and performance problems. MongoDB is written in C++ but has drivers for Perl, Python, PHP, Java, and Ruby ... So, powerful!

The infinity/infinitesimal does not exist.

life and death, yin and yang, start and end, creation and destruction, ...

The infinity/infinitesimal does not exist.

Software versioning

How do you versioning your software?

This is so embarrassing, I didn't know well about Software Versioning before reading the MongoDB book. Because, I mostly worked for web-service companies which don't develop the software product and care about software versioning. (Other country companies may not, but at least 'korea web-service' companies does.)

Anyway, I recently learned many things while translating the book, 'MongoDB: Definitive Guide'. :D

P.S., The Korean version of 'MongoDB: The Definitive Guide - O'Reilly Media' is coming soon!

Apache Hama - BSP Job Submission Portal

I just made a simple web portal, where you can submit your BSP job and execute it on my test cluster.

I spend 1 hour writing code and didn't test enough. :P

Try to upload hama-0.2.0-examples.jar, and put the args "bench 5 5 1".

Enjoy the cool summer with Apache Hama and GSoC 2011

The Apache Software Foundation was accepted for Google Summer of Code 2011. Enjoy the cool summer with Apache Hama and GSoC!


Apache Hama is a Google's Pregel-like, a distributed computing framework based on Bulk Synchronous Parallel.

GSoC 2011 Projects:

1. Evaluation of Hama BSP communication protocol performance

The goal of this project HAMA-358 is performance evaluation of RPC frameworks (e.g., Hadoop RPC, Thrift, Google Protobuf, ..., etc) to figure out which is the best solution for Hama BSP communication. Currently Hama is using Hadoop RPC to communicate and transfer messages between BSP workers. By this project, students will have learned how to evaluate components in the design phase.
  • A list of prerequisite:
    • Understanding of Bulk Synchronous Parallel model
    • Understanding of RPC (remote procedure call)
  • Programming skills:
    • Shell scripts
    • 2d plotting programming
  • The estimated duration for this project: 12 weeks
  • The level of difficulty: High

2. Development of Shortest Path Finding Algorithm

The goal of this project HAMA-359 is development of parallel algorithm for finding a Shortest Path using Hama BSP. By this project, students will have researched the development of a Message-Passing based parallel algorithm and learned Bulk Synchronous Parallel model.
  • A list of prerequisite:
    • Understanding of the Hama BSP programming model
    • Understanding of Graph theory
  • Programming skills:
    • Java Programming
  • The estimated duration for this project: 12 weeks
  • The level of difficulty: High

3. Runtime Compression of BSP Messages to Improve the Performance

In this project HAMA-367, we investigate BSP message data compression in the context of large-scale distributed message-passing systems to reduce the communication time of individual messages and to improve the bandwidth of the overall system.
  • A list of prerequisite:
    • Understanding of the Bulk Synchronous Parallel and Message-Passing model
  • Programming skills:
    • C or Java Programming
  • The estimated duration for this project: 12 weeks
  • The level of difficulty: High

OSS 릴리즈 관리에 GnuPG (그누피지) 사용하기

GnuPG는 GNU Privacy Guard의 약자로 문서나 파일을 암호화할 때 사용되는 소프트웨어이다. 자세한 내용은 공식 웹사이트를 참조:


일반적으로 오픈소스는 (Apache 프로젝트도 역시) 릴리즈할 때 release tarball 에 GnuPG (그누피지) 를 사용해서 사인을 하는데, 나는 어제 처음 써본것이라 3시간을 이것때문에 삽질했다. :/

어쨌거나 암호학에 생소하면 당췌 문서를 이해하기 쉽지않고 한글 문서는 찾아볼 수도 없어서 이참에 정리를 좀 해보면,

1. 제일 먼저 GPG 키가 없으면 다음과 같은 명령어로 하나 생성해야된다:

[edwardyoon@minotaur:~/public_html/dist]$ gpg --gen-key

2. 질문에 대해서는 대부분 default 로 넘어가면 되고 완료가 되면 다음과 같은 명령어로 ASCII 공개키를 파일로 생성할 수 있다:

[edwardyoon@minotaur:~/public_html/dist]$ gpg --armor --output KEYS --export Edward J. Yoon
Warning: using insecure memory!
[edwardyoon@minotaur:~/public_html/dist]$ cat KEYS 
Version: GnuPG v2.0.16 (FreeBSD)


tarball 파일에 사인은 다음과 같은 명령어로 할 수 있다:

[edwardyoon@minotaur:~/public_html/dist/0.2.0]$ gpg --armor --output hama-0.2.0.tar.gz.asc --detach-sig hama-0.2.0.tar.gz 

이렇게 사인을 해두면, 다운로더는 아래와 같이 "Edward J. Yoon <>"이 사인한 공식적인 release tarball인지 아닌지를 확인해볼 수 있게 된다:

[edwardyoon@minotaur:~/public_html/dist/0.2.0]$ gpg --verify hama-0.2.0.tar.gz.asc hama-0.2.0.tar.gz
Warning: using insecure memory!
gpg: Signature made Wed Mar  2 13:54:36 2011 UTC using DSA key ID 3C50E8D7
gpg: Good signature from "Edward J. Yoon <>"

외국 메일링 리스트를 보면, 이메일 서명에도 GPG를 사용하는 OSS 개발자들이 많더라~

4월 까지의 목표

목표는 길게 잡는것 보다 짧게 잡는게 낫다. 그런 의미에서 4월 까지의 목표들을 정리해보면:
  1. 송도 국제도시로 이사 완료
  2. MongoDB 번역 완료 및 출판
  3. Apache Hama 0.2 릴리즈
  4. 후보 2명을 커미터로 충원
  5. 작년에 인디 오더 넣은 애마 인수
이렇게 5 가지 정도가 되겠다. 사실 요즘의 나에 스케쥴은 매우 빡세게 흘러가는데 ... 그렇다고 나 스스로를 혹사시키지는 않는다. 가고자하는 목표가 확실하기 때문에 이렇게 하나하나 천천히 하다보면 목적지에 도달하겠지.

Apache Hama 0.2: User Guide

Just idea about efficient allocation of VMs on clouds

If we can forcibly group and migration by tendency of people e.g., hobby, fields of work .. etc., we might dramatically reduce the traffic of the town.

Similarly, we might efficiently allocate VMs on clouds using their tendency. (e.g., CPU, network, or disk usage pattens). Then, how can we safely migrate VMs to other hosts? ...

Shuffle Error: MAX_FAILED_UNIQUE_FETCHES; bailing-out

First, long time no use MapReduce! Today I wasted some time figuring this out; "Shuffle Error: MAX_FAILED_UNIQUE_FETCHES; bailing-out".

If you meet this error message w/ the higher version than the hadoop-0.20.2, you should check the file "mapred-site.xml" in {$HADOOP_HOME}/conf directory and the "/etc/hosts" because this happens when the IP addresses are all confused and things aren't on the right ports.

# The following lines are desirable for IPv6 capable hosts
#::1     localhost ip6-localhost ip6-loopback
#fe00::0 ip6-localnet
#ff00::0 ip6-mcastprefix
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters


    The task tracker http server address and port.
    If the port is 0 then the server will start on a free port.

Or, if you have a lot of map and reduce processes in your cluster, check the "tasktracker.http.threads" property.


무한의 세계

무한 집합의 크기 Cardinality , 즉 원소의 개수를 수학에서는 '농도'라고 말한다. 유한 집합의 크기는 그대로 원소의 개수 이지만, 무한 집합의 경우는 원소의 개수를 낱낱이 셈하는 것은 불가능하기 때문에 '농도'라...