Showing posts from 2011

Install a Hama cluster using Whirr

Apache Whirr provides a Cloud-neutral way to run a properly-configured system quickly through libraries, common service API, smart defaults, and command line tool. Currently it supports various Cloud services e.g., Hadoop, HBase, Hama, Cassandra, and ZooKeeper. Let's see how it is simple to install Hama cluster using Whirr.

The following commands install Whirr and start a 5 node Hama cluster on Amazon EC2 in 5 minutes or less.

% curl -O % tar zxf whirr-0.7.0.tar.gz; cd whirr-0.7.0 % export AWS_ACCESS_KEY_ID=YOUR_ID % export AWS_SECRET_ACCESS_KEY=YOUR_SECKEY % ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr % bin/whirr launch-cluster --config recipes/ --private -key-file ~/.ssh/id_rsa_whirr
Upon success you should see imok echoed to the console, indicating that Hama is running.

Oh... finished. :)
Now you can run an BSP examples as below:


SSSP (Single Source Shortest Path) problem with Apache Hama

From yesterday I'm testing Apache Hama SSSP (Single Source Shortest Path) example with random graph of ~ 100 million vertices and ~ 1 billion edges as a input on my small cluster. More specifically:
Experimental environmentsOne rack (16 nodes 256 cores) cluster Hadoop 0.20.2Hama TRUNK r1213634.10G networkTask and data partitioningBased on hashing of vertextID in graph and size of input data.SSSP algorithmAlgorithm described in Pregel paperAnd here's rough results for you:

Vertices (x10 edges)TasksSuperstepsJob Execution Time10 million65423656.393 seconds20 million122231449.542 seconds30 million184398886.845 seconds40 million2454321112.912 seconds50 million30107472079.262 seconds60 million3681581754.935 seconds70 million42206344325.141 seconds80 million48143563236.194 seconds90 million54114802785.996 seconds100 million6076792169.528 seconds
What do you think on this chart? I'm quite satisfied considering that the job execution time contains the data partitioning and loading t…

Living life

Have you once thought about how many days you have left until you die? If so you already know that what a cruel thing is time.

Human life is too short.

Everyday is very precious, don't stress the small stuff!

Doomsday is nothing



Open-source M&A

There are two types of M&A from a buyer's perspective: social status/influence (valuable users/employees/databases/experiences) and positive business model. Typically, knowledge-based industry's case falls under the former. For examples, art, internet service, software, .., etc. In a similar vein, open source software also have a big opportunity.

If your open source software is able to make some buzz among the people, be ambitious!

Google Summer of Code 2011 T-shirt

Just received my Google Summer of Code 2011 T-shirt gift from Google today, as I was a mentor for GSoC @ Apache this year.

After all, Hama project have found a friend in Thomas.

MapReduce, Twitter Storm, 그리고 Hama BSP

과거 MapReduce는 확실히 batch-oriented 된 processing engine 이었고, 한 동안 서비스개발에서 멀어져있던 나는 그 굴레에서 쉽게 벗어나지를 못하고 있었던것 같다. large-scale과 우아한 알고리듬 처리 .. 만 생각하고 있었다.

그런데 오늘날 서비스들을 잘 보면 트위터 트렌드나 네이버 실시간 급상승 인기 검색어, 등등 .. 실시간으로 변화하고 진화하는 분야의 문제를 위해 이제는 단순히 거대한 big data/large-scale processing 에서 data stream mining[1], online processing, continuous computation 형태로 진화함을 보고 있다.

확실히 기술은 필요에 의해서 발전의 기틀을 마련한다. 국내에 대형 포탈들은 조용한데 이상한 회사들이 big data를 논하고 있는 현상은 ... 그냥 trend 타고있음을 강조하기 위해 사용되는 서술자. ㅋ

어쨌건 그래서 Storm이나 Stream processing엔진들이 나오고 있고, Google의 Pregel도 100% 이런 형태로 사용되고 있음을 짐작한다. Storm은 내가 안봐서 확실히는 모르겠고, M/R과 달리 Hama BSP는 이 분야에 대해 확실한 강점을 갖는다. traffic anomaly detection 을 위한 시스템을 실험해본 결과 너무 훌륭했다랄까. YARN과 통합된 이후 어떻게 발전할지 기대된다. :D


Real-time, continous, and stream processing with BSP?

Recently, I attended some seminar, met some people who want to use Hama or already made something. I expected only some large-scale/batch-oriented data processing applications but heard very interesting use cases. One was that a continuous processing using infinite loop in a bsp function of each task.

I realized that BSP is can also be used easily for real-time, continous, and stream processing unlike MapReduce.

나의 2011년을 뒤돌아 보며, 그리고 2012

3개월이나 남았으므로 아직 끝난것은 아니나 성격이 급해놔서. 

어디보자~ 블로그를 스캔하니 금방이다. 올해는 블로깅이 매우 뜸했다. 특별히 계획은 안보이고 년초에 4월까지의 목표라고 정리한게 보이는데 ...

1. 송도 국제도시로 이사 완료 
2. MongoDB 번역 완료 및 출판 
3. Apache Hama 0.2 릴리즈 
4. 후보 2명을 커미터로 충원 
5. 작년에 인디 오더 넣은 애마 인수

몽땅 올킬. 굿!

그럼 2012년 목표는:

1. Hama를 Hadoop nextGen에 통합
2. 좀 더 많은 커미터 꼬실레이션
3. 1 thousand nodes 대규모 Hama cluster 테스트
4. Hama 0.5까지 릴리즈
5. 아파치 인큐베이터 졸업
6. 실사례, 킬러앱 추가
7. Hama In Action 저술

더 있지만 일단 요정도만.

Introduction of Apache Hama @ National University Of Chungnam

월요일 잡담

한 때는 순진하게도 어미에게 물려받은 사냥 스킬 하나면 광야를 홀로 누비는 자유로운 삶이 해결된다 믿었다.

한 사람 개인에게 주어지는 한정적인 시간과 능력.

순진한 방법으로는 생존 조차 어려운게 오늘날 우리네 인간사회 현실이다.

Cancers Generate Muscle-Like Contractions ...

There are only two cells that declares independence from our body. A germ cell and cancer cell. Few days ago, I saw the amazing news "Cancers Generate Muscle-Like Contractions to Spread Around the Body ... ".

Do they wants to revert back to a single cell state? Are they independent beings?

VIA-based (Bulk Synchronous Parallel) BSPLib vs. BSPLib

The Virtual Interface Architecture (VIA) is an abstract model of an user-level zero-copy network, and is the basis for InfiniBand, iWARP and RoCE. Created by Microsoft, Intel, and Compaq, the original VIA sought to standardize the interface for high-performance network technologies known as System Area Networks (SAN).

The VIA model of high-performance by simplifying protocol stacks and reducing kernel-copy overheads, can be also used to avoid bottlenecks caused by the communication overheads in Parallel clustering system.

Message (bytes)VIA-based BSPLibBSPLibbsp_get()bsp_send()bsp_get()bsp_send()1k3873596857622k5064768259034k683667103310828k105511081421137716k177618152224212932k3245323138853716
The above table presents a comparison of time delay(us) of VIA-based and UDP/IP based BSPLib. VIA-based BSPLib shows the obvious better performance than the performance of UDP/IP based.

Francis Galton’s Ox

(an excerpt from The Wisdom of Crowds by James Surowiecki)

One day in the fall of 1906, the British scientist Francis Galton left his home in the town of Plymouth and headed for a country fair. Galton was eighty-five years old and beginning to feel his age, but he was still brimming with the curiosity that had won him renown—and notoriety—for his work on statistics and the science of heredity. And on that particular day, what Galton was curious about was livestock.

Galton’s destination was the annual West of England Fat Stock and Poultry Exhibition, a regional fair where the local farmers and townspeople gathered to appraise the quality of each other’s cattle, sheep, chickens, horses, and pigs. Wandering through rows of stalls examining workhorses and prize hogs may seem to have been a strange way for a scientist (especially an elderly one) to spend an afternoon, but there was a certain logic to it. Galton was a man obsessed with two things: the measurement of physical and mental qual…

Rough analysis of Google's Pregel-clone projects.

1. Apache Hama is currently focused on implementing pure BSP computing engine on top of Hadoop and its performance. They also plan to add a Pregel-like graph computing framework.

2. Y!'s giraph is different in the sense that it runs as a map-only job.
See the org.apache.giraph.graph.GiraphJob.BspMapper. They uses existing Hadoop Map/Reduce.

3. JPregel is a prototype level project I think. But, well written code.

4. And, GoldenOrb. Sorry, I didn't read their code.

I gotta autograph from Nancy Lang (modern pop artist)

I gotta autograph from Nancy Lang \o/
I didn't know her very well but she seems such a cool person.

The Survival

Two days ago, I found great illumination in the BBC Wildlife. Competition and collaboration. Both sides has the same purpose.

The Survival.

The social network evolution is influenced by strong instinct, "a desire for existence". Love, money, reputation, ... Now I can understand why many people likes Survival TV shows or social network services. :P The Survival is the perfect story of life.

The power of story

We don't just buy the product, we buy the story behind it. The man loves 'Porsche'. Because there are stories about freedom and romance of youth.

We don't just join a company, we join because of its story and vision.

Create your story, tell your story and leverage your story!!!!!

Apache Hama 0.2.0-incubating Released!

Apache Hama 0.2.0-incubating Released!

Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.

This first release includes:

BSP computing framework and its examplesCLI-based managing and monitoring tool of BSP job
You can be downloaded from the download page of Hama website[2].


MongoDB 완벽 가이드, 책으로 읽어보다

증정용이 도착했다~ \o/
두려운 마음으로 한장 한장 넘겼다.

데드라인이 넘어가면서 역자서문/표지 번역은 출/퇴근길 아이폰으로 버스에서 구글 닥스에 붙어 작업했고 아주 막판엔 지쳐서 2번 정도 대강 읽고 끝내버렸기에 은근히 신경쓰였는데, ... 이거이거 꽤 잘 나왔잖아!!

(아, 물론 내 생각 .. )

The Mountain

GSoC, Implementation of single source shortest path using Hama

Despite the Hama project[1] is still under heavy construction, one of my GSoC students has excellently (or aggressively) finished implementing[2] his plan, "Implementation of single source shortest path using Hama"[3] and also started to contribute improvements to the Hama project. He used an algorithm described in Google Pregel paper[4], and it works nicely on my 2-rack cluster (512 cores).

But, more surprisingly, he is just 19 years old! :o When I was 19 years old?

JDK 1.1 released.I first met the monster called Diablo[5].I ridden bike.


[Note] Dijkstra's vs. Bellman-Ford

Dijkstra's : relax each edge exactly once (can't handle negative weights)Bellman-Ford : relax each edge |V| -1 times
Anything else?

What's the most popular NoSQL?

I'd say that the answer is a MongoDB.

MongoDB is growing quickly in popularity because it offers a relatively rich range of features, while maintaining impressive speed. The features include built-in Indices, range queries, support for replication, and auto-sharding. A Map/Reduce function allows you to add to the aggregate functions natively supported and do large-scale jobs like nightly reports.

According to some reports, Java-based NoSQL solutions seems are not free from the GC(Garbage Collection) and performance problems. MongoDB is written in C++ but has drivers for Perl, Python, PHP, Java, and Ruby ... So, powerful!

The infinity/infinitesimal does not exist.

life and death, yin and yang, start and end, creation and destruction, ...

The infinity/infinitesimal does not exist.

Software versioning

How do you versioning your software?

This is so embarrassing, I didn't know well about Software Versioning before reading the MongoDB book. Because, I mostly worked for web-service companies which don't develop the software product and care about software versioning. (Other country companies may not, but at least 'korea web-service' companies does.)

Anyway, I recently learned many things while translating the book, 'MongoDB: Definitive Guide'. :D

P.S., The Korean version of 'MongoDB: The Definitive Guide - O'Reilly Media' is coming soon!

Apache Hama - BSP Job Submission Portal

I just made a simple web portal, where you can submit your BSP job and execute it on my test cluster.

I spend 1 hour writing code and didn't test enough. :P

Try to upload hama-0.2.0-examples.jar, and put the args "bench 5 5 1".

Enjoy the cool summer with Apache Hama and GSoC 2011

The Apache Software Foundation was accepted for Google Summer of Code 2011. Enjoy the cool summer with Apache Hama and GSoC!

Apache Hama is a Google's Pregel-like, a distributed computing framework based on Bulk Synchronous Parallel.

GSoC 2011 Projects:

1. Evaluation of Hama BSP communication protocol performance

The goal of this project HAMA-358 is performance evaluation of RPC frameworks (e.g., Hadoop RPC, Thrift, Google Protobuf, ..., etc) to figure out which is the best solution for Hama BSP communication. Currently Hama is using Hadoop RPC to communicate and transfer messages between BSP workers. By this project, students will have learned how to evaluate components in the design phase. A list of prerequisite:Understanding of Bulk Synchronous Parallel modelUnderstanding of RPC (remote procedure call)Programming skills:Shell scripts2d plotting programmingThe estimated duration for this project: 12 weeksThe level of difficulty: High

2. Development of Shortest Path Findin…

Apache Hama 0.2.0 Release Candidates

OSS 릴리즈 관리에 GnuPG (그누피지) 사용하기

GnuPG는 GNU Privacy Guard의 약자로 문서나 파일을 암호화할 때 사용되는 소프트웨어이다. 자세한 내용은 공식 웹사이트를 참조:


일반적으로 오픈소스는 (Apache 프로젝트도 역시) 릴리즈할 때 release tarball 에 GnuPG (그누피지) 를 사용해서 사인을 하는데, 나는 어제 처음 써본것이라 3시간을 이것때문에 삽질했다. :/

어쨌거나 암호학에 생소하면 당췌 문서를 이해하기 쉽지않고 한글 문서는 찾아볼 수도 없어서 이참에 정리를 좀 해보면,

1. 제일 먼저 GPG 키가 없으면 다음과 같은 명령어로 하나 생성해야된다:

[edwardyoon@minotaur:~/public_html/dist]$ gpg --gen-key
2. 질문에 대해서는 대부분 default 로 넘어가면 되고 완료가 되면 다음과 같은 명령어로 ASCII 공개키를 파일로 생성할 수 있다:

[edwardyoon@minotaur:~/public_html/dist]$ gpg --armor --output KEYS --export Edward J. Yoon Warning: using insecure memory! [edwardyoon@minotaur:~/public_html/dist]$ cat KEYS -----BEGIN PGP PUBLIC KEY BLOCK----- Version: GnuPG v2.0.16 (FreeBSD) mQMuBE1uPgIRCADVN8LNKn8God0EM+PuJ4Y7+YV6FY3R/IFa/W5XbHZpDBWT161S tZe+w4e4hl+cK5eHDkjvnj+Xx8g9u/ohxF31SOGDrbNBdDpOMJEG6F9vu3TdNdVb Kz8yLHcjp5v/kQNngyytn/4rVXtYjFCta8Ks5DAHUZCfESoRf8HLvk5fHr4AwXng vG85xHqQ3axW3SGO4ZDnsaLrofaHazemjR+wxmY7XK6No/aV5gyfTpKgq1sDKzge NSNxEs99Om1Xfrap1JgWuiXFgTo0…

Monitoring and Mining Network Traffic in Clouds

Check out this SlideShare Presentation:
Monitoring and mining network traffic in clouds
View more presentations from Edward J. Yoon.

4월 까지의 목표

목표는 길게 잡는것 보다 짧게 잡는게 낫다. 그런 의미에서 4월 까지의 목표들을 정리해보면:
송도 국제도시로 이사 완료MongoDB 번역 완료 및 출판Apache Hama 0.2 릴리즈후보 2명을 커미터로 충원작년에 인디 오더 넣은 애마 인수이렇게 5 가지 정도가 되겠다. 사실 요즘의 나에 스케쥴은 매우 빡세게 흘러가는데 ... 그렇다고 나 스스로를 혹사시키지는 않는다. 가고자하는 목표가 확실하기 때문에 이렇게 하나하나 천천히 하다보면 목적지에 도달하겠지.

Apache Hama 0.2: User Guide

Just idea about efficient allocation of VMs on clouds

If we can forcibly group and migration by tendency of people e.g., hobby, fields of work .. etc., we might dramatically reduce the traffic of the town.

Similarly, we might efficiently allocate VMs on clouds using their tendency. (e.g., CPU, network, or disk usage pattens). Then, how can we safely migrate VMs to other hosts? ...

Shuffle Error: MAX_FAILED_UNIQUE_FETCHES; bailing-out

First, long time no use MapReduce! Today I wasted some time figuring this out; "Shuffle Error: MAX_FAILED_UNIQUE_FETCHES; bailing-out".

If you meet this error message w/ the higher version than the hadoop-0.20.2, you should check the file "mapred-site.xml" in {$HADOOP_HOME}/conf directory and the "/etc/hosts" because this happens when the IP addresses are all confused and things aren't on the right ports.

# The following lines are desirable for IPv6 capable hosts #::1 localhost ip6-localhost ip6-loopback #fe00::0 ip6-localnet #ff00::0 ip6-mcastprefix #ff02::1 ip6-allnodes #ff02::2 ip6-allrouters

<property> <name>mapreduce.task.tracker.http.address</name> <value></value> <description> The task tracker http server address and port. If the port is 0 then the server will start on a free port. </description> </property>
Or, if you have a lot of map and reduce …