Posts

Showing posts from 2009

Apache HTTPD and Tomcat Easy & Fast installation guide

1) Install httpd and httpd_devel using yum.

# yum install httpd**
2) Download the latest tomcat and tomcat connectors

# wget http://mirror.khlug.org/apache/tomcat/tomcat-6/v6.0.20/bin/apache-tomcat-6.0.20.tar.gz # wget http://www.apache.org/dist/tomcat/tomcat-connectors/jk/source/jk-1.2.28/tomcat-connectors-1.2.28-src.tar.gz
3) Compile tomcat connectors

# tar -xvfz http://www.apache.org/dist/tomcat/tomcat-connectors/jk/source/jk-1.2.28/tomcat-connectors-1.2.28-src.tar.gz # cd tomcat-connectors-1.2.28-src/native # ./configure --with-apxs=/usr/sbin/apxs # make # su -c 'make install'
4) Configurations

4-1) Add jk module to httpd.conf.
# vi /etc/httpd/conf/httpd.conf LoadModule jk_module modules/mod_jk.so //JkMount /*.jsp ajp13 <ifmodule jk_module=""> JkWorkersFile conf/workers.properties JkLogFile logs/mod_jk.log JkLogLevel error </ifmodule>
4-2) Set an tomcat/jdk home path.
# vi /etc/httpd/conf/workers.properties workers.tomcat_home=/usr/local/src/t…

BMW Z4 Wallpaper (1680x1050 size)

Image
This is my car at Jungdongjin beach on the East Sea in Korea. It's made as a 1680x1050.

2009 Happy Christmas, 정동진 여행기

Image
BMW Z4가 비록 후륜이지만 눈길과 빗길을 전혀 두려워하지 않는 나.. (-_-;;)
어김없이 과감히, 강원도 산길 국도 대관령을 뚫었습니다.
사진으론 표현이 잘 안되지만 정말 아름답더군요.



지나가던 길에 풍력발전 연구소가 있었습니다.
물론 바람은 거셌고, 풍력발전기는 뱅글뱅글~
얼어죽는줄 알았다는.



정동진에 도착해서는 가장 먼저 밥을 먹었습니다.
밥을 먹는 내내 주방 아주머니들이 강원도 사투리를 쓰시는데 귀엽다는 생각이 들더군요.
아마 '웰컴 투 동막골'에서 강원도 사투리가 귀엽게 소개된 탓인것도 같으나,
근본적으로 액센트자체가 귀여운 듯.



또 한번 어김없이 과감히, ..
"들어갈곳 못들어갈곳이 없다", "내가 가는 곳이 곧 길이다", 해변까지 GoGo.
(철판 깔면 누구나 가능함 ㅋ)



아, 바다는 아름답고 일몰도 멋졌(대부분 일출보러가던데)습니다.
요즘 답답한 마음이 뻥 뚫린다고나 할까. :)

모쪼록, 정동진으로 여행을 가신다면 반드시 국도를 애용해보세요!

Explore Followers of You With Twitter Cumulus!

Google Image Swirl was an inspiration to me. So, I just made the twitter cumulus using REST API. It's really simple But It's a 3D!!!! :)

All code written in javascript. You can simply use this and source code available if you want. How to? See this page, or query URL as below:

http://people.apache.org/~edwardyoon/twitter-cumulus.html?id={YOUR_TWITTER_ID}
P.S. If you have a lot of followers, do not use. It'll be slow.

Explore Images with Google Image Swirl

Image

Removing friendships using Twitter API

I just wanted to remove all friendships at once. (As you know, If there is a lot of friends, It could be a rough job... ) There is a Java library for TwitterAPI. So, I could program it. :)

public static void main(String args[]) throws Exception { Twitter twitter = new Twitter("user_id", "passwd"); List list = twitter.getFriendsStatuses(); for (int i = 0; i < list.size(); i++) { User user = list.get(i); twitter.destroyFriendship(String.valueOf(user.getId())); System.out.println("Un-following " + user.getName()); } }

[Book] Pro Hadoop

Hadoop 관련 책을 살펴보다 Apache Hama가 언급된 부분이 있어, "아니 이런 훌륭한 책이 어디 또 있는가?" ㅋㅋ 스스로 감개무량하여 블로그에 포스팅 해둡니다. (아직 다 읽어보진 못해서 review는 패스)

도서 개요

You've heard the hype about Hadoop: it runs petabyte-scale data mining tasks insanely fast, it runs gigantic tasks on clouds for absurdly cheap, it's been heavily committed to by tech giants like IBM, Yahoo!, and the Apache Project, and it's completely open source (thus free). But what exactly is it, and more importantly, how do you even get a Hadoop cluster up and running?

From Apress, the name you've come to trust for hands-on technical knowledge, Pro Hadoop brings you up to speed on Hadoop. You learn the ins and outs of MapReduce; how to structure a cluster, design, and implement the Hadoop file system; and how to build your first cloud-computing tasks using Hadoop. Learn how to let Hadoop take care of distributing and parallelizing your software--you just focus on the code, Hadoop takes care of the rest.

Best of all, you'll …

Mircosoft's Bing (Korea) provides search engine with the Daum, corp.

Today, I just noticed that the http://bing.com is automatically redirected to
http://bing.search.daum.net/search?q=bing.

BTW, Daum and Bing's search results are exactly the same. It looks like a bing-skinned version of Daum search engine. LOL

- daum search result
- bing search result

See also,
- Microsoft’s Bing fails to crack Korea

Hot spring bath, .. 온천욕 매니아

Image
정확히 기억은 안납니다만 언제부터인지 못해도 주-(week) 마다 한번은 꼭 온천을 갑니다. 오늘도 어김없이 ㅋ.



남한산성에서 점심을 먹고, 넘어가는 길에 찍은 사진입니다. '퇴촌 스파그린랜드'를 갔지요. (가장 좋아하는 코스) 오늘은 날씨가 크게 나쁘지않아서 탑을 오픈해도 춥지않고, 적당하더군요.

온천지에 도착해서는 도저히 '촬영 정신'을 연결할 수 없어서, 사진은 이것으로 끝납니다. 홀딱 벗어야 되는 상황도 한 몫 했을것이고..

굳이 이후 상황을 말로 전달하자면, 따듯한 온천에 가만히 누워 하늘을 바라보는 그 맛은 정말 '신선놀음'이라 할 수 있지요. 명품 악세서리, 고급 스포츠카, 맛있는 음식, ..., 등등 무엇보다 달콤한 맛이죠.

이렇게 (한국나이) 31살 12월을 보내고 있습니다.

The BSP package of Hama on Hadoop is now available!!

Serialize Printing of HelloWorld using BSP of Hama

Image
Apache Hama Team made the BSP package, which is a computational model based on the concept of supersteps on the top of Hadoop for perform matrix/graph computations with better performance. It provide more flexible programming model than Map/Reduce, and more simple APIs than MPI. The vertical system structure of the BSP is as below:



Sequential composition of "supersteps".
- Local computation
- Process Communication
- Barrier Synchronization
* More detailed will be announced very soon.

To helping understand the Synchronization and Superstep, I made a "Serialize Printing of Hello World" example.

The below codes create 10 BSPPeerThreads. Each thread will have a shuffled ID number from 0 to 9.
BSPPeerThread thread; int[] randomSequence = new int[] { 2, 3, 4, 5, 0, 1, 6, 7, 8, 9 }; for (int i = 0; i < NUM_PEER; i++) { conf.set("bsp.peers.num", String.valueOf(NUM_PEER)); conf.set(BSPConstants.PEER_HOST, "localhost"); conf.se…

Google Chrome is available today on Linux!

Oh yes!! I love it.

----
Hello everybody out there using Linux -

Google Chrome is go for beta on Linux! Thanks to the many Chromium and WebKit developers who helped make Google Chrome a lean, mean browsing machine. Here are a few fun facts from us on the Google Chrome for Linux team:

60,000 lines of Linux-specific code written
23 developer builds
2,713 Linux-specific bugs fixed
12 external committers and bug editors to the Google Chrome for Linux code base, 48 external code contributors

Thanks for waiting and we hope that you enjoy using Google Chrome!


Google Chrome Team



http://www.google.com/chrome/intl/en/w00t.html

--------

(c) 2009 Google www.google.com 1600 Amphitheatre Parkway, Mountain
View CA 94043 United States of America.

Google is a trademark of Google Inc. All other company and product
names may be trademarks of the respective companies with which they
are associated.

Drawing a pie chart using Google Chart API

Image
Really useful!!!

----
The Google Chart API lets you dynamically generate charts. To see the Chart API in action, open up a browser window and copy the following URL into the address bar:

http://chart.apis.google.com/chart?cht=p3&chd=t:60,40&chs=450x200&chl=Hello|World

Press the Enter or Return key and - presto! - you should see the following image:


Introducing Google Public DNS

어제 구글 공식 블로그에서 Google Public DNS 에 대해 소개가 있었습니다 -- Official Google Blog: Introducing Google Public DNS

DNS라는건 Domain Name System의 약자로 IP 주소를 도메인 네임(Domain Name)으로 매칭해주는 시스템으로, blog.udanax.org 와 같은 도메인을 브라우저로 입력하면, DNS 서버의 데이터베이스를 참조하면서 해당 서버를 찾아가는 일련의 작업을 내부에서 하게됩니다. 이건 통상 ISP 인터넷 제공자, 즉 KT나 뭐 그런애들이 제공하는데, 이걸 자기들이(Google) public DNS 제공해서 좀더 빠르게 해주겠다는 얘기로 요약됩니다.

어떻게 사용하냐? 사용법은 간단하게 DNS server 주소를 8.8.8.8 과 8.8.4.4 로 변경해주면 됩니다.

(리눅스 기준) DNS server를 아래와 같이 변경하면 됩니다.
[edward@udanax ~]# cat /etc/resolv.conf nameserver 8.8.8.8 nameserver 8.8.4.4

윈도우즈도 마찬가지로 네트워크 연결 속성에서 DNS server ip를 8.8.8.8 과 8.8.4.4 로 바꿔주면 됩니다.
C:\Documents and Settings\Administrator>ping 210.94.0.73 Pinging 210.94.0.73 with 32 bytes of data: Reply from 210.94.0.73: bytes=32 time=4ms TTL=243 Reply from 210.94.0.73: bytes=32 time=5ms TTL=243 Reply from 210.94.0.73: bytes=32 time=4ms TTL=243 Reply from 210.94.0.73: bytes=32 time=4ms TTL=243 Ping statistics for 210.94.0.73: Packets: Sent = 4, Received = 4, Lost = 0 …

Review: The mist (2007)

Really fun. ★★★★☆!!!
More Information: http://en.wikipedia.org/wiki/The_Mist_(film)

IMO, This movie is about the uncertainties of life.
We can't see the future and what the end result will be. So, there is no correct answer. I think, .. We don't need to worry about the future, A reflex action is best.

일본만화 추천 100선

1.군계 (액션,격투)

격투만화 중에서는 아주 유명하고 열광적으로 좋아하는 사람이 많은 만화입니다. 전체적으로 만화의 분위기는 어둡고요... 줄거리를 소개하죠. 주인공인 료는 모범생이었습니다. 그러나 부모님의 강압적인 통제... 그에 대한 무의식적 반발로 부모님을 살해하게 되고 맙니다. 그리고 교도소에서 많은 변화를 겪게되죠..그리고 악의 화신같은 존재가 되고 말죠..... 또 그에 대항하는 인물이 나타나고.... 만화에 대한 소개가 만화의 질을 못 따라가네요. 이 만화는 선과 악에 대해 생각해볼 수있는 만화입니다. 그리고 싸움이란 인간에게 어떤 의미가 있는것일까? 하는 생각도요. 아..군계는 싸움닭이란 뜻입니다. 네이밍센스가 좋죠.


2. 생추어리(해적판제목:빛과 그림자) (액션,정치)

일본 최고의 극화체 만화가인 이케가미 료이치님의 작품입니다. 캄보디아에서 일본 외교관의 아들로 태어난 두 남자 이야기를 그리고 있습니다. 캄보디아의 킬링필드(대학살..실존사건입니다.)와중에서 살아남아 돌아온 두 주인공..한명은 암흑계 보스로 다른 한 명은 정치계 거물로 성장해갑니다. 두 남자의 목숨과도 같은 우정....그들의 관계는 어떻게 될까? 뛰어난 스토리작가와 항상 작업하여 재미는 보장이 되는 작품입니다. 단 남성우월주의가 작품전반에 깔려있어서 여성이 보기엔 거부감을 느낄수도 있는 작품..남자분인거 같아 추천합니다.


3.수라의 각 (역사,무협)

해황기의 작가가 쓴 매우 흥미진진한 역사 무협물입니다. 일본 역사속의 많은 실존인물들의 얘기를 기둥줄거리로 하여 거기에 허구를 작가의 능수능란한 솜씨로 섞어냅니다. 이 작품은 코믹 만화인 수험의 제왕에도 등장하는 데요....그 만화에 이런 말이 있지요. "역사 공부를 하려거든 수라의 각을 읽어라" 이 만화작가의 다른 작품인 해황기도 매우 흥미진진하답니다.


4.베르세르크 (판타지)

말이 필요없는 판타지의 최고 걸작.... 뛰어난 작화에 철학적인 사유까지 배어든 최고의 작품. 안 읽으셨다면 부디 읽어보시길..


5.…

Do you need a Google Wave Invitation?

I have 12 Google Wave Invitations left out of 16. Please post a comment on this post, I'll invite you to the Google Wave. :)

구글링하는(googling) 방법

1) 연속된 문자열 검색하기: 검색 중 White-Space(공백)는 or 연산 역할을 합니다. 공백 분할 키워드가 하나라도 매치되면 검색된다는거죠. 때문에 연속된 문자열을 검색할때 사용합니다. 노래가사 일부를 찾을 때 유용하지요. 예를 들어, 노래 중 "I gotta feeling" 이란 구절을 들으면 아래와 같이 검색합니다.
["I gotta feeling" lyrics] 개인적으로는 문법이 잘 맞는건지 확인할때도 사용합니다. 내가 생각한 영문이 맞는거라면 전세계 웹에서 어느 한 구절에 포함될 확률이 있으니까요.

2) 특정 사이트만 검색하기: 말그대로 특정 사이트 내 컨텐트에서만 검색합니다.
제 블로그에서 breadth first search 관련 내용만 찾습니다. [site:blog.udanax.org breadth first search]
3) 특정 파일 포맷에서만 검색하기: 구글이 검색하는 대상은 웹 문서 뿐만아니라 PDF, PPT 등등을 모두 검색합니다. 주로 PDF는 논문 검색할때 사용할 수 있고, 프리젠테이션 내용 역시 찾아볼 수 있습니다.
PDF 파일로 된 문서중에 social network 이 포함된 문서들만 검색 [filetype:pdf social network]
마지막으로 검색 탭 아래 "Show options"라는 버튼을 활용하면 최근 문서, 혹은 최근 이미지, 이미지 사이즈 별로 조건을 더 줄 수 있습니다. 자 이정도 스킬이면 원하는 정보는 충분히 찾지 않겠습니까?

Subversion Submitted to Become a Project at The Apache Software Foundation

Subversion Submitted to Become a Project at The Apache Software Foundation

Today, I noticed that Subversion proposal/vote was opened in the general mailing list of Apache Software Foundation.

Hi all, The Subversion Corporation decided recently to submit Subversion to the Apache Software Foundation. I've sent the initial proposal to Apache's Incubator project requesting the move (see below). There are a lot of ramifications to this step. It would take me quite a while to write this email if I detailed this stuff. In short, there will be *no change for users*. This will primarily impact the development community. Please stay tuned for more info! Thanks, -g
버전 관리툴로 잘 알려진 Subversion이 Apache Project로의 제안이 제출되었습니다. 아직 투표가 진행중이지만, 거의 100% Apache Subversion이 확정된것 같네요. :) 오픈소스는 ASF가 짱 먹을려나 봅니다.

WHY - Villa yacht by Hermes

My Last Dream....

Applying Cramer's rule with MapReduce on Apache Hama

Mr.Seo introduced a Cramer's rule based on M/R using eigenvalue computing of Apache Hama.

Latent Semantic Indexing (LSI) A Fast Track Tutorial

Megan Fox Pictures

Image
You know, She's body is really good.


Facebook Infrastructure and Scaling

Much to my amusement, This video is most popular in Korea. In Korea, there is no global service yet. And, I did not see anyone to trying to enter into international markets.

신기하게도... 이 동영상은 한국 사람들이 가장 많이 보고 좋아하는걸로 나옵니다. (-_-;;) 한국 IT기업도 조금 공격적으로 세계 시장에 달라들 수 있으면 좋겠네요.

백악관 사이트의 새로운 검색엔진, Apache Solr

O'relly 에 따르면 Apache licensed 검색엔진 공개 소프트웨어인 Solr이 White House 검색엔진으로 사용된다고 합니다. -- Thoughts on the Whitehouse.gov switch to Drupal

대조적으로 '정부 프로젝트는 눈 먼 돈'이라는 말이 있듯, 알아서들(?) open source를 활용하되 그것을 바닥아래 몰래 숨겨 프로젝트 몸집만 부풀리는 한국은 여전한것 같군요. 비단 정부 뿐 아니라 사실 한국은 기업부터도 그러한 것 같습니다. open source를 core components로 활용하면서 순수 기술력의 성과라는 과대 포장과 함께 open source는 한마디 언급하지 않는 모습을 보입니다. 누가 잘못됬다라기보다는 IT 사회 전반의 인식 문제 같습니다.

Friend Suggestion using Hama

SNS market share is similar with othello game. It means that the adjacent people could change you. Therefore, to prevent emigration from your service, you need to recommend a couple of people to user. Don't leave them alone!! For the gradual settlement of the Social Network Service, Friend suggestion might be a solution or useful tool.

It's a slide on conceptual design stage. The main idea is the k-plex clustering to find out the communities, and then friend suggestion among the people who don't know each other in community.



I'm still updating.....
If you want to see more detail, please subscribe my blog. ;)

Silicon Valley And Biased Trend V2

I'm impressed at last page. ;)

Silicon Valley And Biased Trend V2
View more presentations from yangtheman.

Hamburg, will be integrated to Apache Hama project.

Hamburg, will be integrated to Apache Hama project soon.

Hamburg is the BSP based graph computing framework on top of Hadoop. We're working on the perfection of a project as a open source. Since the matrix is the great tool to store graph data and process them, we thought the idea of dealing with graph and matrix at one place.

Is this the new venture or just another desperation?

Recently, I'm trying to figure out another (better) life since I've lost my passion for company.

Of course, despite some failures (e.g. global services), this company has had quite a good year on balance. And, this company is still one of the largest enterprises of its kind.

Well, IMO, the problems are partly due to bad (conservative) management. There's no room for new talent. The fact that many people also think like me adds conviction to my view.

However, I have no plan yet and don't know whether it is right or wrong. Is this the new venture of just another desperation?

Google Code Jam 2009

Google Code Jam 2009, 누군가 소개해줘서 어제 진입했습니다. 저녁밥도 먹어야 되고 잠도 자야해서, 한 문제만 풀고 말았습니다만, 다행히 한 문제만 풀어도 다음 Round 가능 한가보군요. 이렇게 다음 Round 참여하는 한국인은 151명 입니다.

Qualification 문제 치고는 나름 영어 + 난이도가 있어 보이더군요. 그런데 오오, 한국인 중 가장 상위권에 있는 놀라운 Astein, 이사람은 대략 똑똑하고 GCJ 과거 유경험자인듯 하네요.

아마도 apache committer들 중에서 코딩 스킬이 가장 허접할것 같은 제가 과연 몇 Round 버틸지는 모르겠으나, 어디 한번 Top 10 ranking 과 상금을 노려볼까요~ :)

Memo, level synchronous parallel BFS algorithm

Inference anatomy of the Google Pregel

Image
The summary paper of the Google Pregel was distributed -- Pregel: a system for large-scale graph processing.

The high-level organization of Pregel programs is inspired by Valiant's Bulk Synchronous Parallel model. Pregel computations consist of a sequence of iterations, called super-steps.
It's same with Hamburg design as figured below:

During a superstep the framework invokes a user-defined Compute() function for each vertex, conceptually in parallel. The function specifies behavior at a single vertex v and a single superstep S. It can read messages sent to v in superstep S - 1, send messages to other vertices that will be received at superstep S + 1, and modify the state of v and its outgoing edges. Messages are typically sent along outgoing edges, but a message may be sent to any vertex whose identifier is known. A program terminates when all vertices declare that they are done.
According to our reasoning, user defined Compute() function which can be called during graph trave…

Same matte black different feel -- Lamborghini LP640 vs BMW Z4 vs Hyundai Avante

Image
Lamborghini LP640




2009 BMW Z4


And, Hyundai Avante ... Looks like burned car.

Graph database on Hadoop

Below is the problem list of the recent trends of graph data in my Insight.

- Very large (e.g. Web linked data, Social network, ..., etc)
- Diversified attributes of node and edge
- Requires real-time processing (for exampe, finding the shortest path based on attributes in Google Map)

So, I'm thinking the graph database on hadoop as described below:


HDFS Hama, Map/Reduce Hamburg
graph data -> graph partitioning for locality -> real-time processing


The large graph data can be stored on Hadoop/Hbase and, communication cost can be reduced by partitioning step as bulk processing. Then, finally we can perform the real-time graph processing. What do you think? ;)

Doug Cutting leaves Yahoo, joins Cloudera

The core member of Hadoop, Doug Cutting, is leaving Yahoo to join a startup called Cloudera. Cool... I would like to learn from his footsteps... and eventually soon be a open source developer like him.

The low-power Hadoop cluster

We're understand that the Hadoop is a low-cost way to manage and process the massive data since it has been designed to run on a lot of cheap commodity computers. But, the electric power costs also should be considered when evaluating cost effectiveness. Have you thought them? Since It's a fault tolerant system with active replication, a few servers could go anytime into power saving mode without data loss.

I heard that some guys are trying to handle this problem. See also : On the Energy (In)efficiency of Hadoop Clusters

Hamburg, a graph computing framework on Hadoop

Image
As mentioned ago, I've been forming up the Hamburg project with Hyunsik Choi. Let's see more detail in the diagram of computing method of Hamburg based on BSP model.



Each worker will process the data fragments stored locally. And then, We can do bulk synchronization using collected communication data. The 'Computation' and 'Bulk synchronization' can be performed iteratively, Data for synchronization can be compressed to reduce network usage.

Plainly, It aims to improve the performance of traverse operations in Graph computing. For example, to explores all the neighboring nodes from the root node using Map/Reduce (FYI, Breadth-First Search (BFS) & MapReduce), We need a lot of iterations to get next vertex per-hop time.

If (same condition as before) do BFS using Hamburg, It will cause a lowering the cost of iterations. Let's assume the graph looks like presented below:



The graph was stored in Hbase on distributed system as above. The root is 1. Then, we need …

Massive DDoS attacks

The targets as described below hit by attacks :

Korea sites:

www.president.go.kr, www.mnd.go.kr, www.mofat.go.kr, www.assembly.go.kr, www.usfk.mil, blog.naver.com, mail.naver.com, banking.nonghyup.com, ezbank.shinhan.com, ebank.keb.co.kr, www.hannara.or.kr, www.chosun.com, www.auction.co.kr

US sites:

www.whitehouse.gov, www.faa.gov, www.dhs.gov, www.state.gov, www.voanews.com, www.defenselink.mil, www.nyse.com, www.nasdaq.com, finance.yahoo.com, www.usauctionslive.com, www.usbank.com, www.washingtonpost.com, www.ustreas.gov

It seems, are of a political character. Anyway, please check the 'msiexec2.exe' on your computer! -- Trojan.Win32.DDoS-Agent.33841

- S. Korean Web Sites Hit by Cyber Attacks
- MS제로데이 취약 패치 공개...DDoS 원인일 수 있어

Pregel, Google's large scale graph computing framework

According to google research blog, They made the Pregel for performing large-scale garph computing and uses it for PageRank calculations, shortest path, ..., etc.

In order to achieve that, we have created scalable infrastructure,
named Pregel, to mine a wide range of graphs. In Pregel, programs
are expressed as a sequence of iterations. In each iteration,
a vertex can, independently of other vertices, receive messages
sent to it in the previous iteration, send messages to other vertices,
modify its own and its outgoing edges' states, and mutate the
graph's topology (experts in parallel processing will recognize that
the Bulk Synchronous Parallel Model inspired Pregel).

Maybe the most important things are the locality of adjacent vertices and the dynamic programming. I talked with Hyunsik, a memeber of Heart project about this, We thought it's other distributed programming model instead of map/reduce, but same the shared-nothing architecture for the better performance. I gues…

Non-negative Matrix Factorization

Non-negative matrix factorization (NMF) is a group of algorithms in multivariate analysis and linear algebra where a matrix, X, is factorized into (usually) two matrices, W and H.

A( M by N ) = W(M by K) X H( K by N )

W,H : Initialize W and H to random positive matrices.
W = ||W*H-A||
H = ||H(T)W(T)-A(T)||

Iterates W*H until convergence.

See also,
- http://en.wikipedia.org/wiki/Non-negative_matrix_factorization, WIKIPEDIA
- Non-Negative Matrix Factorization with sparseness constraints
- Non-Negative Matrix Factorization Techniques and Optimizations
- Document Clustering Based On Non-negative Matrix Factorization

Jacobi eigenvalue algorithm

The Jacobi eigenvalue algorithm is a numerical procedure for the calculation of all eigenvalues and eigenvectors of a real symmetric matrix.

Each Jacobi rotation can be done in n steps when the pivot element p is known. However the search for p requires inspection of all N ≈ ½ n² off-diag elements. We can reduce this to n steps too if we introduce an additional index array with the property that mi is the index of the largest element in row i, (i = 1, … , n - 1) of the current S. Then (k, l) must be one of the pairs (i,mi) . Since only columns k and l change, only mk and ml must be updated, which again can be done in n steps. Thus each rotation has O(n) cost and one sweep has O(n³) cost which is equivalent to one matrix multiplication. Additionally the mi must be initialized before the process starts, this can be done in n² steps.
Typically the Jacobi method converges within numerical precision after a small number of sweeps. Note that multiple eigenvalues reduce the number of iterati…

NAVER Japan, closed beta service has been started again

Image
The Search engine, NAVER Japan closed beta service, will run from the 15th of this month for the limited number of people.



Naver Japan had actually launched in the Japanese market earlier (2005), but the service didn't make any splashes. Instead, According to wikipedia Japan (http://ja.wikipedia.org/wiki/NaverBot), NAVER japan seems disgraced by their behavior, specially, an bots who impersonated a GoogleBot was ran. NAVER Japan will fail again unless the service delivers a significantly better performance and good image.

See also, Japanese Search Engine Market

P.S. The interview model : Julia Oki.

IMAP protocol and server

I realized that IMAP is a really complex protocol with problems like support for concurrent command processing, aside from the extensive list of commands. So, command parser, session manager and request handler are necessary. For example,

LSUB "#news." "comp.mail.*"
A654 FETCH 2:4 (FLAGS BODY[HEADER.FIELDS (DATE FROM)])

Fortunately , There are some open source projects related with IMAP.

- Apache James, fully featured mail server
- Green Mail, mail server, supports SMTP, POP3, IMAP with SSL
- JIMAP, java based implementation of the IMAP protocol

According to Robert Burrell Donkin which is a member of ASF, "As far as James IMAP goes, the protocol handlers are fine. The Mailbox implementations need more work (but that's on my list for this week).".

The secrets of fast mental arithmetic

Image
The 'Britain´s Got Talent' of south korea, 'Star King' introduced the expert of mental arithmetic. She did quick mental arithmetic to calculate large integer numbers in only a few seconds. For example, 258,785,412,354 + 125,465,871,235 + 954,125,487,965 - 845,123,658,792 * 542,136,587,212 ... = ?

I was very impressed. According to She, "I did calculate them with an imaginary abacus".



OK, The abacus-based mental calculation is the recipe. Generally, we learned to calculate multi-digit numbers from right to left, one digit at a time. It's a carry/serial calculation. But Her (abacus) algorithm is little different from us. It's the shift-up/parallel calculation.

Frobenius norm with Hama

Recently, The Frobenius norm was nicely implemented as below by Samuel, which is the one of Hama committers. The Frobenius norm is submultiplicative and is very useful for linear algebra. This norm is often easier to compute than induced norms.


/** Frobenius Norm Mapper */
public static class MatrixFrobeniusNormMapper extends MapReduceBase implements
MatrixNormMapper {
@Override
public void map(IntWritable key, MapWritable value,
OutputCollector output, Reporter reporter)
throws IOException {
double rowSqrtSum = 0;
for (Map.Entry e : value.entrySet()) {
double cellValue = ((DoubleEntry) e.getValue()).getValue();
rowSqrtSum += (cellValue * cellValue);
}

nValue.set(rowSqrtSum);
output.collect(nKey, nValue);
}
}

/** Frobenius Norm Combiner */
public static class MatrixFrobeniusNormCombiner extends MapReduceBase
implements MatrixNormReducer {
private double sqrtSum = 0;

@Override
public void reduce(IntWr…

Google support RDFa and Microformats

Meeting with the CTO of Sun Japan

I was in a meeting with the CTO of Sun Japan today. He visited me and wanted to know the possibility of Hama deployment into scientific/HPC computing area with Hadoop.

Since it's too early stage, there was not much talk/plan. but I felt sure that it'll really valuable if implemented. It was a delight to talk with him.

Psychologists Are Better Stockmarket Speculators Than Economists

Below report shows that the stock is not so much a science as a gamble. ;)

----
From: http://www.medicalnewstoday.com/articles/35953.php

Shareholders seem to be swayed by the buying pattern of other shareholders much less than has hitherto been assumed. This at least is the conclusion arrived at by economists of the Bank of England and the universities of Heidelberg and Bonn. Together with the corporate consultants McKinsey they scrutinised the share-buying behaviour of about 6,500 persons in an Internet experiment. They found no signs of ‘herd instinct’ during the experiment - on the contrary, some of the test subjects decided against buying those specific shares which had just been bought by so many other players. Psychologists, particularly, mistrusted those shares which they regarded as overvalued. This strategy benefited them enormously: on average they were markedly more successful in their speculation than physicists or mathematicians - or even economists.

On average the psychologi…

The spammer is a potential advertiser?

I have once thought that the spammer is a potential advertiser, fighting with spammers. Basically, they want to advertise on mail infra-. Should we only block them? Is there a way to convert them to advertiser?

Be distressed by excessive work

Commonly I make up my mind easily. I'm not on the horns of a dilemma very often. But, I'm not right now. Lately, I've been considering of going back to school (Or other nice company which could provide me some time for my research) since the lack of time for my open source works / researches and my free time distresses me deeply.

By far, workers in South Korea have the longest work hours among OECD countries. The average South Korean works 2,390 hours each year, according to the OECD. This is over 400 hours longer than the next longest-working country and 34% more hours than the average in the United States.

NHN, corp. is good company, but do you know that NHN, corp. is in Korea? ;(

Movie: The Happening 2008

Image
Today, I watched this movie and personally I think it's not bad.



The film opens in New York. People start to get confused in Central Park, repeating their words, standing still and sometimes walking backwards. We hear a few screams. A woman reading on a bench takes her silver chopstick-style hair pin out of her hair and stabs herself in the neck with it.

The plant's chemical defenses were the culprit. That insight seems based on 'New Study Sheds Light on Plants’ Nighttime Defense'.

Compute the transpose of matrix using Hama

The transpose of a matrix is another matrix in which the rows and columns have been reversed. It will be used for SVD (Singular Value Decomposition).

+ + + +
| a11 a12 a13 | | a11 a21 a31 |
| a21 a22 a23 | => | a12 a22 a32 |
| a31 a32 a33 | | a13 a23 a33 |
+ + + +

- A map task receives a row n as a key, and vector of each row as its value
- emit (Reversed index, the entry with the given index)
- Reduce task sets the reversed values

The transpose of 5,000 * 5,000 dense matrix took 12 mins using Hadoop/Hama (10 nodes). Why need to store the result? If we store the result, the locality will be provided for next steps, such as multiplication.

The ASF is ten years

Image
The Apache Software Foundation was 10 years old two weeks ago. At that time, I was university student in Korea, I never dreamed that I would someday play with apache. Recently I'm quite busy at work but I'll do my best for ASF.



- https://blogs.apache.org/foundation/entry/the_asf_is_ten_years

Spam Filtering using Personalized Ontologies

This paper introduce a user-customized filter based on user-preferences and emails as an personalized ontologies.

- http://imsc-dmim.usc.edu/publications/2009_SAC_SWA_final.pdf

Japanese search engine market

I was wondering about japanese IT market since NHN, corp. still plans to japanes web search offering as far as I know. So, I asked about it to 'Tetsuya Kitahata' who is a japanese, a man of IT enterprise, also famous for his open source work (Apache Jakarta). He said, "Baidu (Chinese leading company in web search engine) is struggling here in japan.", "Yahoo! JAPAN is the most popular web search engine in Japan (Yahoo:Goole:MSN = 6:3:1). Yahoo, Inc. is/was mostly owned by the Japanese (to be precise, he has bloodline of Korean), named Masayoshi SON from the beginning (in 1996 or around?). Mr. SON is one of the most famous entrepreneurs in Japan. So, Yahoo! is very popular in Japan. I can not think that Baidu can beat neither Yahoo! Japan nor Google Japan. Mr. SON is the owner and CEO of SOFTBANK, K.K.".

Hmm, I can't forecast the success but, A degree of success seems predictable.

Friday's Dinner

Image
Today, the stock market closed strong and I wound up this week by selling all my stock. But, I'll invest again in a near future, thinking that I had better invest in stocks because they have more growth potential than me. Oh, I wish I lives on income from investments.

Anyway, this is my dinner for tonight. Thank you Lord for the food -- amen.

TIP: How to hide the Blogger Navbar

To hide the Blogger Navbar :

1: On your Dashboard, select Layout. This will take you to the Template tab. Click Edit HTML. Under the Edit Template section you will see you blog's HTML.

2: Paste the CSS definition in the top of the template code:

#navbar-iframe {
display: none !important;
}

Google AdSense for Domains

What is AdSense for domains?

AdSense for domains allows publishers with undeveloped domains to help users by providing relevant information including ads, links and search results.
...

Cool, seems a good solution for useless/undeveloped domain. I wanted to try them immediately. (In all honesty, I heard rumor that some decoy domain names by typing errors (e.g. foofle.com/uoutube.com or maver.com) earn a lot of money. *smile*)

So, Just bought a few domains:

- http://www.fadegook.com
- http://www.filegactory.com

I choose these by statistical analysis, I'll report the detailed result if I can. ;)

Nature's Equity

이인로(李仁老), a wise old scholar of korea said, "The birds have two wings, but only two legs, the beatiful flowers doesn't bear fruits and the color clouds are more easily scattered".

TouchGraph - Graph visualization tool

Image
The diagram below is my social graph from a online social network of facebook. The TouchGraph shows how my friends are connected. It seems was developed, as a open source under the apache license - http://sourceforge.net/projects/touchgraph/


Find the maximum absolute row sum of matrix using MapReduce

The find the maximum absolute row sum of matrix is a good fit with MapReduce model as below.

j=n
The maximum absolute row sum = max ( sum | a_{i,j} | )
1<=i<=n j=1


- A map task receives a row n as a key, and vector of each row as its value
- emit (row, the sum of the absolute value of each entries)
- Reduce task select the maximum one

Maybe it can be written in java as below.

Vector v = givenValue;

double rowSum;
for(VectorEntry e : v) {
rowSum += Math.abs(e.get(i));
}

See more of the Hama - Algorithms

PHP similar_text algorithm

The similar_text returns the number of matching letters of two strings. It can also calculate the similarity of the two strings in percent (matched letters count/average length of two letters)*100. For example,

<?php
echo similar_text("Hello World","Hello Peter");
?>
Output: 7

<?php
echo similar_text("Pello World","Hello Weter");
?>
Output: 6, because the letters are not in the correct order.

<?php
similar_text("Hello World","Hello Peter",$percent);
echo $percent;
?>
Output: 63.6363636364

Gmail and Spam Filtering

"Many Google teams provide pieces of the spam-protection puzzle, from distributed computing to language detection. For example, we use optical character recognition (OCR) developed by the Google Book Search team to protect Gmail users from image spam. And machine-learning algorithms developed to merge and rank large sets of Google search results allow us to combine hundreds of factors to classify spam," explains Google. "Gmail supports multiple authentication systems, including SPF (Sender Policy Framework), DomainKeys, and DKIM (DomainKeys Identified Mail), so we can be more certain that your mail is from who it says it's from. Also, unlike many other providers that automatically let through all mail from certain senders, making it possible for their messages to bypass spam filters, Gmail puts all senders through the same rigorous checks."

See also:

- Official Gmail Blog: How our spam filter works
- A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce

Movie: Gran Torino

Image
This movie was very impressive, took me to think of of my father and my life.

Hadoop Install (하둡 설치)

Many korean people still ask me how to download and install the hadoop. This post describes how to install, configure hadoop cluster. If you are familiar with English, See below links:

- Hadoop: Quick Start
- Hadoop: Cluster Setup

하둡 소개

Hadoop은 일반 PC급 컴퓨터들로 가상화된 대형 Storage를 형성하고 그 안에 보관된 거대한 데이터 셋을 병렬로 처리할 수 있도록 개발된 Java Software Framework 입니다. Map Reduce와 같이 단순화된 병렬 처리 모델은 복잡한 병렬 프로그램을 light-weight 하게 개발 할 수 있도록 도와 주고, Storage의 Fault tolerance 모델은 서비스/운영의 maintenance를 용이하게 합니다.

필요한것들

Hadoop Installation 이전 위에 언급된 바와같이 하둡은 Java Software Framework 이기때문에, JDK 1.6.x 이상이 필요합니다. (하둡 0.19.x 이상 기준) 그리고, Master와 Slaves 간의 통신을 위해 SSH 로 하기 때문에 SSH도 설치해야하며, 연결시 패스워드 입력 없이 자동으로 인식하기 위해서 공개키를 미리 복사해주는것이 좋습니다.


$ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/udanax/.ssh/id_dsa):
Created directory '/home/udanax/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/udanax/.ssh/id…

Breadth-First Search (BFS) & MapReduce

The Breadth-First Search (BFS) & MapReduce was roughly introduced from Distributed Computing Seminar. The graph is stored as a sparse matrix, finds shortest path using Map/Reduce as describe below:

Finding the Shortest Path: Intuition

- We can define the solution to this problem inductively:
- DistanceTo(startNode) = 0
- For all nodes n directly reachable from startNode, DistanceTo(n) = 1
- For all nodes n reachable from some other set of nodes S,
- DistanceTo(n) = 1 + min(DistanceTo(m), m ∈ S)


From Intuition to Algorithm

- A map task receives a node n as a key, and (D, points-to) as its value
- D is the distance to the node from the start
- points-to is a list of nodes reachable from n
∀p ∈ points-to, emit (p, D+1)
- Reduce task gathers possible distances to a given p
and selects the minimum one

According to above-mentioned idea, A map task receives a node n as a key, and (D, points-to) as its value. It means that an "input" is a set of all reachable path from 'sta…

Chinese Proverb - 兎死狗烹

Do you know '兎死狗烹'? It's a chinese proverb. In metaphrase, The man use hound to hunt rabbit but shoo the hound out onto the street after hunt rabbit. it means that when needed, it's grateful, but once unneedful, it's useless and abandoned.

Sexy or Intelligent Internet

It is better to be intelligent or sexy? For scientist is important to be intelligent. For model is important sexy appereance. For school professor is important to be intelligent. It is better to be intelligent or sexy to achieve the succes in relations to other people.

Then, the web application should be intelligent or sexy? IMO, It's same with above. A search engine should be intelligent enough to bring you the best information based on what you mean. A entertainment applications should be sexy, fresh, suggestive to provide more fun.

We can find that examples of these character from google.com and youtube.com.

What are the operating system most dominantly use in korea?

I noticed that someone visit my blog with this question - "what are the operating system most dominantly use in korea?". So I would like to answer that question.

In korea, Microsoft Windows (and IE browser) is still used on as many as 90 percent of China's 40 million PCs, because of most korea web-site (e.g. internet banking, game launch sites, multi-media site, ..., etc) requires Active-X on the high speed internet infra-. It was boom w/o special reason.

So, I think that not IE stuff (e.g., IPhone, Google Chrome, Safari, FireFox) didn't much appeal to korea market.

Google Chubby And Distributed Systems

Chubby is a sort of external lock-management server for reliability and availablity, and used to solve asynchronous consensus and other problems in distributed computing.

Building Chubby was an engineering effort required
to fill the needs mentioned above; it was not research.
We claim no new algorithms or techniques. The purpose
of this paper is to describe what we did and why, rather
than to advocate it. In the sections that follow, we de-
scribe Chubby’s design and implementation, and how it
has changed in the light of experience. We describe un-
expected ways in which Chubby has been used, and fea-
tures that proved to be mistakes. We omit details that are
covered elsewhere in the literature, such as the details of
a consensus protocol or an RPC system.

Papers are focused on their overall system architecture as they claimed. Click below to see more detail:

- The Chubby lock service for loosely-coupled distributed systems
- Paxos made live

Updated:
- There is a similar open source called zookee…

Google PowerMeter

Google PowerMeter, now in prototype, will receive information from utility smart meters and energy management devices and provide anyone who signs up access to her home electricity consumption right on her iGoogle homepage. The graph below shows how someone could use this information to figure out how much energy is used by different household activites.




Great, It represents a kind of ubiquitous computing system.

Naver storage hosting service, called 'N drive'

Naver internally work for online file sharing and storage service, called 'N drive' which is store large amounts of user files using OwFS (Ownership-based File System). I don't know exactly when it will open, But it's a clearly shows that NHN also started to deal with cloud-related problems.

FT: IBM Creates a Cloud Computing Division

by Timothy Prickett MorganYou know that Big Blue is getting serious about something when it creates a formal division to manage it. Last week, IBM announced that it was creating a cloud computing division just as it had also announced that key systems software would soon be available for deployment on Amazon's Elastic Compute Cloud (EC2).Technically speaking, Erich Clementi, who is currently IBM's vice president for strategy and who was formerly a general manager of the Business Systems division (which peddles gear to small and medium businesses) and the System z mainframe business, is now also general manager of Enterprise Initiatives, which is where the company is currently parking all of its cloud computing efforts. And instead of reporting up through Systems and Technology Group or Software Group or Global Services, like the other IBM groups do, this Enterprise Initiatives division reports directly up to IBM president, chief executive officer, and chairman, Sam Palmisano.A…