Edward J. Yoon's Blog: 03/2009

Japanese search engine market

I was wondering about japanese IT market since NHN, corp. still plans to japanes web search offering as far as I know. So, I asked about it to 'Tetsuya Kitahata' who is a japanese, a man of IT enterprise, also famous for his open source work (Apache Jakarta). He said, "Baidu (Chinese leading company in web search engine) is struggling here in japan.", "Yahoo! JAPAN is the most popular web search engine in Japan (Yahoo:Goole:MSN = 6:3:1). Yahoo, Inc. is/was mostly owned by the Japanese (to be precise, he has bloodline of Korean), named Masayoshi SON from the beginning (in 1996 or around?). Mr. SON is one of the most famous entrepreneurs in Japan. So, Yahoo! is very popular in Japan. I can not think that Baidu can beat neither Yahoo! Japan nor Google Japan. Mr. SON is the owner and CEO of SOFTBANK, K.K.".

Hmm, I can't forecast the success but, A degree of success seems predictable.

Friday's Dinner

Today, the stock market closed strong and I wound up this week by selling all my stock. But, I'll invest again in a near future, thinking that I had better invest in stocks because they have more growth potential than me. Oh, I wish I lives on income from investments.

Anyway, this is my dinner for tonight. Thank you Lord for the food -- amen.

TIP: How to hide the Blogger Navbar

To hide the Blogger Navbar :

1: On your Dashboard, select Layout. This will take you to the Template tab. Click Edit HTML. Under the Edit Template section you will see you blog's HTML.

2: Paste the CSS definition in the top of the template code:


#navbar-iframe {
   display: none !important;
}

Google AdSense for Domains


What is AdSense for domains?

AdSense for domains allows publishers with undeveloped domains to help users by providing relevant information including ads, links and search results.
...

Cool, seems a good solution for useless/undeveloped domain. I wanted to try them immediately. (In all honesty, I heard rumor that some decoy domain names by typing errors (e.g. foofle.com/uoutube.com or maver.com) earn a lot of money. *smile*)

So, Just bought a few domains:

- http://www.fadegook.com
- http://www.filegactory.com

I choose these by statistical analysis, I'll report the detailed result if I can. ;)

Nature's Equity

이인로(李仁老), a wise old scholar of korea said, "The birds have two wings, but only two legs, the beatiful flowers doesn't bear fruits and the color clouds are more easily scattered".

TouchGraph - Graph visualization tool

The diagram below is my social graph from a online social network of facebook. The TouchGraph shows how my friends are connected. It seems was developed, as a open source under the apache license - http://sourceforge.net/projects/touchgraph/

Find the maximum absolute row sum of matrix using MapReduce

The find the maximum absolute row sum of matrix is a good fit with MapReduce model as below.


                                         j=n
The maximum absolute row sum =   max   ( sum | a_{i,j} | ) 
                               1<=i<=n   j=1


- A map task receives a row n as a key, and vector of each row as its value
 - emit (row, the sum of the absolute value of each entries)
- Reduce task select the maximum one

Maybe it can be written in java as below.


  Vector v = givenValue;

  double rowSum;
  for(VectorEntry e : v) {
    rowSum += Math.abs(e.get(i));
  }

See more of the Hama - Algorithms

PHP similar_text algorithm

The similar_text returns the number of matching letters of two strings. It can also calculate the similarity of the two strings in percent (matched letters count/average length of two letters)*100. For example,


<?php
echo similar_text("Hello World","Hello Peter");
?>
Output: 7

<?php
echo similar_text("Pello World","Hello Weter");
?>
Output: 6, because the letters are not in the correct order.

<?php
similar_text("Hello World","Hello Peter",$percent);
echo $percent;
?>
Output: 63.6363636364

Gmail and Spam Filtering

"Many Google teams provide pieces of the spam-protection puzzle, from distributed computing to language detection. For example, we use optical character recognition (OCR) developed by the Google Book Search team to protect Gmail users from image spam. And machine-learning algorithms developed to merge and rank large sets of Google search results allow us to combine hundreds of factors to classify spam," explains Google. "Gmail supports multiple authentication systems, including SPF (Sender Policy Framework), DomainKeys, and DKIM (DomainKeys Identified Mail), so we can be more certain that your mail is from who it says it's from. Also, unlike many other providers that automatically let through all mail from certain senders, making it possible for their messages to bypass spam filters, Gmail puts all senders through the same rigorous checks."

See also:

- Official Gmail Blog: How our spam filter works
- A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce
- or Parallelizing Support Vector Machines on Distributed Computers
- Sender Reputation in a Large Webmail Service
- Spam Filtering using Google/GMAIL

Movie: Gran Torino

This movie was very impressive, took me to think of of my father and my life.

Hadoop Install (하둡 설치)

Many korean people still ask me how to download and install the hadoop. This post describes how to install, configure hadoop cluster. If you are familiar with English, See below links:

- Hadoop: Quick Start
- Hadoop: Cluster Setup

하둡 소개

Hadoop은 일반 PC급 컴퓨터들로 가상화된 대형 Storage를 형성하고 그 안에 보관된 거대한 데이터 셋을 병렬로 처리할 수 있도록 개발된 Java Software Framework 입니다. Map Reduce와 같이 단순화된 병렬 처리 모델은 복잡한 병렬 프로그램을 light-weight 하게 개발 할 수 있도록 도와 주고, Storage의 Fault tolerance 모델은 서비스/운영의 maintenance를 용이하게 합니다.

필요한것들

Hadoop Installation 이전 위에 언급된 바와같이 하둡은 Java Software Framework 이기때문에, JDK 1.6.x 이상이 필요합니다. (하둡 0.19.x 이상 기준) 그리고, Master와 Slaves 간의 통신을 위해 SSH 로 하기 때문에 SSH도 설치해야하며, 연결시 패스워드 입력 없이 자동으로 인식하기 위해서 공개키를 미리 복사해주는것이 좋습니다.


$ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/udanax/.ssh/id_dsa):
Created directory '/home/udanax/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/udanax/.ssh/id_dsa.
Your public key has been saved in /home/udanax/.ssh/id_dsa.pub.
The key fingerprint is: blah~ blah~
$ _

$ cat ~/.ssh/id_dsa.pub | ssh id@host "cat >> .ssh/authorized_keys"
password: enter the password
$ _

설치/설정

하둡 다운로드는 여기에서 하세요. 압축만 적당한 디렉토리에 (각 서버 동일하게) 풀면됩니다. 그 다음 설정파일들로 Cluster를 구성합니다. 설정 파일들은 ${HADOOP_HOME}/conf 에 위치합니다.


$ cd ${HADOOP_HOME}
$ vi ./conf/hadoop-env.sh
 # Java 홈 경로를 입력
 export JAVA_HOME=/usr/local/java
 
$ vi ./conf/master
 # 파일에 master 서버의 host 이름을 작성
 MASTER_SERVER_NAME
 ex) master.udanax.org

$ vi ./conf/slaves
 # 파일에 slave들을 라인단위로 작성
 SLAVE_SERVER_NAME_1
 SLAVE_SERVER_NAME_2
 ex) slave1.udanax.org

그리고, hadoop-site.xml 파일을 열어, 아래와 같이 기본적인 경로 및 URL을 설정해줍니다.

- hadoop.tmp.dir : 데이터 저장 디렉터리
- fs.default.name : HDFS 파일시스템 주소 (NameNode 가 올라간 master 서버)
- mapred.job.tracker : JobTracker 주소 (JobTracker 가 올라간 서버)


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/udanax/hadoop-storage</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master.udanax.org:9000/</value>
  <description>
    The name of the default file system. Either the literal string
    "local" or a host:port for NDFS.
  </description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>master.udanax.org:9001</value>
  <description>
    The host and port that the MapReduce job tracker runs at. If
    "local", then jobs are run in-process as a single map and
    reduce task.
  </description>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>3</value>
  <description>
    define mapred.map tasks to be number of slave hosts
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>3</value>
  <description>
    define mapred.reduce tasks to be number of slave hosts
  </description>
</property>

<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>
</configuration>

실행


$ cd ${HADOOP_HOME}
$ ./bin/hadoop namenode -format
$ ./bin/start-all.sh

$ ./bin/hadoop jar hadoop-example.jar pi 10 10

Edward J. Yoon's Blog