A Plan for the New Year

This year has been frantically busy for me. (My 2008 plan worked pretty well)

- My position changes have been made.
- I've been forming up the Hama/Heart project as a open source.
- New York local chapter of the Heart project was organized.

I believe that's a good start even if there is no result. This New Year's,

- I'll improve our projects.
- I've been thinking about taking a job in US next year.

And, I want my parents to be healthy during 2009. :-)

BigTable for GMail


Aram said...

Does anyone know why Google does not use Bigtable for Gmail?

Rob Kohr said...

@aram

How do you know they don't?

Aram said...

@ Rob

I don't remember where I read it for the first time. But I asked
this question from Jeff Dean. He said maybe Gmail is older than bigtable.
Which I don't think is the real reason, because Orkut is even older than
Gmail, and it does use bigtable. He also mentioned it could be because of
the size of Gmail, not a suprise answer for me.

-- http://glinden.blogspot.com/2005/09/googles-bigtable.html

Personally, I think they use bigtable for gmail but, they don't answer because email is a delicate problem, there is no need to highlighting.

Recently, I'm considering to store the large-scale webmail data on the Hbase. I expect to be able to solve both real-time and batch issues and cost-effective physical/human infra- resources management -- ROI and how spam/system sustaining impacts corporate revenue and productivity.

http://wiki.apache.org/hama/HbaseForWebMail

This is intended to explain and illustrate the concept of Hadoop/Hbase Based web-mail storage. There are three main parts:

- Stable & Reliable, Fault tolerant System
- Scalability
- Efficient & Cost-effective

These are simply inherited from the Hadoop/Hbase to web-mail storage.

Google adds 'promote' and 'remove' buttons to search results

Google adds 'promote' and 'remove' buttons to search results. The commenting is also possible. I wild guess that it may be a system for collecting vote, which can be used for many kinds of application (e.g. find an dead links, personalized applications, ..., etc).

'BigTable' Talk with google engineer

@ google open source talkfest.

Edward Yoon :

IMO, BigTable array structure seems similar to the multi-dimensional database, so I thought that a data layer for bulk/analytical processing. Can you tell me how to use BigTable at google? The BigTable is suitable for storing any kind of data?

Google Engineer :

BigTable is one of the storage solution, and used for all sorts of things. It is adequate for the high frequency read/write characteristics, so BigTable suits with real-time service and large scale processing. But, There's no objective data.

Edward Yoon :

Ok, The BigTable entries are expressed as row-column-time, So I thought it can be used for Financial engineering, Topological Algebra, Linear Algebra processing. Can you tell me some exemplification if it's true?

Google Engineer :

Yes, it's possible But, that is (some exemplification) out of my ability.

NHN DeView 2008

NHN corporation has been officially (It may perhaps already... since I'm the one of the NHN, corp.) started to share something w/ exterior developers. Because open source has become the ultimate components, open sourcings are attracting tremendous interest.

NHN DeView is a chance to learn about nhn developer products (zeroboard, cubrid, nForge, ..., etc) from the engineers who built them. The cubrid is a open source database, and nForge is NHN's open source project hosting web site.

This is a good news. However IMO, If they have only value/marketing/strategic-oriented thinking, It'll end in an empty talk.

MATLAB becoming more cluster and grid friendly

According to HPC news, The latest MATLAB upgrade includes built-in support for the European EGEE grid (Enabling Grids for E-sciencE). This was accomplished by integrating the Parallel Computing Toolbox and MATLAB Distributed Computing Server with EGEE’s middleware, gLite.

http://insidehpc.com/2008/10/22/matlab-becoming-more-cluster-and-grid-friendly/

Parallel Matrix Multiply on Hadoop Map/Reduce: 2D blocking

In matrix multiplication, calculating one element of C requires reading an entire row of A and an entire column of B. Calculating an entire column of C (m elements) requires reading all rows (m rows) of A once and re-reading one column of B m times. Thus, Calculating n columns of C requires reading all rows of A n times and all columns of B m times.

In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,

C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)

This is a well-known Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.

Each mapper processes a matrix multiplication of blocks and reducer processes sum up all sub-matrices.


map(BlockID, TwoDArrayWritable)
-> <C_block(1,1), A_block(1,1)*B_block(1,1)>, ...

reduce(BlockID’, <TwoDArrayWritable>*)
-> <C_block(1,1), A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)>, ...


In my own work, It runs greatly. But, I'm not sure it can be auto-optimized.

Sun Tech Day 2008 Seoul

Sun Tech Days kick started few days ago in the Seoul. over 2000 attendees, completely charged, willing to discuss their issues, amazing amount of energy and all there to learn about different Sun technologies

The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.

And, I took pictures with Sang Shin, Rich Green.

Hadoop 0.18.x, a performance was improved.

After update to hadoop 0.18.x from hadoop 0.17.x for Hama, the speed of JUnit test and some processing was dramatically improved w/o hama algorithm changes.


Before:

[junit] Running org.apache.hama.TestDenseMatrix
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 89.078 sec
[junit] Running org.apache.hama.TestDenseVector
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 34.394 sec

After:

[junit] Running org.apache.hama.TestDenseMatrix
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 58.842 sec
[junit] Running org.apache.hama.TestDenseVector
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 26.519 sec

Here are issues that impact 0.18.x

The official Heart project group blog.

We published a official heart project blog, it will shows you the information of how we are work.

Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.

BigTable & Hbase, October 1, 2008 @ KAIST

http://wiki.apache.org/hama-data/attachments/Presentations/attachments/BigTable_and_Hbase.pdf

A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce

In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".
Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email:

Pr(spam|words)=Pr(words|spam)Pr(spam)/Pr(words)

In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).

1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:

Update: --emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category.

Then, it'll be more simplified. :)

Map:
    /**
     * Counts word frequency
     */
    public void map(LongWritable key, Text value,
        OutputCollector output, Reporter reporter)
        throws IOException {
      String line = value.toString();
      String[] tokens = line.split(splitregex);

      // For every word token
      for (int i = 0; i < tokens.length; i++) {
        String word = tokens[i].toLowerCase();
        Matcher m = wordregex.matcher(word);
        if (m.matches()) {
          spamTotal++;
          output.collect(new Text(word), count);
        }
      }
    }
Reduce:
    /**
     * Computes bad (or good) count / total bad (or good) words
     */
    public void reduce(Text key, Iterator values,
        OutputCollector output, Reporter reporter)
        throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += (int) values.next().get();
      }

      output.collect(key, 
        new FloatWritable((float) sum / spamTotal));
    }
2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:
    /**
     * Implement bayes rules to computer 
     * how likely this word is "spam"
     */
    public void finalizeProb() {
      if (rGood + rBad > 0)
        pSpam = rBad / (rBad + rGood);
      if (pSpam < 0.01f)
        pSpam = 0.01f;
      else if (pSpam > 0.99f)
        pSpam = 0.99f;
    }

A automated CI server for Hama integration builds

Apache Hama project is now using Hudson for our nightly/patch builds as a Hadoop project.

The logic is simple :

All the Jira state changes are sent to the Hama mailing list. A daemon on the build machine is subscribed to the mailing list and processes the mail using processEmail.sh script. The patch "queue" created by that script is administered by the Hama-Patch-Admin build which runs QueueAdmin.sh script every few minutes. When there is a patch to test, the Hama-Patch build is kicked off running BuildPatch.sh script.

Type in my name in Google

His name is wesley gibson. (Wanted 2008) And, My name is Edward J. Yoon (윤진석)
According to google search engine, my name is related with NHN.

I'm write under various names (e.g. edward, udanax, 윤진석,..., etc), However, Since I have no activities on local communities, My korean name (윤진석) and ID (Udanax) are dying out so I guess the information of the number of times search query happened used for related keywords.

ScaLAPACK

I found something amazing.

Jaeyoung Choi (professor of soongsil university) is a member of ScaLAPACK team. :) I should like to meet him.

Current Hama mailing list status

Mailing list hama-commits @ incubator.apache.org

Traffic statistics cover a total of 22 days.

Current subscribers: 10
Current digest subscribers: 0
Total posts (22 days): 25
Mean posts per day: 1.14



--------------------------------------------------------------------------------
Mailing list hama-dev @ incubator.apache.org

Traffic statistics cover a total of 22 days.

Current subscribers: 17
Current digest subscribers: 0
Total posts (22 days): 63
Mean posts per day: 2.86



--------------------------------------------------------------------------------
Mailing list hama-user @ incubator.apache.org

Traffic statistics cover a total of 22 days.

Current subscribers: 14
Current digest subscribers: 0
Total posts (22 days): 3
Mean posts per day: 0.14

DNA Jigsaw Puzzle

A new mathematical and statistical method allows the virus population in a diseased organism to be determined quickly and economically. Using this method, medicines and vaccines against diseases caused by viral infections could be developed and deployed in a more targeted way in the future.
Through their diversity resulting from continuous mutation, viruses easily develop drug resistance. This is also why the manufacture of a vaccine against HIV has been unsuccessful up to now. To bring both under control, the strains of virus present in the host must be known. A new method developed by researchers from Switzerland and America now promises help in identifying diverse virus populations.
The method is based on a next generation, high throughput DNA sequencing technique called pyrosequencing. Niko Beerenwinkel, Assistant Professor at the Department of Biosystems of ETH Zurich, explains that this involves a technique that has been in use since 2003, with which the sequencing can be carried out efficiently and economically. He is a co-author of a study published recently in the scientific journal PLoS Computational Biology, in which the researchers successfully identified the virus strains of four patients infected with the HIV virus.

Light signals identify structural elements

In their study, the scientists used the pyrosequencing technique to examine the DNA of HIV/AIDS viruses. As Beerenwinkel explains: “This involves determining the sequence of the DNA modules by synthesising the complementary DNA strand. Each newly incorporated base is recognised by a light signal. This method is reliable only for short DNA segments, but at the same time it can be highly parallelised, which ultimately leads to a very large number of short segments.”
The scientists now used a computer-assisted method which they had developed to combine together the DNA segments from the HIV samples to form virus strains. This involves piecing together matching DNA segments like a jigsaw puzzle to form the complete sequences of the various strains. Beerenwinkel explains that one does not know beforehand either how many or which strains the sample contains, or which DNA segments belong to the same strain.

Error rate minimized

But, according to Beerenwinkel, the disadvantage would be that the segments are very short and may conceal a high error rate. However, by using the mathematical and statistical error correction tools developed by the researchers, they were able to reduce the error rate with their method by a factor of 30, thus determining reliably the virus strains belonging to the DNA segments. This is shown by a comparison with conventional methods in which long DNA segments can be determined with high precision.
For this purpose they chose individual viruses at random from the population and used traditional methods to determine the sequence of the individual DNA structural elements. According to Beerenwinkel, this revealed that the results were very similar, but the pyrosequencing method works faster and more economically and – in contrast to conventional methods – enables entire virus populations to be identified more easily.
The method that has been developed allows the efficient use of new sequencing techniques with which the genetic diversity of the whole virus population of an infected patient can be determined. This can mean a big step forward for use of medicines to treat virus diseases and for vaccine development. This is because it could enable medicines to be used in a more targeted way, thus preventing the development of resistance. It could also facilitate the development of vaccines. — ScienceDaily (May 5, 2008)

Hama Incubation vote has started.

The domain of Hama(parallel matrix computation) is one of the most mammoth scientific research. Therefore, I will seek advice for Hama project from the KAIST (Department of Mathematical Sciences) and the the various laboratories.

And, Hama Incubation vote has started. It’s my first time proposing apache so I’m a bit nervous. :)

Semantic Indexing

The different forms (e.g. singular, plural, synonym, polynym, ..., etc) of term can only be classified as another keyword even if it belongs to the same semantic. So, The use of semantic indexing seems to improve results in the case of below sample. For example(singular/plural): I could hear song singing what i may want to hear again on the radio. So, I tried to search this song using 'picture of you' phrase which is a part of this song lyric.

1) When I used the Naver search engine, I could't find any clue about that song while pages are being turned; That song's title was 'Pictures of you'.

-- 단수형, 복수형, 동의어, 어휘변형 등 형태적으로는 다른 형태를 띄더라도 같은 의미를 나타내는 용어에 대해서, 네이버 검색은 ‘picture’와 복수형인 ‘pictures’, 동의어인 “사진”을 동일한 색인어 ‘picture’로 핸들링하지 않고 있어, ‘picture’로 검색하더라도 ‘pictures’가 들어간 문서들은 검색할 수 없는 것으로 보인다.



2) However, When I used the Google search engine, I could find the singular/plural forms of the 'Picture', synonyms, and other relevant documents at Top 10 list. I just guess this is the effect of the LSI(Latent Semantic Indexing).

-- 반면에 구글은, LSI 기법을 통해 Top 10 리스트안에서 단수형/복수형은 물론 동의어, 다의어에 대해 적절히 분산하여 노출함을 볼 수 있다.




Then, What effect does Latent Semantic Indexing have on Google ranking? There are many references which touch on the subject of the Google's LSI.

Here's few
- How Does Google Use Latent Semantic Indexing (LSI)?
- Google Semantically Related Words & Latent Semantic Indexing Technology

Network Cost.

To reduce the number of remote calls and to avoid the associated overhead (Network Cost), what should we do? Maybe we need a High-Performance Parallel Optimizer for vague/medium-sized operations, but lacks experiences. Now i'm waiting the large test cluster using gigabit switch. :)

PSVM: Parallelizing Support Vector Machines on Distributed Computers

http://books.nips.cc/papers/files/nips20/NIPS2007_0435.pdf

Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides

http://research.yahoo.com/node/2104

I couldn't go there; I received official notice from the ministry of defense that I was summoned to a army reserve forces on March 25th.

Sherpa: Cloud computing of the third kind

Raghu (former professor at Madison Wisconsin, now at Yahoo!) is leading a very interesting project on largely scale storage (Sherpa). Here you can find some of my unconnected notes. Software as a service requires to CPU and data. Cloud computing using assimilated to Map-Reduce grids, but they decouple computation and data. For instance Condor is great for high-throughput computing, but on the data side you run into SSDS, Hadoop, etc. But there is a third one, transactional storage. Moreover SQL is the most largely used parallel programming language. Raghu wonder why can’t we build on the lesson learned on RDBMS for OLTP. Sherpa is aiming not to support ACID models, but massively scalable via relaxation. Updates: creation, or simple object updates. Queries: selection with filtering. The vision is to start in a box, if it needs to scale, that should be transparent. PNUTS is part of Sherpa, and it is the technology for: geographic replication, uniform updates, queries, and flexible schemas. Then he goes and describe the inner parts of PNUTS and the software stack. Some interesting notes, no logging, message validation, no traditional transactions. Lower levels put and get a key, on the top of it ranges and sorted, PNUTS on the top provide the querying facility (insert, select, remove). Flexible schemas, the fields are declared at the table label, but do not need to be present (flexible growth). Records are mastered on different nodes, during the utilization, the masters can migrate depending on the usage of them. The basic consistency model is based on a timeline. Master writes and reads are ordered, others can catch up in time. Load balancing by splitting and migration, and guaranteed by the Yahoo! Message Broker. The goal, simple, light, massively scalable.
Raghu Angadi and Raghu Ramakrishnan are different people with the same name.

SQL Server Data Services (SSDS)

SQL Server Data Services (SSDS) are highly scalable, on-demand data storage and query processing utility services. Built on robust SQL Server database and Windows Server technologies, these services provide high availability, security and support standards-based web interfaces for easy programming and quick provisioning.
SSDS homepage : http://www.microsoft.com/sql/dataservices/default.mspx

It's a Highly Scalable web facing data storage. and SSDS data model really takes after google's BigTable.

* Container: Set of entities
* Entity : Flat scalar property bags
* Property name/value pairs

My Identity

I had a very rough time recently, It was a difficult choice for me to leave one thing i loved to go to another thing i loved. However, I learned most important thing. It's really important, I should keep my identity.

Now i'm working hard to make my dream.

Bean ID overlap-addition

I have some problem of bean ID overlap-addition by many developers on one project. If overlap-addition occur, Spring returns the bean which is defined at last without any exceptions. Isn't problematic?

Here's the Q&A page for this issue :
http://forum.springframework.org/showthread.php?t=52696

Hadoop summit

Hadoop Overview - Doug Cutting and Eric Baldeschwieler

Doug Cutting - pretty much the father of Hadoop gave an overview of Hadoop history. Interesting comment was that Hadoop has achieved web-scale in early 2008...

Eric14 (Eric B...): Grid computing at Yahoo. 500M unique users per month, billions of interesting events per day. 'Data analysis is the inner loop" at Yahoo.

Y's vision and focus: On-demand shared access to vast pool of resources, support for massively parallel execution. , Data Intensive Super Computer (DISC), centrally provisioned and managed, service oriented...Y's focus is not grid computing in in terms of Globus, etc., not focused on external usage ala Amazon EC2/S3. Biggest grid is about 2,000 nodes.

Open Source Stack: Commitment to open source developent, Yahoo is an Apache Platinum Sponsor

Tools used to implement Yahoo's grid vision: Hadoop, Pig, Zookeeper (high avail directory and config sevices), Simon (cluster and app monitoring).

Simon: Very early days, internal to Yahoo right now...similar to Ganglia "but more configurable". Highly configurable aggregation system - gathering data from various nodes to produce (customizable?) reports.

HOG - Hadoop On Demand. Current Hadoop scheduler currently FIFO - jobs will run in parallel to the extent that the previous job doesn't saturate the node. HOG is built on Torque (www.clusterresources.com) to build virtual clusters, separate file systems, etc. Yahoo has taken development about as far as they want...cleaning up code, etc. Future direction for Yahoo is to invest more heavily in the Hadoop scheduler. Does HOG disrupt data locality - yup, it does. Good news: Future Hadoop work will improve rack locality handling significantly.

Hadoop, HOG, Pig all part of Apache today,

Multiple grids inside Yahoo: tens of thousands of nodes, hundreds of thousands of cores, TBs of memory, PBs of disk...ingests TBs of data daily.

M45 Project: Open Academic Cluster in collaboration with CMU: 500 nodes, 3TB RAM, 1.5P disk, high bandwith located conveniently in a semi-truck trailer

Open source project and Apache: Goal is for Hadoop to remain a viable open source project. Yahoo has invested heavily...very excited to see additional contributors and commiters. "Yahoo is very proud of what we've done with Hadoop over the past few years." Interesting metric: Megawatts of Hadoop

Hadoop Interest Growing: 400 people expressed interest in today's conference, 28 organizations registered their Hadoop usage/cluster, in use in universities on multple continents, Y is now started hiring employees with Hadoop experience.

GET INVOLVED: Fix a bug, submit a test case, write some docs, help out!

Random notes: More than 50% conf attendees running Hadoop, many with grids more than 20 nodes, and several with grids > 100 nodes.

Yahoo just announced collaboration with Computational Research Labs (CRL) in India to "jointly support cloud computing research"...CRL runs EKA - the 4th fastest supercomputer on the planet.

Moving

I'll move to service development center from R&D center.
Therefore, i'm no longer used for open source as a full-time open source developer.
But, i'll continuous because i love it.

..┏┓┏┓..
┏┻┫┣┻┓
┃━┫┣━┃
┃━┫┣━┃
┗━┛┗━┛

Language Support Installation On Linux

To install additional language support from the Languages group, use Pirut via Applications Add/Remove Software, or run this command:

su -c 'yum groupinstall {language}-support'
ex) bash-3.00# su -c 'yum groupinstall korean-support

In the command above, {language} is one of assamese, bengali, chinese, gujarati, hindi, japanese, kannada, korean, malayalam, marathi, oriya, punjabi, sinhala, tamil, thai, or telegu.

Users upgrading from earlier releases of Fedora are strongly recommended to install scim-bridge-gtk, which works well with 3rd party C++ applications linked against older versions of libstdc++.

To add SCIM support to input a particular language, install scim-lang-LANG, where LANG is one of assamese, bengali, chinese, dhivehi, farsi, gujarati, hindi, japanese, kannada, korean, latin, malayalam, marathi, oriya, punjabi, sinhalese, tamil, telugu, thai, or tibetan.

What is a Blog?

A blog is a species of interactive electronic diary by means of which the unpublishable, untrammeled by editors or the rules of grammar, can communicate their thoughts via the web. (Though it sounds like something you would find stuck in a drain, the ugly neologism blog is a contraction of "web log.") Until recently, I had not spent much time thinking about blogs or Blog People. But, I know that this conference will has been a key factor in web-culture rapid rise in something, That's why i should participate in this meeting.