Posts

Showing posts from 2008

A Plan for the New Year

This year has been frantically busy for me. (My 2008 plan worked pretty well)

- My position changes have been made.
- I've been forming up the Hama/Heart project as a open source.
- New York local chapter of the Heart project was organized.

I believe that's a good start even if there is no result. This New Year's,

- I'll improve our projects.
- I've been thinking about taking a job in US next year.

And, I want my parents to be healthy during 2009. :-)

BigTable for GMail

Aram said...

Does anyone know why Google does not use Bigtable for Gmail?

Rob Kohr said...

@aram

How do you know they don't?

Aram said...

@ Rob

I don't remember where I read it for the first time. But I asked
this question from Jeff Dean. He said maybe Gmail is older than bigtable.
Which I don't think is the real reason, because Orkut is even older than
Gmail, and it does use bigtable. He also mentioned it could be because of
the size of Gmail, not a suprise answer for me.

-- http://glinden.blogspot.com/2005/09/googles-bigtable.html

Personally, I think they use bigtable for gmail but, they don't answer because email is a delicate problem, there is no need to highlighting.

Recently, I'm considering to store the large-scale webmail data on the Hbase. I expect to be able to solve both real-time and batch issues and cost-effective physical/human infra- resources management -- ROI and how spam/system sustaining impacts corporate revenue and pr…

Google adds 'promote' and 'remove' buttons to search results

Google adds 'promote' and 'remove' buttons to search results. The commenting is also possible. I wild guess that it may be a system for collecting vote, which can be used for many kinds of application (e.g. find an dead links, personalized applications, ..., etc).

'BigTable' Talk with google engineer

@ google open source talkfest.

Edward Yoon :

IMO, BigTable array structure seems similar to the multi-dimensional database, so I thought that a data layer for bulk/analytical processing. Can you tell me how to use BigTable at google? The BigTable is suitable for storing any kind of data?

Google Engineer :

BigTable is one of the storage solution, and used for all sorts of things. It is adequate for the high frequency read/write characteristics, so BigTable suits with real-time service and large scale processing. But, There's no objective data.

Edward Yoon :

Ok, The BigTable entries are expressed as row-column-time, So I thought it can be used for Financial engineering, Topological Algebra, Linear Algebra processing. Can you tell me some exemplification if it's true?

Google Engineer :

Yes, it's possible But, that is (some exemplification) out of my ability.

NHN DeView 2008

NHN corporation has been officially (It may perhaps already... since I'm the one of the NHN, corp.) started to share something w/ exterior developers. Because open source has become the ultimate components, open sourcings are attracting tremendous interest.

NHN DeView is a chance to learn about nhn developer products (zeroboard, cubrid, nForge, ..., etc) from the engineers who built them. The cubrid is a open source database, and nForge is NHN's open source project hosting web site.

This is a good news. However IMO, If they have only value/marketing/strategic-oriented thinking, It'll end in an empty talk.

MATLAB becoming more cluster and grid friendly

According to HPC news, The latest MATLAB upgrade includes built-in support for the European EGEE grid (Enabling Grids for E-sciencE). This was accomplished by integrating the Parallel Computing Toolbox and MATLAB Distributed Computing Server with EGEE’s middleware, gLite.

http://insidehpc.com/2008/10/22/matlab-becoming-more-cluster-and-grid-friendly/

Parallel Matrix Multiply on Hadoop Map/Reduce: 2D blocking

In matrix multiplication, calculating one element of C requires reading an entire row of A and an entire column of B. Calculating an entire column of C (m elements) requires reading all rows (m rows) of A once and re-reading one column of B m times. Thus, Calculating n columns of C requires reading all rows of A n times and all columns of B m times.

In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,

C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)

This is a well-known Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.

Each mapper processes a matrix multiplication of blocks and reducer processes sum up all sub-matrices.


map(BlockID, TwoDArrayWritable)
-> <C_block(1,1), A_block(1,1)*B_block(1,1)>, ...

reduce(BlockID’, <TwoDArrayW…

Sun Tech Day 2008 Seoul

Sun Tech Days kick started few days ago in the Seoul. over 2000 attendees, completely charged, willing to discuss their issues, amazing amount of energy and all there to learn about different Sun technologies

The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.

And, I took pictures with Sang Shin, Rich Green.

A gripping links

Hyunsik Choi sent me these links.

http://www.informatik.uni-trier.de/~ley/db/
http://dblp.uni-trier.de/xml/

We thought these data can be used nicely for building/exploring RDF data.

Hadoop 0.18.x, a performance was improved.

After update to hadoop 0.18.x from hadoop 0.17.x for Hama, the speed of JUnit test and some processing was dramatically improved w/o hama algorithm changes.


Before:

[junit] Running org.apache.hama.TestDenseMatrix
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 89.078 sec
[junit] Running org.apache.hama.TestDenseVector
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 34.394 sec

After:

[junit] Running org.apache.hama.TestDenseMatrix
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 58.842 sec
[junit] Running org.apache.hama.TestDenseVector
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 26.519 sec

Here are issues that impact 0.18.x

The official Heart project group blog.

We published a official heart project blog, it will shows you the information of how we are work.

Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.

BigTable & Hbase, October 1, 2008 @ KAIST

A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce

In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".
Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email: Pr(spam|words)=Pr(words|spam)Pr(spam)/Pr(words)
In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).

1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:

Update: --emit <category,probability> pairs and have the redu…

Added Hama site link to Hadoop website.

Hama project was linked to related projects tab on Hadoop official web site.

A automated CI server for Hama integration builds

Apache Hama project is now using Hudson for our nightly/patch builds as a Hadoop project.

The logic is simple :

All the Jira state changes are sent to the Hama mailing list. A daemon on the build machine is subscribed to the mailing list and processes the mail using processEmail.sh script. The patch "queue" created by that script is administered by the Hama-Patch-Admin build which runs QueueAdmin.sh script every few minutes. When there is a patch to test, the Hama-Patch build is kicked off running BuildPatch.sh script.

Type in my name in Google

His name is wesley gibson. (Wanted 2008) And, My name is Edward J. Yoon (윤진석)
According to google search engine, my name is related with NHN.

I'm write under various names (e.g. edward, udanax, 윤진석,..., etc), However, Since I have no activities on local communities, My korean name (윤진석) and ID (Udanax) are dying out so I guess the information of the number of times search query happened used for related keywords.

Google’s Personalized Page Ranking

Image
http://blog.wired.com/monkeybites/2007/11/google-experime.html

V^T is a historical vector of person. I just guess it is related with BigTable and others.

Hudson configuration

Apache Hama project is now apparently using Hudson for our nightly builds. Yes, hudson was already installed on apache zone. However, it was difficult to configure new job. :)

For more information about the hudson:

Documentation: http://hudson.gotdns.com/wiki/display/HUDSON/Use+Hudson
Plugins: http://hudson.gotdns.com/wiki/display/HUDSON/Plugins
Change log: https://hudson.dev.java.net/changelog.html
SVN: https://svn.apache.org/repos/asf/infrastructure/hudson

Apache Con US 2008

- http://apacheconus2008.crowdvine.com/

I wants to go ApacheCon 2008, expects an accidental meeting with you!! :)

ScaLAPACK

I found something amazing.Jaeyoung Choi (professor of soongsil university) is a member of ScaLAPACK team. :) I should like to meet him.

Current Hama mailing list status

Image
Mailing list hama-commits @ incubator.apache.org

Traffic statistics cover a total of 22 days.

Current subscribers: 10
Current digest subscribers: 0
Total posts (22 days): 25
Mean posts per day: 1.14



--------------------------------------------------------------------------------
Mailing list hama-dev @ incubator.apache.org

Traffic statistics cover a total of 22 days.

Current subscribers: 17
Current digest subscribers: 0
Total posts (22 days): 63
Mean posts per day: 2.86



--------------------------------------------------------------------------------
Mailing list hama-user @ incubator.apache.org

Traffic statistics cover a total of 22 days.

Current subscribers: 14
Current digest subscribers: 0
Total posts (22 days): 3
Mean posts per day: 0.14

DNA Jigsaw Puzzle

A new mathematical and statistical method allows the virus population in a diseased organism to be determined quickly and economically. Using this method, medicines and vaccines against diseases caused by viral infections could be developed and deployed in a more targeted way in the future.
Through their diversity resulting from continuous mutation, viruses easily develop drug resistance. This is also why the manufacture of a vaccine against HIV has been unsuccessful up to now. To bring both under control, the strains of virus present in the host must be known. A new method developed by researchers from Switzerland and America now promises help in identifying diverse virus populations.
The method is based on a next generation, high throughput DNA sequencing technique called pyrosequencing. Niko Beerenwinkel, Assistant Professor at the Department of Biosystems of ETH Zurich, explains that this involves a technique that has been in use since 2003, with which the sequencing can be carried …

Hama project has been accepted into the Incubator

http://www.mail-archive.com/general@incubator.apache.org/msg17706.html
I'm happy, Hama project has accepted into the Apache Incubator. However, It has just began.

Hama Incubation vote has started.

The domain of Hama(parallel matrix computation) is one of the most mammoth scientific research. Therefore, I will seek advice for Hama project from the KAIST (Department of Mathematical Sciences) and the the various laboratories.

And, Hama Incubation vote has started. It’s my first time proposing apache so I’m a bit nervous. :)

Semantic Indexing

Image
The different forms (e.g. singular, plural, synonym, polynym, ..., etc) of term can only be classified as another keyword even if it belongs to the same semantic. So, The use of semantic indexing seems to improve results in the case of below sample. For example(singular/plural): I could hear song singing what i may want to hear again on the radio. So, I tried to search this song using 'picture of you' phrase which is a part of this song lyric.

1) When I used the Naver search engine, I could't find any clue about that song while pages are being turned; That song's title was 'Pictures of you'.

-- 단수형, 복수형, 동의어, 어휘변형 등 형태적으로는 다른 형태를 띄더라도 같은 의미를 나타내는 용어에 대해서, 네이버 검색은 ‘picture’와 복수형인 ‘pictures’, 동의어인 “사진”을 동일한 색인어 ‘picture’로 핸들링하지 않고 있어, ‘picture’로 검색하더라도 ‘pictures’가 들어간 문서들은 검색할 수 없는 것으로 보인다.



2) However, When I used the Google search engine, I could find the singular/plural forms of the 'Picture', synonyms, and other relevant documents at Top 10 list. I just g…

Network Cost.

To reduce the number of remote calls and to avoid the associated overhead (Network Cost), what should we do? Maybe we need a High-Performance Parallel Optimizer for vague/medium-sized operations, but lacks experiences. Now i'm waiting the large test cluster using gigabit switch. :)

PSVM: Parallelizing Support Vector Machines on Distributed Computers

Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides

http://research.yahoo.com/node/2104

I couldn't go there; I received official notice from the ministry of defense that I was summoned to a army reserve forces on March 25th.

Sherpa: Cloud computing of the third kind

Raghu (former professor at Madison Wisconsin, now at Yahoo!) is leading a very interesting project on largely scale storage (Sherpa). Here you can find some of my unconnected notes. Software as a service requires to CPU and data. Cloud computing using assimilated to Map-Reduce grids, but they decouple computation and data. For instance Condor is great for high-throughput computing, but on the data side you run into SSDS, Hadoop, etc. But there is a third one, transactional storage. Moreover SQL is the most largely used parallel programming language. Raghu wonder why can’t we build on the lesson learned on RDBMS for OLTP. Sherpa is aiming not to support ACID models, but massively scalable via relaxation. Updates: creation, or simple object updates. Queries: selection with filtering. The vision is to start in a box, if it needs to scale, that should be transparent. PNUTS is part of Sherpa, and it is the technology for: geographic replication, uniform updates, queries, and flexible schem…

SQL Server Data Services (SSDS)

SQL Server Data Services (SSDS) are highly scalable, on-demand data storage and query processing utility services. Built on robust SQL Server database and Windows Server technologies, these services provide high availability, security and support standards-based web interfaces for easy programming and quick provisioning. SSDS homepage : http://www.microsoft.com/sql/dataservices/default.mspx

It's a Highly Scalable web facing data storage. and SSDS data model really takes after google's BigTable.

* Container: Set of entities
* Entity : Flat scalar property bags
* Property name/value pairs

My Identity

I had a very rough time recently, It was a difficult choice for me to leave one thing i loved to go to another thing i loved. However, I learned most important thing. It's really important, I should keep my identity.

Now i'm working hard to make my dream.

Bean ID overlap-addition

I have some problem of bean ID overlap-addition by many developers on one project. If overlap-addition occur, Spring returns the bean which is defined at last without any exceptions. Isn't problematic?

Here's the Q&A page for this issue :
http://forum.springframework.org/showthread.php?t=52696

Hadoop summit

Hadoop Overview - Doug Cutting and Eric Baldeschwieler

Doug Cutting - pretty much the father of Hadoop gave an overview of Hadoop history. Interesting comment was that Hadoop has achieved web-scale in early 2008...

Eric14 (Eric B...): Grid computing at Yahoo. 500M unique users per month, billions of interesting events per day. 'Data analysis is the inner loop" at Yahoo.

Y's vision and focus: On-demand shared access to vast pool of resources, support for massively parallel execution. , Data Intensive Super Computer (DISC), centrally provisioned and managed, service oriented...Y's focus is not grid computing in in terms of Globus, etc., not focused on external usage ala Amazon EC2/S3. Biggest grid is about 2,000 nodes.

Open Source Stack: Commitment to open source developent, Yahoo is an Apache Platinum Sponsor

Tools used to implement Yahoo's grid vision: Hadoop, Pig, Zookeeper (high avail directory and config sevices), Simon (cluster and app monitoring).

Simon: Very early…

A New Parallel Algorithm For Evaluating The Determinant of A Matrix of Order n

Here's the link of PPT file.
http://www.math.tu-berlin.de/EuroComb05/Talks/Poster/p12-Teimoori_Faal.ppt

I think one of these partitioning ideas can be used for hama.

Hama, Stepping into Apache Incubator.

Moving

I'll move to service development center from R&D center.
Therefore, i'm no longer used for open source as a full-time open source developer.
But, i'll continuous because i love it.

..┏┓┏┓..
┏┻┫┣┻┓
┃━┫┣━┃
┃━┫┣━┃
┗━┛┗━┛

Language Support Installation On Linux

To install additional language support from the Languages group, use Pirut via Applications Add/Remove Software, or run this command:

su -c 'yum groupinstall {language}-support'
ex) bash-3.00# su -c 'yum groupinstall korean-support

In the command above, {language} is one of assamese, bengali, chinese, gujarati, hindi, japanese, kannada, korean, malayalam, marathi, oriya, punjabi, sinhala, tamil, thai, or telegu.

Users upgrading from earlier releases of Fedora are strongly recommended to install scim-bridge-gtk, which works well with 3rd party C++ applications linked against older versions of libstdc++.

To add SCIM support to input a particular language, install scim-lang-LANG, where LANG is one of assamese, bengali, chinese, dhivehi, farsi, gujarati, hindi, japanese, kannada, korean, latin, malayalam, marathi, oriya, punjabi, sinhalese, tamil, telugu, thai, or tibetan.

What is a Blog?

A blog is a species of interactive electronic diary by means of which the unpublishable, untrammeled by editors or the rules of grammar, can communicate their thoughts via the web. (Though it sounds like something you would find stuck in a drain, the ugly neologism blog is a contraction of "web log.") Until recently, I had not spent much time thinking about blogs or Blog People. But, I know that this conference will has been a key factor in web-culture rapid rise in something, That's why i should participate in this meeting.

Google News Personalization System Analysis & Clone Project