Edward J. Yoon's Blog: 10/2008

MATLAB becoming more cluster and grid friendly

According to HPC news, The latest MATLAB upgrade includes built-in support for the European EGEE grid (Enabling Grids for E-sciencE). This was accomplished by integrating the Parallel Computing Toolbox and MATLAB Distributed Computing Server with EGEE’s middleware, gLite.

http://insidehpc.com/2008/10/22/matlab-becoming-more-cluster-and-grid-friendly/

Parallel Matrix Multiply on Hadoop Map/Reduce: 2D blocking

In matrix multiplication, calculating one element of C requires reading an entire row of A and an entire column of B. Calculating an entire column of C (m elements) requires reading all rows (m rows) of A once and re-reading one column of B m times. Thus, Calculating n columns of C requires reading all rows of A n times and all columns of B m times.

In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,

C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)

This is a well-known Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.

Each mapper processes a matrix multiplication of blocks and reducer processes sum up all sub-matrices.


map(BlockID, TwoDArrayWritable)
   -> <C_block(1,1), A_block(1,1)*B_block(1,1)>, ...

reduce(BlockID’, <TwoDArrayWritable>*) 
   -> <C_block(1,1), A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)>, ...

In my own work, It runs greatly. But, I'm not sure it can be auto-optimized.

Sun Tech Day 2008 Seoul

Sun Tech Days kick started few days ago in the Seoul. over 2000 attendees, completely charged, willing to discuss their issues, amazing amount of energy and all there to learn about different Sun technologies

The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.

And, I took pictures with Sang Shin, Rich Green.

A gripping links

Hyunsik Choi sent me these links.

http://www.informatik.uni-trier.de/~ley/db/
http://dblp.uni-trier.de/xml/

We thought these data can be used nicely for building/exploring RDF data.

Hadoop 0.18.x, a performance was improved.

After update to hadoop 0.18.x from hadoop 0.17.x for Hama, the speed of JUnit test and some processing was dramatically improved w/o hama algorithm changes.


Before:

   [junit] Running org.apache.hama.TestDenseMatrix
   [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 89.078 sec
   [junit] Running org.apache.hama.TestDenseVector
   [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 34.394 sec

After:

   [junit] Running org.apache.hama.TestDenseMatrix
   [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 58.842 sec
   [junit] Running org.apache.hama.TestDenseVector
   [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 26.519 sec

Here are issues that impact 0.18.x

The official Heart project group blog.

We published a official heart project blog, it will shows you the information of how we are work.

Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.

BigTable & Hbase, October 1, 2008 @ KAIST

http://wiki.apache.org/hama-data/attachments/Presentations/attachments/BigTable_and_Hbase.pdf

A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce

In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".

Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email:

Pr(spam|words)=Pr(words|spam)Pr(spam)/Pr(words)

In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).

1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:

Update: --emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category.

Then, it'll be more simplified. :)

Map:

    /**
     * Counts word frequency
     */
    public void map(LongWritable key, Text value,
        OutputCollector output, Reporter reporter)
        throws IOException {
      String line = value.toString();
      String[] tokens = line.split(splitregex);

      // For every word token
      for (int i = 0; i < tokens.length; i++) {
        String word = tokens[i].toLowerCase();
        Matcher m = wordregex.matcher(word);
        if (m.matches()) {
          spamTotal++;
          output.collect(new Text(word), count);
        }
      }
    }

Reduce:

    /**
     * Computes bad (or good) count / total bad (or good) words
     */
    public void reduce(Text key, Iterator values,
        OutputCollector output, Reporter reporter)
        throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += (int) values.next().get();
      }

      output.collect(key, 
        new FloatWritable((float) sum / spamTotal));
    }

2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:

    /**
     * Implement bayes rules to computer 
     * how likely this word is "spam"
     */
    public void finalizeProb() {
      if (rGood + rBad > 0)
        pSpam = rBad / (rBad + rGood);
      if (pSpam < 0.01f)
        pSpam = 0.01f;
      else if (pSpam > 0.99f)
        pSpam = 0.99f;
    }

Edward J. Yoon's Blog