**MATLAB Distributed Computing Server with EGEE’s middleware, gLite**.

http://insidehpc.com/2008/10/22/matlab-becoming-more-cluster-and-grid-friendly/

According to HPC news, The latest MATLAB upgrade includes built-in support for the European EGEE grid (Enabling Grids for E-sciencE). This was accomplished by integrating the Parallel Computing Toolbox and **MATLAB Distributed Computing Server with EGEE’s middleware, gLite**.

http://insidehpc.com/2008/10/22/matlab-becoming-more-cluster-and-grid-friendly/

http://insidehpc.com/2008/10/22/matlab-becoming-more-cluster-and-grid-friendly/

In matrix multiplication, calculating one element of C requires reading an entire row of A and an entire column of B. Calculating an entire column of C (m elements) requires reading all rows (m rows) of A once and re-reading one column of B m times. Thus, Calculating n columns of C requires reading all rows of A n times and all columns of B m times.

In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,

C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)

This is a well-known Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.

Each mapper processes a matrix multiplication of blocks and reducer processes sum up all sub-matrices.

In my own work, It runs greatly. But, I'm not sure it can be auto-optimized.

In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,

C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)

This is a well-known Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.

Each mapper processes a matrix multiplication of blocks and reducer processes sum up all sub-matrices.

map(BlockID, TwoDArrayWritable)

-> <C_block(1,1), A_block(1,1)*B_block(1,1)>, ...

reduce(BlockID’, <TwoDArrayWritable>*)

-> <C_block(1,1), A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)>, ...

In my own work, It runs greatly. But, I'm not sure it can be auto-optimized.

Sun Tech Days kick started few days ago in the Seoul. over 2000 attendees, completely charged, willing to discuss their issues, amazing amount of energy and all there to learn about different Sun technologies

The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.

And, I took pictures with Sang Shin, Rich Green.

The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.

And, I took pictures with Sang Shin, Rich Green.

Hyunsik Choi sent me these links.

http://www.informatik.uni-trier.de/~ley/db/

http://dblp.uni-trier.de/xml/

We thought these data can be used nicely for building/exploring RDF data.

http://www.informatik.uni-trier.de/~ley/db/

http://dblp.uni-trier.de/xml/

We thought these data can be used nicely for building/exploring RDF data.

After update to hadoop 0.18.x from hadoop 0.17.x for Hama, the speed of JUnit test and some processing was dramatically improved w/o hama algorithm changes.

Here are issues that impact 0.18.x

Before:

[junit] Running org.apache.hama.TestDenseMatrix

[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 89.078 sec

[junit] Running org.apache.hama.TestDenseVector

[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 34.394 sec

After:

[junit] Running org.apache.hama.TestDenseMatrix

[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 58.842 sec

[junit] Running org.apache.hama.TestDenseVector

[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 26.519 sec

Here are issues that impact 0.18.x

We published a official heart project blog, it will shows you the information of how we are work.

Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.

Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.

In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".

In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).

1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:

**Update**: --emit <category,probability> pairs and have the reducer simply sum-up

the probabilities for a given category.

Then, it'll be more simplified. :)

Map:

Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email: Pr(spam|words)=Pr(words|spam)Pr(spam)/Pr(words)

In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).

1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:

the probabilities for a given category.

Then, it'll be more simplified. :)

Map:

/** * Counts word frequency */ public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); String[] tokens = line.split(splitregex); // For every word token for (int i = 0; i < tokens.length; i++) { String word = tokens[i].toLowerCase(); Matcher m = wordregex.matcher(word); if (m.matches()) { spamTotal++; output.collect(new Text(word), count); } } }Reduce:

/** * Computes bad (or good) count / total bad (or good) words */ public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += (int) values.next().get(); } output.collect(key, new FloatWritable((float) sum / spamTotal)); }2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:

/** * Implement bayes rules to computer * how likely this word is "spam" */ public void finalizeProb() { if (rGood + rBad > 0) pSpam = rBad / (rBad + rGood); if (pSpam < 0.01f) pSpam = 0.01f; else if (pSpam > 0.99f) pSpam = 0.99f; }

Subscribe to:
Posts (Atom)

무한 집합의 크기 Cardinality , 즉 원소의 개수를 수학에서는 '농도'라고 말한다. 유한 집합의 크기는 그대로 원소의 개수 이지만, 무한 집합의 경우는 원소의 개수를 낱낱이 셈하는 것은 불가능하기 때문에 '농도'라...