According to HPC news, The latest MATLAB upgrade includes built-in support for the European EGEE grid (Enabling Grids for E-sciencE). This was accomplished by integrating the Parallel Computing Toolbox and MATLAB Distributed Computing Server with EGEE’s middleware, gLite.
http://insidehpc.com/2008/10/22/matlab-becoming-more-cluster-and-grid-friendly/
Parallel Matrix Multiply on Hadoop Map/Reduce: 2D blocking
In matrix multiplication, calculating one element of C requires reading an entire row of A and an entire column of B. Calculating an entire column of C (m elements) requires reading all rows (m rows) of A once and re-reading one column of B m times. Thus, Calculating n columns of C requires reading all rows of A n times and all columns of B m times.
In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,
C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)
This is a well-known Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.
Each mapper processes a matrix multiplication of blocks and reducer processes sum up all sub-matrices.
In my own work, It runs greatly. But, I'm not sure it can be auto-optimized.
In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,
C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)
This is a well-known Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.
Each mapper processes a matrix multiplication of blocks and reducer processes sum up all sub-matrices.
map(BlockID, TwoDArrayWritable)
-> <C_block(1,1), A_block(1,1)*B_block(1,1)>, ...
reduce(BlockID’, <TwoDArrayWritable>*)
-> <C_block(1,1), A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)>, ...
In my own work, It runs greatly. But, I'm not sure it can be auto-optimized.
Sun Tech Day 2008 Seoul
Sun Tech Days kick started few days ago in the Seoul. over 2000 attendees, completely charged, willing to discuss their issues, amazing amount of energy and all there to learn about different Sun technologies
The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.
And, I took pictures with Sang Shin, Rich Green.
The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.
And, I took pictures with Sang Shin, Rich Green.
A gripping links
Hyunsik Choi sent me these links.
http://www.informatik.uni-trier.de/~ley/db/
http://dblp.uni-trier.de/xml/
We thought these data can be used nicely for building/exploring RDF data.
http://www.informatik.uni-trier.de/~ley/db/
http://dblp.uni-trier.de/xml/
We thought these data can be used nicely for building/exploring RDF data.
Hadoop 0.18.x, a performance was improved.
After update to hadoop 0.18.x from hadoop 0.17.x for Hama, the speed of JUnit test and some processing was dramatically improved w/o hama algorithm changes.
Here are issues that impact 0.18.x
Before:
[junit] Running org.apache.hama.TestDenseMatrix
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 89.078 sec
[junit] Running org.apache.hama.TestDenseVector
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 34.394 sec
After:
[junit] Running org.apache.hama.TestDenseMatrix
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 58.842 sec
[junit] Running org.apache.hama.TestDenseVector
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 26.519 sec
Here are issues that impact 0.18.x
The official Heart project group blog.
We published a official heart project blog, it will shows you the information of how we are work.
Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.
Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.
A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce
In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".
In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).
1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:
Update: --emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category.
Then, it'll be more simplified. :)
Map:
Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email: Pr(spam|words)=Pr(words|spam)Pr(spam)/Pr(words)
In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).
1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:
Update: --emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category.
Then, it'll be more simplified. :)
Map:
/** * Counts word frequency */ public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); String[] tokens = line.split(splitregex); // For every word token for (int i = 0; i < tokens.length; i++) { String word = tokens[i].toLowerCase(); Matcher m = wordregex.matcher(word); if (m.matches()) { spamTotal++; output.collect(new Text(word), count); } } }Reduce:
/** * Computes bad (or good) count / total bad (or good) words */ public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += (int) values.next().get(); } output.collect(key, new FloatWritable((float) sum / spamTotal)); }2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:
/** * Implement bayes rules to computer * how likely this word is "spam" */ public void finalizeProb() { if (rGood + rBad > 0) pSpam = rBad / (rBad + rGood); if (pSpam < 0.01f) pSpam = 0.01f; else if (pSpam > 0.99f) pSpam = 0.99f; }
Subscribe to:
Posts (Atom)
-
음성 인공지능 분야에서 스타트업이 생각해볼 수 있는 전략은 아마 다음과 같이 3가지 정도가 있을 것이다: 독자적 Vertical 음성 인공지능 Application 구축 기 음성 플랫폼을 활용한 B2B2C 형태의 비지니스 구축 기 음성 플랫폼...
-
네이버, KT, 오라클, 그리고 잠깐의 사업을 거쳐 삼성전자에 입사한지도 2년이 지났습니다. 2016년 병신년을 뒤로하며 이번에는 꽤 색다른 도전에 나섭니다. 무슨 일이야!? 국내 O2O 숙박전문 회사 CTO로 조인합니다! 존! 나 고...
-
우리는 남들의 비판을 경험하면서 창조적 사고를 포기하게 된다. 비판으로부터 방어논리와 자기 검열에 취중한 나머지 더 이상 사고에 자유롭지 못하게 되니까 그렇다. 남들의 비판을 두려워하지 않는 자세.. 그것이 순수한 창조적 사고를 지속하는 방법이다...
-
“군자는 어울리되 패거리를 짓지 않고, 소인은 패거리를 짓되 어울리지 않는다." 군자는 의(義)를 높이기에 아부하지 않고, 부화뇌동(附和雷同)하지 않는다. 군자는 대의명분을 지키면서 화합하며 협력한다. 하지만 소인은 이익을 높이기에 이해관...