According to HPC news, The latest MATLAB upgrade includes builtin support for the European EGEE grid (Enabling Grids for EsciencE). This was accomplished by integrating the Parallel Computing Toolbox and MATLAB Distributed Computing Server with EGEE’s middleware, gLite.
http://insidehpc.com/2008/10/22/matlabbecomingmoreclusterandgridfriendly/
Parallel Matrix Multiply on Hadoop Map/Reduce: 2D blocking
In matrix multiplication, calculating one element of C requires reading an entire row of A and an entire column of B. Calculating an entire column of C (m elements) requires reading all rows (m rows) of A once and rereading one column of B m times. Thus, Calculating n columns of C requires reading all rows of A n times and all columns of B m times.
In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,
C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)
This is a wellknown Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.
Each mapper processes a matrix multiplication of blocks and reducer processes sum up all submatrices.
In my own work, It runs greatly. But, I'm not sure it can be autooptimized.
In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,
C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)
This is a wellknown Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.
Each mapper processes a matrix multiplication of blocks and reducer processes sum up all submatrices.
map(BlockID, TwoDArrayWritable)
> <C_block(1,1), A_block(1,1)*B_block(1,1)>, ...
reduce(BlockID’, <TwoDArrayWritable>*)
> <C_block(1,1), A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)>, ...
In my own work, It runs greatly. But, I'm not sure it can be autooptimized.
Sun Tech Day 2008 Seoul
Sun Tech Days kick started few days ago in the Seoul. over 2000 attendees, completely charged, willing to discuss their issues, amazing amount of energy and all there to learn about different Sun technologies
The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.
And, I took pictures with Sang Shin, Rich Green.
The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.
And, I took pictures with Sang Shin, Rich Green.
A gripping links
Hyunsik Choi sent me these links.
http://www.informatik.unitrier.de/~ley/db/
http://dblp.unitrier.de/xml/
We thought these data can be used nicely for building/exploring RDF data.
http://www.informatik.unitrier.de/~ley/db/
http://dblp.unitrier.de/xml/
We thought these data can be used nicely for building/exploring RDF data.
Hadoop 0.18.x, a performance was improved.
After update to hadoop 0.18.x from hadoop 0.17.x for Hama, the speed of JUnit test and some processing was dramatically improved w/o hama algorithm changes.
Here are issues that impact 0.18.x
Before:
[junit] Running org.apache.hama.TestDenseMatrix
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 89.078 sec
[junit] Running org.apache.hama.TestDenseVector
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 34.394 sec
After:
[junit] Running org.apache.hama.TestDenseMatrix
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 58.842 sec
[junit] Running org.apache.hama.TestDenseVector
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 26.519 sec
Here are issues that impact 0.18.x
The official Heart project group blog.
We published a official heart project blog, it will shows you the information of how we are work.
Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.
Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.
A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce
In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".
In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).
1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good emails) as describe below:
Update: emit <category,probability> pairs and have the reducer simply sumup
the probabilities for a given category.
Then, it'll be more simplified. :)
Map:
Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email: Pr(spamwords)=Pr(wordsspam)Pr(spam)/Pr(words)
In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).
1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good emails) as describe below:
Update: emit <category,probability> pairs and have the reducer simply sumup
the probabilities for a given category.
Then, it'll be more simplified. :)
Map:
/** * Counts word frequency */ public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); String[] tokens = line.split(splitregex); // For every word token for (int i = 0; i < tokens.length; i++) { String word = tokens[i].toLowerCase(); Matcher m = wordregex.matcher(word); if (m.matches()) { spamTotal++; output.collect(new Text(word), count); } } }Reduce:
/** * Computes bad (or good) count / total bad (or good) words */ public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += (int) values.next().get(); } output.collect(key, new FloatWritable((float) sum / spamTotal)); }2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:
/** * Implement bayes rules to computer * how likely this word is "spam" */ public void finalizeProb() { if (rGood + rBad > 0) pSpam = rBad / (rBad + rGood); if (pSpam < 0.01f) pSpam = 0.01f; else if (pSpam > 0.99f) pSpam = 0.99f; }
Subscribe to:
Posts (Atom)

음성 인공지능 분야에서 스타트업이 생각해볼 수 있는 전략은 아마 다음과 같이 3가지 정도가 있을 것이다: 독자적 Vertical 음성 인공지능 Application 구축 기 음성 플랫폼을 활용한 B2B2C 형태의 비지니스 구축 기 음성 플랫폼...

네이버, KT, 오라클, 그리고 잠깐의 사업을 거쳐 삼성전자에 입사한지도 2년이 지났습니다. 2016년 병신년을 뒤로하며 이번에는 꽤 색다른 도전에 나섭니다. 무슨 일이야!? 국내 O2O 숙박전문 회사 CTO로 조인합니다! 존! 나 고...

무한 집합의 크기 Cardinality , 즉 원소의 개수를 수학에서는 '농도'라고 말한다. 유한 집합의 크기는 그대로 원소의 개수 이지만, 무한 집합의 경우는 원소의 개수를 낱낱이 셈하는 것은 불가능하기 때문에 '농도'라...

점심 먹고 서점에 들러서 집어든 책이 하나 있다. "인간 본성의 법칙", 몇 장 읽어보면서 바로 몰입되는 책. 아직 다 읽지 못해서 리뷰 하긴 뭐 하지만, 과거 내 행동들에 대한 부끄러움, 감정을 배제한 이성적 의사결정에 큰 깨달음을...