According to HPC news, The latest MATLAB upgrade includes builtin support for the European EGEE grid (Enabling Grids for EsciencE). This was accomplished by integrating the Parallel Computing Toolbox and MATLAB Distributed Computing Server with EGEE’s middleware, gLite.
http://insidehpc.com/2008/10/22/matlabbecomingmoreclusterandgridfriendly/
Parallel Matrix Multiply on Hadoop Map/Reduce: 2D blocking
In matrix multiplication, calculating one element of C requires reading an entire row of A and an entire column of B. Calculating an entire column of C (m elements) requires reading all rows (m rows) of A once and rereading one column of B m times. Thus, Calculating n columns of C requires reading all rows of A n times and all columns of B m times.
In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,
C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)
This is a wellknown Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.
Each mapper processes a matrix multiplication of blocks and reducer processes sum up all submatrices.
In my own work, It runs greatly. But, I'm not sure it can be autooptimized.
In blocking, matrix multiplication of the original arrays is transformed into matrix multiplication of blocks. For example,
C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)
This is a wellknown Cache Blocking technique for efficient memory accesses. Then, We can think about a Memory Blocking technique for efficient disk accesses and parallelism (or distributed?) on the Hadoop MapReduce as describe below.
Each mapper processes a matrix multiplication of blocks and reducer processes sum up all submatrices.
map(BlockID, TwoDArrayWritable)
> <C_block(1,1), A_block(1,1)*B_block(1,1)>, ...
reduce(BlockID’, <TwoDArrayWritable>*)
> <C_block(1,1), A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1)>, ...
In my own work, It runs greatly. But, I'm not sure it can be autooptimized.
Sun Tech Day 2008 Seoul
Sun Tech Days kick started few days ago in the Seoul. over 2000 attendees, completely charged, willing to discuss their issues, amazing amount of energy and all there to learn about different Sun technologies
The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.
And, I took pictures with Sang Shin, Rich Green.
The Zembly session (Sang Shin) was packed with 400 attendees and there were attendees all over the floor.
And, I took pictures with Sang Shin, Rich Green.
A gripping links
Hyunsik Choi sent me these links.
http://www.informatik.unitrier.de/~ley/db/
http://dblp.unitrier.de/xml/
We thought these data can be used nicely for building/exploring RDF data.
http://www.informatik.unitrier.de/~ley/db/
http://dblp.unitrier.de/xml/
We thought these data can be used nicely for building/exploring RDF data.
Hadoop 0.18.x, a performance was improved.
After update to hadoop 0.18.x from hadoop 0.17.x for Hama, the speed of JUnit test and some processing was dramatically improved w/o hama algorithm changes.
Here are issues that impact 0.18.x
Before:
[junit] Running org.apache.hama.TestDenseMatrix
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 89.078 sec
[junit] Running org.apache.hama.TestDenseVector
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 34.394 sec
After:
[junit] Running org.apache.hama.TestDenseMatrix
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 58.842 sec
[junit] Running org.apache.hama.TestDenseVector
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 26.519 sec
Here are issues that impact 0.18.x
The official Heart project group blog.
We published a official heart project blog, it will shows you the information of how we are work.
Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.
Heart is a Highly Extensible & Available RDF Table project, which develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.
A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce
In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".
In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).
1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good emails) as describe below:
Update: emit <category,probability> pairs and have the reducer simply sumup
the probabilities for a given category.
Then, it'll be more simplified. :)
Map:
Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email: Pr(spamwords)=Pr(wordsspam)Pr(spam)/Pr(words)
In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).
1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good emails) as describe below:
Update: emit <category,probability> pairs and have the reducer simply sumup
the probabilities for a given category.
Then, it'll be more simplified. :)
Map:
/** * Counts word frequency */ public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); String[] tokens = line.split(splitregex); // For every word token for (int i = 0; i < tokens.length; i++) { String word = tokens[i].toLowerCase(); Matcher m = wordregex.matcher(word); if (m.matches()) { spamTotal++; output.collect(new Text(word), count); } } }Reduce:
/** * Computes bad (or good) count / total bad (or good) words */ public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += (int) values.next().get(); } output.collect(key, new FloatWritable((float) sum / spamTotal)); }2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:
/** * Implement bayes rules to computer * how likely this word is "spam" */ public void finalizeProb() { if (rGood + rBad > 0) pSpam = rBad / (rBad + rGood); if (pSpam < 0.01f) pSpam = 0.01f; else if (pSpam > 0.99f) pSpam = 0.99f; }
Subscribe to:
Posts (Atom)

The Red : 4천만 MAU를 지탱하는 서비스 설계와 데이터 처리 기술 By 카카오페이지 기술전략이사 윤 당신의 인터넷 서비스는 안정적이고 확장성 있게 설계되어 있습니까? 아파치 하마 프로젝트 의장 활동, GSoC 멘토링 경험과 7개의 기업을 거치...

음성 인공지능 분야에서 스타트업이 생각해볼 수 있는 전략은 아마 다음과 같이 3가지 정도가 있을 것이다: 독자적 Vertical 음성 인공지능 Application 구축 기 음성 플랫폼을 활용한 B2B2C 형태의 비지니스 구축 기 음성 플랫폼...

개발자 컨퍼런스같은 것도 방문한게 언제인지 까마득합니다. 코로나로 왠지 교류가 많이 없어졌습니다. 패스트캠퍼스로부터 좋은 기회를 얻어 강연을 하나 오픈하였습니다. 제가 강연에서 주로 다룰 내용은, 인터넷 역사 이래 발전해온 서버 사이드 기술들에 대해 ...

패밀리 세단으로 새차 구입은 좀 무리일 것 같아서, 중고로 하나 얻어왔습니다. 중고차라고 티 내는건지 :) 시거잭에 전원이 안들어오더군요. 요즘 참 세상 좋아졌다고 생각드는게, 유튜브에서 시거잭 전원 불량에 대한 단서를 얻었습니다. 바로 퓨즈가 나가...