Edward J. Yoon's Blog: A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce

A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce

In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".

Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email:

Pr(spam|words)=Pr(words|spam)Pr(spam)/Pr(words)

In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).

1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:

Update: --emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category.

Then, it'll be more simplified. :)

Map:

    /**
     * Counts word frequency
     */
    public void map(LongWritable key, Text value,
        OutputCollector output, Reporter reporter)
        throws IOException {
      String line = value.toString();
      String[] tokens = line.split(splitregex);

      // For every word token
      for (int i = 0; i < tokens.length; i++) {
        String word = tokens[i].toLowerCase();
        Matcher m = wordregex.matcher(word);
        if (m.matches()) {
          spamTotal++;
          output.collect(new Text(word), count);
        }
      }
    }

Reduce:

    /**
     * Computes bad (or good) count / total bad (or good) words
     */
    public void reduce(Text key, Iterator values,
        OutputCollector output, Reporter reporter)
        throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += (int) values.next().get();
      }

      output.collect(key, 
        new FloatWritable((float) sum / spamTotal));
    }

2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:

    /**
     * Implement bayes rules to computer 
     * how likely this word is "spam"
     */
    public void finalizeProb() {
      if (rGood + rBad > 0)
        pSpam = rBad / (rBad + rGood);
      if (pSpam < 0.01f)
        pSpam = 0.01f;
      else if (pSpam > 0.99f)
        pSpam = 0.99f;
    }

88 comments:

Miles Osborne8/10/08 00:37
You probably want to smooth the estimates too (especially for zero counts ie unknown tokens).

A simple approach is "add one": for some count n, pretend you saw it n+1 times.
ReplyDelete
Replies
Edward J. Yoon8/10/08 00:42
Oh, Good point, miles!! Thanks for your review. :)
ReplyDelete
Replies
Bryan22/2/09 18:42
The materials should be small and definitive, that is to say, distributed algorithm won't much help in this case. But nice example for Map/Reduce.
ReplyDelete
Replies
Edward J. Yoon22/2/09 18:53
I agree with you in part that should be small and definitive, But I thought per-user based bayesian for a large-scale web-mail service, There are a lot of users. ;)
ReplyDelete
Replies
Unknown17/6/17 05:35
Updating with the recent skills and applying it is the only tactic to live in our vocation. You have done really a great job by sharing this blog in here. Keep writing blog like this. .Hadoop Training in Bangalore | Data Science Training in Bangalore
ReplyDelete
Replies
amar26/7/17 00:44
practical are too good but theoretical stuff are small...but its good for technical phase
ReplyDelete
Replies
Unknown15/9/17 05:01
Nice blog Hadoop training in bangalore
AWS training in bangalore
Tableau training in bangalore
PHP training in bangalore
Android training in bangalore
Digital marketing training in bangalore
ReplyDelete
Replies
Praveen15/9/17 22:23
Thanks for posting this blog Devops training in Bangalore
Iot Training in Bangalore
Powershell Training in Bangalore
Machine Learning Training in Bangalore
Best Blogs
ReplyDelete
Replies
Unknown24/11/17 22:09
really awesome b l o g I t was helpful.
ReplyDelete
Replies
nadiya29/7/19 05:46
Thanks for posting this information. Keep updating.
pearson vue test center in chennai
Best IELTS Coaching in Chennai
learn Japanese in Chennai
Best Spoken English Class in Chennai
TOEFL Coaching Centres in Chennai
Blockchain Training
Informatica course in Chennai

ReplyDelete
Replies
Raj Sharma11/10/19 05:16

Devops Training in Noida
Android course in Noida
Machine Learning Training in Noida
Data Science Training in Noida
Cloud Computing Training in Noida
ReplyDelete
Replies
nikhil reddy21/3/20 02:04
Great post it's amazing blog Thanks a lot

Artificial Intelligence Course Training In Hyderabad
ReplyDelete
Replies
datasciencecourse25/3/20 22:09
I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

artificial Intelligence course

machine learning courses in mumbai
ReplyDelete
Replies
RIA Institute of Technology27/3/20 03:29
Great post!! Thanks for sharing...
Web Designing Course in Bangalore
ReplyDelete
Replies
imexpert5/4/20 22:54
Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
AI course in mumbai
ReplyDelete
Replies
imexpert7/4/20 20:41
Excellent Blog! Great Work and informative
artificial intelligence course in mumbai
ReplyDelete
Replies
lionelmessi22/4/20 22:57
Hi, Thanks for sharing wonderful articles...

Ai Training In Hyderabad
ReplyDelete
Replies
nikhil reddy1/6/20 22:57
Hi, Thanks for sharing nice articles...

AI Training In Hyderabad
ReplyDelete
Replies
Avijit6/6/20 07:04
Very few authors can convince me in their mind. You've worked superbly of doing that on a large number of your perspectives here.

SEO services in kolkata
Best SEO services in kolkata
SEO company in kolkata
Best SEO company in kolkata
Top SEO company in kolkata
Top SEO services in kolkata
SEO services in India
SEO copmany in India
ReplyDelete
Replies
Data Science Institute In Banglore11/6/20 02:13
Excellent post.I want to thank you for this informative read, I really appreciate sharing this great post.Keep up your work
Data Science Certification in Bangalore
ReplyDelete
Replies
eazyclasses11/6/20 23:39
Your article has aroused my curiosity. This is unquestionably a mastermind's article with incredible substance and intriguing perspectives. I concur partially with a great deal of this substance. Much thanks to you for sharing this educational material.

Online Teaching Platforms
Online Live Class Platform
Online Classroom Platforms
Online Training Platforms
Online Class Software
Virtual Classroom Software
Online Classroom Software
Learning Management System
Learning Management System for Schools
Learning Management System for Colleges
Learning Management System for Universities
ReplyDelete
Replies
DataScience Specialist12/6/20 10:38
There's no doubt i would fully rate it after i read what is the idea about this article. You did a nice job..
Data Science Course in Bangalore
ReplyDelete
Replies
Business Analytics Course in Hyderabad17/6/20 09:04
You are in point of fact a just right webmaster. The website loading speed is amazing. It kind of feels that you're doing any distinctive trick. Moreover, The contents are masterpiece. you have done a fantastic activity on this subject!
Learn best training course:
Business Analytics Course in Hyderabad
Business Analytics Training in Hyderabad
ReplyDelete
Replies
tejaswini22/6/20 05:28
Interesting post. I Have Been wondering about this issue, so thanks for posting. Pretty cool post.It 's really very nice and Useful post.Thanks
data science course malaysia
ReplyDelete
Replies
Data Science Course26/6/20 02:37
I’m happy I located this blog! From time to time, students want to cognitive the keys of productive literary essays composing. Your first-class knowledge about this good post can become a proper basis for such people. nice one

Data Science Course
ReplyDelete
Replies
Data Science Training26/6/20 07:11
I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end. I would like to read newer posts and to share my thoughts with you.

Data Science Training
ReplyDelete
Replies
BestTrainingMumbai28/6/20 07:44
I must abide that you are highly trained at influential writing as I am highly convinced to share your views.
SAP training in Mumbai
SAP course in Mumbai
SAP training institute Mumbai
ReplyDelete
Replies
BestTrainingKolkata28/6/20 20:55
This article is packed full of constructive information. The valuable points made here are apparent, brief, clear and poignant.
SAP training in Kolkata
SAP course in kolkata
SAP training institute in Kolkata
ReplyDelete
Replies
lionelmessi30/6/20 00:20

Thanks For Sharing The Information With Us.
AWS Training in Hyderabad
AWS Course in Hyderabad
ReplyDelete
Replies
Avijit6/7/20 05:57
This material makes for great reading. It's full of useful information that's interesting,well-presented and easy to understand. I like articles that are well done.

Denial management software
Denials management software
Hospital denial management software
Self Pay Medicaid Insurance Discovery
Uninsured Medicaid Insurance Discovery
Medical billing Denial Management Software
Self Pay to Medicaid
Charity Care Software
Patient Payment Estimator
Underpayment Analyzer
Claim Status

ReplyDelete
Replies
EXCELR28/7/20 08:19

This post is great. I reallly admire your post. Your post was awesome.
data science course in Hyderabad
ReplyDelete
Replies
Best Ethical Hacking Course4/8/20 07:42
really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up. Learn best Ethical Hacking Course in Bangalore
ReplyDelete
Replies
EXCELR6/8/20 02:28
https://blog.udanax.org/2008/10/parallel-bayesian-spam-filtering-using.html
ReplyDelete
Replies
tejaswini6/8/20 23:59
Cool stuff you have and you keep overhaul every one of us
data science certification
ReplyDelete
Replies
Tableau Course in Raipur - 360DigiTMG11/8/20 20:09
Very impressive and interesting blog found to be well written in a simple manner that everyone will understand and gain the enough knowledge from your blog being more informative is an added advantage for the users who are going through it. Once again nice blog keep it up.

360DigiTMG Data Analytics Course
ReplyDelete
Replies
datasciencecourse12/8/20 00:24
I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

data science interview questions
ReplyDelete
Replies
hrithiksai20/8/20 06:17
Very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing, best data science courses in Hyderabad
ReplyDelete
Replies
360digiTMG24/8/20 00:18
I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
https://360digitmg.com/digital-marketing-training-in-hyderabad
ReplyDelete
Replies
Data Science training - 360DigiTMG24/8/20 20:13
Mindblowing blog appreciating your endless efforts in developing a truly transparent content. Which probably the best one to come across disclosing the content which people might not aware of it. Thanks for bringing out the amazing content and keep sharing more further.

360DigiTMG PMP Certification Course
ReplyDelete
Replies
Data Science Training 27/8/20 07:11
Very impressive and interesting blog learnt lot of new things thanks for sharing.
360DigiTMG Data Science Training in Hyderabad
ReplyDelete
Replies
hrithiksai28/8/20 05:36
Found your post interesting to read. I cant wait to see your post soon. Good Luck for the upcoming update. This article is really very interesting and effective, data science online training
ReplyDelete
Replies
Data Science 29/8/20 13:27
I am overwhelmed by your blog post, information provided was of great help thank you.
Data Analytics Certification Training 360DigiTMG
ReplyDelete
Replies
360digitmgas6/9/20 23:19
I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it. 360DigiTMG
ReplyDelete
Replies
360DigiTMG Aurangabad8/9/20 11:09
Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.360digitmg
ReplyDelete
Replies
Professional Course10/9/20 00:11
Her blog has given us valuable information to work on. Every tip in your post is amazing. Thank you so much for sharing. Keep blogging.

Business Analytics Course in Bangalore
ReplyDelete
Replies
Data Analytics Course10/9/20 06:50
It's very educational and well-written content for a change. It's good to see that some people still understand how to write a great article!

Data Analytics Course in Bangalore
ReplyDelete
Replies
EXCELR12/9/20 01:17
I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work. data science training in Hyderabad
ReplyDelete
Replies
tejaswini13/9/20 20:07
I am stunned by the information that you have on this blog. It shows how well you fathom this subject.
360DigiTMG data science course in malaysia
ReplyDelete
Replies
Data Analytics Course1/10/20 08:56
Really, this article is truly one of the best in article history. I am a collector of old "items" and sometimes read new items if I find them interesting. And this one that I found quite fascinating and should be part of my collection. Very good work!

Data Analytics Course in Bangalore
ReplyDelete
Replies
Data Science 15/10/20 05:56
Top quality article with very fantastic information and unique content found very useful thanks for sharing.
Data Analytics Course Online
ReplyDelete
Replies
360digitmg21/10/20 03:59
What a really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up.
Best Digital Marketing Courses in Hyderabad
ReplyDelete
Replies
Sharnith31/10/20 05:48
Lovely post... The concepts and the tips given in the post seems to be very much informative and useful.
Tableau Training in Chennai
Tableau Certification
Oracle DBA Training in Chennai
Advanced Excle Training in Chennai
Unix Training in Chennai
Corporate Training in Chennai
Spark Training in Chennai
Pega Training in Chennai
ReplyDelete
Replies
Artificial Intelligence Course31/10/20 06:12
They are produced by high level developers who will stand out for the creation of their polo dress. You will find Ron Lauren polo shirts in an exclusive range which includes private lessons for men and women.

Artificial Intelligence Course in Bangalore
ReplyDelete
Replies
360digitmg5/11/20 00:25
Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
Data Science Training in Hyderabad
ReplyDelete
Replies
Rohini7/11/20 03:22
I have bookmarked your website because this site contains valuable information in it. I am really happy with articles quality and presentation. Thanks a lot for keeping great stuff. I am very much thankful for this site.
data science training in Hyderabad
ReplyDelete
Replies
EXCELR21/11/20 08:29
Thank you for sharing such a useful post with us, it will useful for everybody, so keep it up that is decent work.data science training in Hyderabad
ReplyDelete
Replies
360digitmg8/12/20 03:23
I don t have the time at the moment to fully read your site but I have bookmarked it and also add your RSS feeds. I will be back in a day or two. thanks for a great site.

business analytics course
ReplyDelete
Replies
360digiTMG Training11/12/20 22:05
A good blog always comes-up with new and exciting information and while reading I have feel that this blog is really have all those quality that qualify a blog to be a one.

Best Institute for Data Science in Hyderabad
ReplyDelete
Replies
chandhran 13/12/20 23:48
Amazing post! I would like to thank you for the wonderful information.
how to become software tester
automation in banking sector
data science languages
features of php programming language
ReplyDelete
Replies
lavanya24/12/20 06:55
Very excellentsalesforce training in chennai

software testing training in chennai

robotic process automation rpa training in chennai

blockchain training in chennai

devops training in chennai

ReplyDelete
Replies
360digiTMG Training5/1/21 01:52

This is an awesome motivating article.I am practically satisfied with your great work.You put truly extremely supportive data. Keep it up. Continue blogging. Hoping to pursuing your next post
Best Institutes For Digital Marketing in Hyderabad
ReplyDelete
Replies
siva5/1/21 04:39
Excellent Blog!!! Waiting for your new blog... thanks for sharing with us.
android developer vs web developer salary
how to use selenium webdriver
which is the best language in the world
professional hacking
devops interview questions and answers pdf
rpa interview questions and answers for experienced
ReplyDelete
Replies
360digiTMG Training18/1/21 22:33
Hello! I just wish to give an enormous thumbs up for the nice info you've got right here on this post. I will probably be coming back to your weblog for more soon!
Best Institute for Data Science in Hyderabad

ReplyDelete
Replies
360digiTMG Training18/1/21 22:50
It is imperative that we read blog post very carefully. I am already done it and find that this post is really amazing.
business analytics course

ReplyDelete
Replies
360DigiTMGAurangabad22/1/21 00:03
This Blog is very useful and informative.
business analytics course aurangabad
ReplyDelete
Replies
360DigiTMGNoida2/2/21 02:01
Hello there to everyone, here everybody is sharing such information, so it's fussy to see this webpage, and I used to visit this blog day by day
data science courses in noida
ReplyDelete
Replies
360digiTMG Training16/2/21 23:20
Impressive. Your story always brings hope and new energy. Keep up the good work.
Best Institute for Data Science in Hyderabad

ReplyDelete
Replies
Best Data Science Courses26/2/21 02:12
I recently found a lot of useful information on your website, especially on this blog page. Among the many comments on your articles. Thanks for sharing.

Best Data Science Courses in Bangalore
ReplyDelete
Replies
data scientist course1/3/21 21:41
Truly quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. Much obliged for sharing.
data scientist certification
ReplyDelete
Replies
vé máy bay từ Nhật Bản về Việt Nam4/3/21 01:00
Đặt vé máy bay tại Aivivu, tham khảo

gia ve may bay di my

bay từ california về việt nam mất bao lâu

vé máy bay ra nha trang

ve may bay hai phong di phu quoc

ve may bay di Hue re nhat
ReplyDelete
Replies
Tech Science23/3/21 12:05
Fantastic blog with excellent information and valuable content just added your blog to my bookmarking sites thank for sharing.
Data Science Course in Chennai
ReplyDelete
Replies
Deekshitha7/4/21 23:47
Informative blog
data science course in india
ReplyDelete
Replies
360digiTMG Training10/4/21 01:45
Very awesome!!! When I seek for this I found this website at the top of all blogs in search engine.
business analytics course
ReplyDelete
Replies
Anonymous21/6/21 20:22
Awesome blog post,
Top 10 Digital Marketing Agencies in Hyderabad
ReplyDelete
Replies
Career Programs Excellence28/6/21 10:33
This comment has been removed by the author.
ReplyDelete
Replies
Priya Rathod19/7/21 04:12
I just needed to record a speedy word to express profound gratitude to you for those magnificent tips and clues you are appearing on this site.
AWS Training in Hyderabad
AWS Course in Hyderabad
ReplyDelete
Replies
data science21/7/21 23:12

I was basically inspecting through the web filtering for certain data and ran over your blog. I am flabbergasted by the data that you have on this blog. It shows how well you welcome this subject. Bookmarked this page, will return for extra. data science course in jaipur
ReplyDelete
Replies
jony blaze19/8/21 23:53
Great Article. I really liked your blog post! It was well organized, insightful and most of all helpful.
Artificial Intelligence Training in Hyderabad
Artificial Intelligence Course in Hyderabad
ReplyDelete
Replies
Priya Rathod21/8/21 02:25
I love this article. It's well-written. Thanks for all the effort you put into it! I enjoyed reading it and plan to read many more of your articles in the future.
Data Science Training in Hyderabad
Data Science Course in Hyderabad
ReplyDelete
Replies
dataanalytics13/9/21 20:45
I wish more writers of this sort of substance would take the time you did to explore and compose so well. I am exceptionally awed with your vision and knowledge.
data scientist training in hyderabad

ReplyDelete
Replies
Arnold DK16/11/21 02:08
Nice blog, valuable and helpful informative for me. Thanks for posting the best information and the blog is very good whatsapp mod
ReplyDelete
Replies
Professional Course18/11/21 01:23
A good blog always contains new and exciting information and as I read it I felt that this blog really has all of these qualities that make a blog.

Digital Marketing Institute in Bangalore
ReplyDelete
Replies
Angular Training in Hyderabad – Learn from experts22/11/21 02:43
Online Training | Classroom | Virtual Classes
Angular JS Training in Hyderabad with 100% placement assistance
1860 testers placed in 600 companies in last 8 years
Angular JS Training in Hyderabad from Real-time expert trainers
Industry oriented training with corporate case studies
Angular Training with Free Aptitude classes & Mock interviews
ReplyDelete
Replies
traininginstitute18/12/21 03:31
You completely match our expectation and the variety of our information.
data science course

ReplyDelete
Replies
Ramesh Sampangi21/2/22 20:08
Very nice blog. A great piece of writing. You have shared a true worthy blog and keep sharing more blogs with us. Thank you.
Data Science Course in Hyderabad
ReplyDelete
Replies
360DigiTMG25/2/22 02:21
Here at this site is really a fastidious material collection so that everybody can enjoy a lot.
business analytics training in hyderabad
ReplyDelete
Replies
vcube28/7/23 02:55
Thanks For Sharing The Blog
Api Testing Using Postman Course In Hyderabad
ReplyDelete
Replies
Cambridge Infotech14/5/24 00:45
Implementing a Parallel Bayesian Spam Filtering Algorithm using Hadoop is a promising approach to combatting the issue of junk email. I'm intrigued by how Cambridge Infotech's expertise could optimize this distributed system for even more efficient filtering.
Software Training Institute in Bangalore
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)