A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce

In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".
Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email:

Pr(spam|words)=Pr(words|spam)Pr(spam)/Pr(words)

In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).

1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:

Update: --emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category.

Then, it'll be more simplified. :)

Map:
    /**
     * Counts word frequency
     */
    public void map(LongWritable key, Text value,
        OutputCollector output, Reporter reporter)
        throws IOException {
      String line = value.toString();
      String[] tokens = line.split(splitregex);

      // For every word token
      for (int i = 0; i < tokens.length; i++) {
        String word = tokens[i].toLowerCase();
        Matcher m = wordregex.matcher(word);
        if (m.matches()) {
          spamTotal++;
          output.collect(new Text(word), count);
        }
      }
    }
Reduce:
    /**
     * Computes bad (or good) count / total bad (or good) words
     */
    public void reduce(Text key, Iterator values,
        OutputCollector output, Reporter reporter)
        throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += (int) values.next().get();
      }

      output.collect(key, 
        new FloatWritable((float) sum / spamTotal));
    }
2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:
    /**
     * Implement bayes rules to computer 
     * how likely this word is "spam"
     */
    public void finalizeProb() {
      if (rGood + rBad > 0)
        pSpam = rBad / (rBad + rGood);
      if (pSpam < 0.01f)
        pSpam = 0.01f;
      else if (pSpam > 0.99f)
        pSpam = 0.99f;
    }

114 comments:

  1. You probably want to smooth the estimates too (especially for zero counts ie unknown tokens).

    A simple approach is "add one": for some count n, pretend you saw it n+1 times.

    ReplyDelete
  2. Oh, Good point, miles!! Thanks for your review. :)

    ReplyDelete
  3. The materials should be small and definitive, that is to say, distributed algorithm won't much help in this case. But nice example for Map/Reduce.

    ReplyDelete
  4. I agree with you in part that should be small and definitive, But I thought per-user based bayesian for a large-scale web-mail service, There are a lot of users. ;)

    ReplyDelete
  5. Updating with the recent skills and applying it is the only tactic to live in our vocation. You have done really a great job by sharing this blog in here. Keep writing blog like this. .Hadoop Training in Bangalore | Data Science Training in Bangalore

    ReplyDelete
  6. really awesome blog It was helpful.

    ReplyDelete
  7. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

    artificial Intelligence course

    machine learning courses in mumbai

    ReplyDelete
  8. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
    AI course in mumbai

    ReplyDelete
  9. Hi, Thanks for sharing wonderful articles...

    Ai Training In Hyderabad

    ReplyDelete
  10. Excellent post.I want to thank you for this informative read, I really appreciate sharing this great post.Keep up your work
    Data Science Certification in Bangalore

    ReplyDelete
  11. Your article has aroused my curiosity. This is unquestionably a mastermind's article with incredible substance and intriguing perspectives. I concur partially with a great deal of this substance. Much thanks to you for sharing this educational material.

    Online Teaching Platforms
    Online Live Class Platform
    Online Classroom Platforms
    Online Training Platforms
    Online Class Software
    Virtual Classroom Software
    Online Classroom Software
    Learning Management System
    Learning Management System for Schools
    Learning Management System for Colleges
    Learning Management System for Universities

    ReplyDelete
  12. There's no doubt i would fully rate it after i read what is the idea about this article. You did a nice job..
    Data Science Course in Bangalore

    ReplyDelete
  13. Attend The Data Analytics Courses From ExcelR. Practical Data Analytics Courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analytics Courses.
    Data Analytics Courses

    ReplyDelete
  14. You are in point of fact a just right webmaster. The website loading speed is amazing. It kind of feels that you're doing any distinctive trick. Moreover, The contents are masterpiece. you have done a fantastic activity on this subject!
    Learn best training course:
    Business Analytics Course in Hyderabad
    Business Analytics Training in Hyderabad

    ReplyDelete
  15. Interesting post. I Have Been wondering about this issue, so thanks for posting. Pretty cool post.It 's really very nice and Useful post.Thanks
    data science course malaysia

    ReplyDelete
  16. I’m happy I located this blog! From time to time, students want to cognitive the keys of productive literary essays composing. Your first-class knowledge about this good post can become a proper basis for such people. nice one

    Data Science Course

    ReplyDelete
  17. I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end. I would like to read newer posts and to share my thoughts with you.

    Data Science Training

    ReplyDelete
  18. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
    Data Science Certification in Bangalore

    ReplyDelete
  19. I must abide that you are highly trained at influential writing as I am highly convinced to share your views.
    SAP training in Mumbai
    SAP course in Mumbai
    SAP training institute Mumbai

    ReplyDelete
  20. This article is packed full of constructive information. The valuable points made here are apparent, brief, clear and poignant.
    SAP training in Kolkata
    SAP course in kolkata
    SAP training institute in Kolkata

    ReplyDelete
  21. Wow! Such an amazing and helpful post this is. I really really love it. It's so good and so awesome. I am just amazed. I hope that you continue to do your work like this in the future also.
    Data Science Training in Bangalore

    ReplyDelete


  22. This post is great. I reallly admire your post. Your post was awesome.
    data science course in Hyderabad

    ReplyDelete
  23. really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up. Learn best Ethical Hacking Course in Bangalore

    ReplyDelete
  24. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    iot course training in guduvanchery

    ReplyDelete
  25. https://blog.udanax.org/2008/10/parallel-bayesian-spam-filtering-using.html

    ReplyDelete
  26. Cool stuff you have and you keep overhaul every one of us
    data science certification

    ReplyDelete
  27. Very impressive and interesting blog found to be well written in a simple manner that everyone will understand and gain the enough knowledge from your blog being more informative is an added advantage for the users who are going through it. Once again nice blog keep it up.

    360DigiTMG Data Analytics Course

    ReplyDelete
  28. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

    data science interview questions

    ReplyDelete
  29. This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.

    Simple Linear Regression

    Correlation vs covariance

    KNN Algorithm

    ReplyDelete
  30. Very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing, best data science courses in Hyderabad

    ReplyDelete
  31. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
    https://360digitmg.com/digital-marketing-training-in-hyderabad

    ReplyDelete
  32. Mindblowing blog appreciating your endless efforts in developing a truly transparent content. Which probably the best one to come across disclosing the content which people might not aware of it. Thanks for bringing out the amazing content and keep sharing more further.

    360DigiTMG PMP Certification Course

    ReplyDelete
  33. Very impressive and interesting blog learnt lot of new things thanks for sharing.
    360DigiTMG Data Science Training in Hyderabad

    ReplyDelete
  34. Found your post interesting to read. I cant wait to see your post soon. Good Luck for the upcoming update. This article is really very interesting and effective, data science online training

    ReplyDelete
  35. I am overwhelmed by your blog post, information provided was of great help thank you.
    Data Analytics Certification Training 360DigiTMG

    ReplyDelete
  36. Anonymous1/9/20 22:48

    Amazing Article ! I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple Linear Regression
    data science interview questions
    KNN Algorithm
    Logistic Regression explained

    ReplyDelete
  37. I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it. 360DigiTMG

    ReplyDelete
  38. Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.360digitmg

    ReplyDelete
  39. Her blog has given us valuable information to work on. Every tip in your post is amazing. Thank you so much for sharing. Keep blogging.

    Business Analytics Course in Bangalore

    ReplyDelete
  40. It's very educational and well-written content for a change. It's good to see that some people still understand how to write a great article!

    Data Analytics Course in Bangalore

    ReplyDelete
  41. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work. data science training in Hyderabad

    ReplyDelete
  42. I am stunned by the information that you have on this blog. It shows how well you fathom this subject.
    360DigiTMG data science course in malaysia

    ReplyDelete
  43. Really, this article is truly one of the best in article history. I am a collector of old "items" and sometimes read new items if I find them interesting. And this one that I found quite fascinating and should be part of my collection. Very good work!

    Data Analytics Course in Bangalore

    ReplyDelete
  44. Top quality article with very fantastic information and unique content found very useful thanks for sharing.
    Data Analytics Course Online

    ReplyDelete
  45. What a really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up.
    Best Digital Marketing Courses in Hyderabad

    ReplyDelete
  46. They are produced by high level developers who will stand out for the creation of their polo dress. You will find Ron Lauren polo shirts in an exclusive range which includes private lessons for men and women.

    Artificial Intelligence Course in Bangalore

    ReplyDelete
  47. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    Data Science Training in Hyderabad

    ReplyDelete
  48. I have bookmarked your website because this site contains valuable information in it. I am really happy with articles quality and presentation. Thanks a lot for keeping great stuff. I am very much thankful for this site.
    data science training in Hyderabad

    ReplyDelete
  49. Thank you for sharing such a useful post with us, it will useful for everybody, so keep it up that is decent work.data science training in Hyderabad

    ReplyDelete
  50. I don t have the time at the moment to fully read your site but I have bookmarked it and also add your RSS feeds. I will be back in a day or two. thanks for a great site.

    business analytics course

    ReplyDelete
  51. I recently found a lot of useful information on your website, especially on this blog page. Among the many comments on your articles. Thanks for sharing.

    Business Analytics Course in Bangalore

    ReplyDelete
  52. A good blog always comes-up with new and exciting information and while reading I have feel that this blog is really have all those quality that qualify a blog to be a one.


    Best Institute for Data Science in Hyderabad

    ReplyDelete
  53. Your content is very unique and understandable useful for the readers keep update more article like this.
    certification of data science

    ReplyDelete

  54. This is an awesome motivating article.I am practically satisfied with your great work.You put truly extremely supportive data. Keep it up. Continue blogging. Hoping to pursuing your next post
    Best Institutes For Digital Marketing in Hyderabad

    ReplyDelete
  55. I have to search sites with relevant information ,This is a
    wonderful blog,These type of blog keeps the users interest in
    the website, i am impressed. thank you.
    Data Science Course in Bangalore

    ReplyDelete
  56. Hello! I just wish to give an enormous thumbs up for the nice info you've got right here on this post. I will probably be coming back to your weblog for more soon!
    Best Institute for Data Science in Hyderabad

    ReplyDelete
  57. It is imperative that we read blog post very carefully. I am already done it and find that this post is really amazing.
    business analytics course

    ReplyDelete
  58. Actually I read it yesterday I looked at most of your posts but I had some ideas about it . This article is probably where I got the most useful information for my research and today I wanted to read it again because it is so well written.
    Data Science Course in Bangalore

    ReplyDelete
  59. Hello there to everyone, here everybody is sharing such information, so it's fussy to see this webpage, and I used to visit this blog day by day
    data science courses in noida

    ReplyDelete
  60. Impressive. Your story always brings hope and new energy. Keep up the good work.
    Best Institute for Data Science in Hyderabad

    ReplyDelete
  61. I recently found a lot of useful information on your website, especially on this blog page. Among the many comments on your articles. Thanks for sharing.

    Best Data Science Courses in Bangalore

    ReplyDelete
  62. Truly quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. Much obliged for sharing.
    data scientist certification

    ReplyDelete
  63. Fantastic blog with excellent information and valuable content just added your blog to my bookmarking sites thank for sharing.
    Data Science Course in Chennai

    ReplyDelete
  64. Very awesome!!! When I seek for this I found this website at the top of all blogs in search engine.
    business analytics course

    ReplyDelete
  65. Thanks for posting the best information and the blog is very helpful.artificial intelligence course in hyderabad

    ReplyDelete
  66. I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors.
    data analytics courses in bangalore

    ReplyDelete
  67. Truly incredible blog found to be very impressive due to which the learners who go through it will try to explore themselves with the content to develop the skills to an extreme level. Eventually, thanking the blogger to come up with such phenomenal content. Hope you arrive with similar content in the future as well.

    Machine Learning Course in Bangalore

    ReplyDelete
  68. Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.
    Data Science Course in Bangalore

    ReplyDelete
  69. I need to thank you for this very good read and i have bookmarked to check out new things from your post. Thank you very much for sharing such a useful article and will definitely saved and revisit your site.
    Data Science Course

    ReplyDelete
  70. I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors.
    data science course in chennai

    ReplyDelete
  71. This comment has been removed by the author.

    ReplyDelete
  72. I just needed to record a speedy word to express profound gratitude to you for those magnificent tips and clues you are appearing on this site.
    AWS Training in Hyderabad
    AWS Course in Hyderabad

    ReplyDelete



  73. I was basically inspecting through the web filtering for certain data and ran over your blog. I am flabbergasted by the data that you have on this blog. It shows how well you welcome this subject. Bookmarked this page, will return for extra. data science course in jaipur

    ReplyDelete
  74. I wanted to leave a little comment to support you and wish you the best of luck. We wish you the best of luck in all of your blogging endeavors.

    Artificial Intelligence Training in Bangalore

    ReplyDelete
  75. Great Article. I really liked your blog post! It was well organized, insightful and most of all helpful.
    Artificial Intelligence Training in Hyderabad
    Artificial Intelligence Course in Hyderabad

    ReplyDelete
  76. Attempting to express profound gratitude won't just be satisfactory, for the fabulous clearness in your creation. I will in a brief instant get your rss channel to stay instructed with respect to any updates.
    data scientist training and placement in hyderabad

    ReplyDelete
  77. I love this article. It's well-written. Thanks for all the effort you put into it! I enjoyed reading it and plan to read many more of your articles in the future.
    Data Science Training in Hyderabad
    Data Science Course in Hyderabad

    ReplyDelete
  78. I wish more writers of this sort of substance would take the time you did to explore and compose so well. I am exceptionally awed with your vision and knowledge.
    data scientist training in hyderabad

    ReplyDelete
  79. You actually make it look so easy with your performance but I find this matter to be actually something which I think I would never comprehend. It seems too complicated and extremely broad for me. I'm looking forward for your next post, I’ll try to get the hang of it!
    data scientist training in hyderabad

    ReplyDelete
  80. Nice to be seeing your site once again, it's been weeks for me. This article which ive been waited for so long. I need this guide to complete my mission inside the school, and it's same issue together along with your essay. Thanks, pleasant share.
    Data Science training in Bangalore

    ReplyDelete
  81. Nice blog, valuable and helpful informative for me. Thanks for posting the best information and the blog is very good whatsapp mod

    ReplyDelete
  82. A good blog always contains new and exciting information and as I read it I felt that this blog really has all of these qualities that make a blog.

    Digital Marketing Institute in Bangalore

    ReplyDelete
  83. I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end.
    Data Analytics Courses In Pune

    ReplyDelete
  84. Online Training | Classroom | Virtual Classes
    Angular JS Training in Hyderabad with 100% placement assistance
    1860 testers placed in 600 companies in last 8 years
    Angular JS Training in Hyderabad from Real-time expert trainers
    Industry oriented training with corporate case studies
    Angular Training with Free Aptitude classes & Mock interviews

    ReplyDelete
  85. You completely match our expectation and the variety of our information.
    data science course

    ReplyDelete
  86. I am a new user of this site so here i saw multiple articles and posts posted by this site,I curious more interest in some of them hope you will give more information on this topics in your next articles. data science course in surat

    ReplyDelete
  87. Happy to chat on your blog, I feel like I can't wait to read more reliable posts and think we all want to thank many blog posts to share with us.

    Data Science Training Institutes in Bangalore

    ReplyDelete
  88. Thanks for posting the best information and the blog is very good.data analytics course in udaipur

    ReplyDelete
  89. The information you have posted is very useful. The sites you have referred was good. Thanks for sharing. business analytics course in mysore

    ReplyDelete
  90. Very nice blog. A great piece of writing. You have shared a true worthy blog and keep sharing more blogs with us. Thank you.
    Data Science Course in Hyderabad

    ReplyDelete
  91. Here at this site is really a fastidious material collection so that everybody can enjoy a lot.
    business analytics training in hyderabad

    ReplyDelete