A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce

In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".
Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email:

Pr(spam|words)=Pr(words|spam)Pr(spam)/Pr(words)

In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).

1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:

Update: --emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category.

Then, it'll be more simplified. :)

Map:
    /**
     * Counts word frequency
     */
    public void map(LongWritable key, Text value,
        OutputCollector output, Reporter reporter)
        throws IOException {
      String line = value.toString();
      String[] tokens = line.split(splitregex);

      // For every word token
      for (int i = 0; i < tokens.length; i++) {
        String word = tokens[i].toLowerCase();
        Matcher m = wordregex.matcher(word);
        if (m.matches()) {
          spamTotal++;
          output.collect(new Text(word), count);
        }
      }
    }
Reduce:
    /**
     * Computes bad (or good) count / total bad (or good) words
     */
    public void reduce(Text key, Iterator values,
        OutputCollector output, Reporter reporter)
        throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += (int) values.next().get();
      }

      output.collect(key, 
        new FloatWritable((float) sum / spamTotal));
    }
2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:
    /**
     * Implement bayes rules to computer 
     * how likely this word is "spam"
     */
    public void finalizeProb() {
      if (rGood + rBad > 0)
        pSpam = rBad / (rBad + rGood);
      if (pSpam < 0.01f)
        pSpam = 0.01f;
      else if (pSpam > 0.99f)
        pSpam = 0.99f;
    }

89 comments:

  1. You probably want to smooth the estimates too (especially for zero counts ie unknown tokens).

    A simple approach is "add one": for some count n, pretend you saw it n+1 times.

    ReplyDelete
  2. Oh, Good point, miles!! Thanks for your review. :)

    ReplyDelete
  3. The materials should be small and definitive, that is to say, distributed algorithm won't much help in this case. But nice example for Map/Reduce.

    ReplyDelete
  4. I agree with you in part that should be small and definitive, But I thought per-user based bayesian for a large-scale web-mail service, There are a lot of users. ;)

    ReplyDelete
  5. Updating with the recent skills and applying it is the only tactic to live in our vocation. You have done really a great job by sharing this blog in here. Keep writing blog like this. .Hadoop Training in Bangalore | Data Science Training in Bangalore

    ReplyDelete
  6. really awesome blog It was helpful.

    ReplyDelete
  7. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

    artificial Intelligence course

    machine learning courses in mumbai

    ReplyDelete
  8. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
    AI course in mumbai

    ReplyDelete
  9. Hi, Thanks for sharing wonderful articles...

    Ai Training In Hyderabad

    ReplyDelete
  10. Excellent post.I want to thank you for this informative read, I really appreciate sharing this great post.Keep up your work
    Data Science Certification in Bangalore

    ReplyDelete
  11. Your article has aroused my curiosity. This is unquestionably a mastermind's article with incredible substance and intriguing perspectives. I concur partially with a great deal of this substance. Much thanks to you for sharing this educational material.

    Online Teaching Platforms
    Online Live Class Platform
    Online Classroom Platforms
    Online Training Platforms
    Online Class Software
    Virtual Classroom Software
    Online Classroom Software
    Learning Management System
    Learning Management System for Schools
    Learning Management System for Colleges
    Learning Management System for Universities

    ReplyDelete
  12. There's no doubt i would fully rate it after i read what is the idea about this article. You did a nice job..
    Data Science Course in Bangalore

    ReplyDelete
  13. You are in point of fact a just right webmaster. The website loading speed is amazing. It kind of feels that you're doing any distinctive trick. Moreover, The contents are masterpiece. you have done a fantastic activity on this subject!
    Learn best training course:
    Business Analytics Course in Hyderabad
    Business Analytics Training in Hyderabad

    ReplyDelete
  14. Interesting post. I Have Been wondering about this issue, so thanks for posting. Pretty cool post.It 's really very nice and Useful post.Thanks
    data science course malaysia

    ReplyDelete
  15. I’m happy I located this blog! From time to time, students want to cognitive the keys of productive literary essays composing. Your first-class knowledge about this good post can become a proper basis for such people. nice one

    Data Science Course

    ReplyDelete
  16. I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end. I would like to read newer posts and to share my thoughts with you.

    Data Science Training

    ReplyDelete
  17. I must abide that you are highly trained at influential writing as I am highly convinced to share your views.
    SAP training in Mumbai
    SAP course in Mumbai
    SAP training institute Mumbai

    ReplyDelete
  18. This article is packed full of constructive information. The valuable points made here are apparent, brief, clear and poignant.
    SAP training in Kolkata
    SAP course in kolkata
    SAP training institute in Kolkata

    ReplyDelete


  19. This post is great. I reallly admire your post. Your post was awesome.
    data science course in Hyderabad

    ReplyDelete
  20. really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up. Learn best Ethical Hacking Course in Bangalore

    ReplyDelete
  21. https://blog.udanax.org/2008/10/parallel-bayesian-spam-filtering-using.html

    ReplyDelete
  22. Cool stuff you have and you keep overhaul every one of us
    data science certification

    ReplyDelete
  23. Very impressive and interesting blog found to be well written in a simple manner that everyone will understand and gain the enough knowledge from your blog being more informative is an added advantage for the users who are going through it. Once again nice blog keep it up.

    360DigiTMG Data Analytics Course

    ReplyDelete
  24. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

    data science interview questions

    ReplyDelete
  25. Very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing, best data science courses in Hyderabad

    ReplyDelete
  26. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
    https://360digitmg.com/digital-marketing-training-in-hyderabad

    ReplyDelete
  27. Mindblowing blog appreciating your endless efforts in developing a truly transparent content. Which probably the best one to come across disclosing the content which people might not aware of it. Thanks for bringing out the amazing content and keep sharing more further.

    360DigiTMG PMP Certification Course

    ReplyDelete
  28. Very impressive and interesting blog learnt lot of new things thanks for sharing.
    360DigiTMG Data Science Training in Hyderabad

    ReplyDelete
  29. Found your post interesting to read. I cant wait to see your post soon. Good Luck for the upcoming update. This article is really very interesting and effective, data science online training

    ReplyDelete
  30. I am overwhelmed by your blog post, information provided was of great help thank you.
    Data Analytics Certification Training 360DigiTMG

    ReplyDelete
  31. I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it. 360DigiTMG

    ReplyDelete
  32. Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.360digitmg

    ReplyDelete
  33. Her blog has given us valuable information to work on. Every tip in your post is amazing. Thank you so much for sharing. Keep blogging.

    Business Analytics Course in Bangalore

    ReplyDelete
  34. It's very educational and well-written content for a change. It's good to see that some people still understand how to write a great article!

    Data Analytics Course in Bangalore

    ReplyDelete
  35. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work. data science training in Hyderabad

    ReplyDelete
  36. I am stunned by the information that you have on this blog. It shows how well you fathom this subject.
    360DigiTMG data science course in malaysia

    ReplyDelete
  37. Really, this article is truly one of the best in article history. I am a collector of old "items" and sometimes read new items if I find them interesting. And this one that I found quite fascinating and should be part of my collection. Very good work!

    Data Analytics Course in Bangalore

    ReplyDelete
  38. Top quality article with very fantastic information and unique content found very useful thanks for sharing.
    Data Analytics Course Online

    ReplyDelete
  39. What a really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up.
    Best Digital Marketing Courses in Hyderabad

    ReplyDelete
  40. They are produced by high level developers who will stand out for the creation of their polo dress. You will find Ron Lauren polo shirts in an exclusive range which includes private lessons for men and women.

    Artificial Intelligence Course in Bangalore

    ReplyDelete
  41. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    Data Science Training in Hyderabad

    ReplyDelete
  42. I have bookmarked your website because this site contains valuable information in it. I am really happy with articles quality and presentation. Thanks a lot for keeping great stuff. I am very much thankful for this site.
    data science training in Hyderabad

    ReplyDelete
  43. Thank you for sharing such a useful post with us, it will useful for everybody, so keep it up that is decent work.data science training in Hyderabad

    ReplyDelete
  44. I don t have the time at the moment to fully read your site but I have bookmarked it and also add your RSS feeds. I will be back in a day or two. thanks for a great site.

    business analytics course

    ReplyDelete
  45. A good blog always comes-up with new and exciting information and while reading I have feel that this blog is really have all those quality that qualify a blog to be a one.


    Best Institute for Data Science in Hyderabad

    ReplyDelete

  46. This is an awesome motivating article.I am practically satisfied with your great work.You put truly extremely supportive data. Keep it up. Continue blogging. Hoping to pursuing your next post
    Best Institutes For Digital Marketing in Hyderabad

    ReplyDelete
  47. Hello! I just wish to give an enormous thumbs up for the nice info you've got right here on this post. I will probably be coming back to your weblog for more soon!
    Best Institute for Data Science in Hyderabad

    ReplyDelete
  48. It is imperative that we read blog post very carefully. I am already done it and find that this post is really amazing.
    business analytics course

    ReplyDelete
  49. Hello there to everyone, here everybody is sharing such information, so it's fussy to see this webpage, and I used to visit this blog day by day
    data science courses in noida

    ReplyDelete
  50. Impressive. Your story always brings hope and new energy. Keep up the good work.
    Best Institute for Data Science in Hyderabad

    ReplyDelete
  51. I recently found a lot of useful information on your website, especially on this blog page. Among the many comments on your articles. Thanks for sharing.

    Best Data Science Courses in Bangalore

    ReplyDelete
  52. Truly quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. Much obliged for sharing.
    data scientist certification

    ReplyDelete
  53. Fantastic blog with excellent information and valuable content just added your blog to my bookmarking sites thank for sharing.
    Data Science Course in Chennai

    ReplyDelete
  54. Very awesome!!! When I seek for this I found this website at the top of all blogs in search engine.
    business analytics course

    ReplyDelete
  55. This comment has been removed by the author.

    ReplyDelete
  56. I just needed to record a speedy word to express profound gratitude to you for those magnificent tips and clues you are appearing on this site.
    AWS Training in Hyderabad
    AWS Course in Hyderabad

    ReplyDelete



  57. I was basically inspecting through the web filtering for certain data and ran over your blog. I am flabbergasted by the data that you have on this blog. It shows how well you welcome this subject. Bookmarked this page, will return for extra. data science course in jaipur

    ReplyDelete
  58. Great Article. I really liked your blog post! It was well organized, insightful and most of all helpful.
    Artificial Intelligence Training in Hyderabad
    Artificial Intelligence Course in Hyderabad

    ReplyDelete
  59. I love this article. It's well-written. Thanks for all the effort you put into it! I enjoyed reading it and plan to read many more of your articles in the future.
    Data Science Training in Hyderabad
    Data Science Course in Hyderabad

    ReplyDelete
  60. I wish more writers of this sort of substance would take the time you did to explore and compose so well. I am exceptionally awed with your vision and knowledge.
    data scientist training in hyderabad

    ReplyDelete
  61. Nice blog, valuable and helpful informative for me. Thanks for posting the best information and the blog is very good whatsapp mod

    ReplyDelete
  62. A good blog always contains new and exciting information and as I read it I felt that this blog really has all of these qualities that make a blog.

    Digital Marketing Institute in Bangalore

    ReplyDelete
  63. Online Training | Classroom | Virtual Classes
    Angular JS Training in Hyderabad with 100% placement assistance
    1860 testers placed in 600 companies in last 8 years
    Angular JS Training in Hyderabad from Real-time expert trainers
    Industry oriented training with corporate case studies
    Angular Training with Free Aptitude classes & Mock interviews

    ReplyDelete
  64. You completely match our expectation and the variety of our information.
    data science course

    ReplyDelete
  65. I am a new user of this site so here i saw multiple articles and posts posted by this site,I curious more interest in some of them hope you will give more information on this topics in your next articles. data science course in surat

    ReplyDelete
  66. The information you have posted is very useful. The sites you have referred was good. Thanks for sharing. business analytics course in mysore

    ReplyDelete
  67. Very nice blog. A great piece of writing. You have shared a true worthy blog and keep sharing more blogs with us. Thank you.
    Data Science Course in Hyderabad

    ReplyDelete
  68. Here at this site is really a fastidious material collection so that everybody can enjoy a lot.
    business analytics training in hyderabad

    ReplyDelete