A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce

In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".
Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email:

Pr(spam|words)=Pr(words|spam)Pr(spam)/Pr(words)

In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).

1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:

Update: --emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category.

Then, it'll be more simplified. :)

Map:
    /**
     * Counts word frequency
     */
    public void map(LongWritable key, Text value,
        OutputCollector output, Reporter reporter)
        throws IOException {
      String line = value.toString();
      String[] tokens = line.split(splitregex);

      // For every word token
      for (int i = 0; i < tokens.length; i++) {
        String word = tokens[i].toLowerCase();
        Matcher m = wordregex.matcher(word);
        if (m.matches()) {
          spamTotal++;
          output.collect(new Text(word), count);
        }
      }
    }
Reduce:
    /**
     * Computes bad (or good) count / total bad (or good) words
     */
    public void reduce(Text key, Iterator values,
        OutputCollector output, Reporter reporter)
        throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += (int) values.next().get();
      }

      output.collect(key, 
        new FloatWritable((float) sum / spamTotal));
    }
2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:
    /**
     * Implement bayes rules to computer 
     * how likely this word is "spam"
     */
    public void finalizeProb() {
      if (rGood + rBad > 0)
        pSpam = rBad / (rBad + rGood);
      if (pSpam < 0.01f)
        pSpam = 0.01f;
      else if (pSpam > 0.99f)
        pSpam = 0.99f;
    }

Comments

  1. You probably want to smooth the estimates too (especially for zero counts ie unknown tokens).

    A simple approach is "add one": for some count n, pretend you saw it n+1 times.

    ReplyDelete
  2. Oh, Good point, miles!! Thanks for your review. :)

    ReplyDelete
  3. The materials should be small and definitive, that is to say, distributed algorithm won't much help in this case. But nice example for Map/Reduce.

    ReplyDelete
  4. I agree with you in part that should be small and definitive, But I thought per-user based bayesian for a large-scale web-mail service, There are a lot of users. ;)

    ReplyDelete
  5. Updating with the recent skills and applying it is the only tactic to live in our vocation. You have done really a great job by sharing this blog in here. Keep writing blog like this. .Hadoop Training in Bangalore | Data Science Training in Bangalore

    ReplyDelete
  6. really awesome blog It was helpful.

    ReplyDelete
  7. Go to the BGAOC website and get instant winnings best internet casino in the world Do not miss your chance.

    ReplyDelete
  8. Современная диодная лента по всем стандартам отличного качества я обычно беру у компании Ekodio

    ReplyDelete
  9. We as a team of real-time industrial experience with a lot of knowledge in developing applications in python programming (7+ years) will ensure that we will deliver our best in python training in vijayawada. , and we believe that no one matches us in this context.

    ReplyDelete
  10. Nice blog,I understood the topic very clearly,And want to study more like this.
    Data Scientist Course

    ReplyDelete
  11. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.

    machine learning course

    artificial intelligence course in mumbai

    ReplyDelete
  12. Attend The Data Analytics Courses From ExcelR. Practical Data Analytics Courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analytics Courses.
    ExcelR Data Analytics Courses
    Data Science Interview Questions

    ReplyDelete
  13. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

    artificial Intelligence course

    machine learning courses in mumbai

    ReplyDelete
  14. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
    AI course in mumbai

    ReplyDelete
  15. You have explained the concept really well. Was looking for this information from a while & luckily I stumbled upon your post. Looking forward for more of such informative updates from you

    Data Science Training In Hyderabad
    Data Science Course In Hyderabad

    ReplyDelete

  16. I think this is one of the most significant information for me. And I’m glad reading your article. Thanks for sharing!
    DevOps Training In Hyderabad

    ReplyDelete
  17. Hi, Thanks for sharing wonderful articles...

    Ai Training In Hyderabad

    ReplyDelete
  18. Nice Post ! The concept has been explained very well. Thanks for sharing...
    AI Training in Hyderabad

    ReplyDelete
  19. wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries.
    Data science Interview Questions
    Data Science Course

    ReplyDelete
  20. I am looking for and I love to post a comment that "The content of your post is awesome" Great work!

    Correlation vs Covariance

    ReplyDelete
  21. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression

    ReplyDelete
  22. Excellent post.I want to thank you for this informative read, I really appreciate sharing this great post.Keep up your work
    Data Science Certification in Bangalore

    ReplyDelete
  23. Your article has aroused my curiosity. This is unquestionably a mastermind's article with incredible substance and intriguing perspectives. I concur partially with a great deal of this substance. Much thanks to you for sharing this educational material.

    Online Teaching Platforms
    Online Live Class Platform
    Online Classroom Platforms
    Online Training Platforms
    Online Class Software
    Virtual Classroom Software
    Online Classroom Software
    Learning Management System
    Learning Management System for Schools
    Learning Management System for Colleges
    Learning Management System for Universities

    ReplyDelete
  24. There's no doubt i would fully rate it after i read what is the idea about this article. You did a nice job..
    Data Science Course in Bangalore

    ReplyDelete
  25. Thanks for sharing this information. I really like your blog post very much. You have really shared a informative and interesting blog post with people..
    Data Science Training in Bangalore

    ReplyDelete
  26. Attend The Data Analytics Courses From ExcelR. Practical Data Analytics Courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analytics Courses.
    Data Analytics Courses

    ReplyDelete
  27. I feel really happy to have seen your web page and look forward to so many more entertaining times reading here. Thanks once more for all the details.
    Data Science Training in Hyderabad | Data Science Course in Hyderabad

    ReplyDelete
  28. You are in point of fact a just right webmaster. The website loading speed is amazing. It kind of feels that you're doing any distinctive trick. Moreover, The contents are masterpiece. you have done a fantastic activity on this subject!
    Learn best training course:
    Business Analytics Course in Hyderabad
    Business Analytics Training in Hyderabad

    ReplyDelete
  29. The context has been explained really well. Looking forward to see more of such informative updates
    Machine Learning Training in Hyderabad

    ReplyDelete
  30. Interesting post. I Have Been wondering about this issue, so thanks for posting. Pretty cool post.It 's really very nice and Useful post.Thanks
    data science course malaysia

    ReplyDelete
  31. I’m happy I located this blog! From time to time, students want to cognitive the keys of productive literary essays composing. Your first-class knowledge about this good post can become a proper basis for such people. nice one

    Data Science Course

    ReplyDelete
  32. I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end. I would like to read newer posts and to share my thoughts with you.

    Data Science Training

    ReplyDelete
  33. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
    Data Science Certification in Bangalore

    ReplyDelete
  34. Attend The Data Analyst Course From ExcelR. Practical Data Analyst Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analyst Course.
    Data Analyst Course

    ReplyDelete
  35. I must abide that you are highly trained at influential writing as I am highly convinced to share your views.
    SAP training in Mumbai
    SAP course in Mumbai
    SAP training institute Mumbai

    ReplyDelete
  36. This article is packed full of constructive information. The valuable points made here are apparent, brief, clear and poignant.
    SAP training in Kolkata
    SAP course in kolkata
    SAP training institute in Kolkata

    ReplyDelete
  37. It is perfect time to make some plans for the future and it is time to be happy. I've read this post and if I could I desire to suggest you some interesting things or suggestions. Perhaps you could write next articles referring to this article. I want to read more things about it!
    360DigiTMG data science training in indore

    ReplyDelete
  38. Such a very useful article. Very interesting to read this article. I would like to thank you for the efforts you had made for writing this awesome article.
    Data Science Course in Pune
    Data Science Training in Pune

    ReplyDelete
  39. I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.
    Data Analytics Course in Pune
    Data Analytics Training in Pune

    ReplyDelete
  40. Very interesting blog. Many blogs I see these days do not really provide anything that attracts others, but believe me the way you interact is literally awesome.You can also check my articles as well.

    Data Science In Banglore With Placements
    Data Science Course In Bangalore
    Data Science Training In Bangalore
    Best Data Science Courses In Bangalore
    Data Science Institute In Bangalore

    Thank you..

    ReplyDelete
  41. Attend The Data Analyst Course From ExcelR. Practical Data Analyst Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analyst Course.
    Data Analyst Course

    ReplyDelete
  42. Writing with style and getting good compliments on the article is quite hard, to be honest.But you've done it so calmly and with so cool feeling and you've nailed the job. This article is possessed with style and I am giving good compliment. Best!
    Data Science Course in Bangalore

    ReplyDelete
  43. Wow! Such an amazing and helpful post this is. I really really love it. It's so good and so awesome. I am just amazed. I hope that you continue to do your work like this in the future also.
    Data Science Training in Bangalore

    ReplyDelete
  44. Attend The Data Analyst Course From ExcelR. Practical Data Analyst Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analyst Course.
    Data Analyst Course

    ReplyDelete
  45. Attend The Artificial Intelligence course From ExcelR. Practical Artificial Intelligence course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Artificial Intelligence course.
    Artificial Intelligence Course

    ReplyDelete


  46. This post is great. I reallly admire your post. Your post was awesome.
    data science course in Hyderabad

    ReplyDelete
  47. This Was An Amazing ! I Haven't Seen This Type of Blog Ever ! Thankyou For Sharing, data sciecne course in hyderabad

    ReplyDelete
  48. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression
    data science interview questions

    ReplyDelete
  49. really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up. Learn best Ethical Hacking Course in Bangalore

    ReplyDelete
  50. You might comment on the order system of the blog. You should chat it's splendid. Your blog audit would swell up your visitors. I was very pleased to find this site.I wanted to thank you for this great read!!. Learn best Ethical Hacking Training in Bangalore

    ReplyDelete
  51. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    iot course training in guduvanchery

    ReplyDelete
  52. https://blog.udanax.org/2008/10/parallel-bayesian-spam-filtering-using.html

    ReplyDelete
  53. Cool stuff you have and you keep overhaul every one of us
    data science certification

    ReplyDelete

Post a Comment

Popular posts from this blog

일본만화 추천 100선

음성 인공지능 스타트업의 기회 분석

[백준 알고리즘 풀이] 7469번 K번째 숫자