A Distributed Bayesian Spam Filtering using Hadoop Map/Reduce

In addressing the growing problem of junk email on the Internet, I examined methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. One was "Bayesian Spam Filtering".
Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email:

Pr(spam|words)=Pr(words|spam)Pr(spam)/Pr(words)

In this post, I'm introduce about the implementation of a Parallel Bayesian Spam Filtering Algorithm on the distributed system (Hadoop).

1. We can get the spam probability P(wordcategory) of the words from an files of category (bad/good e-mails) as describe below:

Update: --emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category.

Then, it'll be more simplified. :)

Map:
    /**
     * Counts word frequency
     */
    public void map(LongWritable key, Text value,
        OutputCollector output, Reporter reporter)
        throws IOException {
      String line = value.toString();
      String[] tokens = line.split(splitregex);

      // For every word token
      for (int i = 0; i < tokens.length; i++) {
        String word = tokens[i].toLowerCase();
        Matcher m = wordregex.matcher(word);
        if (m.matches()) {
          spamTotal++;
          output.collect(new Text(word), count);
        }
      }
    }
Reduce:
    /**
     * Computes bad (or good) count / total bad (or good) words
     */
    public void reduce(Text key, Iterator values,
        OutputCollector output, Reporter reporter)
        throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += (int) values.next().get();
      }

      output.collect(key, 
        new FloatWritable((float) sum / spamTotal));
    }
2. We can get a rBad/rGood value of same key in each data (spam probability set of the words), We are finished adding words so finalize the results () such as a join map/reduce as describe below:
    /**
     * Implement bayes rules to computer 
     * how likely this word is "spam"
     */
    public void finalizeProb() {
      if (rGood + rBad > 0)
        pSpam = rBad / (rBad + rGood);
      if (pSpam < 0.01f)
        pSpam = 0.01f;
      else if (pSpam > 0.99f)
        pSpam = 0.99f;
    }

Comments

  1. You probably want to smooth the estimates too (especially for zero counts ie unknown tokens).

    A simple approach is "add one": for some count n, pretend you saw it n+1 times.

    ReplyDelete
  2. Oh, Good point, miles!! Thanks for your review. :)

    ReplyDelete
  3. The materials should be small and definitive, that is to say, distributed algorithm won't much help in this case. But nice example for Map/Reduce.

    ReplyDelete
  4. I agree with you in part that should be small and definitive, But I thought per-user based bayesian for a large-scale web-mail service, There are a lot of users. ;)

    ReplyDelete
  5. Updating with the recent skills and applying it is the only tactic to live in our vocation. You have done really a great job by sharing this blog in here. Keep writing blog like this. .Hadoop Training in Bangalore | Data Science Training in Bangalore

    ReplyDelete
  6. really awesome blog It was helpful.

    ReplyDelete
  7. Go to the BGAOC website and get instant winnings best internet casino in the world Do not miss your chance.

    ReplyDelete
  8. Современная диодная лента по всем стандартам отличного качества я обычно беру у компании Ekodio

    ReplyDelete
  9. We as a team of real-time industrial experience with a lot of knowledge in developing applications in python programming (7+ years) will ensure that we will deliver our best in python training in vijayawada. , and we believe that no one matches us in this context.

    ReplyDelete
  10. Nice blog,I understood the topic very clearly,And want to study more like this.
    Data Scientist Course

    ReplyDelete
  11. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.

    machine learning course

    artificial intelligence course in mumbai

    ReplyDelete
  12. Attend The Data Analytics Courses From ExcelR. Practical Data Analytics Courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analytics Courses.
    ExcelR Data Analytics Courses
    Data Science Interview Questions

    ReplyDelete
  13. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

    artificial Intelligence course

    machine learning courses in mumbai

    ReplyDelete

Post a Comment

Popular posts from this blog

일본만화 추천 100선

공유 모빌리티 회사로 합류

딥 러닝의 블랙 박스 열어보기

How socket.io-redis adapter works? 레디스 어댑터 동작원리