项目作者: gravity0905

项目描述 :
Naive Bayes Spam Classifier
高级语言: Jupyter Notebook
项目地址: git://github.com/gravity0905/SpamClassifier.git
创建时间: 2017-07-23T08:41:19Z
项目社区:https://github.com/gravity0905/SpamClassifier

开源协议:MIT License

下载


SpamClassifier

The objective of this project is to compare the performance of two popular Naive Bayes Spam Classifiers

  • Multi-variate Bernoulli event model
  • Multinomial event model

Spam Dataset

The Ling-Spam corpus is used for training the models.

Preprocessing

All the mails in the bare subdirectory were preprocessed using the process.py script and stored in another directory.
The following email preprocessing and normalization steps were carried out in the given order:

  • Lower casing
  • Stripping HTML tags
  • Normalizing URLs
  • Normalizing email addresses
  • Normalizing numbers
  • Normalizing currency symbols
  • Removal of non-word characters
  • Stop Word removal
  • Word Stemming

Word Stemming

The Porter Stemming algorithm which was ported to Python from the
version coded up in ANSI C by the author was used for word stemming.

License

Copyright (c) 2017 Garvit Aggarwal