Spam filtering at CSAIL with SpamAssassin: technical details

Content-based spam filtering is done using SpamAssassin augmented with some third party and custom rule sets. We also maintain some local blacklists of netblocks and IP addresses that have a history of abuse.

SpamAssassin uses a large set of rules against which it compares the contents and headers of an incoming message. These rules look for traits common to spammers, as well as traits that are likely to indicate that a message is legitimate. Because spammers are constantly changing their message obfuscation tactics in order to try and bypass tools like SpamAssassin, we are constantly monitoring the effectiveness of our installed ruleset and making modifications as necessary to keep up with changing tactics.

SpamAssassin also uses a Naive Bayes classifier to dynamically learn the difference between spam and legitimate mail. Whenever a message scores very high on the rule-based spam tests, SpamAssassin automatically uses the message as training data for the Bayes classifier. Similarly, obviously legitimate messages are also used as training data.

There are a couple of reasons why you might want to train the Bayesian classifier by hand. First of all, the Bayesian classifier gets more effective with more training data, so it helps to train it with a bunch of mail, both legitimate and spam. Also, in the event that SpamAssassin makes a mistake, you can train it to recognize the traits of the misclassified message. In order to train the Bayesian classifier by hand, you need to provide it with messages to use as training data. This is done by saving mail into your INBOX.MissedSpam and INBOX.NotSpam mailboxes. To insure SpamAssassin's accuracy, you should periodically save samples of legitimate mail to MissedSpam and NotSpam, even if they were properly tagged. This keeps the database of tokens in the Bayes database fresh and is important because tokens appearing in new messages are compared with tokens appearing in the Bayes database.

Note: The Bayesian classifier needs to see 200 distinct spam and legitimate messages before it activates. Thus, it's strongly recommended that you send a lot of mail to your MissedSpam and NotSpam folders when you first start using our spam filtering service. The sooner the system sees those 200 messages of each type of mail, the sooner you start seeing the benefits of the Bayesian classification engine. If you send less than that amount to each folder, it will take time for SpamAssassin to "auto-learn" from messages with very high and very low spam scores, before it reaches those thresholds.
Topic revision: 10 Jul 2009, amitra
 

MIT Computer Science and Artificial Intelligence Laboratory

 

  • About CSAIL
  • Research
  • News + Events
  • Resources
  • People

This site is powered by Foswiki MIT: Massachusetts Institute of Technology