Tweaking your anti-spam settings

While there are a few other arrows in our arsenal, the main tool we use for spam filtering at CSAIL is SpamAssassin. This page describes how to customize how our SpamAssassin installation filters your mail.

(This page is not an exhaustive or complete guide to what you can do to tweak your SpamAssassin experience; it just addresses a few topics in greater detail than they’re covered elsewhere and brings them together.)

The SpamAssassin customization form

We’ve set up a (very mid-’90s) web form for CSAIL folks to customize their SpamAssassin settings. You can get to it by going to http://mail.csail.mit.edu/ (which will redirect you elsewhere) and following the “Edit SpamAssassin settings” link. (Note that this is different from the “Change your password and anti-spam settings” link! That form lets you turn on or off SpamAssassin filtering for your mail altogether; the “Edit SpamAssassin settings” form lets you customize it once it’s turned on.)

This gets you to the “Spamassassin Settings for [your login name]” page, which has a bunch of per-user settings you can adjust.

Two stages of SpamAssassin processing

In order to understand CSAIL’s spam filtering, and the effects of changes you may make to your anti-spam settings, it’s important to know that there are (up to) two passes through SpamAssassin a message will take before it gets delivered. The first one is CSAIL-wide and controls whether the message is accepted at all, or (for very high-scoring messages) refused outright by our mail server. The second one (if you’ve enabled it) is specific to your address, and controls whether the message is marked as spam or not before delivery – and then you can use that to determine how to filter your mail.

The first SpamAssassin scan is on a per-message basis, and a single message can have multiple recipients. (For instance, a message might be sent To: you, Cc:‘ing your labmates, and Bcc:‘ing the sender’s email account if that happens to be how they keep track of their sent mail. That would probably come in to our server as a single message with multiple recipients listed. Or a piece of spam might be sent to a couple dozen different addresses at CSAIL as a single transaction.) At that stage, assuming we don’t have some reason to particularly trust the message, it’s run through SpamAssassin, and if it scores very high, the mail server refuses it. (In the unlikely event it was a legitimate message, that will result in the mail server that connected to us generating a bounce to the sender. It’s important that we’re not generating the bounce, because spam is often sent with forged sender addresses, so if we generate a bounce message in response to a piece of spam it often goes to an innocent bystander and in effect we’re sending spam to some other victim.)

The second SpamAssassin scan is done once separately for each recipient (who has turned on SpamAssassin, see below) before the message is delivered. This is the one you can configure to your liking as described below.

The first scan will refuse outright mail (coming from many but not all sources) which a site-wide SpamAssassin run scores over 8. The second scan, by default, will mark as spam messages which your customized configuration of SpamAssassin, with your whitelist and blacklist and spam-filtering history, scores over 5 (but you can adjust that threshold).

What this means is that a message that SpamAssassin scores as very likely to be spam will not be let through by your whitelist, since it will already have been refused by the server. On the other hand, if a message is from a sender on your blacklist, SpamAssassin will score your copy of that message as spam on the second scan, so it will be marked as spam and probably delivered into your Spam folder, but other recipients might see it scored as non-spam and delivered to their inboxes normally.

(There are CSAIL-wide whitelists and blacklists, and CSAIL can write custom rules to help deal particular sites’ or senders’ problems getting mail to us, though, so if you have a correspondent who sometimes has trouble getting mail to CSAIL at all, please let us know by sending mail to help@csail.mit.edu.)

Reducing the “Required hits” necessary to mark a message as spam

Most important is the “Required hits” field, which defaults to “5”. That means that any message that SpamAssassin (our main anti-spam software) scores over five points will be treated as spam – marked up with a special headers that cause it to be filed into your Spam folder later on.

You can reduce this (or increase it, not that you’d want to do so), but that increases the risk that legitimate mail will be filed into your Spam folder.

The SpamAssassin project volunteers tune all their rulesets on the assumption that a message with a score of 5 or above will be marked as spam. Negative scores are possible (meaning extremely unlikely to be spam) but typical legitimate mail scores in the 2-3 range, and legitimate mail scoring 4 is entirely possible. If a legitimate message scores 4.5 with a ruleset being tested (or even 4.9), the SpamAssassin team does not consider that a problem. (If it scores 5.0 or 5.1, they do, and will adjust scores or make rules more strict appropriately.)

So, decreasing that value almost certainly means that legitimate mail will be marked as spam more often. Now, exactly how severe a problem that will be will vary a lot based on the typical characteristics of your incoming mail. (For instance, somebody who’s a nutritionist might get a lot of mail with “diabetes” or “supplement” or “physician” in the subject, which are also words that tend to be associated with pharmaceutical spam, so their legitimate mail would tend to score higher. On the other hand, somebody who only used email to correspond with their daughter at UCLA would probably have legitimate mail that scored exceptionally low.)

I spot-checked the mail in my inbox recently, and I only found two messages in the past few years that a score of “4” would have marked as spam (one of them being a piece of mail with a colleague about spam), so in my case reducing the score to 4 would not be a big problem. But your mail flow may be significantly different. If you do reduce the “Required hits” value, I would suggest doing it 0.2 or 0.5 at a time, and keeping a close eye on your “Spam” folder after making changes (and checking it fairly regularly even after things seem stable, at least eyeballing sender addresses of messages filed as spam).

Bayesian filtering

The “Bayesian filtering” checkbox lets you turn on or off the simplistic built-in Bayesian filtering system SpamAssassin supports. Essentially, this takes messages that SpamAssassin has a very high confidence are ham (non-spam) or spam, and uses them to learn the characteristics of ham or spam from them.

As a very oversimplified example, if you get a lot of non-spam messages about ACM conferences or publications, you’ll probably get a lot more very low-scoring messages with the term “ACM” in them than very high-scoring messages, and gradually SpamAssassin will learn to slightly reduce the score of messages with that term in them. On the other hand, if you get a lot of spam that has the word “invest” in it, SpamAssassin will gradually learn to slightly increase the score of messages with that term in them. (On the other hand, if you also get mail from your broker or your spouse that uses the word “invest”, SpamAssassin will see that messages that are not spam and messages that are spam both use that term, and won’t be as likely to use it for scoring; Bayesian spam filtering is about finding the distinctive characteristics of otherwise high-scoring and low-scoring messages.)

Most of this scoring will be automatic, based on messages that SpamAssassin has a high confidence it has already classified correctly, but you can also manually train SpamAssassin’s Bayesian classifier by moving messages into your MissedSpam folder to learn them as spam (and then delete them) or copying messages into your NotSpam fodler to learn them as non-spam (and then delete the copies). (Perhaps it would have been wiser for us to set things up so that messages in NotSpam weren’t deleted after learning, but that’s been the behavior for many years.)

If you find that lots of messages are getting classified as spam just because, say, they happen to be in a language you also get a lot of spam in, you might want to turn this off.

What mail gets checked for spam

To reduce the load on our servers, decrease the likelihood of legitimate intra-CSAIL mail getting marked as spam, and avoid checking the same message multiple times (which tends to decrease accuracy for complicated reasons), we skip anti-spam filtering for mail originating within CSAIL. That includes mail resent from CSAIL mailing lists (which, if it originally came from outside addresses, was generally already checked for spam as it came in to the list). However, different mailing lists have different configurations (at the list owner’s discretion) for how they handle mail that’s been marked as spam. The default, when somebody at CSAIL creates a new mailing list, is to hold suspected spam for moderation, but list owners can change that, so some lists may have been (re)configured to allow spam through automatically (or some list owners may not look at the queue of held messages very carefully when approving or rejecting messages, and making mistakes).

Also, most mail that simply originates directly from a machine on the CSAIL network is not checked for spam. This includes authenticated mail sent by humans with ordinary email clients, but it also includes software-generated mail. One source of software-generated mail is things like error logs from nightly jobs, automated reports, and the like. Another, though, is things like web forms, wiki update reports, and the like – mail generated by software on web servers or other interactive services accessible to the public. And most of those are put up not by TIG sysadmins, but by CSAIL researchers in support of their research, so they’re not necessarily secured very well, and that’s a possible additional source of spam. The obvious solution is that we need to subject this mail, or some of it, to more extensive scrutiny and block messages based on content, but there are complicated historical and technical reasons why we can’t safely do that without coordinating software changes with all the people running their own software on our web servers or just accepting that lots of stuff will break, perhaps days before a research deadline. If we’d had the sort of infrastructure five or ten years ago that made it easy for research groups to bring up their own web servers isolated from each other and easily identifiable by us, we wouldn’t have accumulated all this software that depends on the current behavior in the first place and we also would easily be able to identify the various people or groups we needed to coordinate with about changes like this, but at this point our users have lots of legacy code, and we don’t have easy ways to find what research groups are sending which messages (although Mark Pearrow is doing some good work that will help with this). We do intend to fix this, but it’s not going to be an immediate fix.

(And just for the sake of accuracy, there are some CSAIL network ranges – like the IP addresses associated with the StataCenter unauthenticated wireless network – that we do do spam processing for, and a bunch of non-CSAIL MIT network ranges that we don’t do spam processing for, for similar historical reasons as web-app-generated mail.)

Whitelisting senders

If mail from a particular (legitimate) correspondent sometimes incorrectly gets marked as spam and delivered to your Spam folder, you can add their address to your “From: whitelist”. (Again, you do this on the “Edit SpamAssassin settings” page, which you can get to from http://mail.csail.mit.edu/.) You can only add one address at a time, but when you add an address and choose Submit, another field opens up so you can add another address; in this way you can add as many addresses as you need.

If mail from a particular legitimate correspondent consistently scores so high that we refuse the message altogether, rather than just marking it as spam, this won’t work, and the sender will get a bounce message. That’s pretty rare, but if it does happen, you can contact us (at help@csail.mit.edu) and let us know, and we can investigate the situation and either add their address to a site-wide whitelist, or tweak the SpamAssassin rules or scores to prevent their messages scoring so high. (Sometimes, what we do is work with their postmaster to help them correct a configuration problem.)

You can use the “*” pattern-matching character, so “*@eecs.mit.edu” will whitelist any (possibly forged) EECS sender address.

(See “Additional Links” below, or the help text available on the form itself, for details about what message fields are used for whitelist and blacklist matching.)

Blacklisting senders

If a spammer repeatedly uses the same address, you can blacklist it using the “From: blacklist” field. (As with the whitelist field, you can only add one address at a time, but when you do, an new field will appear so you can keep adding addresses.)

This is of somewhat less utility than you might hope, since these addresses are generally under the control of spammers, and they are often randomly chosen. But some spammers do persistently use the same address or domain, and they can be blacklisted.

Be warned that some spammers spoof real email addresses. So if you get a fraudulent phishing message that falsely claims to be from your bank, it might be spoofing the real address that legitimate mail from your bank would come from, and if you blacklist it you might miss real mail, too. More typically, though, spammers will spoof an address that looks plausible but isn’t actually correct. (And most often the addresses have nothing to do with the content of the spam, on the assumption that most people just look at the human-readable name and don’t even see the email address.)

(See “Additional Links” below, or the help text available on the form itself, for details about what message fields are used for whitelist and blacklist matching.)

Blacklisting recipients

You can also use the form to blacklist recipients – i.e., mark as spam messages with a particular recipient address or address pattern. Why would you want to do that? There are at least two reasons:

(See “Additional Links” below, or the help text available on the form itself, for details about what message fields are used for whitelist and blacklist matching.)