File Under: Programming, Security

Stop Spam on Your Mail Server


Casual users of email are only mildly irritated, and even occasionally amused, by spam. “Just click delete!” they say. “One keypress and it’s gone! What could be easier?” The more of it you see, though, and the more wear your Delete key gets, the less tolerant you become. It’s like crazy people coming up to you on the street, perhaps. If you only ever see one, you laugh about his antics forever. If you see one a day, you start to think, “What a shame! Can’t something be done for these poor, poor people?” And if, everywhere you go, you are surrounded by crazy people raving in your ears and blocking your progress, it becomes impossible to get anything done. At that point, you’re basically working in Hollywood.

Spam, for the most part, is not profitable for the advertisers who pay to have it sent. It has an incredibly low success rate, and only seems like a good idea because it’s so cheap to reach millions of inboxes. The only guy who makes a profit is the middlemen:the spamhouses that take money from hapless breast-enlargement-pill manufacturers in exchange for almost-worthless bulk mailings. They use shifty techniques like forged email headers, automated freemail accounts, and bulk-mailing software.

When you start getting a lot of spam, or when you manage email for a number of people, it becomes crucial to sort the noise out of the signal. Because sorting by hand is tedious and unfeasible on even a moderate scale, the key is, of course, finding a way that a computer can distinguish spam from non-spam. A number of interesting solutions to this problem have been attempted.

In this article, it is assumed that you are running a mail server like the one described here:Set Up IMAP on Your Mail Server. Many of the techniques described herein will still be applicable on any Unix system, even if it’s just a mail client machine; and the principles apply to any email handling process.

Contents

  1. Procmail
  2. An Assassination Attempt
  3. Sharpen Those Razors
  4. Filters That Learn
  5. Last-minute Touches

Procmail

First, we need a tool that can place mail in various folders for us, according to rules that we specify. The Unix gods have seen fit to bestow upon us Procmail, which is an extremely powerful mail processing program. It runs each message through a list of matching instructions, or “recipes,” and deals with the message accordingly.

Procmail should already be installed on any standard Linux distribution. It will run automatically for any user who has a .procmailrc file in her home directory. This file contains some basic configuration settings; I like to keep the actual recipe in a separate file, for ease of manipulation. (When creating new user accounts on a Unix system, each user’s home directory is created containing copies of the files in the /etc/skel directory. Put a basic .procmailrc in there and it will work for each new user on your system. Alternately, or in addition, you can put some master rules in /etc/procmailrc.)

The .procmailrc file should look like this:

 MAILDIR=$HOME/mail 	#where mail lives

 DEFAULT=$MAILDIR/in	#default place for mail to go

 INCLUDERC=$HOME/recipe.rc	#location of Procmail recipe file

 

Now let’s create a very basic example Procmail recipe file. This will include just a couple of recipes. Each recipe consists of three elements:a flag line, a condition line, and an action line. The flag line, in the generic case, simply signals the beginning of a recipe, and looks like this:

:0:

It can also contain flags that tell Procmail to treat this recipe in a certain way.

The condition line contains the condition which must be met. If this condition is met, then the action on the action line is carried out, and the recipe terminates. If the condition is not met, then Procmail proceeds to the next rule. Condition lines start with a *, followed by a regular expression pattern, which Procmail tries to match. A condition line might look like this:

* ^(To|Cc).*adams@pote.com

For those of you who don’t know already, ^ denotes the beginning of a line. So this regular expression matches any email message that contains a line beginning with either “To” or “Cc” followed by “adams@pote.com”. In practice, this finds messages sent to me; i.e., not BCCed. The condition line is optional – if it is omitted, then the action line will be applied to every message. Alternately, a series of condition lines can be given in a recipe, so that the message has to meet all of them.

Finally, the action line, to be executed if the condition is met. This can be as simple as the name of a mail folder into which the message should be placed:

 To-Me

 

Let’s add a second rule, with no condition line, to place all other mail in a spam folder, and look at the whole thing. (Note:A filter like this, that weeds out BCCed messages, will block maybe 50 percent of spam. Not bad, but clearly we need to do better.)

:0:

* ^(To|Cc).*adams@pote.com

To-Me



:0:

Spam

This is the gist of what Procmail does, but it gets a fair bit more involved. It can forward mail, run other programs, nest instructions, assign scores to recipes, and more. The procmailex man page is extremely informative about such things. But now that we have a basic infrastructure for dealing with spam once we identify it, let’s take a look at some ways to do just that.

An Assassination Attempt

The principle of identifying spam is to look at what sets it apart from non-spam mail. The content of spam messages is typically rather distinctive (“Your FREE Vacation is READY!!!! dc7bs9″). Spam filters like SpamAssassin capitalize on the unique hallmarks of spam – things like lines of all-caps, mention of so-called spam law “H.R. 3113,” and so forth – to catch it. Using an extensive rule base (which makes for an interesting read in and of itself), SpamAssassin gives each message a numerical score based on telltale features of the headers and body text that make it distinctively look like spam or non-spam. A Procmail recipe can then be created to filter messages based on their score.

To install SpamAssassin, it’s necessary to have Perl on your machine. Probably it’s there already; if not, it can be downloaded and installed from Perl.com. Perl includes a magical module, called CPAN, which allows shell access to the Comprehensive Perl Archive Network and hence makes it very easy to install Perl-based software like SpamAssassin. (Please pardon me if this is all old news to you. Skip to the next paragraph.) If you’ve never used CPAN before, you’ll need to configure it. As the superuser, type:

perl -MCPAN -e shell 

This will launch the CPAN shell. It will ask if you want to configure it manually. Say no, and answer any additional questions it may ask. You’ll probably need to choose a few servers:Pick ones that are near you. When the configuration is done, you will see a prompt that looks like this:

cpan>

On to the install. At the CPAN shell prompt, type:

install Mail::SpamAssassin

Follow the prompts to download any prerequisites that may be required. When that installation is done, install Net::DNS, by typing:

install Net::DNS

Finally, type q to quit the CPAN shell. SpamAssassin has been installed! It should be located in the .cpan/build directory in your home directory. It’s a good idea to copy the SpamAssassin directory into /usr/share, where it can be more readily accessed:

 cd ~/.cpan/build

 cp -Ruv Mail-SpamAssassin-''version'' /usr/share

 

(Alternately, you can install SpamAssassin from an RPM package or source code, both available on the SpamAssassin site.)

SpamAssassin can run in the background as a daemon, and integrates nicely with Procmail. The daemon, spamd, is called by the client, spamc, whenever it is needed. This is a lightweight, quicker way to use the tool than the default method of starting the entire program for each piece of spam. Here’s how to set it up:

Configure spamd to run at startup. In the SpamAssassin directory, there should be a subdirectory called spamd. This contains several different startup scripts, customized for various Unix setups. Use the most appropriate one of these for your system. If you are running Red Hat Linux, for example, copy the Red Hat startup script into /etc/rc.d/init.d, and run chkconfig:

 cd Mail-SpamAssassin-''version''/spamd

 cp -Ruv redhat-rc-script.sh /etc/rc.d/init.d/spamd

 chkconfig --add spamd

 

The next time you reboot, spamd will start automatically. For now, you can start it with:

 /etc/rc.d/init.d/spamd start

 

Spamd is now running in the background. Now we have to configure Procmail to run messages through it, using the spamc client (which just passes the message to spamd). Open up your Procmail recipe file and add the following at the beginning:

:0fw

| /usr/bin/spamc

This rule runs spamc on all messages. This will add an “X-Spam-Status” header to each message indicating whether it is spam or not.

In the first line, you will notice the f flag, which specifies that the rule is a filter, and that Procmail should continue on to the next rule after this one runs – by default, Procmail terminates when a rule runs. And the w flag tells Procmail to wait for spamc to finish running before it moves along to the next rule.

Here is the next rule:

:0:

* ^X-Spam-Status:Yes

Spam

This rule looks at the X-Spam-Status header and sorts those messages deemed spam into the Spam folder. And so we are done!

Except, not quite. SpamAssassin is good, but hardly perfect. The rules it sets out are finely tuned to catch a lot of spam, but rules are made to be broken. As you read this, spammers are modifying their strategies to sneak past SpamAssassin’s rules. If you want to stop all spam, further measures are necessary.

Sharpen Those Razors

Another approach to spam-catching is distributed collaborative filtering. This involves a large network of participants. Whenever a participant receives a spam message, that message is given a unique checksum number, which is then propagated around the network. If the same spam is sent to other members of the network, it can then be recognized automatically and filtered out.

There are a number of implementations of this concept. Vipul’s Razor is perhaps the most well-known, but DCC (Distributed Checksum Clearinghouse) is extremely effective, as is Pyzor. All of these integrate with SpamAssassin and complement it nicely, and I have had about equal success with each of them. I find, however, that the remote checking necessary for these networked methods slows down mail processing more than I would like. Instead of these methods, I use smart filters.

Filters That Learn

Thomas Bayes was born in 1702 in London. Despite the relatively primitive state of the Internet at that time, Bayes nevertheless had a very interesting theorem about spam. Well, perhaps it’s applicable to other things as well, but Paul Graham explains how it can be used in a spam filter.

In brief, a statistical filter evaluates every email message in terms of the individual words of which it is comprised. Each word has a score representing how likely it is to occur in a spam message, and another score representing how likely it is to occur in a non-spam message. Based on these scores, the filter is able to make extremely accurate guesses about whether a message is spam or not.

Where do the scores come from? This is the even-cleverer part. When the filter is installed, you can feed it a corpus of spam, and it will automatically categorize all the words it encounters therein. Likewise with a corpus of non-spam. Thenceforth, the filter can be set to compile statistics about each word it encounters, growing more and more knowledgeable about the sordid world of unsolicited email, and more and more sophisticated at separating bad from good. This type of filtering has several advantages over other methods:It is more accurate because it is customized to each user; it adapts itself to spammers’ changing strategies; and it has very few false-positive results.

Graham’s ideas, with a few key modifications, have been implemented numerous times. Email clients are shipping with built-in versions of this grand new method. Even SpamAssassin, as of this writing, is incorporating statistical filtering into its new version. In keeping with the classical Unix modular way of doing things, though, I like to use a separate tool for my statistical filtering:Bogofilter. I’m sure other tools do fine jobs as well – Bogofilter just happens to be the one I chose. It is fast, written in C, and easy to install and use. Most importantly, it fits into my mail chain very smoothly. Let’s download it together now.

Get the stable version from the project’s SourceForge page; either RPM package or source code. Logged in as the superuser, install the RPM package or unzip, and ./configure ; make ; make install the source. You should now have an executable file at /usr/bin/bogofilter.

Our first step is to train it. Bogofilter is an apt pupil. It has a couple of important flags for training purposes:-n and -s. Running bogofilter -s tells it that the input consists of one or more spam messages. bogofilter -n specifies that the input is not spam.

Initially, we will train it on your archived mail. The other users on your system can use that corpus as a starting point, if you want, and add to it as they choose.

The larger and more varied the corpus fed to Bogofilter, the more accurate its filtering will be. cd to your mail directory. (You should no longer be root.) Find a nice folder full of non-spam mail you have received — say it’s called oldmail. Now feed it to Bogofilter like so (the folder won’t be changed, just perused):

  bogofilter -n < oldmail

  

Repeat this process with other folders of mail of various kinds. It’s important to give Bogofilter a taste of different flavors of mail you want, so if you have some folders devoted to different mailing lists, you probably want to feed those in as well.

Now the filter needs to take its first harsh taste of spam. If you have saved a folder full of spam – perhaps what was weeded out by another filter – you can feed that to it. Otherwise you may want to download and unzip a public corpus of spam, like the ones available from spamarchive.org. Either way, feed it to the filter like so:

 bogofilter -s < spamfolder

 

Repeat as desired. Now your filter is trained! Give it a biscuit. The data files that it creates are in your home directory, under the .bogofilter subdirectory.

Last-minute Touches

Finally, let’s integrate the filter into our mail processing line. Open up your Procmail recipe file again, and add the following to the beginning:

# pass messages through Bogofilter, updating the corpus each time, and

logging

:0fw

| /usr/bin/bogofilter -u -e -p -l



# retry if bogofilter fails

:0e

{ EXITCODE=75 HOST }



# if Bogofilter says it's spam, spamcan it

:0:

* ^X-Bogosity:Yes

Spam



Now you have two spam filters working for you, and complementing each other. But let’s streamline the process a bit. With the above rules placed at the beginning of your recipe file, any spam that Bogofilter catches won’t be passed to SpamAssassin. This saves overhead. But we want Bogofilter to keep learning. Any time SpamAssassin catches a spam message that Bogofilter misses, let’s put the message in Bogofilter’s corpus of spam, so it will know better next time. Earlier, we created a rule in your Procmail recipe file that specifies what to do when SpamAssassin catches a piece of spam:

:0:

* ^X-Spam-Status:Yes

Spam



Let's modify this as follows:



:0

* ^X-Spam-Status:Yes

  {

     :0cw

     | /usr/bin/bogofilter -Ns



     :0

     Spam

  }

Now both of the instructions nested in brackets are executed when SpamAssassin finds a match. First, it calls Bogofilter and trains it on the new message (telling it at the same time to undo its previous training that this message was non-spam); second, it spamcans the message. Keep that up, and you won’t need SpamAssassin at all! If you wish, you can sort spam caught by the two different filters into two different folders, so you can compare their relative aptitude. On my system, Bogofilter catches about 95 percent of spam, and SpamAssassin takes care of the rest. For reference, the whole Procmail recipe file looks like this:

# pass messages through Bogofilter, updating the corpus each time, and

logging

:0fw

| /usr/bin/bogofilter -u -e -p -l



# retry if bogofilter fails

:0e

{ EXITCODE=75 HOST }



# if Bogofilter says it's spam, spamcan it

:0:

* ^X-Bogosity:Yes

Spam



# otherwise, run SpamAssassin

:0fw

| /usr/bin/spamc



# if spam, pass to Bogofilter and spamcan

:0

* ^X-Spam-Status:Yes

  {

     :0cw

     | /usr/bin/bogofilter -Ns



     :0

     Spam

  }

To equip the other users of your mail server with the same spam-fighting powers, simply copy into their home directories your .bogofilter directory, and give them the Procmail recipes that invoke the filters. Place the same in /etc/skel for new users, and you are all set. Running a similar system, I haven’t received an unfiltered spam message in several months.

But it’s an ongoing battle. As long as spammers roam the earth, they will spam us. But we will fight them. Yes. We. Will!