How I set up my spam filtering

It’s true that the initial configuration of SpamAssassin is not for the faint of heart: Unix SSH, procmail, crontab, and so on. If you do it right, it rocks. But make one mistake, even a typo, and you may be deleting all incoming mail rather than just filtering it.

But once that technical hurdle is passed, the solution is both simple and effective. Of the 1000 or so mails filtered so far, none were erroneously marked as spam. Some spam gets through, but less every day as the filter learns and adapts.

This is how it works:

Step 1: The server automatically and immediately analyzes every incoming mail. It gets analyzed, word for word, including headers, and a spam probability is calculated based on all earlier messages received. This spam probability is averaged against previous scores from the same sender so that even unusual messages are treated right. Good mail goes to the INBOX, bad mail to the SPAM folder.

Step 2: If I find a spammy message in my inbox, I move it to the SPAM folder.

Step 3: Occasionally I check the SPAM and SPAM-archive folders for any misplaced non-spam and move it to a HAM folder. (None has been misplaced so far.)

Step 4: Once daily, the server looks in the HAM and SPAM folders and learns that messages there are to be treated as HAM or SPAM, respectively. Then the messages are moved to HAM-archive and SPAM-archive folders so that nothing is lost.

That last step is where the real magic happens, because the analysis is so powerful that it can later recognize the same patterns in completely different messages. That’s the reason why less and less spam gets into my inbox: the filter adapts and learns all the time – and that’s what makes this solution much more powerful than simple rule-based spam filters.

2 Responses to “How I set up my spam filtering”

  1. Torben says:

    Two weeks and 1000+ spams later: Still going strong!

    I thought I’d post a little update now that two weeks have passed since I implemented this new filter. As of now, the filter has caught well over a thousand spams, and made a handful of mistakes to either side – but it’s clearly learning.

    Spam: In the first days, lots of spams weren’t caught by the filter. I manually moved these “false negatives” to the spam folder so that the filter could learn them on its next daily pass, and this training works well: I can see in the spam archive that I get lots of identical or nearly identical spam, but once it has been trained, consecutive spam is caught properly. There are also constant improvements in the automatic filter; it learns based on what it already knows.

    Today, I get nearly zero spams in my inbox. More importantly, I get less spam in my inbox now than I did before I begun this filtering and just relied on the automatic junk filters of the hosting provider.

    Ham: during the learning process, I knew there would be lots of “false positives”: regular mail that is falsely marked as spam. So I have checked the spam folder and spam archive at least daily to fix any filtering mistakes. The filter made many mistakes, especially concerning short messages and newsletters. I moved these ham messages to my ham folder so that the filter could learn them on its next daily pass, just like I did for spam.

    Today, the filter barely catches any hams at all, but I do not trust it yet! If I were to extrapolate an accuracy based on current values, the rating would be horrible, around 0,5% false positives, or 1 ham caught in every 200 mails. That number should be fifty times smaller, more like 0,01% (one ham in 10.000 messages). Obviously I’m not there yet.

    A whitelist could make a huge step towards that goal. I am not currently using any whitelist to force certain senders to pass the filter, but I am considering to add the most glaring mistakes to a whitelist for a while. Over time, the filter will learn that these are ham too, and the whitelist entries should no longer be necessary, though there would be no reason to remove the entries then.

    Summary: The filter still learns about 25 hams and about 100 spams every day, so it is still improving. But it is already now much better than the filters I had before, so I declare this a limited success already. Limited, because there has been a considerable training effort, but I’m expecting this to diminish as the filter makes ever fewer mistakes.

    I am only two weeks into this experiment and the results look promising already. Let’s give it another month and then take another look. I expect to be pleased!

  2. [...] other mail application, and I can access it anywhere, and it has even better spam filtering than my own Bayesian filter (which is pretty good, in fact) on my (not at all good) webmail. So Gmail is for [...]

Leave a Reply