Open Source Filtering Solutions and the Spam Problem

Let us face it, modern e-mail communication relying on SMTP is fundamentally broken – there is no sender authentication. There are lot of countermeasures in form of filtering and add-on authentication, but neither of them are proved to be 100% successful (that is 100% hit ratio with 0% of false positives). Spammers always find new ways of confusing filters with random noise, bad grammar, hidden HTML code, padding, bitmap-rendered messages etc. Our world is becoming an overloaded and unusable mailbox of spam. This article will nevertheless try to cover some of the spam problems and possible solutions, but bare in mind that all of these are just no more than a temporary fix.

Product spam, financial spam, frauds, scams, phishing, health spam, Internet spam, adult spam, political spam, you-name-it spam. Despite Bill Gates’ brave promise in 2004 (“Two years from now, spam will be solved”) e-mail spam has significantly increased worldwide in the last two years in both volume and size, making over 70% of total e-mail traffic. According to the First MAAWG Global Spam Report from Q1 2006, around 80% of incoming e-mail was detected as abusive. A bit later in Q3 2006 various Internet service providers in the world have reported an alarming increase of unsolicited e-mail in a very short period due to the range of new spamming techniques involved. At the end of 2006 an estimated number of the world’s total spam is around 85 billion messages per day (obviously this number is rather approximate) – and it is exponentially increasing. We all know how much it is going to cost (quick spam calculator).

Spammers have undoubtedly adapted and evolved: up to now they used a single IP setup for delivering their unwanted e-mail, usually hopping from one dialup to another. They have used open proxies, open mail relays and other similar easy-to-track sources. Unfortunately, it has changed – current spamming methods now include huge networks (called botnets) of zombie-computers used for distributed spam delivery and Denial of Service attacks. Various new viruses and worms are targeting user computers, making them eventually into huge spam clusters. Not only Microsoft Windows PCs are hacked, more and more Unix and Linux servers are affected too. And it is not for the fame and the glory, but to enable crackers to install and run scripts for the remote controlled spamming. In the meantime, nobody knows how many spambots are currently harvesting the Web in search of new e-mail addresses, their new victims. There is nothing sophisticated in their attacks, only brute force and numbers. Spammers earn a living by making and delivering spam and they do it darn well.

Reality check, 123
Due to the troublesome nature of the Internet today, the spammers and the script kiddies can easily put an anti-spam provider out of a job by simply DDoSing them to death (and doing a lot of collateral damage) – and exactly it happened to Blue Security with their successful but quite controversial Blue Frog service. A person known as PharmaMaster took their campaign as open war declaration and wiped them off from the face of the Internet within a single day. Lessons have been learned: the spammers are to be taken seriously and it seems they cannot be dealt by a single uniform blow nor with a single anti-spam provider.

What can we do about spam? There are numerous commercial solutions against unsolicited e-mail (SurfControl, Websense, Brightmail, IronPort, etc.) and some of them are rather expensive. Depending on the available budget, requirements and resources at hand, an Open Source solution could be substantially cheaper and possibly equally effective as the commercial counterpart. There is a whole range of readily available Open Source solutions for each of the popular anti-spam techniques for e-mail receivers. Some of them are in the core of the even most advanced commercial solutions. As most of the readers probably know, anti-spam solutions are most effective when different methods are combined together, forming several layers of analysis and filtering. Let us name a few of the most popular.

DNS blacklisting is a simple and cheap way of filtering the remote MTA (Mail Transfer Agent) peers. For every remote peer the SMTP service will reverse its IP and check the forward (“A”) record in the BL domain of DNSBL system. The advantage of the method is in its low processing overhead: checking is usually done in the initial SMTP session and unsolicited e-mail never hits the incoming queue. Due to the spam-zombie attacks coming from the hundreds of thousands of fresh IP address every day, this method is today significantly less effective today than it used to be and no more than 40% of total inbound spam can be filtered using this method. There are a lot of free DNSBL services in world, but it is probably best to use well known and reliable providers (and there are even supscription-based DNSBL services) which do not enlist half of the Internet overnight. Some of most widely recognized are Spamhaus and SpamCop, for instance. Almost all FLOSS (Free/Libre/Open-Source Software) SMTP daemons have full RBL support and so does Postfix, Exim, Sendmail, etc. For the SMTP services which do not support DNSBL out of the box it is possible to use DNSBL tests in SpamAssassin, but that usually means no session-time checking. Another variant which Spfilter uses is to store a several DNSBL exports in the form of local blacklists for faster processing. Of course, such a database needs to be synchronized manually from time to time, preferably on a daily basis.

The greylisting method is a recent but fairly popular method which slightly delays an e-mail delivery from any unknown SMTP peer. A server with the greylisting enabled tracks the triplets of the information for every e-mail received: the IP address of every MTA peer, the envelope sender address and the envelope recipient address. When a new e-mail has been received, the triplet gets extracted and compared with a local greylisting database. For every yet unseen triplet the MTA will reject the remote peer with a temporary SMTP failure error and log it into a local database. According to the SMTP RFC, every legitimate SMTP peer should try to reconnect after a while and try to redeliver the failed messages. This method usually requires minimum time to configure and has rather low resource requirements. As a side benefit it rate-limits the incoming SMTP flow from the unknown sources, lowering the cumulative load on the SMTP server.

There are still some mis-configured SMTP servers which actually do not retry the delivery since they interpret the temporary SMTP failure as a permanent error. Secondly, the impact of the initial greylisting of all new e-mail is substantial for an any company that treats e-mail communication as the realtime-alike service, since all of the initial e-mail correspondence will be delayed at least 300 seconds or more, depending on the SMTP retry configuration of the remote MTA peers. Finally, the greylisting does not do any good to the big SMTP providers which have large pools of mail exchangers (ie. more than /24). The problems can be fixed by whitelisting manually each and every of domains or network blocks affected. Regarding the software which does the greylisting almost every Open Source MTA has several greylisting implementations available: Emserver, Postgrey, Milter-greylist, etc.

Sender verify callout
SMTP callback verification or the sender verify callout is a simple way of checking whether the sender address found in the envelope is a really deliverable address or not. Unfortunately, verification probes are usually blocked by the remote ISP if they happen too often. Further, a remote MTA does not have to reject the unknown destinations (ie. Qmail MTA usually responds with “252 send some mail, i’ll try my best”). To conclude: it is best to do verification per known spammer source domains which can be easily extracted from results of the other methods (such as the content analysis). The sender verification is supported in most FLOSS MTA: Postfix, Exim, Sendmail (via milter plugin), etc.

Content analysis
The content-based filtering is probably the core of most anti-spam filters available. It usually consists of several subtypes, so let us state a few. Static filtering is a type which triggers e-mail rejection on special patterns (“bad” words and phrases, regular expressions, blacklisted URI, “evil” numbers and similar) typically found in the e-mail headers or a body of an e-mail itself. False positives are quite possible with this method, so this type is best used in conjunction with policy-based systems (often named as heuristic filters) such as SpamAssassin and Policyd-weight. Such filters use the weighted results of several tests, typically hundreds of, to calculate a total score and decide if the e-mail is a spam or a ham. In this way, a failure in a single test does not necessarily decide the fate of an e-mail. At least several tests have to indicate a found spam content to accumulate the spam score enough for an e-mail to be flagged as a spam, so this results in a more reliable system. Of course, weighted/scoring type of a filter can contain all of the other filter types for its scoring methods.

The next type of the content analysis is the statistical filtering which mostly uses the naive Bayesian classifer for the frequency analysis of word occurrences in an e-mail. Such filtering, depending on an implementation, requires the initial training on an already presorted content and some retraining (albeit in much smaller scale) later on to obtain a maximum efficiency. The Bayesian filtering is surprisingly efficient and robust in all real life examples. It is implemented in the very popular SpamAssassin and DSPAM solutions, as well as Bogofilter, SpamBayes, POPFile and even in user e-mail clients such as Mozilla Thunderbird. Some implementations such as SpamAssassin use an output of other spam filtering methods for a retraining which gradually improves the hit/miss ratio. Most of the implementations (DSPAM, SpamAssassin) have a Web interface which allows a per-user view of the quarantined e-mail as well as the per e-mail retraining. It improves the quality of either the global dictionary (a database of learned tokens) or the individual per user dictionaries. DSPAM, for an instance, supports a whole range of additional features such as: combining of extracted tokens together to obtain a better accuracy, tunable classifiers, the initial training sedation, the automatic whitelisting, etc.

Another popular solution is CRM114 which is a superior classification system featuring 6 different classificators. It uses Sparse Binary Polynomial Hashing with Bayesian Chain Rule evaluation with full Bayesian matching and Markov weighting. CRM114 is both the classifier and a language. DSPAM and CRM114 are currently the two most popular and most advanced solutions in this field, and they are easily plugged into most SMTP services.

Note that plain Bayesian filters can be fooled with quite common Bayesian White Noise attacks which usually look like random nonsensical words (also known as a hashbuster) in a form of a simple poem. Such words are randomly chosen by a spammer mailer software to reflect a personal e-mail correspondence and therefore thwart the classifier. Most of the modern content analysis filters do detect such attacks – and so does SpamAssassin and DSPAM.

Checksum-based filtering
A small but significant amount of unsolicited e-mail is the same for every recipient. A checksum-based filter strips all usual varying parts of an e-mail and calculates a checksum from the leftovers. Such a checksum is then compared to a collaborative or distributed database of the all known spam checksums. Unfortunately, spammers usually insert various poisoning content (already mentioned hashbusters) unique to an every e-mail. It causes the checksums to change and an e-mail is not recognized as known spam any more. Two of the most popular services for this typeare Distributed Checksum Clearinghouse and Vipul’s Razor which both have their own software and they are both supported in third-party spam-filtering software such as SpamAssassin.

E-mail authentication
Finally, we are left with several methods of the authentication that basically try to ensure the identity of a remote sender via some kind of an automated process. The identification makes it possible to reject all of the e-mail from the known spam sources, to negatively score or even to deny an e-mail with the identified sender forgeries and to whitelist an e-mail which is valid and comes from the known reputable domains. This method should minimize the possibility of the false positives because a valid e-mail should get higher positive scores (used for the policy filter) right from the start or even completely bypass the spam filters – which can be made more sensitive in return. There are several similar autentication mechanisms available: SPF (Sender Policy Framework), CSV (Certified Server Validation), SenderID and DomainKeys. They are mostly available as third-party plugins for most popular OSS MTA, usually in the form of Perl scripts available at CPAN. Unfortunately, neither of them is a solution recognized widely enough to be used in an every SMTP service in the world.

DomainKeys and enhanced DKIM (DomainKeys Identified Mail) protocol use a digital signature to authenticate the domain name of sender as well as content of a message. By using a sender domain name and the received headers, a receiving MTA can obtain the public key of a such domain through simple DNS queries and validate the signature of the received message. A success proves that the e-mail has not been tampered with as well as that the sender domain has not been forged.

SPF comes in form of the DNS TXT entries in each SPF-enabled Internet domain. These records can be used to authorize any e-mail in transit from a such domain. SPF records publish the policy of how to handle the e-mail forgeries or the successful validation as well as the list of possible e-mail originating addresses. If none of those match the sender address in received e-mail, the e-mail is probably forged and the receiver can decide on the future of such e-mail depending on SPF qualifiers (SOFTFAIL, FAIL, NEUTRAL, PASS) from a SPF policy. The problem is that SPF breaks e-mail forwarding to the other valid e-mail accounts if the domain administrator decides to use SPF FAIL policy (hard fail), although in the future SRS (Sender Rewriting Scheme) could eventually help it.

SenderID is a crossover between SPF and Caller ID with some serious standardization issues and does not work well with mailing lists (necessary Sender or Resent-Sender headers). CSV is about verifying the SMTP HELO identity of the remote MTA by using simple DNS queries to check if the domain in question is permitted to have a remote IP from the current SMTP session and if it has got a good reputation in a reputable Accreditation Service.