Given that the new open-source approaches to spam filtering are capable of virtually eliminating unwanted e-mail and preserving the good stuff, why do many companies continue to struggle with spam?
Jonathan Zdziarski, the developer of the DSPAM open-source Bayesian spam blocker, believes IT departments of most small- to medium-sized businesses are afraid to try free programs or meet resistance from higher-level company executives.
“Most mid-sized companies just pull an appliance off the shelf,” he said, noting there are “a million anti-spam companies out there with boxes loaded with a hodgepodge” of solutions. “That’s one of the reasons these businesses, if you ask them, are convinced spam filtering is ineffective … A lot of these companies are running technology that’s five to seven years old.”
Some popular, commercially-distributed solutions say they employ Bayesian filters. When used alone, as in DSPAM and other similar program, these filters use statistical analysis to yield incredibly accurate spam control.
However, Zdziarski, the author of a new book titled “Ending Spam,” asserts most of the time the Bayesian filtering in these “hybrid” commercial products, if present at all, is rendered virtually ineffective because it filters only the mail that finds its way through the commercial programs’ outdated “heuristic” filtering layer.
Good mail, or “ham,” that was improperly deemed to be suspicious by the heuristic filter may never reach the Bayesian filter layer. This prevents the filter from learning what makes good mail “hammy” and it further increases the application’s error rate.
The ability to tell good mail from spam is one of the most touted attributes of open-source spam-blocking programs using Bayesian statistical filtering as suggested by Paul Graham in “A Plan for Spam.” Anyone who’s ever absent-mindedly deleted an important e-mail that was improperly routed to a spam bucket can relate.
“False positives are innocent e-mails that get mistakenly identified as spams,” wrote Graham in his paper three years ago. “For most users, missing legitimate e-mail is an order of magnitude worse than receiving spam, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient.”
Zdziarski contends the Bayesian element mentioned on the boxes and ads for most commercial spam blockers “is more of a marketing term, really, than any type of real component of the solution.” That’s because a company selling an adaptive Bayesian spam filter would have a tough time staying in business.
“A true, adaptive solution is like fine wine,” Zdziarski said. “You can take a tool like DSPAM … install it in a system, stick it in a closet and let it do its job with just basic user input. Let it sit there and run, without upgrades, for a couple years. When you take a look at it again, it will be performing better than it did on Day One.”
Systems administrators and others facing a decision about e-mail filtering must weigh the cost of using commercial products against the fact that statistical language classification filters, while free, work best if users are trained to help them out.
Employees need to cooperate by “teaching” the programs the difference between spam and ham, a simple task that gets easier over time as the programs gain knowledge.
IT people must also determine the company’s tolerance level for spam. Maybe 95 percent accuracy is good enough, even though it means up to five errors per hundred e-mails or 10 times more than would pass through a good statistical filter.
Zdziarski says he’s sometimes unnerved by his filter’s uncanny accuracy. For most systems administrators, the thought of employees opening spam containing viruses is something more scary, almost as bad as accidentally deleting that important e-mail from the CEO.