Microsoft Research Automates Hunt for Search Engine Spam
Researchers at Microsoft are working on an ambitious new project to hunt down and neutralize large-scale search engine spammers.
The Redmond, Wash., software giant's Cybersecurity and Systems Management Research Group has taken the wraps off Strider Search Defender, an experimental project that automates the discovery of search spammers through non-content analysis.
The project integrates technology from two previous Microsoft Research prototypesStrider HoneyMonkey and Strider URL Tracerand promises a new approach to removing junk results from search engine queries.
"The Web is so badly spammed, you can find a spam site on just about every search query," said Yi-Min Wang, the researcher heading up the project at Microsoft, in an interview with eWEEK. "We think this approach can pinpoint the big spammers and use their own tactic against them."
According to data from Automattic Kismet, a tool that helps bloggers thwart comment spammers, a whopping 93 percent of all blog comments are spam. With Strider Search Defender, Wang's team is taking a context-based approach that uses URL-redirection analysis to pinpoint spammers.
"For the spammers to be successful, they have to post millions of fake comments on message boards and blogs. That's the only way to get picked up by search engines. If we can find a way to pinpoint them before they get indexed by search engines, the problem is solved," Wang said.
"They want to be found by search engines, that's why they're spamming. Well, now we're finding you," he added.
The problem is tied to the use of spam blogs, or splogs, to earn money from pay-per-click advertising programs offered by Google, Yahoo and MSN. Content on fake blogs often contain text stolen from legitimate Web sites and include an unusually high number of links to sites associated with the splog creator. The sole purpose is to boost the search engine rank of the affiliated sites and cash in on ad impressions from unsuspecting surfers.
During the early stages of the Microsoft research, Wang discovered that successful large-scale spammers create a huge number of "doorway pages" on reputable domains to trick search engine users into clicking on a fake site. It is well-known that Google's BlogSpot, Yahoo's GeoCities and AOL's Hometown services are all used by spammers to create doorway pages.
The doorway pages are then spammed to millions of forums, blog comments and archived newsgroups, pushing the page up the search engine results for certain target keywords. A user clicking on a doorway-page link in search listings gets redirected to a target page controlled by the spammer or, in some cases, Wang explained, the browser is instructed to either redirect to or fetch ads listing operated by the spammer.
Next Page: "Monkey program" analyzes traffic.
The Microsoft Research team is now proposing to treat each spam page as a dynamic program rather than a static page and use a "monkey program" to analyze the traffic resulting from visiting each page with an actual browser. "By identifying those domains that serve target pages for a large number of doorway pages, we can catch major spammers' domains together with all their doorway pages and doorway domains," Wang explained.
Strider Search Defender starts with a seed list of confirmed spam URLs and uses a homegrown tool called Spam Hunter to run link queries on search engines. This is an automated process that pinpoints the forums and guest books on which the known spam URLs were posted. On these pages, additional spam links are scrapped to automatically generate a list of spam URLs.
To filter out false positives, Microsoft feeds the list of potential spam URLs to the Strider URL Tracer, a tool released earlier this year by Microsoft to help trademark owners find typo-squatting domains of their Web sites.
Using the URL Tracer, Wang's team can launch an actual browser to visit each URL and record all secondary URLs visited as a result. At the end of that automated scan, the researchers can figure out which target-page domains are associated with a large number of doorway-page URLs.
In one scenario, Wang said the Spam Hunter collected more than 17,000 BlogSpot URLs and fed them into the URL Tracer. The group was able to identify the top 25 target-page domains that are behind the Google-hosted splogs. The top six are particularly active, Wang said, identifying them as s-e-arch.com, speedsearcher.net, abcsearcher.com, eash.info, paysefeed.net and veryfastsearch.com, which collectively were responsible for approximately 45 percent of the BlogSpot URLs.
Wang said the Strider Search Defender project has already helped to remove junk results from MSN Search. "The more widely spammed a URL is, the easier it is for the Spam Hunter to find it. Once a spammed forum is identified, it becomes a 'HoneyForum' that can be used to capture new spam URLs in new comment postings," he said. "Ideally, since there is a delay between spamming and its effect on search engine results, our spam hunter should be able to identify new spam URLs and notify the search engine before the URLs enter top search results."
Check out eWEEK.com's for the latest security news, reviews and analysis. And for insights on security coverage around the Web, take a look at eWEEK.com Security Center Editor Larry Seltzer's Weblog.