December 16, 2010
I launched a new WordPress blog for a client a few months ago and have been watching as comment spammers find the site and do what they do best, leave spam. As the first few spam comments started coming in I would flag them then manually copy/paste the source IPs and URLs to the built-in WordPress “Comment Blacklist“.
Over time I obviously started seeing some duplicates and I noticed bursts of spam from certain IPs or linking to specific spam URLs. I was curious to see what these top IPs and URLs were, as they would seem the best candidates to include in the blacklist.
So, let’s dig around in the database a bit…
Top Spam IPs
Here is some SQL to count and sort the top spammer source IPs:
SELECT comment_author_IP, COUNT( comment_author_IP ) AS spam_count FROM wp_comments WHERE comment_approved = "spam" GROUP BY comment_author_IP HAVING spam_count > 1 ORDER BY spam_count DESC
This selects all the comments from the WordPress comments table that are flagged as “spam”, then groups them by source IP (comment_author_IP) so they can be counted and sorted. Then, only show those IPs that have spammed more than once, with the highest counts listed first.
Here are the first few lines of output from that call (spammer IPs redacted to protect the guilty)
213.5.69.x 99 213.5.71.x 85 173.234.94.x 54 173.234.211.x 37 64.186.155.x 31 ...
Top Spam URLs
Here is some SQL to count and sort the top spammer author URLs
SELECT comment_author_url, COUNT( comment_author_url ) AS spam_count FROM wp_comments WHERE comment_approved = "spam" AND DATE_SUB( CURDATE( ) , INTERVAL 30 DAY ) <= comment_date GROUP BY comment_author_url HAVING spam_count > 1 ORDER BY spam_count DESC
This is similar to the previous IP counting SQL, except now we are focusing on the source URL (comment_author_url). Additionally, on line 4, we are just counting spam from the last 30 days.
Again, the first few lines of output:
http://data[x]site.com 33 http://garden[x].net/ 20 http://www.[x]philadelphia.net 17 http://www.seo[x].com 16 http://www.webdesign[x].com 16 ...
As you can see, there are definitely some repeat offenders, which I put straight on the blacklist.
The Arms Race
Although this simple technique has worked well to catch large amounts of spam on the new blog, it’s just an exercise in curiosity as any manual process of dealing with spam won’t be sustainable. The proper solution is still a combination of collaborative filtering services like Askimet, Project Honey Pot and Stop Forum Spam with spam fighting tools like Bad Behavior and Conditional CAPTCHA.
* Note: The SQL used in these examples is intentionally a bit verbose (i.e. not tuned for performance) for readability.