Protecting a website against spammers and robots – CallamMcMillan.com

This post was originally going to be a talk about what I did with CallamMcMillan.com to stop comments, followed by another article on how I dealt with a robots problem on the article voting system. After re-reading the article though, I decided to explain the problem a bit better and make this article somewhat informative so that you can use it on your own websites.

If you’re like me then you’ll enjoy getting feedback on your work. The feedback may not always agree with my point of view, or it may suggest that my technical solutions are lacking, but at least it shows that somebody has took the time to stop, read and respond to my hard work. On the other hand, it’s incredibly frustrating to find a poorly written comment, generated by a robot full of links to dubious websites that are pushing equally dubious products. As for when you see dozens of these low-quality comments spammed across your website, it makes you want to go and take a crow bar to the heads of those responsible for them.

As the saying goes, the only way to win the game is to not play it. I’ve tried a variety of methods to stop comment spam over the past seven years. Registration is cumbersome and doesn’t really do anything other than slow down the spammers. Captchas frustrate regular users while doing little to slow down the advanced OCR systems used by the spammers. This time around, I have taken a different approach. Today, anybody can come and write a post. The thing is it has to get past me. This may sound like even more effort, but is relatively simple.

Every time a comment is made, the name, email address and IP are checked against a list managed by a group called Stop Forum Spam. These tell me whether any of these three key attributes have been flagged up on other forums as being spam. When a user posts a comment, it’s entered into the database and scored against the Stop Forum Spam system. A flag is set which stops the post being displayed prior to moderation. The result of the scoring is shown on the moderation queue to decide whether to allow the comment. While the comment is awaiting moderation, none of the details about who posted it is displayed on the site, so there is little value in trying to embed information in the username.

This, for the time being takes care of the problem. My moderation queue is light enough that I can take care of it easily a couple of time per day. Should the quantity of comments requiring moderation increase further then I have an automatic scoring system that applies a weighting to various spam factors. Should a comment score too lowly then it will be auto-hidden. This will further reduce the size of the moderation queue and will only occasionally require a check of the auto-hidden queue for false positives.

Another problem faced by a website is robots following links, disregarding the rel=”nofollow” attribute. A little research suggests that all of the major search engine bots will follow links whether a nofollow attribute is set or not. Therefore a nofollow should be accompanied by an exclusion line in the robots.txt file.

Share

Leave a Comment

Your email address will not be published. Required fields are marked *