Jim over at Revnews has a post talking about a new product, Botsense, that is currently in beta that is for stopping. It sounds like an interesting product that unfortunately will be needed more and more with the way things are going with content theft these days.
Has anyone used it? I’d be interested to hear your experiences.
This tool does not sound too effective. I am the author of “Programming Spiders, Bots and Aggregators in Java”. It relies on the agent header in the HTTP packet. And for bots that identify themselves, yes, you can block them. But a custom built bot can easily identify itself as Internet Explorer. And many of the commercial bot packages allow you to pick any “agent” you want, so you can look just like IE.
And not all bots are bad. The ones I’ve written always clearly identify themselves and are used for tasks like link checking, submitting to search engines, aggregating multiple email accounts to one, scanning RSS feeds, etc.
Jeff
I’d probably feel more confident if they could spell “Rogue” properly in their press release title…
There’s not much to it. It’s basically a script that autogenerates an .htaccess file to block the user agents you specify (including “random string” bots). You can certainly do the same thing by hand if you have a list of bots you want to block, it would just take 15-20 minutes instead of 3 or 4. Honestly, I like the fact that it targets those programs that allow users to download your entire site – minus advertising, of course – and browse it offline. Now, if someone will just write a rewrite rule to block anyone using adblocker plugins….
Jeff is right in this case, but it is still a handy tool. I talked to the developer who tells me this is phase one and that later phases address that problem.
For someone like me it is a useful. I am going to give it a whirl on my site (since it’s free) because I am tired of email harversters, content scrapers and the like. I don’t have the time or knowledge to keep up on these rogues so I welcome the service. Time will tell if it is effective and I am looking forward to the next phases of the program. Maybe I can get the developer to make some comments here.
Lenny
Hm, as the apparent resident expert on bots such as this, I’ll say that I’ve already done this months ago — in PHP! Which means it drops in nicely as a WordPress plugin.
Unlike .htaccess solutions, which aren’t too flexible, in PHP I can look at whether the bot is trying to pretend to be MSIE, and if I catch it in the act, goodbye!
Not only that, it runs on WordPress, MediaWiki, Drupal, Pixelpost, Geeklog, and several other platforms that I’ve forgotten.
It’s meant to block link spam, but also takes care of those common email address harvesters.
What it’s NOT meant to do is prevent someone from looking at your site, which means it doesn’t prevent people from using tools like wget to download your pages. You could easily add that in, of course.
From what I can tell talking to Don, the developer, this makes sense to me because I’m not a tech guy. I’m a marketer. I don’t “do” techy stuff like .htaccess files, and I have no idea what bots are which, etc…
The biggest thing I took away from this was that the extensive research about the bots has been done by Botsense already, and when you sign up and go through the tool, you can see explanation on what bots do what, and how they can effect your system, then you choose if you want to block them or not.
I do know this. I can already see a decline in bandwidth because the tool allowed me to block a major bot that was hitting ReveNews hard.
OK, after looking at Botsense, I am not impressed. I didn’t expect to be, though. It does exactly what I thought it would do: it bans several user-agent snippets, and doesn’t even get half the bots (by user agent) I know about. It also bans a couple of small IP address ranges.
If you have a serious bot problem, this isn’t going to do much for you, especially since you’ve noticed the bots are turning up with quite real looking user agents.
Now is where I am supposed to insert a shameless plug for Bad Behavior, which I wrote to combat the bot problem. It isn’t perfect either, but it gets much closer. And you don’t have to give up your email address to use it.
To: IO ERROR:
I love your “Bad Behavior” module for Drupal. Thanks for writing it.
I love IO Error’s plugin for WordPress, but I can also understand why non-techies need something that they can relate to easily.
Whether this product is capable of filling that gap or not is another question, but I can understand why a marketed product is needed.
Jeff and IO Error you both are correct. The .htaccess file generated by the BotSense .htaccess generation tool does in fact block rogue bots based on the user agent field (and some bots based on their IP’s).
While this may not be the most effective way to block all bad bots not everyone has the technical know-how or the ability to install scripts, configure a database or any other requirements that some solutions may have.
The tool is meant to provide a simple method for anyone regardless of their knowledge or technical background to easily generate an .htaccess file to block rogue bots which may be roaming their site.
It is a very simple tool to use. Log in, select bots (or entire categories), select a few options and generate your .htaccess file. We even provide an interactive feature that allows you to test your new .htaccess file after you upload it to your site.
The .htaccess generation tool is a beta product; you can expect the rogue bot database to grow as the product continues to mature.
I do appreciate and respect all comments made here.
I am with Jeff and IO ERROR.
BotSense does nothing to impress me so far. Sorry Don. Even for people who don’t do techie stuff, they still have to update their .htaccess file manually, overwirting the file will get you in trouble. I hope that task is not too difficult for non-techie.
This is the list of bad bot list that I come up with during a quick Google search (using the keyword bad bot list):
http://www.webmasterworld.com/forum11/2634.htm
Yes, the list should NOT be copied and used as is because there is a small bug that will block all the bot with the name “bot” anywhere in the bot string, but you get the idea.
If all we want is to save bandwidth, I think only a short list of popular bad bot will do the job. Let’s remember that the User Agent string is just a string provided by the client, not something that can’t be modified. Why wasting time on the 80% that will consume 20% when we can just block the 20% which consume 80%?