Controlling the Googlebot

Posted By Guest Blogger 9th of July 2006 General 0 Comments

How to control what gets indexed by Google and when? That is the question. Most of the time, we want Google to snarf up as many pages as possible. In my own experience, I can think of a few times when indexing was not something I wanted and I had to go back to Google to actually have pages removed.

In the immortal words of Darren Rowse, there’s a tangent to follow: In my case, a church website I run inadvertently had information about missionaries that were in sensitive areas of the world and information about them actually placed them in danger. While I typically wanted information about the church and functions of the church indexed for people to find, I did not want this information indexable. While I spun my wheels to correct the sensitivie information, I realized that anyone in the world could find enough information about these people via Google that I had to resort to Google’s url removal tool.

Matt Cutts provides a concise and link-filled guide to information to control the Googlebot’s indexing. Though details can be found at Matt’s site, here is a short rundown:

At a site or directory level, use .htaccess to add password protection.
At a site or directory level, make use of a robots.txt file.
At a page level, use the noindex <meta> tag.
At a link level, use a nofollow attribute.
If the content has already been crawled, use the Google url removal tool as a last resort.

I would add just a point on common sense and intelligent web concepts. There is a saying that says that nothing you do on the internet is anonymous. There is something to be said about thinking before you act. It’s harder to cleanup from a boneheaded mistake such as the one I made for the church whose site I ran, than it is to think before posting anything online. If you don’t want the world to see it, then don’t rely on the mechanisms listed above. Simply don’t post it.

About Guest Blogger

This post was written by a guest contributor. Please see their details in the post above.

What do Gmail’s new Rules Mean for Bloggers? 08 Feb 2024

How to Write Faster, Better Blog Posts: 4 Techniques Top Bloggers Use 01 Feb 2024

14 Top Bloggers Share Their Daily Blogging Routine 07 Sep 2023

Comments

raj says: 07/09/2006 at 10:12 am

There is an old saying that if two people know something, it’s a secret. If three people know it, it’s public knowledge. If you need to publish sensitive info, you might want to try a password-protected site. If the info is truly life-threatening, there’s no way you should post it on the site at all. Because what if someone else links to it? [Provided you haven’t changed your .htaccess file or used Google’s URL remover.]
Brad says: 07/09/2006 at 2:26 pm

If I don’t want anyone to read something, then I don’t publish it on any of my websites! But, since my blog is still fairly new, I’m happy anytime I get a visit from the googlebot in hopes it helps my search engine ranking.
Nukes and Candy says: 07/09/2006 at 2:45 pm

“…If you don’t want the world to see it, then don’t rely on the mechanisms listed above. Simply don’t post it.”

This phrase is paramount.
Michael Moncur says: 07/09/2006 at 4:50 pm

I always operate on the assumption that there are only two possibilities: something is private (unpublished) or it’s public (published). If I publish I assume that everyone on the world will have a chance to read everything on the site at the worst possible moment.

I haven’t had to beg Google to remove a URL yet…

Okay, that’s a lie, I’ve used Google’s URL Removal Tool a few times to correct a less dangerous mistake, but it’s one you should also watch out for:

* Always password-protect “beta” or “temporary” versions of a site.

It’s convenient to have a beta site online to test a new design or function, but be careful lest Google spider the whole thing and start substituting it for the results on your original site…
John Evans (Syntagma) says: 07/09/2006 at 7:53 pm

Matt Mullenweg actually had the noindex tag put on his site maliciously. Now that’s a dismal thought.

Aaron, any info on when this delayed PR update is going to happen? Matt Cutts is back. What’s the holdup?
Aaron Brazell says: 07/09/2006 at 10:36 pm

John, I’ll presume that you realize I have no clue about Google updates. I wait like anyone else. I’ll also presume that you were not aware that there was a PR update last week.
» links for 2006-07-10 Ian’s Messy Desk says: 07/10/2006 at 11:00 pm

[…] Controlling the Googlebot (tags: blogging Search Tools Google Bots) […]
flys (Marketing Lessons) says: 07/11/2006 at 2:11 am

Very interesting article.
How long Google take to cancel the pages using URL removal tool ?
bye
adfi says: 08/13/2006 at 7:31 am

Yes. Very good site!
Brian says: 09/07/2006 at 6:35 am

this site rocks! http://www.sis.utk.edu/Members/derik/5.html
What to do when you publish by mistake « Church website and blog ideas says: 09/22/2006 at 10:18 pm

[…] One webmasters comments and help when things are posted by mistake has been written about on Problogger. […]
Mark says: 09/25/2007 at 1:38 am

flys – Google can take quite a while to cancel pages, however, this of course depends on the frequency of their visits.
Netbooks says: 02/24/2009 at 6:23 am

I am a big believer in using robots.txt
Netbook Review says: 03/09/2009 at 2:05 am

I think robots.txt is a good tool, but be careful what you restrict!
Netbooks UK says: 03/17/2009 at 10:41 pm

Sure robots.txt might work, but what about after the pages have already been indexed. How do you get google to remove them?