Give me 31 Days and I’ll Give You a Better Blog… Guaranteed

Check out 31 Days to Build a Better Blog

Give me 31 Days and I’ll Give You a Better Blog

Check it out

A Practical Podcast… to Help You Build a Better Blog

The ProBlogger Podcast

A Practical Podcast…

FREE Problogging tips delivered to your inbox  

Controlling the Googlebot

Posted By 9th of July 2006 General 0 Comments

How to control what gets indexed by Google and when? That is the question. Most of the time, we want Google to snarf up as many pages as possible. In my own experience, I can think of a few times when indexing was not something I wanted and I had to go back to Google to actually have pages removed.

In the immortal words of Darren Rowse, there’s a tangent to follow:
In my case, a church website I run inadvertently had information about missionaries that were in sensitive areas of the world and information about them actually placed them in danger. While I typically wanted information about the church and functions of the church indexed for people to find, I did not want this information indexable. While I spun my wheels to correct the sensitivie information, I realized that anyone in the world could find enough information about these people via Google that I had to resort to Google’s url removal tool.

Matt Cutts provides a concise and link-filled guide to information to control the Googlebot’s indexing. Though details can be found at Matt’s site, here is a short rundown:

  1. At a site or directory level, use .htaccess to add password protection.
  2. At a site or directory level, make use of a robots.txt file.
  3. At a page level, use the noindex <meta> tag.
  4. At a link level, use a nofollow attribute.
  5. If the content has already been crawled, use the Google url removal tool as a last resort.

I would add just a point on common sense and intelligent web concepts. There is a saying that says that nothing you do on the internet is anonymous. There is something to be said about thinking before you act. It’s harder to cleanup from a boneheaded mistake such as the one I made for the church whose site I ran, than it is to think before posting anything online. If you don’t want the world to see it, then don’t rely on the mechanisms listed above. Simply don’t post it.

  • raj

    There is an old saying that if two people know something, it’s a secret. If three people know it, it’s public knowledge. If you need to publish sensitive info, you might want to try a password-protected site. If the info is truly life-threatening, there’s no way you should post it on the site at all. Because what if someone else links to it? [Provided you haven’t changed your .htaccess file or used Google’s URL remover.]

  • If I don’t want anyone to read something, then I don’t publish it on any of my websites! But, since my blog is still fairly new, I’m happy anytime I get a visit from the googlebot in hopes it helps my search engine ranking.

  • “…If you don’t want the world to see it, then don’t rely on the mechanisms listed above. Simply don’t post it.”

    This phrase is paramount.

  • I always operate on the assumption that there are only two possibilities: something is private (unpublished) or it’s public (published). If I publish I assume that everyone on the world will have a chance to read everything on the site at the worst possible moment.

    I haven’t had to beg Google to remove a URL yet…

    Okay, that’s a lie, I’ve used Google’s URL Removal Tool a few times to correct a less dangerous mistake, but it’s one you should also watch out for:

    * Always password-protect “beta” or “temporary” versions of a site.

    It’s convenient to have a beta site online to test a new design or function, but be careful lest Google spider the whole thing and start substituting it for the results on your original site…

  • Matt Mullenweg actually had the noindex tag put on his site maliciously. Now that’s a dismal thought.

    Aaron, any info on when this delayed PR update is going to happen? Matt Cutts is back. What’s the holdup?

  • John, I’ll presume that you realize I have no clue about Google updates. I wait like anyone else. I’ll also presume that you were not aware that there was a PR update last week.

  • Pingback: » links for 2006-07-10 Ian’s Messy Desk()

  • Very interesting article.
    How long Google take to cancel the pages using URL removal tool ?

  • Yes. Very good site!

  • Pingback: What to do when you publish by mistake « Church website and blog ideas()

  • flys – Google can take quite a while to cancel pages, however, this of course depends on the frequency of their visits.

  • I am a big believer in using robots.txt

  • I think robots.txt is a good tool, but be careful what you restrict!

  • Sure robots.txt might work, but what about after the pages have already been indexed. How do you get google to remove them?