How to control what gets indexed by Google and when? That is the question. Most of the time, we want Google to snarf up as many pages as possible. In my own experience, I can think of a few times when indexing was not something I wanted and I had to go back to Google to actually have pages removed.
In the immortal words of Darren Rowse, there’s a tangent to follow: In my case, a church website I run inadvertently had information about missionaries that were in sensitive areas of the world and information about them actually placed them in danger. While I typically wanted information about the church and functions of the church indexed for people to find, I did not want this information indexable. While I spun my wheels to correct the sensitivie information, I realized that anyone in the world could find enough information about these people via Google that I had to resort to Google’s url removal tool.
Matt Cutts provides a concise and link-filled guide to information to control the Googlebot’s indexing. Though details can be found at Matt’s site, here is a short rundown:
- At a site or directory level, use .htaccess to add password protection.
- At a site or directory level, make use of a
- At a page level, use the noindex
- At a link level, use a
- If the content has already been crawled, use the Google url removal tool as a last resort.
I would add just a point on common sense and intelligent web concepts. There is a saying that says that nothing you do on the internet is anonymous. There is something to be said about thinking before you act. It’s harder to cleanup from a boneheaded mistake such as the one I made for the church whose site I ran, than it is to think before posting anything online. If you don’t want the world to see it, then don’t rely on the mechanisms listed above. Simply don’t post it.