Member Sign In
Not a member?

A Wired.com user account lets you create, edit and comment on Webmonkey articles. You will also be able to contribute to the Wired How-To Wiki and comment on news stories at Wired.com.


It's fast and free.

Sign in with OpenID
Sign In
Webmonkey is a property of Wired Digital.
processing...
Join Webmonkey

Please send me occasional e-mail updates about new features and special offers from Wired/Webmonkey.
Yes No

Please send occasional e-mail offers from Wired/Webmonkey affiliated web sites and publications, and carefully selected companies.
Yes No

I understand and agree that registration on or use of this site constitutes agreement to Webmonkey's User Agreement and Privacy Policy.
Webmonkey is a property of Wired Digital.
processing...

Retrieve Sign In

Please enter your e-mail address or username below. Your username and password will be sent to the e-mail address you provided us.

or
Webmonkey is a property of Wired Digital.
processing...

Welcome to Webmonkey

A private profile page has been created for you.
As a member of Webmonkey, you can now:
  • edit articles
  • add to the code library
  • design and write a tutorial
  • comment on any Webmonkey article
Close
Webmonkey is a property of Wired Digital.

Sign In Information Sent

An e-mail has been sent to the e-mail address registered in this account.
If you cannot find it in your in-box, please check your bulk or junk folders.
Sign In
Webmonkey is a property of Wired Digital.

Control Bot Access to Your Site

/skill level/
/viewed/
0 Times

Have you ever wondered why your server logs show 404 errors for a file named robots.txt when you've never linked to or created any such file?

The answer is because web crawlers, the kind sites like Google, Yahoo and Microsoft use to scour the web for content, always look for a file named robots.txt first when they index your content. The file ostensibly gives the robots permission to index some pages and leave the rest alone.

If you've got 404 errors, it means your site is missing a robots.txt file. If you haven't supplied one, the bots will go on crawling your site without instructions, essentially just winging it. But why not help them out and gain control over what is indexed in the process? For example, you can restrict bots from downloadable content in your admin section. When used in conjunction with a sitemap, it might even help improve your search engine ranking.

This article is one of many wiki articles you can edit yourself. If you know a thing or two about how to protect web pages from search crawling, log in and edit this page.

Contents

What You'll Need

  • A text editor
  • An FTP client
  • Admin access to your web server


Get Started

As the name implies, robots.txt is simply a flat text file with a few simple directions that tell robots, or specific crawlers, what parts of your site to index.

To get started writing your own, let's use a simple example. Imagine you have a site at http://mysite.com and you use WordPress. You access the site at the URL: http://www.mysite.com/wp-admin/. Now you don't want the robots to index your admin login page because it's private, so create a new file at the root level of your site and name it robots.txt. Next, add these lines:

User-Agent: *
Disallow: /wp-admin

This tells all bots crawling your site to ignore the wp-admin directory and everything below it. The * is used as a wildcard to match any user agent.

The basic format for all robots.txt rules is:

User-Agent: [Bot name]
Disallow: [Directory or File Name]

So let's modify the above example so that only the Google Bot is excluded (for no good reason other than for the sake of example):

User-Agent: Googlebot
Disallow: /wp-admin

Here's a more practical example for preventing the Google image scraping bot from indexing your images folder:

User-Agent: Googlebot-Image
Disallow: /images

Let's say you really hate the Lycos web crawler. Well, just disallow your whole site:

User-Agent: T-Rex
Disallow: /

The Lycos user agent is strangely "T-Rex," which raises the question: where do you find out the name of all the various crawlers and their user agent signatures?

The answer is to head over to Robotstxt website and check out the list of bots in the wild. You'll note that there are over 300 different bots listed there, most of which you've probably never heard of. Don't worry, neither have we.

In most cases you can get by with rules that just use the * wildcard, but should you ever need to target a specific bot, now you know how.

More complex scenarios

So far we've just created very basic rules, but you can actually get quite complex. Let's say for example that we want all bots to ignore our WordPress admin pages and we want all except the GoogleBot to ignore our images directory.

Here's what that would look like:

User-agent: *
Disallow: /wp-admin
Disallow: /images

User-agent: Googlebot-Image
Disallow: /wp-admin

First we address all bots and tell them to ignore both of the directories we want to keep hidden. Then we specifically address the Google Image bot and tell it to ignore only the wp-admin directory. The specific rule overrides the general one, so the Google Image bot will be free to crawl the images directory.

Caveats

Most well behaved bots will obey your robots.txt rules. However, it's important to note that this isn't a security method. Just because you tell the bots to ignore your private files, doesn't mean a) that they will (there are badly behaved bots out there) or b) anyone else will.

Robots.txt files are merely guides, not a way to make sure no one sees your pages. If you're looking to secure your files, use something like a password protected directory. This way you'll stop the bots and the humans.

Conclusion

That's really all there is to robots.txt. If you'd like to learn more about robots and see some other examples, head over to the Robotstxt website which the web's most comprehensive source for all things related to web crawlers.

  • This page was last modified 15:55, 17 October 2008.
Edit this article
Reddit Digg
 
Subscribe now

Special Offer For Webmonkey Users

WIRED magazine:
The first word on how technology is changing our world.

Subscribe for just $10 a year