A Wired.com user account lets you create, edit and comment on Webmonkey articles. You will also be able to contribute to the Wired How-To Wiki and comment on news stories at Wired.com.
It's fast and free.
processing...Retrieve Sign In
Please enter your e-mail address or username below. Your username and password will be sent to the e-mail address you provided us.
processing...Welcome to Webmonkey
- edit articles
- add to the code library
- design and write a tutorial
- comment on any Webmonkey article
Sign In Information Sent
Control Bot Access to Your Site
/skill level/
/viewed/
(�What you'll need) |
|||
| Line 1: | Line 1: | ||
| - | + | Have you ever wondered why your server logs show 404 errors for a file named robots.txt when you've never linked to or created any such file? The answer is because web crawlers, the kind sites like Google, Yahoo and Microsoft use to scour the web for content, always look for a file named robots.txt first. The file ostensibly gives the robots permission to index some pages and leave the rest alone. | |
| - | + | ||
| + | If you've got 404 errors, it means your site is missing a robots.txt file. The bots will go on crawling your site without instructions by just winging it. Why not help them out and gain a little control over what is indexed in the process? For example, restrict bots to download content in your admin section. When used in conjunction with a sitemap it might even help improve your search engine ranking. | ||
| + | |||
| + | ''This article is one of many '''wiki''' articles you can edit yourself. If you know a thing or two about how to protect web pages from search crawling, log in and edit this page.'' | ||
| + | |||
| + | == Get Started == | ||
| + | |||
| + | As the name implies, robots.txt is simply a flat text file with a few simple directions that tell robots, or specific crawlers, what parts of your site to index. | ||
| + | |||
| + | To get started writing your own, let's use a simple example. Imagine you have a site at http://mysite.com and you use WordPress. You access the site at the URL: http://www.mysite.com/wp-admin/. Now you don't want the robots to index your admin login page because it's private, so create a new file at the root level of your site and name it <code>robots.txt</code>. Next, add these lines: | ||
| + | |||
| + | <pre> | ||
| + | User-Agent: * | ||
| + | Disallow: /wp-admin | ||
| + | </pre> | ||
| + | |||
| + | This tells all bots crawling your site to ignore the <code>wp-admin</code> directory and everything below it. The '''*''' is used as a wildcard to match any user agent. | ||
| + | |||
| + | The basic format for all robots.txt rules is: | ||
| + | |||
| + | <pre> | ||
| + | User-Agent: [Bot name] | ||
| + | Disallow: [Directory or File Name] | ||
| + | </pre> | ||
| + | |||
| + | So let's modify the above example so that only the Google Bot is excluded (for no good reason other than for the sake of example): | ||
| + | |||
| + | <pre> | ||
| + | User-Agent: Googlebot | ||
| + | Disallow: /wp-admin | ||
| + | </pre> | ||
| + | |||
| + | Here's a more practical example for preventing the Google image scraping bot from indexing your images folder: | ||
| + | |||
| + | <pre> | ||
| + | User-Agent: Googlebot-Image | ||
| + | Disallow: /images | ||
| + | </pre> | ||
| + | |||
| + | Let's say you really hate the Lycos web crawler. Well, just disallow your whole site: | ||
| + | |||
| + | <pre> | ||
| + | User-Agent: T-Rex | ||
| + | Disallow: / | ||
| + | </pre> | ||
| + | |||
| + | The Lycos user agent is strangely "T-Rex," which raises the question: where do you find out the name of all the various crawlers and their user agent signatures? | ||
| + | |||
| + | The answer is to head over to [http://www.robotstxt.org Robotstxt] website and check out the [http://www.robotstxt.org/db.html list of bots in the wild]. You'll note that there are over 300 different bots listed there, most of which you've probably never heard of. Don't worry, neither have we. | ||
| + | |||
| + | In most cases you can get by with rules that just use the * wildcard, but should you ever need to target a specific bot, now you know how. | ||
| + | |||
| + | == More complex scenarios == | ||
| + | |||
| + | So far we've just created very basic rules, but you can actually get quite complex. Let's say for example that we want all bots to ignore our WordPress admin pages and we want all except the GoogleBot to ignore our images directory. | ||
| + | |||
| + | Here's what that would look like: | ||
| + | |||
| + | <pre> | ||
| + | User-agent: * | ||
| + | Disallow: /wp-admin | ||
| + | Disallow: /images | ||
| + | |||
| + | User-agent: Googlebot-Image | ||
| + | Disallow: /wp-admin | ||
| + | </pre> | ||
| + | |||
| + | First we address all bots and tell them to ignore both of the directories we want to keep hidden. Then we specifically address the Google Image bot and tell it to ignore only the wp-admin directory. The specific rule overrides the general one, so the Google Image bot will be free to crawl the images directory. | ||
| + | |||
| + | == Caveats == | ||
| + | |||
| + | Most well behaved bots will obey your robots.txt rules. However, it's important to note that this isn't a security method. Just because you tell the bots to ignore your private files, doesn't mean a) that they will (there are badly behaved bots out there) or b) anyone else will. | ||
| + | |||
| + | Robots.txt files are merely guides, not a way to make sure no one sees your pages. If you're looking to secure your files, use something like a password protected directory. This way you'll stop the bots and the humans. | ||
| + | |||
| + | == Conclusion == | ||
| + | |||
| + | That's really all there is to robots.txt. If you'd like to learn more about robots and see some other examples, head over to the [http://www.robotstxt.org/ Robotstxt website] which the web's most comprehensive source for all things related to web crawlers. | ||
| + | == What you'll need == | ||
| + | #admin access to your website | ||
Revision as of 15:00, 17 October 2008
Have you ever wondered why your server logs show 404 errors for a file named robots.txt when you've never linked to or created any such file? The answer is because web crawlers, the kind sites like Google, Yahoo and Microsoft use to scour the web for content, always look for a file named robots.txt first. The file ostensibly gives the robots permission to index some pages and leave the rest alone.
If you've got 404 errors, it means your site is missing a robots.txt file. The bots will go on crawling your site without instructions by just winging it. Why not help them out and gain a little control over what is indexed in the process? For example, restrict bots to download content in your admin section. When used in conjunction with a sitemap it might even help improve your search engine ranking.
This article is one of many wiki articles you can edit yourself. If you know a thing or two about how to protect web pages from search crawling, log in and edit this page.
Contents |
Get Started
As the name implies, robots.txt is simply a flat text file with a few simple directions that tell robots, or specific crawlers, what parts of your site to index.
To get started writing your own, let's use a simple example. Imagine you have a site at http://mysite.com and you use WordPress. You access the site at the URL: http://www.mysite.com/wp-admin/. Now you don't want the robots to index your admin login page because it's private, so create a new file at the root level of your site and name it robots.txt. Next, add these lines:
User-Agent: * Disallow: /wp-admin
This tells all bots crawling your site to ignore the wp-admin directory and everything below it. The * is used as a wildcard to match any user agent.
The basic format for all robots.txt rules is:
User-Agent: [Bot name] Disallow: [Directory or File Name]
So let's modify the above example so that only the Google Bot is excluded (for no good reason other than for the sake of example):
User-Agent: Googlebot Disallow: /wp-admin
Here's a more practical example for preventing the Google image scraping bot from indexing your images folder:
User-Agent: Googlebot-Image Disallow: /images
Let's say you really hate the Lycos web crawler. Well, just disallow your whole site:
User-Agent: T-Rex Disallow: /
The Lycos user agent is strangely "T-Rex," which raises the question: where do you find out the name of all the various crawlers and their user agent signatures?
The answer is to head over to Robotstxt website and check out the list of bots in the wild. You'll note that there are over 300 different bots listed there, most of which you've probably never heard of. Don't worry, neither have we.
In most cases you can get by with rules that just use the * wildcard, but should you ever need to target a specific bot, now you know how.
More complex scenarios
So far we've just created very basic rules, but you can actually get quite complex. Let's say for example that we want all bots to ignore our WordPress admin pages and we want all except the GoogleBot to ignore our images directory.
Here's what that would look like:
User-agent: * Disallow: /wp-admin Disallow: /images User-agent: Googlebot-Image Disallow: /wp-admin
First we address all bots and tell them to ignore both of the directories we want to keep hidden. Then we specifically address the Google Image bot and tell it to ignore only the wp-admin directory. The specific rule overrides the general one, so the Google Image bot will be free to crawl the images directory.
Caveats
Most well behaved bots will obey your robots.txt rules. However, it's important to note that this isn't a security method. Just because you tell the bots to ignore your private files, doesn't mean a) that they will (there are badly behaved bots out there) or b) anyone else will.
Robots.txt files are merely guides, not a way to make sure no one sees your pages. If you're looking to secure your files, use something like a password protected directory. This way you'll stop the bots and the humans.
Conclusion
That's really all there is to robots.txt. If you'd like to learn more about robots and see some other examples, head over to the Robotstxt website which the web's most comprehensive source for all things related to web crawlers.
What you'll need
- admin access to your website
/related_articles/
Special Offer For Webmonkey Users
WIRED magazine:
The first word on how technology is changing our world.
