Robots.txt Howto, Tutorial And Reference
1.1 Robot.txt File
A robots.txt is the first file a crawler will grab from a site, and this file will indicate where ona site the crawler is not allowed to go.
The robots.txt must be placed in the root of a site and be called robots.txt, e.g. http://www.webmasterworld.com/robots.txt, which is a particularly thorough robots.txt. This is true for any top-level domain name, including sub-domains. As such, example.example.com needs a robots.txt file of its own, as does www.example.com.
There are only three elements that make up a robots.txt protocol, and these are:
- This indicates which user agent or bot the following directives are aimed at, e.g.
User-agent: Googlebot
This indicates that the commands on the following lines are aimed at Google. Every bot has their own UA, and there is also a generic command, * that indicates all bots.
- This command indicates what areas a bot is not allowed to index. For example:
User-agent: Teoma
Disallow: /
Tells Teoma, Ask Jeeves bot, to not index anything. (see below for more detail).
– Everything after a hash (#) is commented out.
There also exist several non standard robots.txt commands, such as Allow (which implicitly allows a search engine access to a folder or file), Yahoo and MSN’s crawl delay, and a number of regular expression based disallow commands supported only officially by Google. It is probably best to avoid such commands, as they may cause issues, although as Google has such a strong market share in Australia, the Google specific commands may be OK.
1.2 Disallowing Pages
The following are examples of disallowing pages based upon different commands:
1.2.1 Disallow Indexing of Whole Directories
Adding a trailing slash indicates that all the files in a folder are disallowed, e.g.
User-agent: *
Disallow: /pdfs/
Would disallow the robots from indexing anything in the www.example.com/pdfs/ folder.
1.2.2 Disallow Indexing Specific Pages
Similarly, individual pages can be indexed, e.g.
User-agent: *
Disallow: /embarrassment.htm
Would disallow www.example.com/embarrassment.htm from being indexed.
1.2.3 Disallow Indexing Anything That Start With A Set Prefix
The command
User-agent: *
Disallow: /dog
Would stop anything that started with www.example.com/dog from being indexed, e.g. www.example.com/dogybag.htm, www.example.com/dog/dog.htm or www.example.com/dog.pdf. This is a useful command to utilise, but has obvious problems as it can be hard to track all the disallowed prefixes. It is far better to simply lock of a directory than to use this method, which is fraught with problems.
1.2.4 Proprietary Commands
These are all unsupported and non-standard commands, and should only be used with an understanding that they may cause issues with other crawlers.
1.2.4.1 Crawl-Delay
This is a command that both Yahoo and MSN accept that slows down their crawler, e.g.
http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
User-agent: Slurp
Crawl-delay: 20
This command supposedly slows requests down to slower than one per 20 seconds. It is the opinion of WMS that this is ill-advised. The hardest part of SEO is getting all a site’s pages indexed, and anything that makes a crawler change their crawl schedule is likely to be bad. If search engine crawlers are clogging a server, it is better to look at optimising code, database calls and issue 304 response codes than it is to attempt to slow the crawlers down.
1.2.4.2 Allow
Some crawlers accept the allow command, which specifically allows directories to be crawled. Although this has some merit, it is probably better to simply put excluded material in password protected or disallowed directories.
1.2.4.3 Regular Expressions
Google and MSN both allow regular expressions to be used in commands. This can be useful in banning specific filetypes, e.g.
User-agent: msnbot
Disallow: /*.PDF$
Disallow: /*.jpeg$
Disallow: /*.exe$
Would disallow msn from indexing PDFs, jpegs and executables found in the root folder.
This is a very useful feature addition, but as a non-standard, it is problematic (especially the use of special characters in the commands) and there are simply better ways to achieve the same goal (e.g. putting PDFs in a disallowed folder).
1.2.5 Disallowing Crawling and Indexing
There can often be some confusion as to how a URL can be excluded but still indexed. This is due to the semantics of robots.txt, and the timeframe between finding a URL and downloading a site’s robots.txt file.
Robots.txt does not as written exclude a Search engines from having a specific URL in its index. The robots.txt protocol bans a search engine from accessing a page, but not from having it in their index. With the rise of off-page analysis, a Search Engine can build quite a bit of context for a page without ever visiting it, and as such a search engine can conceivably have a lot of URLs in their index without ever visiting them.
The only way to ensure a page or URL is not indexed is to use the robots meta tag, to specifically ban the indexing of a URL. Even then, before a page is visited, it may still reside in the index.
1.2.6 Additional Robots.txt Resources
http://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.1 - W3c 4.01 HTML standard
http://www.robotstxt.org/ - The robots.txt information site. this hasn’t been updated in a very long time.
http://www.robotstxt.org/wc/exclusion-admin.html - A How to for Web Administrators on creating robots.txt files
http://www.robotstxt.org/wc/exclusion.html - information on excluding Search Engine robots
http://www.google.com/webmasters/3.html - Google’s Information about robots.txt and the robots meta tag.
help.yahoo.com/us/ysearch/deletions/deletions-03.html – Yahoo’s information on robots.txt and deletions.
search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm – MSN’s information on their robots.txt implementation.

