Note that this is not a specification -- for details and formal syntax and definition see the specification.
When a compliant Web Robot vists a site, it first checks for a "/robots.txt" URL on the site. If this URL exists, the Robot parses its contents for directives that instruct the robot not to visit certain parts of the site.
As a Web Server Administrator you can create directives that make sense for your site. This page tells you how.
Site URL Corresponding Robots.txt URL http://www.w3.org/ http://www.w3.org/robots.txt http://www.w3.org:80/ http://www.w3.org:80/robots.txt http://www.w3.org:1234/ http://www.w3.org:1234/robots.txt http://w3.org/ http://w3.org/robots.txt
Note that there can only be a single "/robots.txt" on a site. Specifically, you should not put "robots.txt" files in user directories, because a robot will never look at them. If you want your users to be able to create their own "robots.txt", you will need to merge them all into a single "/robots.txt". If you don't want to do this your users might want to use the Robots META Tag instead.
Also, remeber that URL's are case sensitive, and "/robots.txt" must be all lower-case.
Pointless robots.txt URLs http://www.w3.org/admin/robots.txt http://www.w3.org/~timbl/robots.txt ftp://ftp.w3.com/robots.txt
So, you need to provide the "/robots.txt" in the top-level of your URL space. How to do this depends on your particular server software and configuration.
For most servers it means creating a file in your top-level server directory. On a UNIX machine this might be /usr/local/etc/httpd/htdocs/robots.txt
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/
In this example, three directories are excluded.
Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/". Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".
What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:
User-agent: * Disallow: /
User-agent: * Disallow:
Or create an empty "/robots.txt" file.
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/
User-agent: BadBot Disallow: /
User-agent: WebCrawler Disallow: User-agent: * Disallow: /
Alternatively you can explicitly disallow all disallowed pages:User-agent: * Disallow: /~joe/docs/
User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html