Robots Exclusion

Sometimes people find they have been indexed by an indexing robot, or that a resource discovery robot has visited part of a site that for some reason shouldn't be visited by robots.

In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms:

The Robots Exclusion Protocol
A Web site administrator can indicate which parts of the site should not be vistsed by a robot, by providing a specially formatted file on their site, in http://.../robots.txt.

The Robots META tag
A Web author can indicate if a page may or may not be indexed, or analysed for links, through the use of a special HTML META tag.

The remainder of this pages provides full details on these facilities.

Note that these methods rely on cooperation from the Robot, and are by no means guaranteed to work for every Robot. If you need stronger protection from robots and other agents, you should use alternative methods such as password protection.

The Robots Exclusion Protocol

The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot.

In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records like:

User-agent: *
Disallow: /

to see if it is allowed to retrieve the document. The precise details on how these rules can be specified, and what they mean, can be found in:

The Robots META tag

The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required.

Note that currently only a few robots implement this.

In this simple example:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

a robot should neither index this document, nor analyse it for links.

Full details on how this tags works is provided:

Web Server Administrator's Guide to the Robots META tag
HTML Author's Guide to the Robots META tag
The original notes from the May 1996 IndexingWorkshop

The Web Robots Pages

The Robots Exclusion Protocol	A Web site administrator can indicate which parts of the site should not be vistsed by a robot, by providing a specially formatted file on their site, in http://.../robots.txt.
The Robots META tag	A Web author can indicate if a page may or may not be indexed, or analysed for links, through the use of a special HTML META tag.