In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms:
The Robots Exclusion Protocol
|
A Web site administrator can indicate which parts of the site
should not be vistsed by a robot, by providing a specially formatted
file on their site, in http://.../robots.txt.
|
The Robots META tag
|
A Web author can indicate if a page may or may not be indexed,
or analysed for links, through the use of a special HTML META tag.
|
Note that these methods rely on cooperation from the Robot, and are by no means guaranteed to work for every Robot. If you need stronger protection from robots and other agents, you should use alternative methods such as password protection.
In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records like:
to see if it is allowed to retrieve the document. The precise details on how these rules can be specified, and what they mean, can be found in:User-agent: * Disallow: /
Note that currently only a few robots implement this.
In this simple example:
a robot should neither index this document, nor analyse it for links.<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Full details on how this tags works is provided: