The Standard for Robots Exclusion (SRE) was first proposed in 1994, as a mechanism for keeping robots out of unwanted areas of the Web. Such unwanted areas included:
The main design consideration to achieve this goal were:
Instead, the mechanism uses a specially formatted resource, at a know location in the server's URL space. In its simplest form the resource could be a text file produced with a text edittor, placed in the root-level server directory.
This formatted-file approach satisfied the design considerations: The administration was simple, because the format of the file was easy to understand, and required no special software to produce. The implementation was simple, because the format was simple to parse and apply. The deployment was simple, because no client or server changes were required.
Indeed the majority of robot authors rapidly embraced this proposal, and it has received a great deal of attention in both Web-based documentation and the printed press. This in turn has promoted awareness and acceptance amongst users.
In the years since the inital proposal, a lot of practical experience with the SRE has been gained, and a considerable number of suggestions for improvement or extensions have been made. They broadly fall into the following categories:
I will discuss some of the most frequent suggestions in that order, and give some arguments in favour or against them.
One main point to keep in mind is that it is difficult to gauge how much of an issue these problems are in practice, and how wide-spread support for extensions would be. When considering further development of the SRE it is important to prevent second-system syndrome.
These relate to the administration of the SRE, and as such the effectiveness of the approach for the purpose.
The SRE specifies a location for the resource, in the root level of a server's URL space. Modifying this file generally requires administrative access to the server, which may not be granted to a user who would like to add exclusion directives to the file. This is especially common in large multi-user systems.
It can be argued this is not a problem with the SRE, which after all does not specify how the resource is administered. It is for example possible to programatically collect individual's '~/robots.txt' files, combining them into a single '/robots.txt' file on a regular basis. How this could be implemented depends on the operating system, server software, and publishing process. In practice users find their adminstrators unwilling or incapable of providing such a solution. This indicates again how important it is to stress simplicity; even if the extra effort required is miniscule, requiring changes in practices, procedures, or software is a major barrier for deployment.
Suggestions to alleviate the problem have been producing a CGI script which combines multiple individual files on the fly, or listing multiple referral files in the '/robots.txt' which the robot can retrieve and combine. Both these options suffer from the same problem; some administrative access is still required.
This is the most painful operational problem, and cannot be sufficiently addressed in the current design. It seems that the only solution is to move the robot policy closer to the user, in the URL space they do control.
The SRE allows only a single method for specifying parts of the URL space: by substring anchored at the front. People have asked for substrings achored at the end, as in "Disallow: *.shtml", as well as generlised regular expression parsing, as in 'Disallow: *sex*'. XXX
The issue with this extension is that it increases complexity of both administration and implementation. In this case I feel this may be justified.
The SRE allows for specific directives for individual robots. This may result in considerable repetiton of rules common to all robots. It has been suggested that an OO inheritance scheme could address this.
In practice the per-robot distinction is not that widely used, and the need seems to be sporadic. The increased complexity of both adminstration and implementation seems prohibitive in this case.
The SRE groups all rules for the server into a single file. This doesn't scale well to thousands or millions of individually specified URL's.
This is a fundamental problem, and one that can only be solved by moving beyond a single file, and bringing the policy closer to the individual resources.
These are problems faced by the Web at large, which could be addressed (at leats for robots) separately using extensions to the SRE. I am against following that route, as it is fixing the problem in the wrong place. These issues should be addressed by proper general solution separate from the SRE.
The use of multiple domain names sharing a logical network interface is a common practice (even without vanity domains), which often leads to problems with indexing robots, who may end up using an undesired domain name for a given URL.
This could be adressed by adding a "preferred" address, or even encoding "preferred" domain names for certain parts of a URL space. This again increases complexity, and doesn't solve the problem for non-robots which can suffer the same fate.
The issue here is that deployed HTTP software doesn't have a facility to indicate the host part of the HTTP URL, and a server therefore cannot use that to decide the availability of a URL. HTTP 1.1 and later address this using a Host header and full URI's in the request line. This will address this problem accross the board, but will take time to be deployed and used.
Some servers, such as "webcrawler.com", run identical URL spaces on several different machines, for load balancing or redundancy purposes. This can lead to problems when a robot uses only the IP address to uniquely identify a server; the robot would traverse and list each instance of the server separately.
It is possible to list alternative IP addresses in the /robots.txt file, indicating equivalency. However, in the common case where a single domain name is used for these separate IP addresses this information is already obtainable from the DNS.
Currently robots can only track updates by frequent revisits. There seem to be a few: the robot could request a notification when a page changes, the robot could ask for modification information in bulk, or the SRE could be extended to suggest expirations on URL's.
This is a more general problem, and ties in to caching issues and the link consistency. I will not go into the first two options as they donot concern the SRE. The last option would duplicate existing HTTP-level mechanisms such as Expires, only because they are currently difficult to configure in servers. It seems to me this is the wrong place to solve that problem.
These concern further suggestions to reduce robot-generated problems for a server. All of these are easy to add, at the cost of more complex administration and implementation. It also brings up the issue of partial compliance; not all robot may be willing or able to support all of these. Given that the importance of these extensions is secondary to the SRE's purpose, I suggest they are to be listed as MAY or SHOULD, not MUST options.
The SRE doesn't allow multiple URL prefixes on a single line, as in "Disallow: /users /tmp". In practice people do this, so the implementation (if not the SRE) could be changed to condone this practice.
This directive could indicate to a robot how long to wait between requests to the server. Currently it is accepted practice to wait at least 30 seconds between requests, but this is too fast for some sites, too slow for others.
A limitation is that this would specify a value for the entire site, whereas the value may depend on specific parts of the URL space.
This directive could indicate how long a robot should wait before revisiting pages on the server.
A limitation is that this would specify a value for the entire site, whereas the value may depend on specific parts of the URL space.
This appears to duplicate some of the existing (and future) cache-consistency measures such as Expires.
This is a special version of the directive above; specifying how often the '/robots.txt' file should be refreshed.
Again Expires could be used to do this.
It has often been suggested to list certain hours as "preferred hours" for robot accesses. These would be given in GMT, and would probably list local low-usage time.
A limitation is that this would specify a value for the entire site, whereas the value may depend on specific parts of the URL space.
The SRE specifies URL prefixes that are not to be retrieved. In practice we find it is used both for URL's that are not to be retrieved, as ones that are not to be indexed, and that the distinction is not explicit.
For example, a page with links to a company's employees pages may not be all that desirable to appear in an index, whereas the employees pages themselves are desirable; The robot should be allowed to recurse on the parent page to get to the child pages and index them, without indexing the parent.
This could be addressed by adding a "DontIndex" directive.
The SRE's aim was to reduce abuses by robots, by specifying what is off-limits. It has often been suggested to add more constructive information. I strongly believe such constructive information would be of immense value, but I contest that the '/robots.txt' file is the best place for this. In the first place, there may be a number of different schemes for providing such information; keeping exclusion and "inclusion" separate allows multiple inclusions schemes to be used, or the inclusion scheme to be changed without affecting the exclusion parts. Given the broad debates on meta information this seems prudent.
Some of you may actually not be aware of ALIWEB, a separate pilot project I set up in 1994 which used a '/site.idx' file in IAFA format, as one way of making such inclusive information available. A full analysis of ALIWEB is beyond the scope of this document, but as it used the same concept as the '/robots.txt' (single resource on a known URL), it shares many of the problems outlined in this document. In addition there were issues with the exact nature of the meta data, the complexity of administration, the restrictiveness of the RFC822-like format, and internationalisation issues. That experience suggests to me that this does not belong in the '/robots.txt' file, except possibly in its most basic form: a list of URL's to visit.
For the record, people's suggestions for inclusive information included:
This association can be done in a few ways:
I suggest the first option should be an immediate first step, with the other options possibly following later.
The meaures above address some of the problems in the SRE in a more scaleable and flexible way than by adding a multitude of directives to the '/robots.txt' file.
I believe that of the suggested additions, this one will have the most benefit, without adding complexity: