This problem has prompted experiments with automated browsing by "robots". A Web robot is a program that traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. These programs are sometimes called "spiders", "web wanderers", or "web worms". These names, while perhaps more appealing, may be misleading, as the term "spider" and "wanderer" give the false impression that the robot itself moves, and the term "worm" might imply that the robot multiplies itself, like the infamous Internet worm [2]. In reality robots are implemented as a single software system that retrieves information from remote sites using standard Web protocols.
A robot that verifies references, such as MOMspider [4], can assist an author in locating these dead links, and as such can assist in the maintenance of the hypertext structure. Robots can help maintain the content as well as the structure, by checking for HTML [5] compliance, conformance to style guidelines, regular updates, etc., but this is not common practice. Arguably this kind of functionality should be an integrated part of HTML authoring environments, as these checks can then be repeated when the document is modified, and any problems can be resolved immediately.
In the Web mirroring can be implemented with a robot, but at the time of writing no sophisticated mirroring tools exist. There are some robots that will retrieve a subtree of Web pages and store it locally, but they don't have facilities for updating only those pages that have changed. A second problem unique to the Web is that the references in the copied pages need to be rewritten: where they reference pages that have also been mirrored they may need to changed to point to the copies, and where relative links point to pages that haven't been mirrored they need to be expanded into absolute links. The need for mirroring tools for performance reasons is much reduced by the arrival of sophisticated caching servers [6], which do offer selective updates, can guarantee that a cached document is up-to-date, and are largely self maintaining. However, it is expected that mirroring tools will be developed in due course.
This means that rather than relying solely on browsing, a Web user can combine browsing and searching to locate information; even if the database doesn't contain the exact item you want to retrieve, it is likely to contain references to related pages, which in turn may reference the target item.
The second advantage is that these databases can be updated automatically at regular intervals, so that dead links in the database will be detected and removed. This in contrast to manual document maintenance, where verification is often sporadic and not comprehensive. The use of robots for resource discovery will be further discussed below.
Traditionally the Internet has been perceived to be "free", as the individual users did not have to pay for its operation. This perception is coming under scrutiny, as especially corporate users do feel a direct cost associated with network usage. A company may feel that the service to its (potential) customers is worth this cost, but that automated transfers by robots are not.
Besides placing demands on network, a robot also places extra demand on servers. Depending on the frequency with which it requests documents from the server this can result in a considerable load, which results in a lower level of service for other Web users accessing the server. Especially when the host is also used for other purposes this may not be acceptable. As an experiment the author ran a simulation of 20 concurrent retrievals from his server running the Plexus server on a Sun 4/330. Within minutes the machine slowed down to a crawl and was usable for anything. Even with only consecutive retrievals the effect can be felt. Only the week that this paper was written a robot visited the author's site with rapid fire requests. After 170 consecutive retrievals the server, which had been operating fine for weeks, crashed under the extra load.
This shows that rapid fire needs to be avoided. Unfortunately even modern manual browsers (e.g. Netscape) contribute to this problem by retrieving in-line images concurrently. The Web's protocol, HTTP [8], has been shown to be inefficient for this kind of transfer [9], and new protocols are being designed to remedy this [10].
The HTTP does provide the "If-Modified-Since" mechanism, whereby the user-agent can specify the modification time-stamp of a cached document along with a request for the document. The server will then only transfer the contents if the document has been modified since it was cached.
This facility can only be used by a robot if it retains the relationship between the summary data it extracts from a document, it's URL, and the timestamp of the retrieval. This places extra requirements on the size and complexity on the database, and is not widely implemented. Client-side robots/agents The load on the network is especially an issue with the category of robots that are used by end-users, and implemented as part of a general purpose Web client (e.g. the Fish Search [11] and the tkWWW robot [12]). One feature that is common in these end-user robots is the ability to pass on search-terms to search engines found while traversing the Web. This is touted as improving resource discovery by querying several remote resource discovery databases automatically. However it is the author's opinion that this feature is unacceptable for two reasons. Firstly a search operation places a far higher load on a server than a simple document retrieval, so a single user can cause a considerable overhead on several servers in a far shorter period than normal. Secondly, it is a fallacy to assume that the same search-terms are relevant, syntactically correct, let alone optimal for a broad range of databases, and the range of databases is totally hidden from the user. For example, the query "Ford and garage" could be sent to a database on 17th century literature, a database that doesn't support Boolean operators, or a database that specifies that queries specific to automobiles should start with the word "car:". And the user isn't even aware of this.
Another dangerous aspect of a client-side robot is that once it is distributed no bugs can be fixed, no knowledge of problem areas can be added and no new efficient facilities can be taken advantage of, as not everyone will upgrade to the latest version.
The most dangerous aspect however is the sheer number of possible users. While some people are likely to use such a facility sensibly, i.e. bounded by some maximum, on a known local area of the web, and for a short period of time, there will be people who will abuse this power, through ignorance or arrogance. It is the author's opinion that remote robots should not be distributed to end-users, and fortunately it has so far been possible to convince at least some robot authors to cancel releases [13].
Even without the dangers client-side robots pose an ethical question: where the use of a robot may be acceptable to the community if its data is then made available to the community, client-side robots may not be acceptable as they operate only for the benefit a single user. The ethical issues will be discussed further below.
End-user "intelligent agents" [14] and "digital assistants" are currently a popular research topic in computing, and often viewed as the future of networking. While this may indeed be the case, and it is already apparent that automation is invaluable for resource discovery, a lot more research is required for them to be effective. Simplistic user-driven Web robots are far removed from intelligent network agents: an agent needs to have some knowledge of where to find specific kinds of information (i.e. which services to use) rather than blindly traversing all information. Compare the situation where a person is searching for a book shop; they use the Yellow Pages for a local area, find the list of shops, select one or a few, and visit those. A client-side robot would walk into all shops in the area asking for books. On a network, as in real life, this is inefficient on a small scale, and prohibitve on a larger scale.
The author has observed several identical robot runs accessing his server. While in some cases this was caused by people using the site for testing (instead of a local server), in some cases it became apparent that this was caused by lax implementation. Repeated retrievals can occur when either no history of accessed locations is stored (which is unforgivable), or when a robot does not recognise cases where several URL are syntactically equivalent, e.g. where different DNS aliases for the same IP address are used, or where URL's aren't canonicalised by the robot, e.g. "foo/bar/../baz.html" is equivalent to "foo/baz.html".
Some robots sometimes retrieve document types, such as GIF's and Postscript, which they cannot handle and thus ignore.
Another danger is that some areas of the web are near-infinite. For example, consider a script that returns a page with a link to one level further down. This will start with for example "/cgi-bin/pit/", and continue with "/cgi-bin/pit/a/", "/cgi-bin/pit/a/a/", etc. Because such URL spaces can trap robots that fall into them, they are often called "black holes". See also the discussion of the Proposed Standard for Robot Exclusion below.
There is too much material, and it's too dynamic
One measure of effectiveness of an information retrieval approach is "recall", the fraction of all relevant documents that were actually found. Brian Pinkerton [15] states that recall in Internet indexing systems is adequate, as finding enough relevant documents is not the problem. However, if one considers the complete set of information available on the Internet as a basis, rather than the database created by the robot, recall cannot be high, as the amount of information is enormous, and changes are very frequent. So in practice a robot database may not contain a particular resource that is available, and this will get worse as the Web grows.
In an attempt to alleviate this situation somewhat the robot community has adopted "A Standard for Robot exclusion" [16]. This standard describes the use of a simple structured text file available at well-known place on a server ("/robots.txt") to specify which parts of their URL space should be avoided by robots (see Figure 1). This facility can also be used to warn Robots for black holes. Individual robots can be given specific instructions, as some may behave more sensibly than others, or are known to specialise in a particular area. This standard is voluntary, but is very simple to implement, and there is considerable public pressure for robots to comply.
Determining how to traverse the Web is a related problem. Given that most Web servers are organised hierarchically, a breadth-first traversal from the top to a limited depth is likely to more quickly find a broader and higher-level set of document and services than a depth-first traversal, and is therefore much preferable for resource discovery. However, a depth-first traversal is more likely to find individual users' home pages with links to other, potentially new, servers, and is therefore more likely to find new sites to traverse.
# /robots.txt for http://www.site.com/ User-agent: * # attention all robots: Disallow: /cyberworld/map # infinite URL space Disallow: /tmp/ # temporary filesFigure 1: An example robots.txt file
These methods are good general measures, and can be automatically applied to all Web pages, but cannot be as effective as manual indexing by the author. HTML provides a facility to attach general meta information to documents, by specifying a <META> element, e.g. "<meta name= "Keywords" value= "Ford Car Maintenance">. However, no semantics have (yet) been defined for specific values of the attributes of this tag, and this severely limits its acceptance, and therfore its usefulness.
This results in a low "precision", the proportion of the total number of documents retrieved that is relevant to the query. Advanced features such as Boolean operators, weighted matches like WAIS, or relevance feedback can improve this, but given that the information on the Internet is enormously diverse, this will continue to be a problem.
The META tag discussed above could provide a mechanism for authors to classify their own documents. The question then arises which classification system to use, and how to apply it. Even traditional libraries don't use a single universal system, but adopt one of a few, and adopt their own conventions for applying them. This gives little hope for an immediate universal solution for the Web.
Related to the above problem is that the content of web pages are often written for a specific context, provided by the access structure, and may not make sense outside that context. For example, a page describing the goals of a project may refer to "The project", without fully specifying the name, or giving a link to the welcome page. Another problem is that of moved URL's. Often when service administrators reorganise their URL structure they will provide mechanisms for backward compatibility with the previous URL structure, to prevent broken links. In some servers this can be achieved by specifying redirection configuration, which results in the HTTP negotiating a new URL when users try to access the old URL. However, when symbolic links are used it is not possible to tell the difference between the two. An indexing robot can in these cases store the deprecated URL, prolonging the requirement for a web administrator to provide backward compatibility.
A related problem is that a robot might index a mirror of a particular service, rather than the original site. If both source and mirror are visited there will be duplicate entries in the database, and bandwidth is being wasted repeating identical retrievals to different hosts. If only the mirror is visited users may be referred to out-of-date information even when up-to-date information is available elsewhere.
When some of the acceptability issues first became apparent (after a few incidents with robots doubling the load on servers) the author developed a set of Guidelines for Robot Writers [19], as a first step to identify problem areas and promote awareness. These guidelines can be summarised as follows:
The fact that most Robot writers have already implemented these guidelines indicates that they are conscious of the issues, and eager to minimise any negative impact. The public discussion forum provided by the robots mailing list speeds up the discussion of new problem areas, and the public overview of the robots on the Active list provides a certain community pressure on robot behaviour [21].
This maturation of the Robot field means there have recently been fewer incidents where robots have upset information providers. Especially the standard for robot exclusion means that people who don't approve of robots can prevent being visited. Experiences from several projects that have deployed robots have been published, especially at the World-Wide Web conferences at CERN in July 1994 and Chicago in October 1994, and these help to educate, and discourage, would-be Robot writers. However, with the increasing popularity of the Internet in general, and the Web in particular it is inevitable that more Robots will appear, and it is likely that some will not behave appropriately.
ALIWEB has a simple model for human distributed indexing of services in the Web, loosely based on Archie [24]. In this model aggregate indexing information is available from hosts on the Web. This information indexes only local resources, not resources available from third parties. In ALIWEB this is implemented with IAFA templates [25], which give typed resource information is a simple text-based format (See Figure 2). These templates can be produced manually, or can be constructed by automated means, for example from titles and META elements in a document tree. The ALIWEB gathering engine retrieves these index files through normal Web access protocols, and combines them into a searchable database. Note that it is not a robot, as it doesn't recursively retrieve documents found in the index.
Template-Type: SERVICE Title: The ArchiePlex Archie Gateway URL: /public/archie/archieplex/archieplex.html Description: A Full Hypertext interface to Archie. Keywords: Archie, Anonymous FTP. Template-Type: DOCUMENT Title: The Perl Page URL: /public/perl/perl.html Description: Information on the Perl Programming Language. Includes hypertext versions of the Perl 5 Manual and the latest FAQ. Keywords: perl, programming language, perl-faqFigure 2: An IAFA index file
There are several advantages to this approach. The quality of human-generated index information is combined with the efficiency of automated update mechanisms. The integrity of the information is higher than with traditional "hotlists", as only local index information is maintained. Because the information is typed in a computer-readable format, search interfaces can offer extra facilities to constrain queries. There is very little network overhead, as the index information is retrieved in a single request. The simplicity of the model and the index file means any information provider can immediately participate.
There are some disadvantages. The manual maintenance of indexing information can appear to give a large burden on the information provider, but in practice indexing information for major services don't change often. There have been experiments with index generation from TITLE and META tags in the HTML, but this requires the local use of a robot, and has the danger that the quality of the index information suffers. A second limitiation is that in the current implementation information providers have to register their index files at a central registry, which limits scalability. Finally updates are not optimally efficient, as an entire index files needs to retrieved even if only one of its records was modified.
ALIWEB has been in operation since October 1993, and the results have been encouraging. The main operational difficulties appeared to be lack of understanding; initially people often attempted to register their own HTML files instead of IAFA index files. The other problem is that as a personal project ALIWEB is run on a spare-time basis and receives no funding, so further development is slow.
Harvest is a distributed resource discovery system recently released by Internet Research Task Force Research Group on Resource Discovery (IRTF-RD), and offers software systems for automated indexing contents of documents, efficient replication and caching of such index information on remote hosts, and finally searching of this data through an interface in the web. Initial reactions to this system have been very positive.
One disadvantage of Harvest is that it is a large and complex system which requires considerable human and computing resource, making it less accessible to information providers.
The use of Harvest to form a common platform for the interworking of existing databases is perhaps its most exciting aspect. It is reasonably straightforward for other systems to interwork with Harvest; experiments have shown that ALIWEB for example can operate as a Harvest broker. This gives ALIWEB the caching and searching facilities Harvest offers, and offers Harvest a low-cost entry mechanism.
These two systems show attractive alternatives to the use of robots for resource discovery: ALIWEB provides a simple and high-level index, Harvest provides comprehensive indexing system that uses low-level information. However, neither system is targeted at indexing of third-parties that don't actively participate, and it is therefore expected that robots will continue to be used for that purpose, but in co-operation with other systems such as ALIWEB and Harvest.
ConneXions--The Interoperability Report is published monthly by:
Interop Company, a division of SOFTBANK Expos 303 Vintage Park Drive, Suite 201 Foster City, CA 94404-1138 USA Phone: +1 415 578-6900 FAX: +1 415 525-0194 Toll-free (in USA): 1-800-INTEROP E-mail: connexions@interop.comFree sample issue and list of back issues available upon request.