The Web Robots Pages

Robots in the Web: threat or treat?

April 1995
[1997: Updated links and addresses]

ABSTRACT

Robots have been operating in the World-Wide Web for over a year. In that time they have performed useful tasks, but also on occasion wreaked havoc on the networks. This paper investigates the advantages and disadvantages of robots, with an emphasis on robots used for resource discovery. New alternative resource discovery strategies are discussed and compared. It concludes that while current robots will be useful in the immediate future, they will become less effective and more problematic as the Web grows.

INTRODUCTION

The World Wide Web [1] has become highly popular in the last few years, and is now one of the primary means of information publishing on the Internet. When the size of the Web increased beyond a few sites and a small number of documents, it became clear that manual browsing through a significant portion of the hypertext structure is no longer possible, let alone an effective method for resource discovery.

This problem has prompted experiments with automated browsing by "robots". A Web robot is a program that traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. These programs are sometimes called "spiders", "web wanderers", or "web worms". These names, while perhaps more appealing, may be misleading, as the term "spider" and "wanderer" give the false impression that the robot itself moves, and the term "worm" might imply that the robot multiplies itself, like the infamous Internet worm [2]. In reality robots are implemented as a single software system that retrieves information from remote sites using standard Web protocols.

ROBOT USES

Robots can be used to perform a number of useful tasks:

Statistical Analysis

The first robot [3] was deployed to discover and count the number of Web servers. Other statistics could include the average number of documents per server, the proportion of certain file types, the average size of a Web page, the degree of interconnectedness, etc.

Maintenance

One of the main difficulties in maintaining a hypertext structure is that references to other pages may become "dead links", when the page referred to is moved or even removed. There is currently no general mechanism to proactively notify the maintainers of the referring pages of this change. Some servers, for example the CERN HTTPD, will log failed requests caused by dead links, along with the reference of the page where the dead link occurred, allowing for post-hoc manual resolution. This is not very practical, and in reality authors only find that their documents contain bad links when they notice themselves, or in the rare case that a user notifies them by e-mail.

A robot that verifies references, such as MOMspider [4], can assist an author in locating these dead links, and as such can assist in the maintenance of the hypertext structure. Robots can help maintain the content as well as the structure, by checking for HTML [5] compliance, conformance to style guidelines, regular updates, etc., but this is not common practice. Arguably this kind of functionality should be an integrated part of HTML authoring environments, as these checks can then be repeated when the document is modified, and any problems can be resolved immediately.

Mirroring

Mirroring is a popular technique for maintaining FTP archives. A mirror copies an entire directory tree recursively by FTP, and then regularly retrieves those documents that have changed. This allows load sharing, redundancy to cope with host failures, and faster and cheaper local access, and off-line access.

In the Web mirroring can be implemented with a robot, but at the time of writing no sophisticated mirroring tools exist. There are some robots that will retrieve a subtree of Web pages and store it locally, but they don't have facilities for updating only those pages that have changed. A second problem unique to the Web is that the references in the copied pages need to be rewritten: where they reference pages that have also been mirrored they may need to changed to point to the copies, and where relative links point to pages that haven't been mirrored they need to be expanded into absolute links. The need for mirroring tools for performance reasons is much reduced by the arrival of sophisticated caching servers [6], which do offer selective updates, can guarantee that a cached document is up-to-date, and are largely self maintaining. However, it is expected that mirroring tools will be developed in due course.

Resource discovery

Perhaps the most exciting application of robots is their use in resource discovery. Where humans cannot cope with the amount of information it is attractive to let the computer do the work. There are several robots that summarise large parts of the Web, and provide access to a database with these results through a search engine.

This means that rather than relying solely on browsing, a Web user can combine browsing and searching to locate information; even if the database doesn't contain the exact item you want to retrieve, it is likely to contain references to related pages, which in turn may reference the target item.

The second advantage is that these databases can be updated automatically at regular intervals, so that dead links in the database will be detected and removed. This in contrast to manual document maintenance, where verification is often sporadic and not comprehensive. The use of robots for resource discovery will be further discussed below.

Combined Uses

A single robot can perform more than one of the above tasks. For example the RBSE Spider [7] does statistical analysis of the retrieved documents as well providing a resource discovery database. Such combined uses are unfortunately quite rare.

OPERATIONAL COSTS AND DANGERS

The use of robots comes at a price, especially when they are operated remotely on the Internet. In this section we will see that robots can be dangerous in that they place high demands on the Web.

Network resource and server load

Robots require considerable bandwidth. Firstly robots operate continually over prolonged periods of time, often months. To speed up operations many robots feature parallel retrieval, resulting in a consistently high use of bandwidth in the immediate proximity. Even remote parts of the network can feel the network resource strain if the robot makes a large number of retrievals in a short time ("rapid fire"). This can result in a temporary shortage of bandwidth for other uses, especially on low-bandwidth links, as the Internet has no facility for protocol-dependent load balancing.

Traditionally the Internet has been perceived to be "free", as the individual users did not have to pay for its operation. This perception is coming under scrutiny, as especially corporate users do feel a direct cost associated with network usage. A company may feel that the service to its (potential) customers is worth this cost, but that automated transfers by robots are not.

Besides placing demands on network, a robot also places extra demand on servers. Depending on the frequency with which it requests documents from the server this can result in a considerable load, which results in a lower level of service for other Web users accessing the server. Especially when the host is also used for other purposes this may not be acceptable. As an experiment the author ran a simulation of 20 concurrent retrievals from his server running the Plexus server on a Sun 4/330. Within minutes the machine slowed down to a crawl and was usable for anything. Even with only consecutive retrievals the effect can be felt. Only the week that this paper was written a robot visited the author's site with rapid fire requests. After 170 consecutive retrievals the server, which had been operating fine for weeks, crashed under the extra load.

This shows that rapid fire needs to be avoided. Unfortunately even modern manual browsers (e.g. Netscape) contribute to this problem by retrieving in-line images concurrently. The Web's protocol, HTTP [8], has been shown to be inefficient for this kind of transfer [9], and new protocols are being designed to remedy this [10].

Updating overhead

It has been mentioned that databases generated by robots can be automatically updated. Unfortunately there is no efficient change control mechanism in the Web; There is no single request that can determine which of a set of URL's has been removed, moved, or modified.

The HTTP does provide the "If-Modified-Since" mechanism, whereby the user-agent can specify the modification time-stamp of a cached document along with a request for the document. The server will then only transfer the contents if the document has been modified since it was cached.

This facility can only be used by a robot if it retains the relationship between the summary data it extracts from a document, it's URL, and the timestamp of the retrieval. This places extra requirements on the size and complexity on the database, and is not widely implemented. Client-side robots/agents The load on the network is especially an issue with the category of robots that are used by end-users, and implemented as part of a general purpose Web client (e.g. the Fish Search [11] and the tkWWW robot [12]). One feature that is common in these end-user robots is the ability to pass on search-terms to search engines found while traversing the Web. This is touted as improving resource discovery by querying several remote resource discovery databases automatically. However it is the author's opinion that this feature is unacceptable for two reasons. Firstly a search operation places a far higher load on a server than a simple document retrieval, so a single user can cause a considerable overhead on several servers in a far shorter period than normal. Secondly, it is a fallacy to assume that the same search-terms are relevant, syntactically correct, let alone optimal for a broad range of databases, and the range of databases is totally hidden from the user. For example, the query "Ford and garage" could be sent to a database on 17th century literature, a database that doesn't support Boolean operators, or a database that specifies that queries specific to automobiles should start with the word "car:". And the user isn't even aware of this.

Another dangerous aspect of a client-side robot is that once it is distributed no bugs can be fixed, no knowledge of problem areas can be added and no new efficient facilities can be taken advantage of, as not everyone will upgrade to the latest version.

The most dangerous aspect however is the sheer number of possible users. While some people are likely to use such a facility sensibly, i.e. bounded by some maximum, on a known local area of the web, and for a short period of time, there will be people who will abuse this power, through ignorance or arrogance. It is the author's opinion that remote robots should not be distributed to end-users, and fortunately it has so far been possible to convince at least some robot authors to cancel releases [13].

Even without the dangers client-side robots pose an ethical question: where the use of a robot may be acceptable to the community if its data is then made available to the community, client-side robots may not be acceptable as they operate only for the benefit a single user. The ethical issues will be discussed further below.

End-user "intelligent agents" [14] and "digital assistants" are currently a popular research topic in computing, and often viewed as the future of networking. While this may indeed be the case, and it is already apparent that automation is invaluable for resource discovery, a lot more research is required for them to be effective. Simplistic user-driven Web robots are far removed from intelligent network agents: an agent needs to have some knowledge of where to find specific kinds of information (i.e. which services to use) rather than blindly traversing all information. Compare the situation where a person is searching for a book shop; they use the Yellow Pages for a local area, find the list of shops, select one or a few, and visit those. A client-side robot would walk into all shops in the area asking for books. On a network, as in real life, this is inefficient on a small scale, and prohibitve on a larger scale.

Bad Implementations

The strain placed on the network and hosts is sometimes increased by bad implementations of especially newly written robots. Even if the protocol and URL's sent by the robot is correct, and the robot correctly deals with returned protocol (including more advanced features such as redirection), there are some less-obvious problems.

The author has observed several identical robot runs accessing his server. While in some cases this was caused by people using the site for testing (instead of a local server), in some cases it became apparent that this was caused by lax implementation. Repeated retrievals can occur when either no history of accessed locations is stored (which is unforgivable), or when a robot does not recognise cases where several URL are syntactically equivalent, e.g. where different DNS aliases for the same IP address are used, or where URL's aren't canonicalised by the robot, e.g. "foo/bar/../baz.html" is equivalent to "foo/baz.html".

Some robots sometimes retrieve document types, such as GIF's and Postscript, which they cannot handle and thus ignore.

Another danger is that some areas of the web are near-infinite. For example, consider a script that returns a page with a link to one level further down. This will start with for example "/cgi-bin/pit/", and continue with "/cgi-bin/pit/a/", "/cgi-bin/pit/a/a/", etc. Because such URL spaces can trap robots that fall into them, they are often called "black holes". See also the discussion of the Proposed Standard for Robot Exclusion below.

CATALOGUING ISSUES

That resource discovery databases generated by robots are popular is undisputed. The author himself regularly uses such databases when locating resources. However, there are some issues that limit the applicability of robots to Web-wide resource discovery.

There is too much material, and it's too dynamic

One measure of effectiveness of an information retrieval approach is "recall", the fraction of all relevant documents that were actually found. Brian Pinkerton [15] states that recall in Internet indexing systems is adequate, as finding enough relevant documents is not the problem. However, if one considers the complete set of information available on the Internet as a basis, rather than the database created by the robot, recall cannot be high, as the amount of information is enormous, and changes are very frequent. So in practice a robot database may not contain a particular resource that is available, and this will get worse as the Web grows.

Determining what to include/exclude

A robot cannot automatically determine if a given Web page should be included in its index. Web servers may serve documents that are only relevant to a local context (for example an index of an internal library), that exists only temporarily, etc. To a certain extent the decision of what is relevant also depends on the audience, which may not have been identified at the time the robot operates. In practice robots end up storing almost everything they come come accross. Note that even if a robot could decide if a particular page is to be exclude form its database they have already incurred the cost of retrieving the file; a robot that decides to ignore a high percentage of documents is very wasteful.

In an attempt to alleviate this situation somewhat the robot community has adopted "A Standard for Robot exclusion" [16]. This standard describes the use of a simple structured text file available at well-known place on a server ("/robots.txt") to specify which parts of their URL space should be avoided by robots (see Figure 1). This facility can also be used to warn Robots for black holes. Individual robots can be given specific instructions, as some may behave more sensibly than others, or are known to specialise in a particular area. This standard is voluntary, but is very simple to implement, and there is considerable public pressure for robots to comply.

Determining how to traverse the Web is a related problem. Given that most Web servers are organised hierarchically, a breadth-first traversal from the top to a limited depth is likely to more quickly find a broader and higher-level set of document and services than a depth-first traversal, and is therefore much preferable for resource discovery. However, a depth-first traversal is more likely to find individual users' home pages with links to other, potentially new, servers, and is therefore more likely to find new sites to traverse.

    # /robots.txt for http://www.site.com/

    User-agent: *               # attention all robots:
    Disallow:   /cyberworld/map # infinite URL space
    Disallow:   /tmp/	        # temporary files

Figure 1: An example robots.txt file

Summarising documents

It is very difficult to index an arbitrary Web document. Early robots simply stored document titles and anchor texts, but newer robots use more advanced mechanisms and generally consider the entire content.

These methods are good general measures, and can be automatically applied to all Web pages, but cannot be as effective as manual indexing by the author. HTML provides a facility to attach general meta information to documents, by specifying a <META> element, e.g. "<meta name= "Keywords" value= "Ford Car Maintenance">. However, no semantics have (yet) been defined for specific values of the attributes of this tag, and this severely limits its acceptance, and therfore its usefulness.

This results in a low "precision", the proportion of the total number of documents retrieved that is relevant to the query. Advanced features such as Boolean operators, weighted matches like WAIS, or relevance feedback can improve this, but given that the information on the Internet is enormously diverse, this will continue to be a problem.

Classifying documents

Web users often ask for a "subject hierarchy" of documents in the Web. Projects such as GENVL [17] allow these subject hierarchies to be manually maintained, which presents a number of problems that fall outside the scope of this paper. It would be useful if a robot could present a subject hierarchy view of its data, but this requires some automated classification of documents [18].

The META tag discussed above could provide a mechanism for authors to classify their own documents. The question then arises which classification system to use, and how to apply it. Even traditional libraries don't use a single universal system, but adopt one of a few, and adopt their own conventions for applying them. This gives little hope for an immediate universal solution for the Web.

Determining document structures

Perhaps the most difficult issue is that the Web doesn't consist of a flat set of files of equal importance. Often services on the Web consist of a collection of Web pages: there is a welcome page, maybe some pages with forms, maybe some pages with background information, and some pages with individual data points. The service provider announces the service by referring to the welcome page, which is designed to give structured access to the rest of the information. A robot however has no way of distinguishing these pages, and may well find a link into for example one of the data points or background files, and index those rather than the main page. So it can happen that rather than storing a reference to "The Perl FAQ", it stores some random subset of the questions addressed in the FAQ. If there was a facility in the web for specifying per document that someone shouldn't link to the page, but to another one specified, this problem could be avoided.

Related to the above problem is that the content of web pages are often written for a specific context, provided by the access structure, and may not make sense outside that context. For example, a page describing the goals of a project may refer to "The project", without fully specifying the name, or giving a link to the welcome page. Another problem is that of moved URL's. Often when service administrators reorganise their URL structure they will provide mechanisms for backward compatibility with the previous URL structure, to prevent broken links. In some servers this can be achieved by specifying redirection configuration, which results in the HTTP negotiating a new URL when users try to access the old URL. However, when symbolic links are used it is not possible to tell the difference between the two. An indexing robot can in these cases store the deprecated URL, prolonging the requirement for a web administrator to provide backward compatibility.

A related problem is that a robot might index a mirror of a particular service, rather than the original site. If both source and mirror are visited there will be duplicate entries in the database, and bandwidth is being wasted repeating identical retrievals to different hosts. If only the mirror is visited users may be referred to out-of-date information even when up-to-date information is available elsewhere.

ETHICS

We have seen that robots are useful, but that they can place high demands on bandwidth, and that they have some fundamental problems when indexing the Web. Therefore a robot author needs to balance these issues when designing and deploying a robot. This becomes an ethical question "Is the cost to others of the operation of a robot justified". This is a grey area, and people have very different opinions on what is acceptable.

When some of the acceptability issues first became apparent (after a few incidents with robots doubling the load on servers) the author developed a set of Guidelines for Robot Writers [19], as a first step to identify problem areas and promote awareness. These guidelines can be summarised as follows:

Reconsider: Do you really need a new robot?
Be accountable: Ensure the robot can be identified by server maintainers, and the author can be easily contacted.
Test extensively on local data
Moderate resource consumption: Prevent rapid fire and eliminate redundant and pointless retrievals.
Follow the Robot Exclusion Standard.
Monitor operation: Continuously analyse the robot logs.
Share results: Make the robot's results available to others, the raw results as well as any intended high-level results.

David Eichman [20] makes a further distinction between Service Agents, robots that build information bases that will be publicly available, and User Agents, robots that benefit only a single user such as client-side robots, and has identified separate high-level ethics for each.

The fact that most Robot writers have already implemented these guidelines indicates that they are conscious of the issues, and eager to minimise any negative impact. The public discussion forum provided by the robots mailing list speeds up the discussion of new problem areas, and the public overview of the robots on the Active list provides a certain community pressure on robot behaviour [21].

This maturation of the Robot field means there have recently been fewer incidents where robots have upset information providers. Especially the standard for robot exclusion means that people who don't approve of robots can prevent being visited. Experiences from several projects that have deployed robots have been published, especially at the World-Wide Web conferences at CERN in July 1994 and Chicago in October 1994, and these help to educate, and discourage, would-be Robot writers. However, with the increasing popularity of the Internet in general, and the Web in particular it is inevitable that more Robots will appear, and it is likely that some will not behave appropriately.

ALTERNATIVES FOR RESOURCE DISCOVERY

Robots can be expected to continue to be used for network information retrieval on the Internet. However, we have seen that there are practical, fundamental and ethical problems with deploying robots, and it is worth considering research into alternatives, such as ALIWEB [22] and Harvest [23].

ALIWEB has a simple model for human distributed indexing of services in the Web, loosely based on Archie [24]. In this model aggregate indexing information is available from hosts on the Web. This information indexes only local resources, not resources available from third parties. In ALIWEB this is implemented with IAFA templates [25], which give typed resource information is a simple text-based format (See Figure 2). These templates can be produced manually, or can be constructed by automated means, for example from titles and META elements in a document tree. The ALIWEB gathering engine retrieves these index files through normal Web access protocols, and combines them into a searchable database. Note that it is not a robot, as it doesn't recursively retrieve documents found in the index.

    Template-Type: SERVICE
    Title:         The ArchiePlex Archie Gateway
    URL:           /public/archie/archieplex/archieplex.html
    Description:   A Full Hypertext interface to Archie.
    Keywords:	   Archie, Anonymous FTP.
    
    Template-Type: DOCUMENT
    Title:         The Perl Page
    URL:           /public/perl/perl.html
    Description:   Information on the Perl Programming Language.
                   Includes hypertext versions of the Perl 5 Manual
                   and the latest FAQ.
    Keywords:	   perl, programming language, perl-faq

Figure 2: An IAFA index file

There are several advantages to this approach. The quality of human-generated index information is combined with the efficiency of automated update mechanisms. The integrity of the information is higher than with traditional "hotlists", as only local index information is maintained. Because the information is typed in a computer-readable format, search interfaces can offer extra facilities to constrain queries. There is very little network overhead, as the index information is retrieved in a single request. The simplicity of the model and the index file means any information provider can immediately participate.

There are some disadvantages. The manual maintenance of indexing information can appear to give a large burden on the information provider, but in practice indexing information for major services don't change often. There have been experiments with index generation from TITLE and META tags in the HTML, but this requires the local use of a robot, and has the danger that the quality of the index information suffers. A second limitiation is that in the current implementation information providers have to register their index files at a central registry, which limits scalability. Finally updates are not optimally efficient, as an entire index files needs to retrieved even if only one of its records was modified.

ALIWEB has been in operation since October 1993, and the results have been encouraging. The main operational difficulties appeared to be lack of understanding; initially people often attempted to register their own HTML files instead of IAFA index files. The other problem is that as a personal project ALIWEB is run on a spare-time basis and receives no funding, so further development is slow.

Harvest is a distributed resource discovery system recently released by Internet Research Task Force Research Group on Resource Discovery (IRTF-RD), and offers software systems for automated indexing contents of documents, efficient replication and caching of such index information on remote hosts, and finally searching of this data through an interface in the web. Initial reactions to this system have been very positive.

One disadvantage of Harvest is that it is a large and complex system which requires considerable human and computing resource, making it less accessible to information providers.

The use of Harvest to form a common platform for the interworking of existing databases is perhaps its most exciting aspect. It is reasonably straightforward for other systems to interwork with Harvest; experiments have shown that ALIWEB for example can operate as a Harvest broker. This gives ALIWEB the caching and searching facilities Harvest offers, and offers Harvest a low-cost entry mechanism.

These two systems show attractive alternatives to the use of robots for resource discovery: ALIWEB provides a simple and high-level index, Harvest provides comprehensive indexing system that uses low-level information. However, neither system is targeted at indexing of third-parties that don't actively participate, and it is therefore expected that robots will continue to be used for that purpose, but in co-operation with other systems such as ALIWEB and Harvest.

CONCLUSIONS

In today's World-Wide Web, robots are used for a number of different purposes, including global resource discovery. There are several practical, fundamental, and ethical problems involved in the use of robots for this task. The practical and ethical problems are being addressed as experience with robots increases, but are likely to continue to cause occasional problems. The fundamental problems limit the amount of growth there is for robots. Alternatives strategies such as ALIWEB and Harvest are more efficient, and give authors and sites control of the indexing of their own information. It is expected that this type of system will increase in popularity, and will operate alongside robots and interwork with them. In the longer term complete Web-wide traversal by robots will become prohibitvely slow, expensive, and ineffective for resource discovery.

REFERENCES

1: Berners-Lee, T., R. Cailliau, A. Loutonen, H.F.Nielsen and A. Secret. "The World-Wide Web". Communications of the ACM, v. 37, n. 8, August 1994, pp. 76-82.
2: Seeley, Donn. "A tour of the worm". USENINX Association Winter Conference 1989 Proceedings, January 1989, pp. 287-304.
3: Gray, M. "Growth of the World-Wide Web," Dec. 1993. <URL: http://www.mit.edu:8001/aft/sipb/user/mkgray/ht/web-growth.html >
4: Fielding, R. "Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web". Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994.
5: Berners-Lee, T., D. Conolly at al., "HyperText Markup Language Spacification 2.0". Work in progress of the HTML working group of the IETF. <URL: ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-html-spec-00.txt >
6: Luotonen, A., K. Altis. "World-Wide Web Proxies". Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994.
7: Eichmann, D. "The RBSE Spider - Balancing Effective Search against Web Load". Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994.
8: Berners-Lee, T., R. Fielding, F. Nielsen. "HyperText Transfer Protocol". Work in progress of the HTTP working group of the IETF. <URL: ftp://nic.merit.edu/documents/internet-drafts/draft-fielding-http-spec-00.txt >
9: Spero, S. "Analysis of HTTP Performance problems" July 1994 <URL: http://sunsite.unc.edu/mdma-release/http-prob.html >
10: Spero, S. "Progress on HTTP-NG". <URL: http://info.cern.ch/hypertext/www/Protocols/HTTP-NG/http-ng-status.html >
11: De Bra, P.M.E and R.D.J. Post. "Information Retrieval in the World-Wide Web: Making Client-based searching feasable". Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994.
12: Spetka, Scott. "The TkWWW Robot: Beyond Browsing". Proceedings of the Second International World-Wide Web Conference, Chicago United States, October 1994.
13: Slade, R., "Risks of client search tools," RISKS-FORUM Digest, v. 16, n. 37, Weds 31 August 1994.
14: Riechen, Doug. "Intelligent Agents". Communications of the ACM Vol. 37 No. 7, July 1994.
15: Pinkerton, B., "Finding What PEople Want: Experiences with the WebCrawler," Proceedings of the Second International World-Wide Web Conference, Chicago United States, October 1994.
16: Koster, M., "A Standard for Robot Exclusion," < URL: http://info.webcrawler.com/mak/projects/robots/exclusion.html >
17: McBryan, A., "GENVL and WWWW: Tools for Taming the Web," Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994.
18: Kent, R.E., Neus, C., "Creating a Web Analysis and Visualization Environment," Proceedings of the Second International World-Wide Web Conference, Chicago United States, October 1994.
19: Koster, Martijn. "Guidelines for Robot Writers". 1993. <URL: http://info.webcrawler.com/mak/projects/robots/guidelines.html >
20: Eichmann, D., "Ethical Web Agents," "Proceedings of the Second International World-Wide Web Conference, Chicago United States, October 1994.
21: Koster, Martijn. "WWW Robots, Wanderers and Spiders". <URL: http://info.webcrawler.com/mak/projects/robots/robots.html >
22: Koster, Martijn, "ALIWEB - Archie-Like Indexing in the Web," Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994.
23: Bowman, Mic, Peter B. Danzig, Darren R. Hardy, Udi Manber and Michael F. Schwartz. "Harvest: Scalable, Customizable Discovery and Access System". Technical Report CU-CS-732-94, Department of Computer Science, University of Colorado, Boulder, July 1994. <URL: http://harvest.cs.colorado.edu/>
24: Deutsch, P., A. Emtage, "Archie - An Electronic Directory Service for the Internet", Proc. Usenix Winter Conf., pp. 93-110, Jan 92.
25: Deutsch, P., A. Emtage, M. Koster, and M. Stumpf. "Publishing Information on the Internet with Anonymous FTP". Work in progress of the Integrated Internet Information Retrieval working group. <URL: ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-iiir-publishing-02.txt >

MARTIJN KOSTER holds a B.Sc. in Computer Science from Nottingham University (UK). During his national service he worked on as 2nd lieutenant of the Dutch Army at the Operations Research group of STC, NATO's research lab in the Netherlands. Since 1992 he has worked for NEXOR as software engineer on X.500 Directory User Agents, and maintains NEXOR's World-Wide Web service. He is also author of the ALIWEB and CUSI search tools, and maintains a mailing-list dedicated to World-Wide Web robots.

Reprinted with permission from ConneXions, Volume 9, No. 4, April 1995.

ConneXions--The Interoperability Report is published monthly by:

Interop Company, a division of SOFTBANK Expos
303 Vintage Park Drive, Suite 201
Foster City, CA 94404-1138
USA
Phone: +1 415 578-6900  FAX: +1 415 525-0194
Toll-free (in USA): 1-800-INTEROP
E-mail: connexions@interop.com

Free sample issue and list of back issues available upon request.

The Web Robots Pages