Resource discovery, replication (WWW Announcements archives?)

Daniel W. Connolly ([email protected])
Tue, 03 May 1994 15:21:00 -0500


Is www-announce and/or comp.infosystems.announce archived? I keep a
lot of copies of announcements, thinking "someday I might want to look
at that..." and I'd get rid of my local copies if I knew I could
replace them if I wanted to...

It seems to me that a database of these articles, with a WAIS search
index, would be an extremely valuable resource discovery
application. Possibly more useful than databases currently built by
knowbots such as WWWWorm, AliWeb, Veronica, etc.

It would be even better if the messages were written in html and
tagged with MIME-Version: 1.0 and Content-Type: text/html (or even
better yet: multipart/alternative, with text/plain first and text/html
as an attachment...), so that the links could be interpreted reliably,
but for now, I'll settle for copying links out of the announcements by
hand (or having tools pick them out by heuristic parsing...)

If there is no archive of these messages, I will be very disappointed.
And I would encourage anyone who already has resources (disk space,
tools, time, etc.) for archiving newsgroups to please do this.

Hmmm... about maintenance... it's not really feasible to expect sites
that are going down to send out notification; Nor is a centralized
polling strategy... perhaps a hierarchical polling strategy would
work...

But my favorite idea for this problem is a broadcast strategy: (This
could possibly even be a candiate for the whole URN/URC administration
problem). We could deploy a set of conventions like FAQ posting,
where WWW server sites (be they HTTP, gopher, ftp, WAIS...) post an
announcement via USENET periodically, summarizing their status and
contents. The article would go to the typically relavent newsgroups,
plus a special well-known newsgroup, say news.resources (much like
news.answers).

This newsgroup could be used in the typical news applications:
newsreaders, web browsers, etc.

But beyond that, it allows a distributed solution to the resource
discovery problem: Any site could build an index of available internet
resources just by archiving news.resources, indexing the contents, and
expiring old articles.

Let's see if this idea meets the requirements of an internet
distributed application:

* Portability: It's as portable as UNENET news. You don't get much
more portable than that. And for sites that don't want the full burden
of USENET news administration, it should be simple to implement scaled
down applications that only receive news.resources.

* Scalability: It seems scalable to me. Witness FAQ distribution as
evidence.

* Security: it's only as secure as USENET news, but PEM (Privacy
Enhanced Mail) techniques can be layered on top to provide
authentication of announcements through digital signatures, and
privacy of announcements through encryption.

* Reliability: hmmm... how often do news articles get dropped? And
even if an article gets dropped, the fault only lasts until the next
time the resource is announced. Seems suitably reliable and fault
tolerant.

It's fairly important that the contents of the articles can be
processed by machine: a method for following references from the
articles to the resources themselves must be well-defined. Plain text
messages containg URL's should be deprecated in favor of articles that
use MIME to indicate text/html, application/wais-source,
message/external-body, etc. body parts.

This could also be used as a way of distributing information about
replicated data. A mirror archive site could post a summary of its
contents, with (a) references to its own contents(A), and (b)
references to the original materials that it mirrors(B), and (c) a
machine-readable indication that A is a copy of B. Then any client
looking for any part of B that also has access to c can try looking at
A if it's closer or more available.

In order to exploit this, we need agreement between clients and
servers (read: specifications) about what it means for one reference
to refer to "part of" what another references refers to. e.g.. the fact
that
ftp://host/dir1/dir2/file1
is part of:
ftp://host/dir1
must be part of the specification, if we're to use URLs as references.
(Plug: This is why I think URLs should be explicitly hierarchical.)

We also need clients to have access to a database of these "A is a
copy of B" factoids. I think we should extend HTML to express this,
ala:

See <A HREF="http://host1/dir1/dir2/file">more info</A>
<REPLICA ORIGINAL="http://host1/dir1/" COPY="ftp://host2/dir3/">

Then, any client that parses this document would know that it can
retrieve http://host1/dir1/dir2/file as ftp://host2/dir3/dir2/file
if it prefers. It could also scribble that REPLICA factoid into a
database for use in other queries. It can also scribble the factoid
away in memory for use in other queries.

It's kinda like the way many clients look for
setenv wais_proxy=http://nearby.proxy/
and learn to make proxy queries.

In this same way, proxy servers could learn about all sorts of
replicas and proxy servers just from documents that pass through
them. The factoids are subject to forgery, but so is any other info in
an HTML document, and the same sorts of authenticity techniques
digital signatures...) apply.

Nify, huh? I think so... I think I'll code it up and give it a try...

Daniel W. Connolly "We believe in the interconnectedness of all things"
Software Engineer, Hal Software Systems, OLIAS project (512) 834-9962 x5010
<[email protected]> http://www.hal.com/%7Econnolly/index.html