Re: RE Machine-readable server announcements

Daniel W. Connolly ([email protected])
Wed, 9 Mar 1994 01:10:46 --100


In message <[email protected]>, Terry Allen writes:
>I'm quite interested in Dan's SGML/MIME thread, but when
>I reread his proposal for making server announcement
>knowbot-friendly, I think I missed something. Dan,
>is there anything about the format of the message that
>describes its content as a server announcement?

Only the Newsgroups, Mime-Version, and Content-Type headers.

>Or would any message sent to the server-announcement
>newsgroup that contained a URL be (mis)interepreted
>as a server announcement?

I expect any message sent to the server-announcement newsgroup would
be interpreted as a server announcement... is there some reason not
to? The question is: who does the "interpreting." Plain text messages
are fine for interpretation by humans, but not so great for automated
newsreaders.

The feature I'm after is a reliable way to extract URLs from these
announcements. Currently, folks post plain text in any of the
following forms:

See <A HREF="http://foo.com/index.html">our server</A>

Have a look at the server on foo.com

Point your browser at <http://foo.com/index/html> and go!

etc. And while there are plenty of heuristic methods that would find
90% of the urls posted today, that sort of thing doesn't scale well.

The object of the game is that (1) we settle on a format or a small
number of formats and register them with the IANA as MIME
content-types (this may already be the case for wais-sources... HTML
is headed that direction). (2) folks use those formats to distribute
announcements (and label them as such using MIME headers), and
finally, (3) other folks have well-defined ways to extract resource
pointers from announcements. They may choose to (4) stick the
announcement in a fulltext indexed database for local resource
discovery.

As for the SGML/MIME stuff... I'm also interested in expressing lots
of other sentiments in a machine-readable way. Such things as:

"This data is also mirrored at the following sites..."
"The latest version is always available from ..."
(caching, replication)
"The document is available as text, postscript..."
(format conversion/negotiation)
"This text was written by Daniel W. Connolly on March 8, 1994"
(digital signatures)
"The following is a quote from document X, as of March 1, 1994..."
(verifiable links)
"Only folks that have a license to this data can read it"
(authentication, authorization)

In order to evolve the set of sentiments we can express in
machine-readable fashion, I think it's critical to develop a system
where we distribute more self-describing SGML documents, i.e.
documents that contain (at least a pointer to) their DTD.

For example, there's no handy way to validate an HTML document, since
most of them have an instance with no hint of a prologue. This is
largely due to some bad decisions I made a year or so ago... I was
naive enough to expect that we'd all agree on the same DTD. Not in
this lifetime :-{

HTML serves the needs of simple situations like campus-wide
information systems pretty well. But imagine preparing a hypertext
legal briefing: you'd want to be SURE that the document you link to
don't change out from under you (or at least that you can tell if it
does...). You might be willing to pay to get access to documents... you
might pay more for better indexing... you might pay a hypertext
librarian to organize the documents you have access to with respect to
a particular vertical market...

You might think this is far-fetched, but there are already seeds of
electronically distributed research journals using WWW and other
Internet tools. From there, it will bleed into entertainment, news
media, etc.

Dan