Global HyperLinks was: quotes around tags and escape sequences

Dan Connolly ([email protected])
Tue, 01 Dec 92 00:35:22 CST


OK, now you're asking for it. I've been mulling this
stuff over in my head for a couple weeks, and I've got some
pretty good ideas as to how it all fits together.

My model of global hypermedia includes the following terms:

Entity -- SGML and MIME use this term. WAIS calls it a document.
Gopher calls it an item or a textfile or something.
WWW used to call it a document, and now calls it
a resource.

The meaning is the same in all of them: a unit
of retrieval [from the URL document].

Content-Type -- MIME coined this term. SGML calls it a NOTATION.
WAIS used to call it :type, but they'll call
it :content-type if they follow up on what they
told me. Most gopher types fall under this scheme
(telnet, cso, and other types that don't use gopher
protocol don't fit)

Reference -- This is the WWW anchor, the Gopher Menu item, the WAIS
:document-id structure, The MIME message/external-body. It is
enough information to 1) decide whether to retrieve the entity,
2) perform the retrieval transaction, and 3) process the entity
once you've got it.

>Really, though, the gopher reference is (in gopherspeak)
>
>Name=An arbitrary, but meaningful name
>Host=gopher.micro.umn.edu
>Port=70
>Type=0
>Path=Some Stuff

NOTE: Some Stuff is terminated by a newline, and may not contain tabs.

>And the "href=" is just a way to squash it down to a single string.
>It could just as well be a set of attributes and not a single one.
>E.g.
>
><a gopherhost="gopher.micro.umn.edu"
> gopherport="70"
> gopherpath="/Some Stuff"
> gophertype="0">
>An arbitrary, but meaningful, name</a>

NOTE: for type 7 items, you need gophersearch="terms" too.

>expresses the meaning of what's going on in a way that's far closer to
>how SGML might do it as far as I have been able to make out...Dan is
>that actually legal SGML?

Sure, that's legal. I suggested that URLs be expressed in SGML a long
time ago. Tim said it was overkill, and I'm starting to agree.

Let's take a closer look at references:

1) What features allow users and clients to decide to retrieve an entity:

WWW context and content of the anchor (Is it relevant?)

MIME content-id (do I have this entity cached already?)
content-description (relevant?)
content-type (can I process it once I've got it?)
SIZE (is it too big to bother?)

WAIS :score (relavent to my query?)
:headline (relevant?)
:doc-id (in cache?)
original/distributor-server,database,local-id particularly useful
:number-of-lines, :number-of-bytes (too big?)
:type, :content-type (can I process it?)
:date (how old is it?)

Gopher name (is it interesting?)
type (can I process it?)

2) What features allow the client to make the transfer?

WWW URL -- protocol, host, port, path, type, size, search terms
handles local files, HTTP, gopher, WAIS connections.
includes search terms for fulltext indexes.
scheme mechanism allows gateways to new protocols

MIME access-type, etc.: handles ftp, anon-ftp, local-file
Ghost body allows arbitrary extra data.

Gopher host, port, path, search words

WAIS source (host, port, database), doc-id, search terms,
relavent documents (these are the novel feature. Quite handy)

3) What features allow the client to process the entity?
(Keep in mind that these are features of the reference -- this
is information we have _before_ we transfer the entity).

WWW processing is tied to the protocol. Content-Type
of local files is inferred from file extensions.

Entities from HTTP connections are assumed to
be text/x-html.

Gopher entites are typed: 0=text/plain, 1=application/x-gopher,
w=text/x-html.

WAIS entites are typed: TEXT=text/plain, WSRC=application/x-wais.

MIME content-type mechanism is quite expressive. Any content-type
can be encapsulated in a message/rfc822 entity. Multiple
entities can be encapsulated in a multipart/mixed entity.

Gopher gopher type tells you what to do with the data.
text/plain, application/x-gopher are universally supported.
other types are supported by pilot projects.

WAIS :type tells what to do. text/plain and application/x-wsrc are
supported. Other types are supported by pilot projects.

Now let's see how we should change the WWW reference mechanism.

Here's what we've got currently:

<!ELEMENT A - - (#PCDATA)>
<!ATTLIST A
NAME ID #IMPLIED
HREF CDATA #IMPLIED
TYPE CDATA #IMPLIED
>

What's the TYPE used for? It's not a data type. There's some
code in LineMode to handle it, but I'm not sure what it does.

The NAME identifies the anchor as the target of some other anchor.
We should have NAME (or ID) attributes on pretty much all elements,
for example:

<DL>
<DT ID=term>term<DD>definition
</DL>

The HREF attribute is enough information to retrieve and Entity.
Good. But it's got thie #anchor stuck on the end. That should
be a separate attribute. It should be an IDREF, so that we
can validate that it references an existing ID with an SGML
parser.

"But," you say, "what if it references an ID outside the current document?"

I suggest we treat a group of nodes that reference each other not
as separate documents, but as entities of one big document. That
way, an author can validate the internal links in his/her web.

I suggest two new elements: XREF, for intra-document links (i.e.
links within the local web), and SEE for inter-document links
(i.e. links that go outside the local web).

<!ELEMENT XREF - - (#PCDATA)
-- This element is for links within an HTML document. (a document
is a collection of entities, or a web of nodes).
-->
<!ATTLIST XREF
CONTEXT CDATA #IMPLIED -- entity containing the XREF is implied --
-- SGML purists would make this attribute an ENTITY reference,
and put the URL in the SYSTEM identifier in the prologue.
For expediency, we put the URL right in the attribute.
--
ORIGIN CDATA #IMPLIED
-- another URL, used as an identifier, rather than a locator.
Ala the WAIS original-server,database,local-id triple.
--
REF IDREF #REQUIRED -- ID of referent element --
>

<!ELEMENT SEE - - (#PCDATA)
-- This element is for links from an HTML document to any entity
in the global web. The location and content-type of the entity
are sufficient to resolve the reference.

The other attributes could be specified in the text of the
SEE content, but by making them attributes, the client software
can process them, for example, to display a table of references
sorted by date.
-->
<!ATTLIST SEE
LOCATION CDATA #REQUIRED -- URL of referent entity --
CONTENT-TYPE CDATA #REQUIRED -- MIME Content-Type for the entity --
CHUNK CDATA #IMPLIED
-- This is the analogue of the #anchor mechanism.
If CONTEXT is an SGML entity, this would be an ID,
though it won't be validated.
However, if CONTEXT is a text file, this could be a line number.
The meaning is defined by the content-type.
--
ORIGIN CDATA #IMPLIED
FROM CDATA #IMPLIED -- email address or name of author/provider --
DATE NUMBER #IMPLIED -- in ISO format: YYYYMMDDHHMMSSZ --
BYTES NUMBER #IMPLIED -- useful in many cases --
MD5 CDATA #IMPLIED -- data signature --
>

What do you think?

Dan