SGML for URLs

Dan Connolly ([email protected])
Fri, 24 Jul 92 10:32:14 CDT


--cut-here
Content-Type: multipart/alternamtive; boundary=alt

--alt

OBJECTIVE

The issue of what to call these things we're defining has been discussed at
length. First it was Universal Document Identifier. The name has changed as
the objective has been refined. The latest name is Universal Resource
Locator. The provisional charter is;

To define a printable string syntax to the allow

The expression of the address on the network of any accesable object
using existing information retrieval protocols;

The expression of the name of any object held in a directory system or
unique naming space on the network;

The distinction to be made easily in the syntax between such protocols
and directories and name spaces;

New protocols, directories and naming schemes to be included as and when
they are developed. [1]

Clearly what we are about is defining a language, i.e. a syntax and
semantics for communicating some information.

The information is the location and/or identity of some information
object in the global hypertext. It's a citation or a reference or a
hypertext link anchor.

I propose a specification for the language of URLs, in the context of a
specification for a language of global hypertext references.

These global hypertext references include more semantics than just
differentiating between protocols and accessing data. There are also
issues of determining the type and the identity of the referent data.

SGML as a syntactic specification tool

That's what it's for, after all. What I propose is a DTD that (with the
default SGML declaration) defines the language of global hypertext
references.

Some examples of the language:

<http host="info.cern.ch" path="hypertext/TheProject.html">
<http host="info.cern.ch" path="hypertext/people.html" anchor="timbl">
<http host="info.cern.ch" path="XFIND" search="SGML">

<prospero host="archie.mgil.ca" path="pub/ftp">

<file host="snoopy" path="~connolly/bin/cgrep.pl" type=appl subtype=x-perl>

<ftp host="export.lcs.mit.edu" dir="contrib"
name="XcRichText-1.2.tar.Z">

<usenet group="comp.infosystems.gopher">
<usenet article="<[email protected]>">

<wais host="quake.think.com" database="INFO" search="help">
<wais host="quake.think.com" database="INFO" wtype="TEXT" size=1000
path="/usr/local/wais/README" >

<telnet host="info.cern.ch">

<gopher host="boombox.umn.edu" port=70>
<gopher host="boombox.umn.edu" selector="foo &#34;bar&#34;" gtype=0 >

The DTD uses only the most basic features of SGML, and thus the resulting
language is not very complex. Implementation of a parser for this particular
SGML language is a vastly more simple task than implementing an SGML parser.
At the same time, we get the benefits of a rigorously defined language based
on established standards.

Note: I haven't studied the HyTime standard very carefully.
I think it's beyond the scope of the task at hand, but
I'd like to have that opinion substantiated by someone
who really knows. In particular, its Finite Coordinate
Systems could be used to model positions within
documents: characters, lines, paragraphs.

RELAVENT ISSUES

Verbosity This syntax is somewhat verbose, but I think that
implicit markup (punctuation rather than names) will
lead to a mass of quoting in many cases. And the
consistency between schemes is not necessarily very
high.

Long URLs Extra whitespace between tokens has no effect. There
is still the problem of quoted strings that are longer
than a mailer allows. Certainly there's some SGML
feature that I'm not aware of that addresses the
issue.

I don't believe there's a way to restrict the length of an element, though
there is a 960 character limit on the length of an
attribute value (in the default SGML declaration).

Quoting The SGML numeric character reference (e.g. &#128;)
allows an attribute value literal to represent any
sequence of bytes.

NAMELEN The default SGML declaration specifies that names of
elements and attributes be 8 characters or less. It's
a conceptually simple matter to operate under an SGML
declaration where NAMELEN is higher.

Extensibility One problem with the current UDI syntax specification
is that it seems to allow new schemes to add arbitrary
complexity to the grammar. This specification limits
the language to an SMGL start tag.

If we adopt this spec, we need to give it a public text identifier, and
maintain a registry of the names used (probably with
the IANA).

DEPLOYMENT AND USAGE

The first place to try this specification out is in the WWW browser. (I'll
try to make the code changes if I find time). It's a simple matter of
elevating UDI's as SGML attributes to URLs as SGML elements. I'd like to
have someone who really knows SGML to have a look at this DTD and see if it
can be improved. And I'd like to study the HyTime standard, the Davenport
DASH, the CFCM standard, etc. to see how this element meshes with their
citation strategies. Also, it would be nice to have explicit support from
WAIS and Gopher clients -- drag and drop comes to mind.

SGML and semantics

SGML is famous for being divorced from application semantics. Most of the
semantics of URLs is in the constituent protocols. All we need to do is
define a way to parse a URL and pass the various bits to the protocol. But
as long as we're going to all the trouble to gather information accessible
with all these protocols into one specification, it makes sense to define
some semantics common to most applications that will use URLs.

DATA TYPES

Some of the schemes have explicit type information (wais, gopher), some have
implicit typing (html, USENET), and some have no typing at all (file, ftp).
The MIME content-type system is general and useful enough to warrant
support. An application should be able to determine the content-type of the
data regardless of the protocol.

RESOURCE IDENTITY

Many applications have use for determining whether two URLs refer to the
same information. Various schemes (such as USENET article id's) may have
semantics for identifying resources. But I think this capability is so
widely useful that it should be coherently supported for all protocols.

[email protected]
--alt
Content-Type: text/x-html

<!DOCTYPE html SYSTEM>
<title>Using SGML to define Universal Resource Locators</title>

<H1>Objective</H1>

The issue of what to call these things we're defining has been
discussed at length. First it was Universal Document Identifier. The
name has changed as the objective has been refined. The latest name is
Universal Resource Locator. The provisional charter is;

<a HREF="x-message-id:<[email protected]>">
<h4>To define a printable string syntax to the allow</h4>

<ol>
<li>The expression of the address on the network of any accesable
object using existing information retrieval protocols;

<li>The expression of the name of any object held in a directory
system or unique naming space on the network;

<li>The distinction to be made easily in the syntax between such
protocols and directories and name spaces;

<li>New protocols, directories and naming schemes to be included as
and when they are developed.
</ol>
</a>

<p>
Clearly what we are about is defining a language, i.e. a syntax and
semantics for communicating some information.
<p>

The information is the location and/or identity of some information
object in the global hypertext. It's a citation or a reference or a
hypertext link anchor.
<p>

I propose a specification for the language of URLs, in the context of
a specification for a language of global hypertext references.
<p>

These global hypertext references include more semantics than just
differentiating between protocols and accessing data. There are also
issues of determining the type and the identity of the referent data.

<H2>SGML as a syntactic specification tool</H2>

That's what it's for, after all. What I propose is a DTD that
(with the default SGML declaration) defines the language of
global hypertext references.
<p>

<h4>Some examples of the language:</h4>
<XMP>
<http host="info.cern.ch" path="hypertext/TheProject.html">
<http host="info.cern.ch" path="hypertext/people.html" anchor="timbl">
<http host="info.cern.ch" path="XFIND" search="SGML">

<prospero host="archie.mgil.ca" path="pub/ftp">

<file host="snoopy" path="~connolly/bin/cgrep.pl" type=appl subtype=x-perl>

<ftp host="export.lcs.mit.edu" dir="contrib"
name="XcRichText-1.2.tar.Z">

<usenet group="comp.infosystems.gopher">
<usenet article="<[email protected]>">

<wais host="quake.think.com" database="INFO" search="help">
<wais host="quake.think.com" database="INFO" wtype="TEXT" size=1000
path="/usr/local/wais/README" >

<telnet host="info.cern.ch">

<gopher host="boombox.umn.edu" port=70>
<gopher host="boombox.umn.edu" selector="foo &#34;bar&#34;" gtype=0 >
</XMP>

The DTD uses only the most basic features of SGML, and thus
the resulting language is not very complex. Implementation
of a parser for this particular SGML language is a vastly
more simple task than implementing an SGML parser.

At the same time, we get the benefits of a rigorously
defined language based on established standards.

<dl><dt>Note:
<dd>I haven't studied the HyTime standard very
carefully. I think it's beyond the scope of the task at
hand, but I'd like to have that opinion substantiated by someone
who really knows. In particular, its Finite Coordinate Systems
could be used to model positions within documents: characters,
lines, paragraphs.
</dl><p>

<h3>Relavent Issues</h3>

<dl>
<dt>Verbosity <dd>This syntax is somewhat verbose, but I think that
implicit markup (punctuation rather than names) will lead to a mass of
quoting in many cases. And the consistency between schemes is not
necessarily very high.

<dt>Long URLs
<dd>Extra whitespace between tokens has no effect. There is still
the problem of quoted strings that are longer than a mailer allows.
Certainly there's some SGML feature that I'm not aware of that
addresses the issue.
<p>
I don't believe there's a way to restrict the length of an element,
though there is a 960 character limit on the length of an attribute
value (in the default SGML declaration).

<dt>Quoting
<dd>The SGML numeric character reference (e.g. &#128;) allows
an attribute value literal to represent any sequence of bytes.

<dt>NAMELEN
<dd>The default SGML declaration specifies that names of
elements and attributes be 8 characters or less. It's a
conceptually simple matter to operate under an SGML declaration
where NAMELEN is higher.

<dt>Extensibility
<dd>One problem with the current UDI syntax specification is that it
seems to allow new schemes to add arbitrary complexity to the grammar.
This specification limits the language to an SMGL start tag.
<p>
If we adopt this spec, we need to give it a public text identifier,
and maintain a registry of the names used (probably with the IANA).
</dl>

<h3>Deployment and Usage</h3>

The first place to try this specification out is in the
WWW browser. (I'll try to make the code changes if I find
time). It's a simple matter of elevating UDI's as SGML
attributes to URLs as SGML elements.

I'd like to have someone who really knows SGML to have a look
at this DTD and see if it can be improved. And I'd like
to study the HyTime standard, the Davenport DASH, the CFCM
standard, etc. to see how this element meshes with their
citation strategies.

Also, it would be nice to have explicit support from WAIS and Gopher
clients -- drag and drop comes to mind.

<h2>SGML and semantics</h2>

SGML is famous for being divorced from application semantics.
Most of the semantics of URLs is in the constituent protocols.
All we need to do is define a way to parse a URL and pass
the various bits to the protocol.

But as long as we're going to all the trouble to gather information
accessible with all these protocols into one specification, it makes
sense to define some semantics common to most applications that will
use URLs.

<h3>Data Types</h3>

Some of the schemes have explicit type information (wais, gopher),
some have implicit typing (html, USENET), and some have no typing at
all (file, ftp). The MIME content-type system is general and useful
enough to warrant support. An application should be able to determine
the content-type of the data regardless of the protocol.

<h3>Resource Identity</h3>

Many applications have use for determining whether two URLs refer
to the same information. Various schemes (such as USENET article
id's) may have semantics for identifying resources. But I think this
capability is so widely useful that it should be coherently supported
for all protocols.

<address>[email protected]</>

--alt--
--cut-here

<!-- Universal Resource Locator specification
derived from http://info.cern.ch/hypertext/WWW/Addressing/BNF.html
on 24 July 1992
by [email protected] -->

<!-- Typical usage:
<!DOCTYPE url SYSTEM>
(we need a public identifier)
or as part of another SGML document type:
<!ELEMENT url SYSTEM>
&url;
-->

<!-- minimization? I believe you can omit the name= part
of an SGML attribute specification in some circumstances.
I don't think it works with CDATA attributes because
order is not significant. -->

<!-- news: scheme renames USENET -->
<!-- file: is somewhat vague. I suggest explicit support for FTP: -->
<!ENTITY % schemes "http|file|ftp|usenet|telnet|prospero|gopher|wais">

<!ELEMENT url - - (%schemes;)* >
<!-- content model of URL: more than one element in a URL? (obviously
an application can use multiple URLs. The question is whether
to define semantics for multiple elements in a single URL.)

Also, what about type, size, search information? Perhaps
one element should describe the connection information,
another element or elements describes the path to the data
(allowing us to define semantics of hierarchical databases)
and another element defines the type of information there.

-->

<!ELEMENT (%schemes;) - O EMPTY >

<!-- TCP connection info: internet domain address and port number -->
<!ENTITY % host "host CDATA #REQUIRED" >
<!ENTITY % hostp "%host; port NUMBER #IMPLIED" >

<!ENTITY % types "text|image|audio|video|message|multi|appl">
<!ENTITY % stypes "plain|richtext|
gif|g3fax|
basic|
mpeg|
rfc822|external|partial|
mixed|altern|parallel|
octets|ps|oda">
<!-- content-type parameters? -->

<!ENTITY % cte "7bit|8bit|qp|base64|binary"
-- we could define several of the gopher types
in terms of encodings and types
e.g. x-binhex, application/x-stuffit
-->

<!ENTITY % MD5 "datasig CDATA #IMPLIED" -- MD5 data signature -->
<!ENTITY % bytes "bytes NUMBER #IMPLIED">
<!ENTITY % lines "lines NUMBER #IMPLIED">

<!ATTLIST http
-- information accessing attributes --
%hostp;
path CDATA #REQUIRED -- server local name --
-- must match xalpha [/ path ] --
-- can a CDATA attribute contain an arbitrary bytestream? --
search CDATA #IMPLIED -- search terms --
anchor CDATA #IMPLIED -- HTML anchor name --

-- information content attributes --
type (%types) text
subtype (%stypes) #IMPLIED
encoding (%cte) 7bit
%MD5;
%bytes;
>

<!ATTLIST prospero
%hostp;
path CDATA #REQUIRED
-- prospero path should not be constrained to WWW path syntax --

-- information content attributes --
type (%types) appl
subtype (%stypes) octets
encoding (%cte) binary
%MD5;
%bytes;
>

<!ATTLIST file
%host;
path CDATA #REQUIRED
-- unix path should not be constrained to WWW path syntax --

-- information content attributes --
type (%types) appl
subtype (%stypes) octets
encoding (%cte) binary
%MD5;
%bytes;
>

<!ATTLIST ftp
%hostp;
dir CDATA #REQUIRED -- directory for cd command --
name CDATA #REQUIRED -- name for get command --
user CDATA "anonymous" -- anonymous ftp by default --
password CDATA #IMPLIED -- not always needed --

-- information content attributes --
type (%types) appl
subtype (%stypes) octets
encoding (%cte) binary -- use 7bit for ascii transfers --
%MD5;
%bytes;
>

<!ATTLIST usenet
group CDATA #IMPLIED -- usenet newsgroup name --
article CDATA #IMPLIED -- article message-id --

-- information content attributes --
type (%types) message
subtype (%stypes) rfc822
encoding (%cte) 7bit
%MD5;
%lines; -- you can add headers without changing a USENET
article, so bytes isn't a good measure --
>

<!-- should we split this into two nodes so that
we can put #REQUIRED on the size and type for documents? -->
<!ATTLIST wais
%hostp;
database CDATA #IMPLIED -- WAIS database name --
search CDATA #IMPLIED -- search terms --
-- what about relavent documents? --
wtype CDATA #IMPLIED -- WAIS data type --
-- this should be obsoleted by the MIME type system --
bytes NUMBER #IMPLIED
path CDATA #IMPLIED -- split into original x, y? --

-- information content attributes --
type (%types) text
subtype (%stypes) plain
encoding (%cte) binary
%MD5;
>

<!ATTLIST telnet
%hostp;
user CDATA #IMPLIED -- username --
>

<!ATTLIST gopher
%hostp;
gtype CDATA "1" -- gopher type --
-- again, MIME types should be used --
-- www browser can be inundated by non-text data
unless it recognizes other types --
selector CDATA "" -- gopher object selector --
search CDATA #IMPLIED -- fulltext search terms --

-- information content attributes --
type (%types) #IMPLIED
subtype (%stypes) #IMPLIED
encoding (%cte) binary
%MD5;
%bytes;
>
--cut-here
Content-type: text/sgml
Content-Description: Example URLs

<!DOCTYPE url SYSTEM>
<url>
<http host="info.cern.ch" path="hypertext/TheProject.html">
<http host="info.cern.ch" path="hypertext/people.html" anchor="timbl">
<http host="info.cern.ch" path="XFIND" search="SGML">

<prospero host="archie.mgil.ca" path="pub/ftp">

<file host="snoopy" path="~connolly/bin/cgrep.pl" type=appl subtype=x-perl>

<ftp host="export.lcs.mit.edu" dir="contrib"
name="XcRichText-1.2.tar.Z">

<usenet group="comp.infosystems.gopher">
<usenet article="<[email protected]>">

<wais host="quake.think.com" database="INFO" search="help">
<wais host="quake.think.com" database="INFO" wtype="TEXT" size=1000
path="/usr/local/wais/README" >

<telnet host="info.cern.ch">

<gopher host="boombox.umn.edu" port=70>
<gopher host="boombox.umn.edu" selector="foo &#34;bar&#34;" gtype=0 >
</url>

--cut-here--