On languages and character repertoires and all that

[email protected]
13 Dec 93 14:19


=========================================================================
E-mail from: Prof J Larmouth J.Larmouth @ ITI.SALFORD.AC.UK
Director Telephone: +44 61 745 5657
IT Institute Fax: +44 61 745 8169
University of Salford Telex: 668680 (Sulib)
Salford M5 4WT
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

To: www-talk @ info.cern.ch

Subject: On languages and character repertoires and all that

Hope this is not going over too much old ground.

It is clear that a Japanese text can be supported by an appropriate G-set
in the ISO 2022 framework. However, a marked-up Japanese text is no
longer a Japanese text, and requires additional G-sets. A clear
separation of "language" (eg "Japanese with mark-up" versus "Japanese")
from the character repertoires needed to support that "language" is
important.

More generally, the discussions I have been hearing recently on this
list seem to be focussed entirely on ISO 2022. ISO 2022 is a kludge on
ISO 646 which has done a fair (only fair) job for the last decade or so,
but the character standard for the next decade/century will be ISO 10646.

It may be right to base the HTML+ work on ISO 2022, as it is here now
(tho' personally I disagree), but it is certainly wrong to ignore ISO
10646.

Use of ISO 2022 to support "Japanese with mark-up" will require frequent
uses of escape sequences, and generally there is not a lot of computer
software around that does a good job with ISO 2022. By contrast, ISO
10646 with a "selected subset" of KATAKANA provides a much cleaner
solution for most languages. Moreover, I think most computer vendors
are working on or already have provided support for ISO 10646 16-bit
encodings. (Note that for some way-out (?) languages the BMP of ISO
10646 will not suffice, and a 32-bit encoding may be needed.)

However, once we determine that a particular character repertoire is
needed, the choice of ISO 2022 or ISO 10646 as the encoding mechanism
(and the choice of encodings within 10646) is just the same as the choice
of GIF or JPEG for an image, and should be negotiated by HTTP in the
same way.

Turning now to user level requirements for languages, it seems to me
that the model should be that an HTML document is anglic (by default) or
one of a listed set of languages (plus markup, and the tags are of
course not translated). The http: URL should identify the language if
it is not the default, but we should also allow the identification to be
"multiple". In this case the document is available in multiple
languages, and RUN-TIME NEGOTIATION using HTTP negotiation can be used
to determine which one is to be fetched.

I envisage that a browser would support the selection of a "preferred
language" for use in the negotiation, but would also allow the
"preferred language" to be UNSET, in which case the user is told the
languages available for a particular document and asked which one he
wants it in.

I am not sure the current proposals provide quite this functionality?

John L