Re: Unicode and HTML

Daniel W. Connolly ([email protected])
Tue, 25 Oct 1994 18:40:04 -0500


In message <[email protected]>, "Richard L. Goerwitz" writ
es:
>A week or two ago, the subject of Unicode and HTTP/HTML came up.
>Then it died. Is this because its utility is so obvious that no-
>body considered it worth discussing? Or is it because implemen-
>ting Unicode on current procrustean systems is so unrealistic
>that the issue was simply ignored? Something in-between?

My guess is that nobody has any deliverables hanging over their head
in this area. In other words, nobody has any immediate problems to get
solved. Necessity is the mother of invention, after all.

But I wouldn't say that nobody's interested in the issue of expanding
the web to effectively support non-western writing system. I have
gotten a number of inquiries on this subject from different
directions: folks on ISO committees, folks writing HTML books, folks
making strategic corporate decisions.

I don't have any good answers. I point them to the various archives
and wish them good luck. UTF-8 looks like a good idea, since it has
some of the right backwards-compatibility characteristics. There's a
version of Mosaic available that groks certain ISO2022 constructs.

I haven't had time to read either of the specs (Unicode/ISO646 or
ISO2022). (And givem my experience reading ISO document in the past,
I'm not looking forward to it.)

>From following the discussions on comp.text.sgml, and reading the IETF
drafts, the only thing that's clear to me is that language and
character set issues are a mess.

My advice: find specific, solvable, problems, and attack them. Don't
try to be everything to everybody. Have a body of information and and
audience in mind, and design and deploy an application. In other
words, we're in the research phase.

> Still others thought
>the whole issue of multilingual text to be far too arcane to be
>worth worrying about at this juncture.

Well, the response to some proposals has been "but what about ancient
Martian, where the tense of verbs is communicated through the ion
radiation of the ink?" What I mean is that there are some writing
systems where the cost of implementation and deployment exceeds the
benefit to be gained. We must keep in mind that HTML and WWW are
essentially cheap technology. Otherwise, it gets too big to maintain,
and only commercial enterprises can support it, and their products
have short life cycles.

Hmmm... in general, it looks like the problem is that language and
character issues are expensive to solve, and nobody has found a cheap
solution that scales. HTML took off because the start-up costs were
low, but enough folks saw the value of hypertext functionlity so that
a large body of data accumulated, so that as time went by, the value
of "joining the web" increased. Obviously, critical mass was reached
and BOOM! Terabytes of HTML.

So if you find something relatively simple that will solve 80% of the
problems, for god sakes let us know! Write it up. Give examples. Get
somebody to code it up.

One unfortunate side effect of the growth of the web is that there
are a lot more people to convince when you want to introduce a new
feature. But don't let that stop you!

Dan