i18n and Unicode/10646

David Goldsmith ([email protected])
Tue, 6 Dec 1994 13:40:55 -0800


I noticed that there has been some discussion here over the last few months
of using Unicode/ISO 10646 for internationalization of HTTP. I am
interested in use of Unicode/10646 on the Internet (all of Taligent's
software is based on Unicode), and so I've joined this list.

Gavin Nicol ([email protected]) recently proposed using Unicode as a kind of
character set "bus". Servers and clients would deal with their own, local
character sets, but by using Unicode for transmission, you get maximal
interoperability between a server and a client not using the same character
set.

Advantages I see:

1. Since there is complete round-trip mapping between Unicode and the
character sets in use today, a client and server using the same character
set should get the same result. Unicode has round trip mapping with all of
the ISO-8859-x sets, JIS 0208, JIS 0212, and KSC 5601. It also supports all
of the Big5 and GB character sets for Chinese, except for a handful of
variant forms of radicals which can be represented in Unicode/10646 in
other ways (I can get the details on this for anyone who is interested, but
Taligent's Chinese character set expert is out sick today).

2. Clients and servers using different character sets will get as much of
the text as possible. For example, a Japanese client reading from a server
using Greek will see the Greek, because JIS 0208 has the Greek character
set, and the mapping through Unicode will preserve it. If clients or
servers support more than one character set locally (some operating systems
allow this), they can try mapping to the primary character set first, then
try the others for characters which can't be mapped.

3. Clients or servers who use Unicode can interoperate with clients and
servers who don't.

The only disadvantage I see is how to enable such use of Unicode while
still supporting existing clients who don't understand it (the issue raised
by Sandra O'Donnell of OSF). In the absence of a character set negotiation
protocol in HTTP, it's not clear how to do this. It is certainly not the
case that everyone will switch to using Unicode this way overnight, so any
plan to enable use of Unicode has to make it optional.

I also wanted to respond to a couple of issues that were raised:

1. Using UTF-7 vs. UTF-8 vs. UCS-2 forms of Unicode: UTF-7 is a 7-bit code,
which you would really only want for mail and other channels where sending
8 bits is not safe. UTF-8 has the advantage that it looks like ASCII for
the first 128 octets, and so may work better with non-Unicode-aware
clients. It has the disadvantage that Asian characters take three bytes
each, a 50% increase in overhead over the encoding methods used now.
Straight Unicode results in no overhead for Asian users, but users of
8859-x will pay double overhead. Since it is very easy to mechanically
convert between UTF-8 and Unicode, if there were a character set
negotiation protocol, you could select the appropriate encoding based on
client preferences.

Another issue with using straight Unicode is that the MIME spec (at least,
the new draft version) specifies that all subtypes of the "text" content
type must use CRLF (0x0D 0x0A) line feed conventions, which rules out any
character set that does not use ASCII as a base, and certainly rules out a
16 bit character set like Unicode.

2. Han unification and language tagging: While tagging text with the
language will give a better presentation and aesthetics, it is not
necessary for basic legibility. Han unification in Unicode was performed in
such a way that characters were not unified if they had substantially
different appearances, or could be confused semantically (the Han
unification process is described in The Unicode Standard, Version 1.0,
Volume 2, Addison-Wesley, ISBN 0-201-60845-6. I strongly urge anyone who is
concerned about Han unification issues to read this and judge for
yourself).

Consider an example where this would be an issue. Suppose a Japanese user
receives some text which is in Chinese. Would language tagging help? There
are two cases:

a. The Japanese user does not have Chinese support on their system, only
Japanese. No Chinese fonts are installed. There are two options open:
translate the Unicode into JIS code and display it, or don't display
anything. The first option is preferable, because the text will be legible
even if not displayed in an ideal fashion.

b. The Japanese user does have Chinese support on their system, as well as
Chinese fonts. Here the issue is that because the "primary" character set
is Japanese, in the absence of language tagging the Unicode will be
converted to a Japanese character set and displayed in a Japanese font.
Thus, it would be desirable for language tagging to be supported in WWW
(and elsewhere on the Internet, for that matter). However, this situation
is not too bad, because:

- The text is still legible, as in a. above.
- The end user can specify that their preferred character set is Chinese,
and refetch the data.
- If the client system uses Unicode, they can just change the font to get
correct display.

Also note that this only occurs when the recipient has a multilingual,
multi-character set operating system with the right fonts installed. This
capability is still a rarity. In all other cases, the language tagging
won't make any difference in the end result.

In conclusion, I'm pretty familiar with Unicode but not very familiar with
HTTP and HTML. I am hoping to work together with the people on this list to
enable use of Unicode on WWW.

----------------------------
David Goldsmith
[email protected]
Senior Scientist
Taligent, Inc.
10201 N. DeAnza Blvd.
Cupertino, CA 95014-2233