develooper Front page | perl.perl6.internals | Postings from June 2001

Re: Should we care much about this Unicode-ish criticism?

From:
Simon Cozens
Date:
June 5, 2001 10:55
Subject:
Re: Should we care much about this Unicode-ish criticism?
Message ID:
20010605185503.A4636@deep-dark-truthful-mirror.pmb.ox.ac.uk
On Tue, Jun 05, 2001 at 01:31:38PM -0400, Dan Sugalski wrote:
> The other issue it actively brought up was the complaint about having to 
> share glyphs amongst several languages, which didn't strike me as all that 
> big a deal either, except perhaps as a matter of national pride and/or easy 
> identification of the language of origin for a glyph. Not being literate in 
> any of the languages in question, though, I didn't feel particularly 
> qualified to make a judgement as to the validity of the complaints.

There are a number of related problems here; the Han unification effort has
pissed off some Asians on several counts. There easiest part to explain is
display; this isn't something that Perl particularly needs to care about, but
the same glyph may need to look different if it's in Chinese rather than in
Japanese. 

For the rest, I refer the assembly to my undergraduate dissertation :) :

--------
Unicode itself is, like the JIS standard, simply an enumeration of
characters with their orderings; it says nothing about how the data is
represented to the computer, and must be supplemented by one of several
Unicode Transformation Formats which describe the encoding. 

However, despite the huge benefits to programmers worldwide, two
critical problems are hindering the adoption of Unicode amongst the
Japanese computer-using community. The first objection is technical, and
the second is more sociological. 

The technical objection stems from the fact that the Unicode Consortium
initially assigned a finite space for all Japanese, Chinese and Korean
characters, allowing only just under 28,000 characters. This space has
nearly been filled, with 20,902 basic characters already accepted, and
6,585 new characters under review; the situation is not going to get any
better as Chinese characters are invented for use in proper names and so
on. It is evident that 28,000 characters is not going to be anywhere
near enough, and programmers have felt betrayed that the promise of a
`fully Universal character set' will satisfy all other languages but
theirs. 

Thankfully, the Unicode Consortium has recently assigned another
extension plane for CJK characters and adopted a further 42,711
characters, meaning that all the characters in the Chinese Han Yu Da
Zidian and the Japanese Morohashi Dai Kanwa Jiten are now adopted into
Unicode. However, many programmers are unaware of the extension plane
and still feel that the Unicode Consortium is ignoring their plight.

More serious, however, is the decision to unify equivalent characters in
the Chinese, Japanese and Korean character sets into a single table
known as `Unihan'[10]. This has proved controversial primarily through
lack of understanding of the nature of `equivalent characters': the
Unihan table does not constitute a dumbing down of the character set, as
simplified and traditional forms of characters have been maintained.
However, Chinese and Japanese variants of the same single character have
been unified. The Unicode standard seeks to encode characters rather than
glyphs[11] , and hence the variant characters which come about due to
variations in writing style have been unified. On the other hand,
characters undergoing structural variance have not been unified. 

The principles on which Han Unification took place, are, according to
[Graham, 2000], not dissimilar to those used to unify characters in the
legacy JIS and other character sets. Three rules were used to determine
whether or not two kanji should be considered equivalent: 

Source Separation Rule 

If two kanji were distinct in a primary source character set (JIS in the
case of Japanese, GB2312-80 and other GB standards for Chinese,
KSC5601-1987 for Korean, and so on) then they should not be unified.
This would allow round-trip-conversion between Unicode and the original
source. For instance, the following variants of the character for
tsurugi, sword, were not unified: 

    [Picture omitted]

Non-Cognate Rule Kanji

which are not cognate are not variants; this prohibits, for instance,
the unifiation of the following characters:

    [Picture omitted]

Component Structure 

If a unification is acceptable under the above rules, unification is
only carried out if the characters share the same radicals and component
features, taking into consideration their arrangement. 


Using these rules, the CJK Joint Research Group of the ISO technical
committee on Unicode reduced a candidate 121,000 Han characters into
20,902 unique characters [12]. On the other hand, there are some valid
objections from Japanese, on three specific counts [13]: 

Firstly, the JIS standard defines, along with the ordering and
enumeration of its characters, their glyph shape. Unicode, on the other
hand does not. This means that as far as Unicode is concerned, there is
literally no distinction between two distinct shapes and hence no way to
specify which should be used. This becomes particularly emotive when one
is, for instance, attempting to represent a person's name - if they have
a particular preferred variant character with which they write their
name, there is no way to communicate that to the computer, and
information is lost. 

The second objection is again related to character versus glyph issues:
since Chinese, Japanese and Korean forms of glyphs are unified into a
single character, display of a CJK text becomes difficult. As there is
no indication of the language of the input, software displaying Unicode
text has no hints about the style in which characters should be
displayed. Chinese and Japanese fonts have distinct styles, and it is
impossible to devise a font in which Japanese and Chinese texts could
both be displayed concurrently without appearing `alien'. For instance,
a Chinese user could conceivably see recognisably Japanese variants of
characters appearing in his Chinese text, and vice versa. 

In defence of Unicode, it provides (but discourages) `non-printing'
characters which tag the following text as being in a particular
language [14] . For instance, the three Unicode characters U-000E0001
U-000E006A U-000E0061 signify that the following text is in Japanese,
allowing an application to select the correct font style. 

Finally, there is a historiographical issue; when computers are used to
digitise and store historical literature containing archaic characters,
specifying the exact variant character becomes an important
consideration. Once again, this can be made more emotive by considering
the digitisation of Japanese or Korean family records. In such a case,
one would want to make a faithful representation of the original source
document, something which Unihan unification does not permit.


10 The characters are known in Unicode as Han ideographs; the name `Han'
is a reference to the Chinese origin of the characters, which are known
as hanzi in Chinese, hanja in Korean, chu Han in Vietnamese and, of
course, kanji in Japanese

11 See Appendix B

12 [Lunde, 1999, p.124-125] presents a graphical view of the unić°”ation
process. 

13 This section adopted from [Cheong, 1999] 

14 [Whistler and Adams, 2001]

-- 
"He was a modest, good-humored boy.  It was Oxford that made him insufferable."



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About