develooper Front page | perl.perl6.internals | Postings from June 2001

Re: Should we care much about this Unicode-ish criticism?

From:
Larry Wall
Date:
June 5, 2001 19:02
Subject:
Re: Should we care much about this Unicode-ish criticism?
Message ID:
200106060159.SAA06233@kiev.wall.org
Dan Sugalski writes:
: At 04:44 PM 6/5/2001 -0700, Larry Wall wrote:
: >(Perl 5 extends it all the way to 64-bit values, represented in 13 bytes!)
: 
: I know we can, but is it really a good idea? 32 bits is really stretching 
: it for character encoding, and 64 seems rather excessive.

Such large values would not typically be used for standard characters, but
as a means of embedding an inline chunk of non-character data, such as a
pointer, or a set of metadata bits.

: Really 
: space-wasteful as well, if we maintain a character type with a fixed width 
: large enough to hold the largest decoded variable-width character.

True 'nuff.  I suspect most people would want to stick within 32 bits,
which is sufficiently wasteful for most purposes.

: And I 
: really, *really* want to do as little as possible internally with 
: variable-width encodings. Yech.

Mmm, the difficulty of that is overrated.  Very seldom do you want to
do anything other than find the next character, or the previous
character, and those are pretty easy to do in utf8.

: >They also arbitrarily define UTF-32 to not use higher values than
: >0x10ffff, but that doesn't mean we're gonna send in the high-bit Nazis
: >if people want higher values for their own purposes.
: 
: Well, that'd be inappropriate since a good chunk of the rest of the set's 
: been dedicated to future expansion. I think it might be a reasonable idea 
: for -w to grumble if someone's used a character in the unassigned range, 
: though. (IIRC there's a piece set aside for folks to do whatever they want 
: with)

Certainly, but it's easy to come up with reasons to want to stuff more
bits inline than the private use areas will support.  Rather than have
-w grumble about such characters, I'd rather see an optional output
discipline that enforces strict Unicode output.

: >But since the names UTF-8 and UTF-32 are becoming associated with those
: >arbitrary restrictions, it's getting even more important to refer to
: >Perl's looser style as utf8 (and, potentially, utf32).  I don't know
: >if Perl will have a utf16 that is distinguised from UTF-16.
: 
: I'd as soon not do UTF-16 at all, or at least no more than we need to 
: convert to UTF-32 or UTF-8.

Well, as you pointed out above, we might not use any kind of "UTF"
internally, but just arrays of properly sized integers, which are never
variable length.  (UTF-32 is the only UTF that's not a variable-length
encoding.)

On the other hand, maybe there's some use for a data structure that is
a sequence of integers of various sizes, where the representation of
different chunks of the array/string might be different sizes.  Would
make some aspects of copy-on-write more efficient to be able to chunk
strings and integer arrays.  And of course this would all be transparent
at the language level, in the absence of explicit syntax to treat an
array as a string or a string as an array.

Larry



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About