Dan Sugalski writes: : At 04:44 PM 6/5/2001 -0700, Larry Wall wrote: : >(Perl 5 extends it all the way to 64-bit values, represented in 13 bytes!) : : I know we can, but is it really a good idea? 32 bits is really stretching : it for character encoding, and 64 seems rather excessive. Such large values would not typically be used for standard characters, but as a means of embedding an inline chunk of non-character data, such as a pointer, or a set of metadata bits. : Really : space-wasteful as well, if we maintain a character type with a fixed width : large enough to hold the largest decoded variable-width character. True 'nuff. I suspect most people would want to stick within 32 bits, which is sufficiently wasteful for most purposes. : And I : really, *really* want to do as little as possible internally with : variable-width encodings. Yech. Mmm, the difficulty of that is overrated. Very seldom do you want to do anything other than find the next character, or the previous character, and those are pretty easy to do in utf8. : >They also arbitrarily define UTF-32 to not use higher values than : >0x10ffff, but that doesn't mean we're gonna send in the high-bit Nazis : >if people want higher values for their own purposes. : : Well, that'd be inappropriate since a good chunk of the rest of the set's : been dedicated to future expansion. I think it might be a reasonable idea : for -w to grumble if someone's used a character in the unassigned range, : though. (IIRC there's a piece set aside for folks to do whatever they want : with) Certainly, but it's easy to come up with reasons to want to stuff more bits inline than the private use areas will support. Rather than have -w grumble about such characters, I'd rather see an optional output discipline that enforces strict Unicode output. : >But since the names UTF-8 and UTF-32 are becoming associated with those : >arbitrary restrictions, it's getting even more important to refer to : >Perl's looser style as utf8 (and, potentially, utf32). I don't know : >if Perl will have a utf16 that is distinguised from UTF-16. : : I'd as soon not do UTF-16 at all, or at least no more than we need to : convert to UTF-32 or UTF-8. Well, as you pointed out above, we might not use any kind of "UTF" internally, but just arrays of properly sized integers, which are never variable length. (UTF-32 is the only UTF that's not a variable-length encoding.) On the other hand, maybe there's some use for a data structure that is a sequence of integers of various sizes, where the representation of different chunks of the array/string might be different sizes. Would make some aspects of copy-on-write more efficient to be able to chunk strings and integer arrays. And of course this would all be transparent at the language level, in the absence of explicit syntax to treat an array as a string or a string as an array. Larry