develooper Front page | perl.perl6.internals | Postings from June 2001

Re: Should we care much about this Unicode-ish criticism?

From:
Larry Wall
Date:
June 6, 2001 10:52
Subject:
Re: Should we care much about this Unicode-ish criticism?
Message ID:
200106061750.KAA10547@kiev.wall.org
Dan Sugalski writes:
: At 06:59 PM 6/5/2001 -0700, Larry Wall wrote:
: >Such large values would not typically be used for standard characters, but
: >as a means of embedding an inline chunk of non-character data, such as a
: >pointer, or a set of metadata bits.
: 
: Ah. In that case, perhaps extended utf-8 processing isn't really the most 
: appropriate way to go. If the intent is to do embedded binary bits in a 
: text stream, maybe we should build input and output filters to do that instead.

I see the issue of filters as orthogonal to the issue of representation.
In other words, I don't understand what you're saying.

: >Mmm, the difficulty of that is overrated.  Very seldom do you want to
: >do anything other than find the next character, or the previous
: >character, and those are pretty easy to do in utf8.
: 
: As Hong pointed out to me on more than one occasion. I'm not sure I buy 
: that, and I have serious reservations about the speed of dealing with 
: variable length characters instead of fixed-length ones.

Whether you buy it or not, I wasn't offering it as a mere conjecture.
That is precisely what Perl 5.6+ is already doing for Unicode data.
It's not a big deal unless your program is full of substr().  And it
saves on input processing, if the input is already known to be UTF-8.

: >Certainly, but it's easy to come up with reasons to want to stuff more
: >bits inline than the private use areas will support.
: 
: Maybe. That trips my "way too clever" reflex, though, and makes me think 
: that perhaps it's not the best way to go about that sort of thing. Rather 
: than making non-text things look like text, maybe we'd be better off coming 
: up with a better way to intermingle text and non-text things. It'd be more 
: space-efficient as well, since utf-8 encoding random binary things will 
: tend to expand them more than would seem necessary.

True 'nuff.  There are many valid reasons to want to annotate
substrings with various kinds of out-of-band metadata.  The trick
is to keep the out-of-band data in sync with the in-band data.

: >On the other hand, maybe there's some use for a data structure that is
: >a sequence of integers of various sizes, where the representation of
: >different chunks of the array/string might be different sizes.  Would
: >make some aspects of copy-on-write more efficient to be able to chunk
: >strings and integer arrays.  And of course this would all be transparent
: >at the language level, in the absence of explicit syntax to treat an
: >array as a string or a string as an array.
: 
: I think that'd be a better solution than fibbing about what a piece of a 
: data stream is.

If you want to attach the label "fib" to an intentional violation of
cultural convention, I suppose I can't stop you.  But Rosa Parks wasn't
pretending to be white when she sat in the front of the bus.  :-)

Larry



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About