Dan Sugalski writes: : At 06:59 PM 6/5/2001 -0700, Larry Wall wrote: : >Such large values would not typically be used for standard characters, but : >as a means of embedding an inline chunk of non-character data, such as a : >pointer, or a set of metadata bits. : : Ah. In that case, perhaps extended utf-8 processing isn't really the most : appropriate way to go. If the intent is to do embedded binary bits in a : text stream, maybe we should build input and output filters to do that instead. I see the issue of filters as orthogonal to the issue of representation. In other words, I don't understand what you're saying. : >Mmm, the difficulty of that is overrated. Very seldom do you want to : >do anything other than find the next character, or the previous : >character, and those are pretty easy to do in utf8. : : As Hong pointed out to me on more than one occasion. I'm not sure I buy : that, and I have serious reservations about the speed of dealing with : variable length characters instead of fixed-length ones. Whether you buy it or not, I wasn't offering it as a mere conjecture. That is precisely what Perl 5.6+ is already doing for Unicode data. It's not a big deal unless your program is full of substr(). And it saves on input processing, if the input is already known to be UTF-8. : >Certainly, but it's easy to come up with reasons to want to stuff more : >bits inline than the private use areas will support. : : Maybe. That trips my "way too clever" reflex, though, and makes me think : that perhaps it's not the best way to go about that sort of thing. Rather : than making non-text things look like text, maybe we'd be better off coming : up with a better way to intermingle text and non-text things. It'd be more : space-efficient as well, since utf-8 encoding random binary things will : tend to expand them more than would seem necessary. True 'nuff. There are many valid reasons to want to annotate substrings with various kinds of out-of-band metadata. The trick is to keep the out-of-band data in sync with the in-band data. : >On the other hand, maybe there's some use for a data structure that is : >a sequence of integers of various sizes, where the representation of : >different chunks of the array/string might be different sizes. Would : >make some aspects of copy-on-write more efficient to be able to chunk : >strings and integer arrays. And of course this would all be transparent : >at the language level, in the absence of explicit syntax to treat an : >array as a string or a string as an array. : : I think that'd be a better solution than fibbing about what a piece of a : data stream is. If you want to attach the label "fib" to an intentional violation of cultural convention, I suppose I can't stop you. But Rosa Parks wasn't pretending to be white when she sat in the front of the bus. :-) Larry