develooper Front page | perl.perl6.internals | Postings from January 2002

RE: on parrot strings

Thread Previous | Thread Next
From:
Brent Dax
Date:
January 18, 2002 00:26
Subject:
RE: on parrot strings
Message ID:
FJELLKOPEAGHOOODKEDPMEFIDBAA.brentdax@cpan.org
Jarkko Hietaniemi: <from attachment>
About the implementation of character classes: since the Unicode code
point range is big, a single big bitmap won't work any more: firstly,
it would be big.  Secondly, for most cases, it would be wastefully
sparse.  A balanced binary tree of (begin, end) points of ranges is
suggested.  That would seem to give the required flexibility and
reasonable compromise betwen speed and space for implementing the
operations required by both the traditional regular expression
character classes (complement, case-ignorance) and the new Unicode
character class semantics (difference, intersection) (see the Unicode
Technical Report #18, I<Unicode Regular Expression Guidelines>,
http://www.unicode.org/unicode/reports/tr18/ )

Another, possible simpler way would be to use inversion lists:
1-dimensional arrays where odd (starting from zero) indices store
the beginnings of ranges belonging to the class, and and even indices
store the beginnings of ranges not belonging to the class.
Note "array" instead of (a linked) "list": with an array one can do
binary search to determine membership (so an inversion list is in
effect a flattened binary tree).  Yet another way would be to use
various two-level table schemes.  The choice of the appropriate data
structure, as always, depends on the expected operational (read vs
modify) mix and the expected data distribution.

###

Since I seem to be the main regex hacker for Parrot, I'll respond to
this as best I can.

Currently, we are using bitmaps for character classes.  Well, sort of.
A Bitmap in Parrot is defined like this:

	typedef struct bitmap_t {
		char*		bmp;
		STRING*	bigchars;
	} Bitmap;

Characters <256 are stored as a bitmap in bmp; other characters are
stored in bigchars and linear-searched.  This is a temporary measure,
since Parrot isn't yet dealing with many characters outside of ASCII.
Several schemes have been proposed for the final version; I'm currently
leaning towards an array of arrays of arrays of bitmaps (one level for
each byte of the character):

	INTVAL ch;
	return
bmp->bmp[FIRST_BYTE(ch)][SECOND_BYTE(ch)][THIRD_BYTE(ch)][FORTH_BYTE(ch)
>>3] & (1<<(FORTH_BYTE(ch) & 7));

Ungainly, but it works.  It would actually be a bit more
complicated--only the arrays that we actually used would be allocated to
save space--but you get the idea.  (However, I'm quite flexible on the
implementation chosen.  I'll look at the ideas you propose in more
detail; if anyone else has any suggestions, suggest them.)

As for character encodings, we're forcing everything to UTF-32 in
regular expressions.  No exceptions.  If you use a string in a regex,
it'll be transcoded.  I honestly can't think of a better way to
guarantee efficient string indexing.

--Brent Dax
brentdax@cpan.org
Parrot Configure pumpking and regex hacker

<obra> mmmm. hawt sysadmin chx0rs
<lathos> This is sad. I know of *a* hawt sysamin chx0r.
<obra> I know more than a few.
<lathos> obra: There are two? Are you sure it's not the same one?


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About