develooper Front page | perl.perl6.internals | Postings from January 2002

parrot rx engine

Thread Next
From:
Ashley Winters
Date:
January 30, 2002 08:27
Subject:
parrot rx engine
Message ID:
20020130161355.56041.qmail@web10103.mail.yahoo.com
Hello p6i,

Who the hell am I?
I've been only a weblog-lurker till now. It's been a couple years since
I last contributed to Perl5. I just read the latest Apocalypse and it
inspired me to get a parrot snapshot and look around.

What's my beef?
I don't like the rx_literal and rx_oneof ops, and I don't like how the
"on parrot strings" thread is being related to the regex engine. I know
how useless "wouldn't it be nice" messages are, so please understand my
advocacy on this is more than just raving lunacy.

Basically, I see a black-box being built in the interests of speed.
Voodoo array formats, bitmaps, and other such things to avoid actually
spelling out what the regular expression is doing *in parrot code*.

What the hell am I talking about?
Let me cut&paste some code from the great rx.ops documentation (million
thanks to the authors).

                rx_setprops P0, "i", 2
                branch $start0
        $advance:
                rx_advance P0, $fail
        $start0:
                rx_literal P0, "a", $advance

First, we set the rx engine to case-insensitive. Why is that bad? It's
setting a runtime property for what should be compile-time
unicode-character-kung-fu. Assuming your "CPU" knows what the gritty
details of unicode in the first place just feels wrong, but I digress.

Next, a branch. No problem there.

Next, a comparison between the string in P0, and whatever the hell "a"
means. In this case, it probably means at least:

0041 LATIN CAPITAL LETTER A
0061 LATIN SMALL LETTER A
FF21 FULLWIDTH LATIN CAPITAL LETTER A
FF41 FULLWIDTH LATIN SMALL LETTER A

If you include various diacritic thingies through some voodoo
/switch-fu

LATIN CAPITAL LETTER A WITH .*
LATIN SMALL LETTER A WITH .*
AKA.
0041, 0061, 00C0-00C5, 00E0-00E5, 0100-0105,
01CD-01E1, 01FA, 01FB, 0200-0203, 0226, 0227,
1E00, 1E01, 1E9A, 1EA0-1EB7, and perhaps more.

Now, the current CVS rx engine is/would do this at runtime. I read that
someone else is working on doing that at compile-time (a necessity) and
caching the results in some data structure. The "some data structure"
part bothers me. Using Perl to create "some data structure" which is
needed by C seems dubious at best. Whatever. Moving on...

What I see is that rx_literal is a speed hack to avoid compiling this
into parrot code:

given $a_utf32_code_point {
    when U+41 {}
    when U+61 {}
    when U+C0 <= $_ <= U+C5 {}
    when U+E0 <= $_ <= U+E5 {}
    # ..... on and on and on .....
    default { next ON_SOME_LOOP }
}

I think that's exactly what you should be doing! Neither parrot nor the
rx engine should try to be a full compiler. The rx engine definitely
should have opcodes in the virtual machine, but those opcodes should
simply contain state-machine/backtracking info, not godly unicode info.

If you want to optimize a regular expression, you should write that
optimizer in Perl6, or Python, or Scheme, or whatever, not in C.

So, what am I saying?

Once you squash rx_literal and friends, any attempt to benchmark the
"rx" engine really becomes a benchmark of parrot itself. When you speed
up parrot, you speed up regular expressions. Voila, no more black box.
If Parrot is just too damn slow for you, whip out libmylang and do the
nitty gritty yourself. Since this is mostly a "just don't do it" post,
no code is actually *required* from me, right? :)

Here is an example based on the rx.ops example written in
psuedo-Perl6.apocalypse.4. It may or may not be relevant. Any errors
are my own. Any formatting problems are my fault for using an inferior
mail system. This is purely for entertainment purposes, no warranty or
specification expressed or implied.

#!/usr/bin/perl6
# /ab*[cd]+/i
sub match ($string) {
    return false if $string.length < 2;
    my $r = rx::allocateinfo($string);
ADVANCE:
    loop {
	NEXT { $r.advance or last ADVANCE }
	# /a/
	given $r.current_code_point {   # whatever
	    when U+41, U+61 {} # it's an "a"
	    default { next ADVANCE }
	}
	# rx_literal used to move the current pointer
	# upon success. replace with rx_next_code_point?
	$r.next_code_point;
	$r.pushmark;	# backtracking starts here

	# /b*/
	loop {
	    NEXT { $r.next_code_point or last; $r.pushindex }
	    given $r.current_code_point {
		when U+42, U+62 {}   # it's a "b"
		# one-to-many unicode ops resolved at compile-time?
#		when $_ =~ any(toupper(U+62), tolower(U+62),
#			       totitle(U+62), tofold(U+62)) {}
		default { last }
	    }
	}

	loop {
	    NEXT {
		if $r.distance_from_last_index > 1 {
		    return true;	   # success...
		} else {
		    $r.popindex or last;   # backtrack or start over
		}
	    }
	    # /[cd]+/
	    loop {
		NEXT { $r.next_code_point or last }
		given $r.current_code_point {
		    when U+43, U+44, U+63, U+64 {}
		    # since a switch is actually executed in order, the
		    # optimizer can not only merge sequential codepoints,
		    # it can order the comparisons to make the most likely
		    # matches (ascii) be tested first, and for oddball
		    # "equivalent" character comparisons be done later.
		    #
		    # just showing what the optimizer might do with /[a-z]/i
		    #if U+41 <= $_ <= U+5A or U+61 <= $_ <= U+7A {}
		    default { last }
		}
	    }
	}
    }

    return false;
}

match("xxabbBBBBBBcdcdcdzz");
__END__


Am I fool, or an idiot? Discuss.

Mostly, I'd like to hear how either Unicode character-ranges aren't
deterministic at compile-time (I doubt that) or how crippling to
performance this would be (and by implication how slow parrot will be)
in either time or space.

Ashley Winters

__________________________________________________
Do You Yahoo!?
Great stuff seeking new owners in Yahoo! Auctions! 
http://auctions.yahoo.com

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About