RE: parrot rx engine

Front page | perl.perl6.internals | Postings from January 2002

Thread Previous | Thread Next

From:

Brent Dax

Date:

January 30, 2002 09:27

Subject:

RE: parrot rx engine

Message ID:

FJELLKOPEAGHOOODKEDPGEFFDCAA.brentdax@cpan.org

Ashley Winters:
# Who the hell am I?
# I've been only a weblog-lurker till now. It's been a couple
# years since
# I last contributed to Perl5. I just read the latest Apocalypse and it
# inspired me to get a parrot snapshot and look around.

Welcome back to the land of the living.  :^)

# What's my beef?
# I don't like the rx_literal and rx_oneof ops, and I don't like how the
# "on parrot strings" thread is being related to the regex
# engine. I know
# how useless "wouldn't it be nice" messages are, so please
# understand my
# advocacy on this is more than just raving lunacy.
#
# Basically, I see a black-box being built in the interests of speed.
# Voodoo array formats, bitmaps, and other such things to avoid actually
# spelling out what the regular expression is doing *in parrot code*.
#
# What the hell am I talking about?
# Let me cut&paste some code from the great rx.ops
# documentation (million
# thanks to the authors).

And a million "you're welcome"s.  :^)  (To be fair, japhy and Angel Faus
helped with the design a LOT.)

#                 rx_setprops P0, "i", 2
#                 branch $start0
#         $advance:
#                 rx_advance P0, $fail
#         $start0:
#                 rx_literal P0, "a", $advance
#
# First, we set the rx engine to case-insensitive. Why is that bad? It's
# setting a runtime property for what should be compile-time
# unicode-character-kung-fu. Assuming your "CPU" knows what the gritty
# details of unicode in the first place just feels wrong, but I digress.

That "i" does a once-off case-folding operation on the target string.
All other input to the engine MUST already be case-folded for speed.

# Next, a branch. No problem there.
#
# Next, a comparison between the string in P0, and whatever the hell "a"
# means. In this case, it probably means at least:
#
# 0041 LATIN CAPITAL LETTER A
# 0061 LATIN SMALL LETTER A
# FF21 FULLWIDTH LATIN CAPITAL LETTER A
# FF41 FULLWIDTH LATIN SMALL LETTER A

Which have been case-folded and normalized (Normalization Form KC,
probably) to LATIN SMALL LETTER A.

# If you include various diacritic thingies through some voodoo
# /switch-fu
#
# LATIN CAPITAL LETTER A WITH .*
# LATIN SMALL LETTER A WITH .*
# AKA.
# 0041, 0061, 00C0-00C5, 00E0-00E5, 0100-0105,
# 01CD-01E1, 01FA, 01FB, 0200-0203, 0226, 0227,
# 1E00, 1E01, 1E9A, 1EA0-1EB7, and perhaps more.

Once again, the switch-fu plus normalization will have converted all of
that to LATIN SMALL LETTER A.

# Now, the current CVS rx engine is/would do this at runtime. I
# read that
# someone else is working on doing that at compile-time (a
# necessity) and
# caching the results in some data structure. The "some data structure"
# part bothers me. Using Perl to create "some data structure" which is
# needed by C seems dubious at best. Whatever. Moving on...

Why?  If you know what info is contained in the "some data structure"
and it gives you a speedup, who cares?

# What I see is that rx_literal is a speed hack to avoid compiling this
# into parrot code:

That's more or less true.  Speed hacks are extremely important in regex
engines.

# given $a_utf32_code_point {
#     when U+41 {}
#     when U+61 {}
#     when U+C0 <= $_ <= U+C5 {}
#     when U+E0 <= $_ <= U+E5 {}
#     # ..... on and on and on .....
#     default { next ON_SOME_LOOP }
# }
#
# I think that's exactly what you should be doing! Neither
# parrot nor the
# rx engine should try to be a full compiler. The rx engine definitely
# should have opcodes in the virtual machine, but those opcodes should
# simply contain state-machine/backtracking info, not godly
# unicode info.

This "godly Unicode info" is actually limited to "call a transcoding
function, a normalizing function, and a case-folding function if /i is
used".  All of this information is built-in to the string library from
the start.

# If you want to optimize a regular expression, you should write that
# optimizer in Perl6, or Python, or Scheme, or whatever, not in C.

C will almost always be faster with this sort of thing than Perl or any
other language, simply because of all the extra layers of Stuff you'll
have to go through in a higher-level language.

# So, what am I saying?
#
# Once you squash rx_literal and friends, any attempt to benchmark the
# "rx" engine really becomes a benchmark of parrot itself. When
# you speed
# up parrot, you speed up regular expressions. Voila, no more black box.

This is true already.  The regex engine consists of normal opcodes, so a
fast, general Parrot opcode dispatch speedup will speed up regexes, too.

# If Parrot is just too damn slow for you, whip out libmylang and do the
# nitty gritty yourself. Since this is mostly a "just don't do it" post,
# no code is actually *required* from me, right? :)
#
# Here is an example based on the rx.ops example written in
# psuedo-Perl6.apocalypse.4. It may or may not be relevant. Any errors
# are my own. Any formatting problems are my fault for using an inferior
# mail system. This is purely for entertainment purposes, no warranty or
# specification expressed or implied.
#
# #!/usr/bin/perl6
# # /ab*[cd]+/i
# sub match ($string) {
#     return false if $string.length < 2;
#     my $r = rx::allocateinfo($string);
# ADVANCE:
#     loop {
# 	NEXT { $r.advance or last ADVANCE }
# 	# /a/
# 	given $r.current_code_point {   # whatever
# 	    when U+41, U+61 {} # it's an "a"
# 	    default { next ADVANCE }
# 	}
# 	# rx_literal used to move the current pointer
# 	# upon success. replace with rx_next_code_point?
# 	$r.next_code_point;
# 	$r.pushmark;	# backtracking starts here
#
# 	# /b*/
# 	loop {
# 	    NEXT { $r.next_code_point or last; $r.pushindex }
# 	    given $r.current_code_point {
# 		when U+42, U+62 {}   # it's a "b"
# 		# one-to-many unicode ops resolved at compile-time?
# #		when $_ =~ any(toupper(U+62), tolower(U+62),
# #			       totitle(U+62), tofold(U+62)) {}
# 		default { last }
# 	    }
# 	}
#
# 	loop {
# 	    NEXT {
# 		if $r.distance_from_last_index > 1 {
# 		    return true;	   # success...
# 		} else {
# 		    $r.popindex or last;   # backtrack or start over
# 		}
# 	    }
# 	    # /[cd]+/
# 	    loop {
# 		NEXT { $r.next_code_point or last }
# 		given $r.current_code_point {
# 		    when U+43, U+44, U+63, U+64 {}
# 		    # since a switch is actually executed in order, the
# 		    # optimizer can not only merge sequential
# codepoints,
# 		    # it can order the comparisons to make the
# most likely
# 		    # matches (ascii) be tested first, and for oddball
# 		    # "equivalent" character comparisons be done later.
# 		    #
# 		    # just showing what the optimizer might do
# with /[a-z]/i
# 		    #if U+41 <= $_ <= U+5A or U+61 <= $_ <= U+7A {}
# 		    default { last }
# 		}
# 	    }
# 	}
#     }
#
#     return false;
# }
#
# match("xxabbBBBBBBcdcdcdzz");
# __END__
#
#
# Am I fool, or an idiot? Discuss.
#
# Mostly, I'd like to hear how either Unicode character-ranges aren't
# deterministic at compile-time (I doubt that) or how crippling to

One word: locale.

# performance this would be (and by implication how slow parrot will be)
# in either time or space.

Lemmie put it this way:  rx_is_w and rx_is_s used to consist of several
range tests.  Switching them to bitmaps doubled their speed.

I understand your argument and see how it would be a good thing, but
everything I've experimented with has led me to believe that your way
won't give us the speed we need.

--Brent Dax
brentdax@cpan.org
Parrot Configure pumpking and regex hacker
Check out the Parrot FAQ: http://www.panix.com/~ziggy/parrot.html (no,
it's not mine)

<obra> mmmm. hawt sysadmin chx0rs
<lathos> This is sad. I know of *a* hawt sysamin chx0r.
<obra> I know more than a few.
<lathos> obra: There are two? Are you sure it's not the same one?

Thread Previous | Thread Next