Re: Regex and Matched Delimiters

Front page | perl.perl6.language | Postings from April 2002

Re: Regex and Matched Delimiters

Thread Previous | Thread Next

From:

Larry Wall

Date:

April 22, 2002 18:54

Subject:

Re: Regex and Matched Delimiters

Message ID:

200204230153.SAA20791@wall.org

Me writes:
: > Very nice (but, I assume you meant {$foo data})!
: 
: I didn't mean that (even if I should have).
: 
: Aiui, Mike's final suggestion was that parens end up
: doing all the (ops data) tricks, and braces are used
: purely to do code insertions. (I really liked that idea.)
: 
: So:
: 
: Perl 5            Perl6
: (data)            ( data)
: (?opsdata)        (ops data)
: ({})              {}  

Hmm.  Let me spill a few beans about where I'm going with A5.  I've
been thinking similar thoughts about the problem of overloading parens
so heavily in Perl 5, but I'm going in a slightly different direction
with it.  The basic principles for the new regexen are:

    * Parens always capture.
    * Braces are always closures.
    * Square brackets are always character classes.
    * Angle brackets are always metasyntax (along with backslash).

So a first whack at the differences might be:

    Old			New
    ---			---
    //			/<prior>/  ???
    ?pat?		/<?f:pat/  ???
    /pat/i		m:i/pat/ or /<?i:pat>/ or even m<?i:pat> ???
    /pat/x		/pat/
    /^pat$/m		/^^pat$$/
    /./s		/<any>/ or /<.>/ ???

    \p{prop}		<+prop>  ???
    \P{prop}		<-prop>  ???
    space		<sp> (or \h for "horizontal"?)
    {n,m}		<n,m>

    \t			also <tab>
    \n			also <lf> or <nl> (latter matching logical newline)
    \r			also <cr>
    \f			also <ff>
    \a			also <bell>
    \e			also <esc>
    \033		same
    \x1B		same
    \x{263a}		\x<263a> ???
    \c[			same
    \N{name}		<name>
    \l			same
    \u			same
    \Lstring\E		\L<string>
    \Ustring\E		\U<string>
    \E			gone
    [\040\t]		\h	plus any Unicode horizontal whitespace
    [\r\n\ck]		\v      plus any Unicode vertical whitespace

    \b			same
    \B			same
    \A			^
    \Z			same?
    \z			$
    \G			<pos>, but assumed in nested patterns?
 
    \1			$1

    \Q$var\E		$var    always assumed literal, so $1 is literal backref
    $var		<$var>  assumed to be regex
    =~ $re		=~ /<$re>/   ouch?

    (??{$rule})		<rule>
    (?{ code })		{ code } with failure semantics
    (?#...)		{"..."}		:-)
    (?:...)		<:...>
    (?=...)		<before: ...>
    (?!...)		<!before: ...>
    (?<=...)		<after: ...>
    (?<!...)		<!after: ...>
    (?>...)		<grab: ...>
    (?(cond)t|f)	Not sure.  Could just use { if ... }

Obviously the <word> and <word:...> syntaxes will be user extensible.
We have to be able to support full grammars.  I consider it a feature
that <foo> looks like a non-terminal in standard BNF notation.  I do
not consider it a misfeature that <foo> resembles an HTML or XML tag,
since most of those languages need to be matched with a fancy rule
named <tag> anyway.

An interesting idea would be that if you say

    m<foo: pat>

or

    m{code}

it's as if you said

    m/<foo: pat>/
    
or
    
    m/{code}/

The latter is particularly interesting to me in that I can see uses for
patterns that are Perl code at the top level rather than regex
literal.  Any closure within a regular expression has full access to
the current state object for the match.  So most of the RFCs proposing
ad hoc mechanisms for saving submatches in various kinds of variables
can be handled with closures.

    /(...)(...)(...) { @array = .all } /

or

    /(...) { $first  = $+ }
     (...) { $second = $+ }
     (...) { $third  = $+ }/

or

    /<IF> (<COND>) (<BLOCK>) { .node = ["if",$1,$2] } /  # shades of yacc

or whatever.  Could have a <$foo=...> as syntactic sugar, perhaps.
But we need the general mechanism for building up parse trees of
arrays of hashes of arrays of arrays of hashes of arrays of hashes of...

I haven't decided yet whether matches embedded in the closure should
automatically pick up where the outer match is, or whether there should
be some explicit match op to mean that, much like \G only better.  I'm
thinking when the current topic is a match state, we automatically
continue where we left off, and require explicit =~ to start an unrelated
match.

I also haven't committed to any particular mechanism for defining a
set of related rules in a grammar.  Obviously it needs to be a good
enough mechanism to parse Perl and its variants, which means it
probably needs to be OO based, and you make new grammars by derivation
from the base grammar and overriding the rules you want to change.

Sorry if this is a bit delirious--I'm fighting off some kind of
infection, and my nights have been shortchanged lately by the
neighborhood panhandler who doesn't seem to understand either
complicated concepts like "bedtime" or simple concepts like "no".

Larry

Thread Previous | Thread Next