[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
13 Posix Regular Expressions
This whole section has been written by Dorai Sitaram.
It consists in the documentation of the pregexp
package that may be
found at http://www.ccs.neu.edu/~dorai/pregexp/pregexp.html.
The regexp notation supported is modeled on Perl’s, and includes such powerful directives as numeric and nongreedy quantifiers, capturing and non-capturing clustering, POSIX character classes, selective case- and space-insensitivity, backreferences, alternation, backtrack pruning, positive and negative lookahead and lookbehind, in addition to the more basic directives familiar to all regexp users. A regexp is a string that describes a pattern. A regexp matcher tries to match this pattern against (a portion of) another string, which we will call the text string. The text string is treated as raw text and not as a pattern.
Most of the characters in a regexp pattern are meant to match
occurrences of themselves in the text string. Thus, the pattern
"abc"
matches a string that contains the characters a
, b
,
c
in succession.
In the regexp pattern, some characters act as
metacharacters, and some character sequences act as
metasequences. That is, they specify something
other than their literal selves. For example, in the
pattern "a.c"
, the characters a
and c
do
stand for themselves but the metacharacter .
can match any character (other than
newline). Therefore, the pattern "a.c"
matches an a
, followed by any character,
followed by a c
.
If we needed to match the character .
itself,
we escape it, ie, precede it with a backslash
(\
). The character sequence \.
is thus a
metasequence, since it doesn’t match itself but rather
just .
. So, to match a
followed by a literal
.
followed by c
, we use the regexp pattern
"a\\.c"
.(4)
Another example of a metasequence is \t
, which is a
readable way to represent the tab character.
We will call the string representation of a regexp the U-regexp, where U can be taken to mean Unix-style or universal, because this notation for regexps is universally familiar. Our implementation uses an intermediate tree-like representation called the S-regexp, where S can stand for Scheme, symbolic, or s-expression. S-regexps are more verbose and less readable than U-regexps, but they are much easier for Scheme’s recursive procedures to navigate.
13.1 Regular Expressions Procedures | ||
13.2 Regular Expressions Pattern Language | ||
13.3 An Extended Example |
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated on March 31, 2014 using texi2html 5.0.