info gawk


File: gawk.info,  Node: Regexp Operator Details,  Next: Interval Expressions,  Up: Regexp Operators

3.3.1 Regexp Operators in 'awk'
-------------------------------

The escape sequences described in *note Escape Sequences:: are valid
inside a regexp.  They are introduced by a '\' and are recognized and
converted into corresponding real characters as the very first step in
processing regexps.

   Here is a list of metacharacters.  All characters that are not escape
sequences and that are not listed here stand for themselves:

'\'
     This suppresses the special meaning of a character when matching.
     For example, '\$' matches the character '$'.

'^'
     This matches the beginning of a string.  '^@chapter' matches
     '@chapter' at the beginning of a string, for example, and can be
     used to identify chapter beginnings in Texinfo source files.  The
     '^' is known as an "anchor", because it anchors the pattern to
     match only at the beginning of the string.

     It is important to realize that '^' does not match the beginning of
     a line (the point right after a '\n' newline character) embedded in
     a string.  The condition is not true in the following example:

          if ("line1\nLINE 2" ~ /^L/) ...

'$'
     This is similar to '^', but it matches only at the end of a string.
     For example, 'p$' matches a record that ends with a 'p'.  The '$'
     is an anchor and does not match the end of a line (the point right
     before a '\n' newline character) embedded in a string.  The
     condition in the following example is not true:

          if ("line1\nLINE 2" ~ /1$/) ...

'.' (period)
     This matches any single character, _including_ the newline
     character.  For example, '.P' matches any single character followed
     by a 'P' in a string.  Using concatenation, we can make a regular
     expression such as 'U.A', which matches any three-character
     sequence that begins with 'U' and ends with 'A'.

     In strict POSIX mode (*note Options::), '.' does not match the NUL
     character, which is a character with all bits equal to zero.
     Otherwise, NUL is just another character.  Other versions of 'awk'
     may not be able to match the NUL character.

'['...']'
     This is called a "bracket expression".(1)  It matches any _one_ of
     the characters that are enclosed in the square brackets.  For
     example, '[MVX]' matches any one of the characters 'M', 'V', or 'X'
     in a string.  A full discussion of what can be inside the square
     brackets of a bracket expression is given in *note Bracket
     Expressions::.

'[^'...']'
     This is a "complemented bracket expression".  The first character
     after the '[' _must_ be a '^'.  It matches any characters _except_
     those in the square brackets.  For example, '[^awk]' matches any
     character that is not an 'a', 'w', or 'k'.

'|'
     This is the "alternation operator" and it is used to specify
     alternatives.  The '|' has the lowest precedence of all the regular
     expression operators.  For example, '^P|[aeiouy]' matches any
     string that matches either '^P' or '[aeiouy]'.  This means it
     matches any string that starts with 'P' or contains (anywhere
     within it) a lowercase English vowel.

     The alternation applies to the largest possible regexps on either
     side.

'('...')'
     Parentheses are used for grouping in regular expressions, as in
     arithmetic.  They can be used to concatenate regular expressions
     containing the alternation operator, '|'.  For example,
     '@(samp|code)\{[^}]+\}' matches both '@code{foo}' and '@samp{bar}'.
     (These are Texinfo formatting control sequences.  The '+' is
     explained further on in this list.)

     The left or opening parenthesis is always a metacharacter; to match
     one literally, precede it with a backslash.  However, the right or
     closing parenthesis is only special when paired with a left
     parenthesis; an unpaired right parenthesis is (silently) treated as
     a regular character.

'*'
     This symbol means that the preceding regular expression should be
     repeated as many times as necessary to find a match.  For example,
     'ph*' applies the '*' symbol to the preceding 'h' and looks for
     matches of one 'p' followed by any number of 'h's.  This also
     matches just 'p' if no 'h's are present.

     There are two subtle points to understand about how '*' works.
     First, the '*' applies only to the single preceding regular
     expression component (e.g., in 'ph*', it applies just to the 'h').
     To cause '*' to apply to a larger subexpression, use parentheses:
     '(ph)*' matches 'ph', 'phph', 'phphph', and so on.

     Second, '*' finds as many repetitions as possible.  If the text to
     be matched is 'phhhhhhhhhhhhhhooey', 'ph*' matches all of the 'h's.

'+'
     This symbol is similar to '*', except that the preceding expression
     must be matched at least once.  This means that 'wh+y' would match
     'why' and 'whhy', but not 'wy', whereas 'wh*y' would match all
     three.

'?'
     This symbol is similar to '*', except that the preceding expression
     can be matched either once or not at all.  For example, 'fe?d'
     matches 'fed' and 'fd', but nothing else.

'{'N'}'
'{'N',}'
'{'N','M'}'
     One or two numbers inside braces denote an "interval expression".
     If there is one number in the braces, the preceding regexp is
     repeated N times.  If there are two numbers separated by a comma,
     the preceding regexp is repeated N to M times.  If there is one
     number followed by a comma, then the preceding regexp is repeated
     at least N times:

     'wh{3}y'
          Matches 'whhhy', but not 'why' or 'whhhhy'.

     'wh{3,5}y'
          Matches 'whhhy', 'whhhhy', or 'whhhhhy' only.

     'wh{2,}y'
          Matches 'whhy', 'whhhy', and so on.

   In regular expressions, the '*', '+', and '?' operators, as well as
the braces '{' and '}', have the highest precedence, followed by
concatenation, and finally by '|'.  As in arithmetic, parentheses can
change how operators are grouped.

   In POSIX 'awk' and 'gawk', the '*', '+', and '?' operators stand for
themselves when there is nothing in the regexp that precedes them.  For
example, '/+/' matches a literal plus sign.  However, many other
versions of 'awk' treat such a usage as a syntax error.

                     What About The Empty Regexp?

   We describe here an advanced regexp usage.  Feel free to skip it upon
first reading.

   You can supply an empty regexp constant ('//') in all places where a
regexp is expected.  Is this useful?  What does it match?

   It is useful.  It matches the (invisible) empty string at the start
and end of a string of characters, as well as the empty string between
characters.  This is best illustrated with the 'gsub()' function, which
makes global substitutions in a string (*note String Functions::).
Normal usage of 'gsub()' is like so:

     $ awk '
     > BEGIN {
     >     x = "ABC_CBA"
     >     gsub(/B/, "bb", x)
     >     print x
     > }'
     -| AbbC_CbbA

   We can use 'gsub()' to see where the empty strings are that match the
empty regexp:

     $ awk '
     > BEGIN {
     >     x = "ABC"
     >     gsub(//, "x", x)
     >     print x
     > }'
     -| xAxBxCx

   ---------- Footnotes ----------

   (1) In other literature, you may see a bracket expression referred to
as either a "character set", a "character class", or a "character list".