info gawk


File: gawk.info,  Node: Bracket Expressions,  Next: Leftmost Longest,  Prev: Regexp_Operators.php">Regexp Operators,  Up: Regexp

3.4 Using Bracket Expressions
=============================

As mentioned earlier, a bracket expression matches any character among
those listed between the opening and closing square brackets.

   Within a bracket expression, a "range expression" consists of two
characters separated by a hyphen.  It matches any single character that
sorts between the two characters, based upon the system's native
character set.  For example, '[0-9]' is equivalent to '[0123456789]'.
(See *note Ranges and Locales:: for an explanation of how the POSIX
standard and 'gawk' have changed over time.  This is mainly of
historical interest.)

   With the increasing popularity of the Unicode character standard
(http://www.unicode.org), there is an additional wrinkle to consider.
Octal and hexadecimal escape sequences inside bracket expressions are
taken to represent only single-byte characters (characters whose values
fit within the range 0-256).  To match a range of characters where the
endpoints of the range are larger than 256, enter the multibyte
encodings of the characters directly.

   To include one of the characters '\', ']', '-', or '^' in a bracket
expression, put a '\' in front of it.  For example:

     [d\]]

matches either 'd' or ']'.  Additionally, if you place ']' right after
the opening '[', the closing bracket is treated as one of the characters
to be matched.

   The treatment of '\' in bracket expressions is compatible with other
'awk' implementations and is also mandated by POSIX. The regular
expressions in 'awk' are a superset of the POSIX specification for
Extended Regular Expressions (EREs).  POSIX EREs are based on the
regular expressions accepted by the traditional 'egrep' utility.

   "Character classes" are a feature introduced in the POSIX standard.
A character class is a special notation for describing lists of
characters that have a specific attribute, but the actual characters can
vary from country to country and/or from character set to character set.
For example, the notion of what is an alphabetic character differs
between the United States and France.

   A character class is only valid in a regexp _inside_ the brackets of
a bracket expression.  Character classes consist of '[:', a keyword
denoting the class, and ':]'.  *note Table 3.1: table-char-classes.
lists the character classes defined by the POSIX standard.


Class       Meaning
--------------------------------------------------------------------------
'[:alnum:]' Alphanumeric characters
'[:alpha:]' Alphabetic characters
'[:blank:]' Space and TAB characters
'[:cntrl:]' Control characters
'[:digit:]' Numeric characters
'[:graph:]' Characters that are both printable and visible (a space is
            printable but not visible, whereas an 'a' is both)
'[:lower:]' Lowercase alphabetic characters
'[:print:]' Printable characters (characters that are not control
            characters)
'[:punct:]' Punctuation characters (characters that are not letters,
            digits, control characters, or space characters)
'[:space:]' Space characters (these are: space, TAB, newline, carriage
            return, formfeed and vertical tab)
'[:upper:]' Uppercase alphabetic characters
'[:xdigit:]'Characters that are hexadecimal digits

Table 3.1: POSIX character classes

   For example, before the POSIX standard, you had to write
'/[A-Za-z0-9]/' to match alphanumeric characters.  If your character set
had other alphabetic characters in it, this would not match them.  With
the POSIX character classes, you can write '/[[:alnum:]]/' to match the
alphabetic and numeric characters in your character set.

   Some utilities that match regular expressions provide a nonstandard
'[:ascii:]' character class; 'awk' does not.  However, you can simulate
such a construct using '[\x00-\x7F]'.  This matches all values
numerically between zero and 127, which is the defined range of the
ASCII character set.  Use a complemented character list ('[^\x00-\x7F]')
to match any single-byte characters that are not in the ASCII range.

     NOTE: Some older versions of Unix 'awk' treat '[:blank:]' like
     '[:space:]', incorrectly matching more characters than they should.
     Caveat Emptor.

   Two additional special sequences can appear in bracket expressions.
These apply to non-ASCII character sets, which can have single symbols
(called "collating elements") that are represented with more than one
character.  They can also have several characters that are equivalent
for "collating", or sorting, purposes.  (For example, in French, a plain
"e" and a grave-accented "è" are equivalent.)  These sequences are:

Collating symbols
     Multicharacter collating elements enclosed between '[.' and '.]'.
     For example, if 'ch' is a collating element, then '[[.ch.]]' is a
     regexp that matches this collating element, whereas '[ch]' is a
     regexp that matches either 'c' or 'h'.

Equivalence classes
     Locale-specific names for a list of characters that are equal.  The
     name is enclosed between '[=' and '=]'.  For example, the name 'e'
     might be used to represent all of "e," "ê," "è," and "é."  In this
     case, '[[=e=]]' is a regexp that matches any of 'e', 'ê', 'é', or
     'è'.

   These features are very valuable in non-English-speaking locales.

     CAUTION: The library functions that 'gawk' uses for regular
     expression matching currently recognize only POSIX character
     classes; they do not recognize collating symbols or equivalence
     classes.

   Inside a bracket expression, an opening bracket ('[') that does not
start a character class, collating element or equivalence class is taken
literally.  This is also true of '.' and '*'.