[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
3.4 Using Bracket Expressions
As mentioned earlier, a bracket expression matches any character amongst those listed between the opening and closing square brackets.
Within a bracket expression, a range expression consists of two
characters separated by a hyphen. It matches any single character that
sorts between the two characters, based upon the system’s native character
set. For example, ‘[0-9]’ is equivalent to ‘[0123456789]’.
(See Regexp Ranges and Locales: A Long Sad Story, for an explanation of how the POSIX
standard and gawk
have changed over time. This is mainly
of historical interest.)
To include one of the characters ‘\’, ‘]’, ‘-’, or ‘^’ in a bracket expression, put a ‘\’ in front of it. For example:
[d\]] |
matches either ‘d’ or ‘]’.
This treatment of ‘\’ in bracket expressions
is compatible with other awk
implementations and is also mandated by POSIX.
The regular expressions in awk
are a superset
of the POSIX specification for Extended Regular Expressions (EREs).
POSIX EREs are based on the regular expressions accepted by the
traditional egrep
utility.
Character classes are a feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but the actual characters can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs between the United States and France.
A character class is only valid in a regexp inside the brackets of a bracket expression. Character classes consist of ‘[:’, a keyword denoting the class, and ‘:]’. table-char-classes lists the character classes defined by the POSIX standard.
Class | Meaning |
---|---|
[:alnum:] | Alphanumeric characters. |
[:alpha:] | Alphabetic characters. |
[:blank:] | Space and TAB characters. |
[:cntrl:] | Control characters. |
[:digit:] | Numeric characters. |
[:graph:] | Characters that are both printable and visible. (A space is printable but not visible, whereas an ‘a’ is both.) |
[:lower:] | Lowercase alphabetic characters. |
[:print:] | Printable characters (characters that are not control characters). |
[:punct:] | Punctuation characters (characters that are not letters, digits, control characters, or space characters). |
[:space:] | Space characters (such as space, TAB, and formfeed, to name a few). |
[:upper:] | Uppercase alphabetic characters. |
[:xdigit:] | Characters that are hexadecimal digits. |
Table 3.1: POSIX Character Classes
For example, before the POSIX standard, you had to write /[A-Za-z0-9]/
to match alphanumeric characters. If your
character set had other alphabetic characters in it, this would not
match them.
With the POSIX character classes, you can write
/[[:alnum:]]/
to match the alphabetic
and numeric characters in your character set.
Two additional special sequences can appear in bracket expressions. These apply to non-ASCII character sets, which can have single symbols (called collating elements) that are represented with more than one character. They can also have several characters that are equivalent for collating, or sorting, purposes. (For example, in French, a plain “e” and a grave-accented “è” are equivalent.) These sequences are:
- Collating symbols
Multicharacter collating elements enclosed between ‘[.’ and ‘.]’. For example, if ‘ch’ is a collating element, then
[[.ch.]]
is a regexp that matches this collating element, whereas[ch]
is a regexp that matches either ‘c’ or ‘h’.- Equivalence classes
Locale-specific names for a list of characters that are equal. The name is enclosed between ‘[=’ and ‘=]’. For example, the name ‘e’ might be used to represent all of “e,” “è,” and “é.” In this case,
[[=e=]]
is a regexp that matches any of ‘e’, ‘é’, or ‘è’.
These features are very valuable in non-English-speaking locales.
CAUTION: The library functions that
gawk
uses for regular expression matching currently recognize only POSIX character classes; they do not recognize collating symbols or equivalence classes.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |