manpagez: man pages & more
info grep
Home | html | info | man

File: grep.info,  Node: Problematic Expressions,  Next: Character Encoding,  Prev: Basic vs Extended,  Up: Regular Expressions

3.7 Problematic Regular Expressions
===================================

Some strings are “invalid regular expressions” and cause ‘grep’ to issue
a diagnostic and fail.  For example, ‘xy\1’ is invalid because there is
no parenthesized subexpression for the back-reference ‘\1’ to refer to.

   Also, some regular expressions have “unspecified behavior” and should
be avoided even if ‘grep’ does not currently diagnose them.  For
example, ‘xy\0’ has unspecified behavior because ‘0’ is not a special
character and ‘\0’ is not a special backslash expression (*note Special
Backslash Expressions::).  Unspecified behavior can be particularly
problematic because the set of matched strings might be only partially
specified, or not be specified at all, or the expression might even be
invalid.

   The following regular expression constructs are invalid on all
platforms conforming to POSIX, so portable scripts can assume that
‘grep’ rejects these constructs:

   • A basic regular expression containing a back-reference ‘\N’
     preceded by fewer than N closing parentheses.  For example,
     ‘\(a\)\2’ is invalid.

   • A bracket expression containing ‘[:’ that does not start a
     character class; and similarly for ‘[=’ and ‘[.’.  For example,
     ‘[a[:b]’ and ‘[a[:ouch:]b]’ are invalid.

   GNU ‘grep’ treats the following constructs as invalid.  However,
other ‘grep’ implementations might allow them, so portable scripts
should not rely on their being invalid:

   • Unescaped ‘\’ at the end of a regular expression.

   • Unescaped ‘[’ that does not start a bracket expression.

   • A ‘\{’ in a basic regular expression that does not start an
     interval expression.

   • A basic regular expression with unbalanced ‘\(’ or ‘\)’, or an
     extended regular expression with unbalanced ‘(’.

   • In the POSIX locale, a range expression like ‘z-a’ that represents
     zero elements.  A non-GNU ‘grep’ might treat it as a valid range
     that never matches.

   • An interval expression with a repetition count greater than 32767.
     (The portable POSIX limit is 255, and even interval expressions
     with smaller counts can be impractically slow on all known
     implementations.)

   • A bracket expression that contains at least three elements, the
     first and last of which are both ‘:’, or both ‘.’, or both ‘=’.
     For example, a non-GNU ‘grep’ might treat ‘[:alpha:]’ like
     ‘[[:alpha:]]’, or like ‘[:ahlp]’.

   The following constructs have well-defined behavior in GNU ‘grep’.
However, they have unspecified behavior elsewhere, so portable scripts
should avoid them:

   • Special backslash expressions like ‘\b’, ‘\<’, and ‘\]’.  *Note
     Special Backslash Expressions::.

   • A basic regular expression that uses ‘\?’, ‘\+’, or ‘\|’.

   • An extended regular expression that uses back-references.

   • An empty regular expression, subexpression, or alternative.  For
     example, ‘(a|bc|)’ is not portable; a portable equivalent is
     ‘(a|bc)?’.

   • In a basic regular expression, an anchoring ‘^’ that appears
     directly after ‘\(’, or an anchoring ‘$’ that appears directly
     before ‘\)’.

   • In a basic regular expression, a repetition operator that directly
     follows another repetition operator.

   • In an extended regular expression, unescaped ‘{’ that does not
     begin a valid interval expression.  GNU ‘grep’ treats the ‘{’ as an
     ordinary character.

   • A null character or an encoding error in either pattern or input
     data.  *Note Character Encoding::.

   • An input file that ends in a non-newline character, where GNU
     ‘grep’ silently supplies a newline.

   The following constructs have unspecified behavior, in both GNU and
other ‘grep’ implementations.  Scripts should avoid them whenever
possible.

   • A backslash escaping an ordinary character, unless it is a
     back-reference like ‘\1’ or a special backslash expression like
     ‘\<’ or ‘\b’.  *Note Special Backslash Expressions::.  For example,
     ‘\x’ has unspecified behavior now, and a future version of ‘grep’
     might specify ‘\x’ to have a new behavior.

   • A repetition operator that appears directly after an anchor, or at
     the start of a complete regular expression, parenthesized
     subexpression, or alternative.  For example, ‘+|^*(+a|?-b)’ has
     unspecified behavior, whereas ‘\+|^\*(\+a|\?-b)’ is portable.

   • A range expression outside the POSIX locale.  For example, in some
     locales ‘[a-z]’ might match some characters that are not lowercase
     letters, or might not match some lowercase letters, or might be
     invalid.  With GNU ‘grep’ it is not documented whether these range
     expressions use native code points, or use the collating sequence
     specified by the ‘LC_COLLATE’ category, or have some other
     interpretation.  Outside the POSIX locale, it is portable to use
     ‘[[:lower:]]’ to match a lower-case letter, or
     ‘[abcdefghijklmnopqrstuvwxyz]’ to match an ASCII lower-case letter.

© manpagez.com 2000-2025
Individual documents may contain additional copyright information.