File: grep.info, Node: Problematic Expressions, Next: Character Encoding, Prev: Basic vs Extended, Up: Regular Expressions 3.7 Problematic Regular Expressions =================================== Some strings are “invalid regular expressions” and cause ‘grep’ to issue a diagnostic and fail. For example, ‘xy\1’ is invalid because there is no parenthesized subexpression for the back-reference ‘\1’ to refer to. Also, some regular expressions have “unspecified behavior” and should be avoided even if ‘grep’ does not currently diagnose them. For example, ‘xy\0’ has unspecified behavior because ‘0’ is not a special character and ‘\0’ is not a special backslash expression (*note Special Backslash Expressions::). Unspecified behavior can be particularly problematic because the set of matched strings might be only partially specified, or not be specified at all, or the expression might even be invalid. The following regular expression constructs are invalid on all platforms conforming to POSIX, so portable scripts can assume that ‘grep’ rejects these constructs: • A basic regular expression containing a back-reference ‘\N’ preceded by fewer than N closing parentheses. For example, ‘\(a\)\2’ is invalid. • A bracket expression containing ‘[:’ that does not start a character class; and similarly for ‘[=’ and ‘[.’. For example, ‘[a[:b]’ and ‘[a[:ouch:]b]’ are invalid. GNU ‘grep’ treats the following constructs as invalid. However, other ‘grep’ implementations might allow them, so portable scripts should not rely on their being invalid: • Unescaped ‘\’ at the end of a regular expression. • Unescaped ‘[’ that does not start a bracket expression. • A ‘\{’ in a basic regular expression that does not start an interval expression. • A basic regular expression with unbalanced ‘\(’ or ‘\)’, or an extended regular expression with unbalanced ‘(’. • In the POSIX locale, a range expression like ‘z-a’ that represents zero elements. A non-GNU ‘grep’ might treat it as a valid range that never matches. • An interval expression with a repetition count greater than 32767. (The portable POSIX limit is 255, and even interval expressions with smaller counts can be impractically slow on all known implementations.) • A bracket expression that contains at least three elements, the first and last of which are both ‘:’, or both ‘.’, or both ‘=’. For example, a non-GNU ‘grep’ might treat ‘[:alpha:]’ like ‘[[:alpha:]]’, or like ‘[:ahlp]’. The following constructs have well-defined behavior in GNU ‘grep’. However, they have unspecified behavior elsewhere, so portable scripts should avoid them: • Special backslash expressions like ‘\b’, ‘\<’, and ‘\]’. *Note Special Backslash Expressions::. • A basic regular expression that uses ‘\?’, ‘\+’, or ‘\|’. • An extended regular expression that uses back-references. • An empty regular expression, subexpression, or alternative. For example, ‘(a|bc|)’ is not portable; a portable equivalent is ‘(a|bc)?’. • In a basic regular expression, an anchoring ‘^’ that appears directly after ‘\(’, or an anchoring ‘$’ that appears directly before ‘\)’. • In a basic regular expression, a repetition operator that directly follows another repetition operator. • In an extended regular expression, unescaped ‘{’ that does not begin a valid interval expression. GNU ‘grep’ treats the ‘{’ as an ordinary character. • A null character or an encoding error in either pattern or input data. *Note Character Encoding::. • An input file that ends in a non-newline character, where GNU ‘grep’ silently supplies a newline. The following constructs have unspecified behavior, in both GNU and other ‘grep’ implementations. Scripts should avoid them whenever possible. • A backslash escaping an ordinary character, unless it is a back-reference like ‘\1’ or a special backslash expression like ‘\<’ or ‘\b’. *Note Special Backslash Expressions::. For example, ‘\x’ has unspecified behavior now, and a future version of ‘grep’ might specify ‘\x’ to have a new behavior. • A repetition operator that appears directly after an anchor, or at the start of a complete regular expression, parenthesized subexpression, or alternative. For example, ‘+|^*(+a|?-b)’ has unspecified behavior, whereas ‘\+|^\*(\+a|\?-b)’ is portable. • A range expression outside the POSIX locale. For example, in some locales ‘[a-z]’ might match some characters that are not lowercase letters, or might not match some lowercase letters, or might be invalid. With GNU ‘grep’ it is not documented whether these range expressions use native code points, or use the collating sequence specified by the ‘LC_COLLATE’ category, or have some other interpretation. Outside the POSIX locale, it is portable to use ‘[[:lower:]]’ to match a lower-case letter, or ‘[abcdefghijklmnopqrstuvwxyz]’ to match an ASCII lower-case letter.