info gawk

9.1.3.1 More About ‘`\`’ and ‘`&`’ with `sub()`, `gsub()`, and `gensub()`

When using sub(), gsub(), or gensub(), and trying to get literal backslashes and ampersands into the replacement text, you need to remember that there are several levels of escape processing going on.

First, there is the lexical level, which is when awk reads your program and builds an internal copy of it that can be executed. Then there is the runtime level, which is when awk actually scans the replacement string to determine what to generate.

At both levels, awk looks for a defined set of characters that can come after a backslash. At the lexical level, it looks for the escape sequences listed in Escape Sequences. Thus, for every ‘\’ that awk processes at the runtime level, you must type two backslashes at the lexical level. When a character that is not valid for an escape sequence follows the ‘\’, Brian Kernighan’s awk and gawk both simply remove the initial ‘\’ and put the next character into the string. Thus, for example, "a\qb" is treated as "aqb".

At the runtime level, the various functions handle sequences of ‘\’ and ‘&’ differently. The situation is (sadly) somewhat complex. Historically, the sub() and gsub() functions treated the two character sequence ‘\&’ specially; this sequence was replaced in the generated text with a single ‘&’. Any other ‘\’ within the replacement string that did not precede an ‘&’ was passed through unchanged. This is illustrated in table-sub-escapes.

 You type         sub() sees          sub() generates
 ——–         ———-          —————
     \&              &            the matched text
    \\&             \&            a literal ‘&’
   \\\&             \&            a literal ‘&’
  \\\\&            \\&            a literal ‘\&’
 \\\\\&            \\&            a literal ‘\&’
\\\\\\&           \\\&            a literal ‘\\&’
    \\q             \q            a literal ‘\q’

Table 9.1: Historical Escape Sequence Processing for sub() and gsub()

This table shows both the lexical-level processing, where an odd number of backslashes becomes an even number at the runtime level, as well as the runtime processing done by sub(). (For the sake of simplicity, the rest of the following tables only show the case of even numbers of backslashes entered at the lexical level.)

The problem with the historical approach is that there is no way to get a literal ‘\’ followed by the matched text.

The POSIX rules state that ‘\&’ in the replacement string produces a literal ‘&’, ‘\\’ produces a literal ‘\’, and ‘\’ followed by anything else is not special; the ‘\’ is placed straight into the output. These rules are presented in table-posix-sub.

 You type         sub() sees         sub() generates
 ——–         ———-         —————
\\\\\\&           \\\&            a literal ‘\&’
  \\\\&            \\&            a literal ‘\’, followed by the matched text
    \\&             \&            a literal ‘&’
    \\q             \q            a literal ‘\q’
   \\\\             \\            \

Table 9.2: POSIX rules for sub() and gsub()

gawk follows the POSIX rules.

The rules for gensub() are considerably simpler. At the runtime level, whenever gawk sees a ‘\’, if the following character is a digit, then the text that matched the corresponding parenthesized subexpression is placed in the generated output. Otherwise, no matter what character follows the ‘\’, it appears in the generated text and the ‘\’ does not, as shown in table-gensub-escapes.

  You type          gensub() sees         gensub() generates
  ——–          ————-         ——————
      &                    &            the matched text
    \\&                   \&            a literal ‘&’
   \\\\                   \\            a literal ‘\’
  \\\\&                  \\&            a literal ‘\’, then the matched text
\\\\\\&                 \\\&            a literal ‘\&’
    \\q                   \q            a literal ‘q’

Table 9.3: Escape Sequence Processing for gensub()

Because of the complexity of the lexical and runtime level processing and the special cases for sub() and gsub(), we recommend the use of gawk and gensub() when you have to do substitutions.

Advanced Notes: Matching the Null String

In awk, the ‘*’ operator can match the null string. This is particularly important for the sub(), gsub(), and gensub() functions. For example:

$ echo abc | awk '{ gsub(/m*/, "X"); print }'
-| XaXbXcX

Although this makes a certain amount of sense, it can be surprising.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

9.1.3.1 More About ‘\’ and ‘&’ with sub(), gsub(), and gensub()

Advanced Notes: Matching the Null String

9.1.3.1 More About ‘`\`’ and ‘`&`’ with `sub()`, `gsub()`, and `gensub()`