info gawk


File: gawk.info,  Node: Gory Details,  Up: String Functions

9.1.4.1 More about '\' and '&' with 'sub()', 'gsub()', and 'gensub()'
.....................................................................

     CAUTION: This subsubsection has been reported to cause headaches.
     You might want to skip it upon first reading.

   When using 'sub()', 'gsub()', or 'gensub()', and trying to get
literal backslashes and ampersands into the replacement text, you need
to remember that there are several levels of "escape processing" going
on.

   First, there is the "lexical" level, which is when 'awk' reads your
program and builds an internal copy of it to execute.  Then there is the
runtime level, which is when 'awk' actually scans the replacement string
to determine what to generate.

   At both levels, 'awk' looks for a defined set of characters that can
come after a backslash.  At the lexical level, it looks for the escape
sequences listed in *note Escape Sequences::.  Thus, for every '\' that
'awk' processes at the runtime level, you must type two backslashes at
the lexical level.  When a character that is not valid for an escape
sequence follows the '\', BWK 'awk' and 'gawk' both simply remove the
initial '\' and put the next character into the string.  Thus, for
example, '"a\qb"' is treated as '"aqb"'.

   At the runtime level, the various functions handle sequences of '\'
and '&' differently.  The situation is (sadly) somewhat complex.
Historically, the 'sub()' and 'gsub()' functions treated the
two-character sequence '\&' specially; this sequence was replaced in the
generated text with a single '&'.  Any other '\' within the REPLACEMENT
string that did not precede an '&' was passed through unchanged.  This
is illustrated in *note Table 9.1: table-sub-escapes.


      You type         'sub()' sees          'sub()' generates
      -----         -------          ----------
          '\&'              '&'            The matched text
         '\\&'             '\&'            A literal '&'
        '\\\&'             '\&'            A literal '&'
       '\\\\&'            '\\&'            A literal '\&'
      '\\\\\&'            '\\&'            A literal '\&'
     '\\\\\\&'           '\\\&'            A literal '\\&'
         '\\q'             '\q'            A literal '\q'

Table 9.1: Historical escape sequence processing for 'sub()' and
'gsub()'

This table shows the lexical-level processing, where an odd number of
backslashes becomes an even number at the runtime level, as well as the
runtime processing done by 'sub()'.  (For the sake of simplicity, the
rest of the following tables only show the case of even numbers of
backslashes entered at the lexical level.)

   The problem with the historical approach is that there is no way to
get a literal '\' followed by the matched text.

   Several editions of the POSIX standard attempted to fix this problem
but weren't successful.  The details are irrelevant at this point in
time.

   At one point, the 'gawk' maintainer submitted proposed text for a
revised standard that reverts to rules that correspond more closely to
the original existing practice.  The proposed rules have special cases
that make it possible to produce a '\' preceding the matched text.  This
is shown in *note Table 9.2: table-sub-proposed.


      You type         'sub()' sees         'sub()' generates
      -----         -------         ----------
     '\\\\\\&'           '\\\&'            A literal '\&'
       '\\\\&'            '\\&'            A literal '\', followed by the matched text
         '\\&'             '\&'            A literal '&'
         '\\q'             '\q'            A literal '\q'
        '\\\\'             '\\'            '\\'

Table 9.2: 'gawk' rules for 'sub()' and backslash

   In a nutshell, at the runtime level, there are now three special
sequences of characters ('\\\&', '\\&', and '\&') whereas historically
there was only one.  However, as in the historical case, any '\' that is
not part of one of these three sequences is not special and appears in
the output literally.

   'gawk' 3.0 and 3.1 follow these rules for 'sub()' and 'gsub()'.  The
POSIX standard took much longer to be revised than was expected.  In
addition, the 'gawk' maintainer's proposal was lost during the
standardization process.  The final rules are somewhat simpler.  The
results are similar except for one case.

   The POSIX rules state that '\&' in the replacement string produces a
literal '&', '\\' produces a literal '\', and '\' followed by anything
else is not special; the '\' is placed straight into the output.  These
rules are presented in *note Table 9.3: table-posix-sub.


      You type         'sub()' sees         'sub()' generates
      -----         -------         ----------
     '\\\\\\&'           '\\\&'            A literal '\&'
       '\\\\&'            '\\&'            A literal '\', followed by the matched text
         '\\&'             '\&'            A literal '&'
         '\\q'             '\q'            A literal '\q'
        '\\\\'             '\\'            '\'

Table 9.3: POSIX rules for 'sub()' and 'gsub()'

   The only case where the difference is noticeable is the last one:
'\\\\' is seen as '\\' and produces '\' instead of '\\'.

   Starting with version 3.1.4, 'gawk' followed the POSIX rules when
'--posix' was specified (*note Options::).  Otherwise, it continued to
follow the proposed rules, as that had been its behavior for many years.

   When version 4.0.0 was released, the 'gawk' maintainer made the POSIX
rules the default, breaking well over a decade's worth of backward
compatibility.(1)  Needless to say, this was a bad idea, and as of
version 4.0.1, 'gawk' resumed its historical behavior, and only follows
the POSIX rules when '--posix' is given.

   The rules for 'gensub()' are considerably simpler.  At the runtime
level, whenever 'gawk' sees a '\', if the following character is a
digit, then the text that matched the corresponding parenthesized
subexpression is placed in the generated output.  Otherwise, no matter
what character follows the '\', it appears in the generated text and the
'\' does not, as shown in *note Table 9.4: table-gensub-escapes.


       You type          'gensub()' sees         'gensub()' generates
       -----          ---------         ------------
           '&'                    '&'            The matched text
         '\\&'                   '\&'            A literal '&'
        '\\\\'                   '\\'            A literal '\'
       '\\\\&'                  '\\&'            A literal '\', then the matched text
     '\\\\\\&'                 '\\\&'            A literal '\&'
         '\\q'                   '\q'            A literal 'q'

Table 9.4: Escape sequence processing for 'gensub()'

   Because of the complexity of the lexical- and runtime-level
processing and the special cases for 'sub()' and 'gsub()', we recommend
the use of 'gawk' and 'gensub()' when you have to do substitutions.

   ---------- Footnotes ----------

   (1) This was rather naive of him, despite there being a note in this
minor node indicating that the next major version would move to the
POSIX rules.