info gawk


File: gawk.info,  Node: gawk split records,  Prev: awk split records,  Up: Records

4.1.2 Record Splitting with 'gawk'
----------------------------------

When using 'gawk', the value of 'RS' is not limited to a one-character
string.  If it contains more than one character, it is treated as a
regular expression (*note Regexp::).  (c.e.)  In general, each record
ends at the next string that matches the regular expression; the next
record starts at the end of the matching string.  This general rule is
actually at work in the usual case, where 'RS' contains just a newline:
a record ends at the beginning of the next matching string (the next
newline in the input), and the following record starts just after the
end of this string (at the first character of the following line).  The
newline, because it matches 'RS', is not part of either record.

   When 'RS' is a single character, 'RT' contains the same single
character.  However, when 'RS' is a regular expression, 'RT' contains
the actual input text that matched the regular expression.

   If the input file ends without any text matching 'RS', 'gawk' sets
'RT' to the null string.

   The following example illustrates both of these features.  It sets
'RS' equal to a regular expression that matches either a newline or a
series of one or more uppercase letters with optional leading and/or
trailing whitespace:

     $ echo record 1 AAAA record 2 BBBB record 3 |
     > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
     >             { print "Record =", $0,"and RT = [" RT "]" }'
     -| Record = record 1 and RT = [ AAAA ]
     -| Record = record 2 and RT = [ BBBB ]
     -| Record = record 3 and RT = [
     -| ]

The square brackets delineate the contents of 'RT', letting you see the
leading and trailing whitespace.  The final value of 'RT' is a newline.
*Note Simple Sed:: for a more useful example of 'RS' as a regexp and
'RT'.

   If you set 'RS' to a regular expression that allows optional trailing
text, such as 'RS = "abc(XYZ)?"', it is possible, due to implementation
constraints, that 'gawk' may match the leading part of the regular
expression, but not the trailing part, particularly if the input text
that could match the trailing part is fairly long.  'gawk' attempts to
avoid this problem, but currently, there's no guarantee that this will
never happen.

            Caveats When Using Regular Expressions for 'RS'

   Remember that in 'awk', the '^' and '$' anchor metacharacters match
the beginning and end of a _string_, and not the beginning and end of a
_line_.  As a result, something like 'RS = "^[[:upper:]]"' can only
match at the beginning of a file.  This is because 'gawk' views the
input file as one long string that happens to contain newline
characters.  It is thus best to avoid anchor metacharacters in the value
of 'RS'.

   Record splitting with regular expressions works differently than
regexp matching with the 'sub()', 'gsub()', and 'gensub()' (*note String
Functions::).  Those functions allow a regexp to match the empty string;
record splitting does not.  Thus, for example 'RS = "()"' does _not_
split records between characters.

   The use of 'RS' as a regular expression and the 'RT' variable are
'gawk' extensions; they are not available in compatibility mode (*note
Options::).  In compatibility mode, only the first character of the
value of 'RS' determines the end of the record.

   'mawk' has allowed 'RS' to be a regexp for decades.  As of October,
2019, BWK 'awk' also supports it.  Neither version supplies 'RT',
however.

                      'RS = "\0"' Is Not Portable

   There are times when you might want to treat an entire data file as a
single record.  The only way to make this happen is to give 'RS' a value
that you know doesn't occur in the input file.  This is hard to do in a
general way, such that a program always works for arbitrary input files.

   You might think that for text files, the NUL character, which
consists of a character with all bits equal to zero, is a good value to
use for 'RS' in this case:

     BEGIN { RS = "\0" }  # whole file becomes one record?

   'gawk' in fact accepts this, and uses the NUL character for the
record separator.  This works for certain special files, such as
'/proc/environ' on GNU/Linux systems, where the NUL character is in fact
the record separator.  However, this usage is _not_ portable to most
other 'awk' implementations.

   Almost all other 'awk' implementations(1) store strings internally as
C-style strings.  C strings use the NUL character as the string
terminator.  In effect, this means that 'RS = "\0"' is the same as 'RS =
""'.  (d.c.)

   It happens that recent versions of 'mawk' can use the NUL character
as a record separator.  However, this is a special case: 'mawk' does not
allow embedded NUL characters in strings.  (This may change in a future
version of 'mawk'.)

   *Note Readfile Function:: for an interesting way to read whole files.
If you are using 'gawk', see *note Extension Sample Readfile:: for
another option.

   ---------- Footnotes ----------

   (1) At least that we know about.