File: gawk.info, Node: gawk split records, Prev: awk split records, Up: Records 4.1.2 Record Splitting with 'gawk' ---------------------------------- When using 'gawk', the value of 'RS' is not limited to a one-character string. If it contains more than one character, it is treated as a regular expression (*note Regexp::). (c.e.) In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string. This general rule is actually at work in the usual case, where 'RS' contains just a newline: a record ends at the beginning of the next matching string (the next newline in the input), and the following record starts just after the end of this string (at the first character of the following line). The newline, because it matches 'RS', is not part of either record. When 'RS' is a single character, 'RT' contains the same single character. However, when 'RS' is a regular expression, 'RT' contains the actual input text that matched the regular expression. If the input file ends without any text matching 'RS', 'gawk' sets 'RT' to the null string. The following example illustrates both of these features. It sets 'RS' equal to a regular expression that matches either a newline or a series of one or more uppercase letters with optional leading and/or trailing whitespace: $ echo record 1 AAAA record 2 BBBB record 3 | > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" } > { print "Record =", $0,"and RT = [" RT "]" }' -| Record = record 1 and RT = [ AAAA ] -| Record = record 2 and RT = [ BBBB ] -| Record = record 3 and RT = [ -| ] The square brackets delineate the contents of 'RT', letting you see the leading and trailing whitespace. The final value of 'RT' is a newline. *Note Simple Sed:: for a more useful example of 'RS' as a regexp and 'RT'. If you set 'RS' to a regular expression that allows optional trailing text, such as 'RS = "abc(XYZ)?"', it is possible, due to implementation constraints, that 'gawk' may match the leading part of the regular expression, but not the trailing part, particularly if the input text that could match the trailing part is fairly long. 'gawk' attempts to avoid this problem, but currently, there's no guarantee that this will never happen. Caveats When Using Regular Expressions for 'RS' Remember that in 'awk', the '^' and '$' anchor metacharacters match the beginning and end of a _string_, and not the beginning and end of a _line_. As a result, something like 'RS = "^[[:upper:]]"' can only match at the beginning of a file. This is because 'gawk' views the input file as one long string that happens to contain newline characters. It is thus best to avoid anchor metacharacters in the value of 'RS'. Record splitting with regular expressions works differently than regexp matching with the 'sub()', 'gsub()', and 'gensub()' (*note String Functions::). Those functions allow a regexp to match the empty string; record splitting does not. Thus, for example 'RS = "()"' does _not_ split records between characters. The use of 'RS' as a regular expression and the 'RT' variable are 'gawk' extensions; they are not available in compatibility mode (*note Options::). In compatibility mode, only the first character of the value of 'RS' determines the end of the record. 'mawk' has allowed 'RS' to be a regexp for decades. As of October, 2019, BWK 'awk' also supports it. Neither version supplies 'RT', however. 'RS = "\0"' Is Not Portable There are times when you might want to treat an entire data file as a single record. The only way to make this happen is to give 'RS' a value that you know doesn't occur in the input file. This is hard to do in a general way, such that a program always works for arbitrary input files. You might think that for text files, the NUL character, which consists of a character with all bits equal to zero, is a good value to use for 'RS' in this case: BEGIN { RS = "\0" } # whole file becomes one record? 'gawk' in fact accepts this, and uses the NUL character for the record separator. This works for certain special files, such as '/proc/environ' on GNU/Linux systems, where the NUL character is in fact the record separator. However, this usage is _not_ portable to most other 'awk' implementations. Almost all other 'awk' implementations(1) store strings internally as C-style strings. C strings use the NUL character as the string terminator. In effect, this means that 'RS = "\0"' is the same as 'RS = ""'. (d.c.) It happens that recent versions of 'mawk' can use the NUL character as a record separator. However, this is a special case: 'mawk' does not allow embedded NUL characters in strings. (This may change in a future version of 'mawk'.) *Note Readfile Function:: for an interesting way to read whole files. If you are using 'gawk', see *note Extension Sample Readfile:: for another option. ---------- Footnotes ---------- (1) At least that we know about.