[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
4.1 How Input Is Split into Records
The awk
utility divides the input for your awk
program into records and fields.
awk
keeps track of the number of records that have
been read
so far
from the current input file. This value is stored in a
built-in variable called FNR
. It is reset to zero when a new
file is started. Another built-in variable, NR
, records the total
number of input records read so far from all data files. It starts at zero,
but is never automatically reset to zero.
Records are separated by a character called the record separator.
By default, the record separator is the newline character.
This is why records are, by default, single lines.
A different character can be used for the record separator by
assigning the character to the built-in variable RS
.
Like any other variable,
the value of RS
can be changed in the awk
program
with the assignment operator, ‘=’
(see section Assignment Expressions).
The new record-separator character should be enclosed in quotation marks,
which indicate a string constant. Often the right time to do this is
at the beginning of execution, before any input is processed,
so that the very first record is read with the proper separator.
To do this, use the special BEGIN
pattern
(see section The BEGIN
and END
Special Patterns).
For example:
awk 'BEGIN { RS = "/" } { print $0 }' BBS-list |
changes the value of RS
to "/"
, before reading any input.
This is a string whose first character is a slash; as a result, records
are separated by slashes. Then the input file is read, and the second
rule in the awk
program (the action with no pattern) prints each
record. Because each print
statement adds a newline at the end of
its output, this awk
program copies the input
with each slash changed to a newline. Here are the results of running
the program on ‘BBS-list’:
$ awk 'BEGIN { RS = "/" } > { print $0 }' BBS-list -| aardvark 555-5553 1200 -| 300 B -| alpo-net 555-3412 2400 -| 1200 -| 300 A -| barfly 555-7685 1200 -| 300 A -| bites 555-1675 2400 -| 1200 -| 300 A -| camelot 555-0542 300 C -| core 555-2912 1200 -| 300 C -| fooey 555-1234 2400 -| 1200 -| 300 B -| foot 555-6699 1200 -| 300 B -| macfoo 555-6480 1200 -| 300 A -| sdace 555-3430 2400 -| 1200 -| 300 A -| sabafoo 555-2127 1200 -| 300 C -| |
Note that the entry for the ‘camelot’ BBS is not split. In the original data file (see section Data Files for the Examples), the line looks like this:
camelot 555-0542 300 C |
It has one baud rate only, so there are no slashes in the record,
unlike the others which have two or more baud rates.
In fact, this record is treated as part of the record
for the ‘core’ BBS; the newline separating them in the output
is the original newline in the data file, not the one added by
awk
when it printed the record!
Another way to change the record separator is on the command line, using the variable-assignment feature (see section Other Command-Line Arguments):
awk '{ print $0 }' RS="/" BBS-list |
This sets RS
to ‘/’ before processing ‘BBS-list’.
Using an unusual character such as ‘/’ for the record separator produces correct behavior in the vast majority of cases. However, the following (extreme) pipeline prints a surprising ‘1’:
$ echo | awk 'BEGIN { RS = "a" } ; { print NF }' -| 1 |
There is one field, consisting of a newline. The value of the built-in
variable NF
is the number of fields in the current record.
Reaching the end of an input file terminates the current input record,
even if the last character in the file is not the character in RS
.
(d.c.)
The empty string ""
(a string without any characters)
has a special meaning
as the value of RS
. It means that records are separated
by one or more blank lines and nothing else.
See section Multiple-Line Records, for more details.
If you change the value of RS
in the middle of an awk
run,
the new value is used to delimit subsequent records, but the record
currently being processed, as well as records already processed, are not
affected.
After the end of the record has been determined, gawk
sets the variable RT
to the text in the input that matched
RS
.
When using gawk
,
the value of RS
is not limited to a one-character
string. It can be any regular expression
(see section Regular Expressions). (c.e.)
In general, each record
ends at the next string that matches the regular expression; the next
record starts at the end of the matching string. This general rule is
actually at work in the usual case, where RS
contains just a
newline: a record ends at the beginning of the next matching string (the
next newline in the input), and the following record starts just after
the end of this string (at the first character of the following line).
The newline, because it matches RS
, is not part of either record.
When RS
is a single character, RT
contains the same single character. However, when RS
is a
regular expression, RT
contains
the actual input text that matched the regular expression.
If the input file ended without any text that matches RS
,
gawk
sets RT
to the null string.
The following example illustrates both of these features.
It sets RS
equal to a regular expression that
matches either a newline or a series of one or more uppercase letters
with optional leading and/or trailing whitespace:
$ echo record 1 AAAA record 2 BBBB record 3 | > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" } > { print "Record =", $0, "and RT =", RT }' -| Record = record 1 and RT = AAAA -| Record = record 2 and RT = BBBB -| Record = record 3 and RT = -| |
The final line of output has an extra blank line. This is because the
value of RT
is a newline, and the print
statement
supplies its own terminating newline.
See section A Simple Stream Editor, for a more useful example
of RS
as a regexp and RT
.
If you set RS
to a regular expression that allows optional
trailing text, such as ‘RS = "abc(XYZ)?"’ it is possible, due
to implementation constraints, that gawk
may match the leading
part of the regular expression, but not the trailing part, particularly
if the input text that could match the trailing part is fairly long.
gawk
attempts to avoid this problem, but currently, there’s
no guarantee that this will never happen.
NOTE: Remember that in
awk
, the ‘^’ and ‘$’ anchor metacharacters match the beginning and end of a string, and not the beginning and end of a line. As a result, something like ‘RS = "^[[:upper:]]"’ can only match at the beginning of a file. This is becausegawk
views the input file as one long string that happens to contain newline characters in it. It is thus best to avoid anchor characters in the value ofRS
.
The use of RS
as a regular expression and the RT
variable are gawk
extensions; they are not available in
compatibility mode
(see section Command-Line Options).
In compatibility mode, only the first character of the value of
RS
is used to determine the end of the record.
Advanced Notes: RS = "\0"
Is Not Portable
There are times when you might want to treat an entire data file as a
single record. The only way to make this happen is to give RS
a value that you know doesn’t occur in the input file. This is hard
to do in a general way, such that a program always works for arbitrary
input files.
You might think that for text files, the NUL character, which
consists of a character with all bits equal to zero, is a good
value to use for RS
in this case:
BEGIN { RS = "\0" } # whole file becomes one record? |
gawk
in fact accepts this, and uses the NUL
character for the record separator.
However, this usage is not portable
to other awk
implementations.
All other awk
implementations(19) store strings internally as C-style strings. C strings use the
NUL character as the string terminator. In effect, this means that
‘RS = "\0"’ is the same as ‘RS = ""’.
(d.c.)
The best way to treat a whole file as a single record is to simply read the file in, one record at a time, concatenating each record onto the end of the previous ones.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |