info gawk


File: gawk.info,  Node: Regexp Field Splitting,  Next: Single Character Fields,  Prev: Default Field Splitting,  Up: Field Separators

4.5.2 Using Regular Expressions to Separate Fields
--------------------------------------------------

The previous node discussed the use of single characters or simple
strings as the value of 'FS'.  More generally, the value of 'FS' may be
a string containing any regular expression.  In this case, each match in
the record for the regular expression separates fields.  For example,
the assignment:

     FS = ", \t"

makes every area of an input line that consists of a comma followed by a
space and a TAB into a field separator.  ('\t' is an "escape sequence"
that stands for a TAB; *note Escape Sequences::, for the complete list
of similar escape sequences.)

   For a less trivial example of a regular expression, try using single
spaces to separate fields the way single commas are used.  'FS' can be
set to '"[ ]"' (left bracket, space, right bracket).  This regular
expression matches a single space and nothing else (*note Regexp::).

   There is an important difference between the two cases of 'FS = " "'
(a single space) and 'FS = "[ \t\n]+"' (a regular expression matching
one or more spaces, TABs, or newlines).  For both values of 'FS', fields
are separated by "runs" (multiple adjacent occurrences) of spaces, TABs,
and/or newlines.  However, when the value of 'FS' is '" "', 'awk' first
strips leading and trailing whitespace from the record and then decides
where the fields are.  For example, the following pipeline prints 'b':

     $ echo ' a b c d ' | awk '{ print $2 }'
     -| b

However, this pipeline prints 'a' (note the extra spaces around each
letter):

     $ echo ' a  b  c  d ' | awk 'BEGIN { FS = "[ \t\n]+" }
     >                                  { print $2 }'
     -| a

In this case, the first field is null, or empty.

   The stripping of leading and trailing whitespace also comes into play
whenever '$0' is recomputed.  For instance, study this pipeline:

     $ echo '   a b c d' | awk '{ print; $2 = $2; print }'
     -|    a b c d
     -| a b c d

The first 'print' statement prints the record as it was read, with
leading whitespace intact.  The assignment to '$2' rebuilds '$0' by
concatenating '$1' through '$NF' together, separated by the value of
'OFS' (which is a space by default).  Because the leading whitespace was
ignored when finding '$1', it is not part of the new '$0'.  Finally, the
last 'print' statement prints the new '$0'.

   There is an additional subtlety to be aware of when using regular
expressions for field splitting.  It is not well specified in the POSIX
standard, or anywhere else, what '^' means when splitting fields.  Does
the '^' match only at the beginning of the entire record?  Or is each
field separator a new string?  It turns out that different 'awk'
versions answer this question differently, and you should not rely on
any specific behavior in your programs.  (d.c.)

   As a point of information, BWK 'awk' allows '^' to match only at the
beginning of the record.  'gawk' also works this way.  For example:

     $ echo 'xxAA  xxBxx  C' |
     > gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
     >                             printf "-->%s<--\n", $i }'
     -| --><--
     -| -->AA<--
     -| -->xxBxx<--
     -| -->C<--

   Finally, field splitting with regular expressions works differently
than regexp matching with the 'sub()', 'gsub()', and 'gensub()' (*note
String Functions::).  Those functions allow a regexp to match the empty
string; field splitting does not.  Thus, for example 'FS = "()"' does
_not_ split fields between characters.