[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
4.7 Defining Fields By Content
NOTE: This section discusses an advanced feature of
gawk
. If you are a noviceawk
user, you might want to skip it on the first reading.
Normally, when using FS
, gawk
defines the fields as the
parts of the record that occur in between each field separator. In other
words, FS
defines what a field is not, instead of what a field
is.
However, there are times when you really want to define the fields by
what they are, and not by what they are not.
The most notorious such case is so-called comma separated value (CSV) data. Many spreadsheet programs, for example, can export their data into text files, where each record is terminated with a newline, and fields are separated by commas. If only commas separated the data, there wouldn’t be an issue. The problem comes when one of the fields contains an embedded comma. While there is no formal standard specification for CSV data(22), in such cases, most programs embed the field in double quotes. So we might have data like this:
Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA |
The FPAT
variable offers a solution for cases like this.
The value of FPAT
should be a string that provides a regular expression.
This regular expression describes the contents of each field.
In the case of CSV data as presented above, each field is either “anything that
is not a comma,” or “a double quote, anything that is not a double quote, and a
closing double quote.” If written as a regular expression constant
(see section Regular Expressions),
we would have /([^,]+)|("[^"]+")/
.
Writing this as a string requires us to escape the double quotes, leading to:
FPAT = "([^,]+)|(\"[^\"]+\")" |
Putting this to use, here is a simple program to parse the data:
BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")" } { print "NF = ", NF for (i = 1; i <= NF; i++) { printf("$%d = <%s>\n", i, $i) } } |
When run, we get the following:
$ gawk -f simple-csv.awk addresses.csv NF = 7 $1 = <Robbins> $2 = <Arnold> $3 = <"1234 A Pretty Street, NE"> $4 = <MyTown> $5 = <MyState> $6 = <12345-6789> $7 = <USA> |
Note the embedded comma in the value of $3
.
A straightforward improvement when processing CSV data of this sort would be to remove the quotes when they occur, with something like this:
if (substr($i, 1, 1) == "\"") { len = length($i) $i = substr($i, 2, len - 2) # Get text within the two quotes } |
As with FS
, the IGNORECASE
variable (see section Built-in Variables That Control awk
)
affects field splitting with FPAT
.
Similar to FIELDWIDTHS
, the value of PROCINFO["FS"]
will be "FPAT"
if content-based field splitting is being used.
NOTE: Some programs export CSV data that contains embedded newlines between the double quotes.
gawk
provides no way to deal with this. Since there is no formal specification for CSV data, there isn’t much more to be done; theFPAT
mechanism provides an elegant solution for the majority of cases, and thegawk
maintainer is satisfied with that.
As written, the regexp used for FPAT
requires that each field
have a least one character. A straightforward modification
(changing changed the first ‘+’ to ‘*’) allows fields to be empty:
FPAT = "([^,]*)|(\"[^\"]+\")" |
Finally, the patsplit()
function makes the same functionality
available for splitting regular strings (see section String-Manipulation Functions).
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |