manpagez: man pages & more
info gawk
Home | html | info | man

File: gawk.info,  Node: More CSV,  Next: FS versus FPAT,  Up: Splitting By Content

4.7.1 More on CSV Files
-----------------------

Manuel Collado notes that in addition to commas, a CSV field can also
contains quotes, that have to be escaped by doubling them.  The
previously described regexps fail to accept quoted fields with both
commas and quotes inside.  He suggests that the simplest 'FPAT'
expression that recognizes this kind of fields is
'/([^,]*)|("([^"]|"")+")/'.  He provides the following input data to
test these variants:

     p,"q,r",s
     p,"q""r",s
     p,"q,""r",s
     p,"",s
     p,,s

And here is his test program:

     BEGIN {
          fp[0] = "([^,]+)|(\"[^\"]+\")"
          fp[1] = "([^,]*)|(\"[^\"]+\")"
          fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
          FPAT = fp[fpat+0]
     }

     {
          print "<" $0 ">"
          printf("NF = %s ", NF)
          for (i = 1; i <= NF; i++) {
              printf("<%s>", $i)
          }
          print ""
     }

   When run on the third variant, it produces:

     $ gawk -v fpat=2 -f test-csv.awk sample.csv
     -| 
     -| NF = 3 

<"q,r"> -| -| NF = 3

<"q""r"> -| -| NF = 3

<"q,""r"> -| -| NF = 3

<""> -| -| NF = 3

<> In general, using 'FPAT' to do your own CSV parsing is like having a bed with a blanket that's not quite big enough. There's always a corner that isn't covered. We recommend, instead, that you use Manuel Collado's 'CSVMODE' library for 'gawk' (http://mcollado.z15.es/xgawk/).

© manpagez.com 2000-2025
Individual documents may contain additional copyright information.