info gawk


File: gawk.info,  Node: wc program,  Prev: Using extensions,  Up: Wc Program

11.2.7.3 Code for 'wc.awk'
..........................

The usage for 'wc' is as follows:

     'wc' ['-lwcm'] [FILES ...]

   If no files are specified on the command line, 'wc' reads its
standard input.  If there are multiple files, it also prints total
counts for all the files.  The options and their meanings are as
follows:

'-c'
     Count only bytes.  Once upon a time, the 'c' in this option stood
     for "characters."  But, as explained earlier, bytes and character
     are no longer synonymous with each other.

'-l'
     Count only lines.

'-m'
     Count only characters.

'-w'
     Count only words.  A "word" is a contiguous sequence of
     nonwhitespace characters, separated by spaces and/or TABs.
     Luckily, this is the normal way 'awk' separates fields in its input
     data.

   Implementing 'wc' in 'awk' is particularly elegant, because 'awk'
does a lot of the work for us; it splits lines into words (i.e., fields)
and counts them, it counts lines (i.e., records), and it can easily tell
us how long a line is in characters.

   This program uses the 'getopt()' library function (*note Getopt
Function::) and the file-transition functions (*note Filetrans
Function::).

   This version has one notable difference from older versions of 'wc':
it always prints the counts in the order lines, words, characters and
bytes.  Older versions note the order of the '-l', '-w', and '-c'
options on the command line, and print the counts in that order.  POSIX
does not mandate this behavior, though.

   The 'BEGIN' rule does the argument processing.  The variable
'print_total' is true if more than one file is named on the command
line:

     # wc.awk --- count lines, words, characters, bytes

     # Options:
     #    -l    only count lines
     #    -w    only count words
     #    -c    only count bytes
     #    -m    only count characters
     #
     # Default is to count lines, words, bytes
     #
     # Requires getopt() and file transition library functions
     # Requires mbs extension from gawkextlib

     @load "mbs"

     BEGIN {
         # let getopt() print a message about
         # invalid options. we ignore them
         while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) {
             if (c == "l")
                 do_lines = 1
             else if (c == "w")
                 do_words = 1
             else if (c == "c")
                 do_bytes = 1
             else if (c == "m")
                 do_chars = 1
         }
         for (i = 1; i < Optind; i++)
             ARGV[i] = ""

         # if no options, do lines, words, bytes
         if (! do_lines && ! do_words && ! do_chars && ! do_bytes)
             do_lines = do_words = do_bytes = 1

         print_total = (ARGC - i > 1)
     }

   The 'beginfile()' function is simple; it just resets the counts of
lines, words, characters and bytes to zero, and saves the current file
name in 'fname':

     function beginfile(file)
     {
         lines = words = chars = bytes = 0
         fname = FILENAME
     }

   The 'endfile()' function adds the current file's numbers to the
running totals of lines, words, and characters.  It then prints out
those numbers for the file that was just read.  It relies on
'beginfile()' to reset the numbers for the following data file:

     function endfile(file)
     {
         tlines += lines
         twords += words
         tchars += chars
         tbytes += bytes
         if (do_lines)
             printf "\t%d", lines
         if (do_words)
             printf "\t%d", words
         if (do_chars)
             printf "\t%d", chars
         if (do_bytes)
             printf "\t%d", bytes
         printf "\t%s\n", fname
     }

   There is one rule that is executed for each line.  It adds the length
of the record, plus one, to 'chars'.  Adding one plus the record length
is needed because the newline character separating records (the value of
'RS') is not part of the record itself, and thus not included in its
length.  Similarly, it adds the length of the record in bytes, plus one,
to 'bytes'.  Next, 'lines' is incremented for each line read, and
'words' is incremented by the value of 'NF', which is the number of
"words" on this line:

     # do per line
     {
         chars += length($0) + 1    # get newline
         bytes += mbs_length($0) + 1
         lines++
         words += NF
     }

   Finally, the 'END' rule simply prints the totals for all the files:

     END {
         if (print_total) {
             if (do_lines)
                 printf "\t%d", tlines
             if (do_words)
                 printf "\t%d", twords
             if (do_chars)
                 printf "\t%d", tchars
             if (do_bytes)
                 printf "\t%d", tbytes
             print "\ttotal"
         }
     }