info gawk


File: gawk.info,  Node: Uniq Program,  Next: Wc Program,  Prev: Tee Program,  Up: Clones

11.2.6 Printing Nonduplicated Lines of Text
-------------------------------------------

The 'uniq' utility reads sorted lines of data on its standard input, and
by default removes duplicate lines.  In other words, it only prints
unique lines--hence the name.  'uniq' has a number of options.  The
usage is as follows:

     'uniq' ['-udc' ['-f N'] ['-s N']] [INPUTFILE [OUTPUTFILE]]

   The options for 'uniq' are:

'-d'
     Print only repeated (duplicated) lines.

'-u'
     Print only nonrepeated (unique) lines.

'-c'
     Count lines.  This option overrides '-d' and '-u'.  Both repeated
     and nonrepeated lines are counted.

'-f N'
     Skip N fields before comparing lines.  The definition of fields is
     similar to 'awk''s default: nonwhitespace characters separated by
     runs of spaces and/or TABs.

'-s N'
     Skip N characters before comparing lines.  Any fields specified
     with '-f' are skipped first.

'INPUTFILE'
     Data is read from the input file named on the command line, instead
     of from the standard input.

'OUTPUTFILE'
     The generated output is sent to the named output file, instead of
     to the standard output.

   Normally 'uniq' behaves as if both the '-d' and '-u' options are
provided.

   'uniq' uses the 'getopt()' library function (*note Getopt Function::)
and the 'join()' library function (*note Join Function::).

   The program begins with a 'usage()' function and then a brief outline
of the options and their meanings in comments:

     # uniq.awk --- do uniq in awk
     #
     # Requires getopt() and join() library functions

     function usage()
     {
         print("Usage: uniq [-udc [-f fields] [-s chars]] " \
               "[ in [ out ]]") > "/dev/stderr"
         exit 1
     }

     # -c    count lines. overrides -d and -u
     # -d    only repeated lines
     # -u    only nonrepeated lines
     # -f n  skip n fields
     # -s n  skip n characters, skip fields first

   The POSIX standard for 'uniq' allows options to start with '+' as
well as with '-'.  An initial 'BEGIN' rule traverses the arguments
changing any leading '+' to '-' so that the 'getopt()' function can
parse the options:

     # As of 2020, '+' can be used as the option character in addition to '-'
     # Previously allowed use of -N to skip fields and +N to skip
     # characters is no longer allowed, and not supported by this version.

     BEGIN {
         # Convert + to - so getopt can handle things
         for (i = 1; i < ARGC; i++) {
             first = substr(ARGV[i], 1, 1)
             if (ARGV[i] == "--" || (first != "-" && first != "+"))
                 break
             else if (first == "+")
                 # Replace "+" with "-"
                 ARGV[i] = "-" substr(ARGV[i], 2)
         }
     }

   The next 'BEGIN' rule deals with the command-line arguments and
options.  If no options are supplied, then the default is taken, to
print both repeated and nonrepeated lines.  The output file, if
provided, is assigned to 'outputfile'.  Early on, 'outputfile' is
initialized to the standard output, '/dev/stdout':

     BEGIN {
         count = 1
         outputfile = "/dev/stdout"
         opts = "udcf:s:"
         while ((c = getopt(ARGC, ARGV, opts)) != -1) {
             if (c == "u")
                 non_repeated_only++
             else if (c == "d")
                 repeated_only++
             else if (c == "c")
                 do_count++
             else if (c == "f")
                 fcount = Optarg + 0
             else if (c == "s")
                 charcount = Optarg + 0
             else
                 usage()
         }

         for (i = 1; i < Optind; i++)
             ARGV[i] = ""

         if (repeated_only == 0 && non_repeated_only == 0)
             repeated_only = non_repeated_only = 1

         if (ARGC - Optind == 2) {
             outputfile = ARGV[ARGC - 1]
             ARGV[ARGC - 1] = ""
         }
     }

   The following function, 'are_equal()', compares the current line,
'$0', to the previous line, 'last'.  It handles skipping fields and
characters.  If no field count and no character count are specified,
'are_equal()' returns one or zero depending upon the result of a simple
string comparison of 'last' and '$0'.

   Otherwise, things get more complicated.  If fields have to be
skipped, each line is broken into an array using 'split()' (*note String
Functions::); the desired fields are then joined back into a line using
'join()'.  The joined lines are stored in 'clast' and 'cline'.  If no
fields are skipped, 'clast' and 'cline' are set to 'last' and '$0',
respectively.  Finally, if characters are skipped, 'substr()' is used to
strip off the leading 'charcount' characters in 'clast' and 'cline'.
The two strings are then compared and 'are_equal()' returns the result:

     function are_equal(    n, m, clast, cline, alast, aline)
     {
         if (fcount == 0 && charcount == 0)
             return (last == $0)

         if (fcount > 0) {
             n = split(last, alast)
             m = split($0, aline)
             clast = join(alast, fcount+1, n)
             cline = join(aline, fcount+1, m)
         } else {
             clast = last
             cline = $0
         }
         if (charcount) {
             clast = substr(clast, charcount + 1)
             cline = substr(cline, charcount + 1)
         }

         return (clast == cline)
     }

   The following two rules are the body of the program.  The first one
is executed only for the very first line of data.  It sets 'last' equal
to '$0', so that subsequent lines of text have something to be compared
to.

   The second rule does the work.  The variable 'equal' is one or zero,
depending upon the results of 'are_equal()''s comparison.  If 'uniq' is
counting repeated lines, and the lines are equal, then it increments the
'count' variable.  Otherwise, it prints the line and resets 'count',
because the two lines are not equal.

   If 'uniq' is not counting, and if the lines are equal, 'count' is
incremented.  Nothing is printed, as the point is to remove duplicates.
Otherwise, if 'uniq' is counting repeated lines and more than one line
is seen, or if 'uniq' is counting nonrepeated lines and only one line is
seen, then the line is printed, and 'count' is reset.

   Finally, similar logic is used in the 'END' rule to print the final
line of input data:

     NR == 1 {
         last = $0
         next
     }

     {
         equal = are_equal()

         if (do_count) {    # overrides -d and -u
             if (equal)
                 count++
             else {
                 printf("%4d %s\n", count, last) > outputfile
                 last = $0
                 count = 1    # reset
             }
             next
         }

         if (equal)
             count++
         else {
             if ((repeated_only && count > 1) ||
                 (non_repeated_only && count == 1))
                     print last > outputfile
             last = $0
             count = 1
         }
     }

     END {
         if (do_count)
             printf("%4d %s\n", count, last) > outputfile
         else if ((repeated_only && count > 1) ||
                 (non_repeated_only && count == 1))
             print last > outputfile
         close(outputfile)
     }

   As a side note, this program does not follow our recommended
convention of naming global variables with a leading capital letter.
Doing that would make the program a little easier to follow.