File: gawk.info, Node: Uniq Program, Next: Wc Program, Prev: Tee Program, Up: Clones 11.2.6 Printing Nonduplicated Lines of Text ------------------------------------------- The 'uniq' utility reads sorted lines of data on its standard input, and by default removes duplicate lines. In other words, it only prints unique lines--hence the name. 'uniq' has a number of options. The usage is as follows: 'uniq' ['-udc' ['-f N'] ['-s N']] [INPUTFILE [OUTPUTFILE]] The options for 'uniq' are: '-d' Print only repeated (duplicated) lines. '-u' Print only nonrepeated (unique) lines. '-c' Count lines. This option overrides '-d' and '-u'. Both repeated and nonrepeated lines are counted. '-f N' Skip N fields before comparing lines. The definition of fields is similar to 'awk''s default: nonwhitespace characters separated by runs of spaces and/or TABs. '-s N' Skip N characters before comparing lines. Any fields specified with '-f' are skipped first. 'INPUTFILE' Data is read from the input file named on the command line, instead of from the standard input. 'OUTPUTFILE' The generated output is sent to the named output file, instead of to the standard output. Normally 'uniq' behaves as if both the '-d' and '-u' options are provided. 'uniq' uses the 'getopt()' library function (*note Getopt Function::) and the 'join()' library function (*note Join Function::). The program begins with a 'usage()' function and then a brief outline of the options and their meanings in comments: # uniq.awk --- do uniq in awk # # Requires getopt() and join() library functions function usage() { print("Usage: uniq [-udc [-f fields] [-s chars]] " \ "[ in [ out ]]") > "/dev/stderr" exit 1 } # -c count lines. overrides -d and -u # -d only repeated lines # -u only nonrepeated lines # -f n skip n fields # -s n skip n characters, skip fields first The POSIX standard for 'uniq' allows options to start with '+' as well as with '-'. An initial 'BEGIN' rule traverses the arguments changing any leading '+' to '-' so that the 'getopt()' function can parse the options: # As of 2020, '+' can be used as the option character in addition to '-' # Previously allowed use of -N to skip fields and +N to skip # characters is no longer allowed, and not supported by this version. BEGIN { # Convert + to - so getopt can handle things for (i = 1; i < ARGC; i++) { first = substr(ARGV[i], 1, 1) if (ARGV[i] == "--" || (first != "-" && first != "+")) break else if (first == "+") # Replace "+" with "-" ARGV[i] = "-" substr(ARGV[i], 2) } } The next 'BEGIN' rule deals with the command-line arguments and options. If no options are supplied, then the default is taken, to print both repeated and nonrepeated lines. The output file, if provided, is assigned to 'outputfile'. Early on, 'outputfile' is initialized to the standard output, '/dev/stdout': BEGIN { count = 1 outputfile = "/dev/stdout" opts = "udcf:s:" while ((c = getopt(ARGC, ARGV, opts)) != -1) { if (c == "u") non_repeated_only++ else if (c == "d") repeated_only++ else if (c == "c") do_count++ else if (c == "f") fcount = Optarg + 0 else if (c == "s") charcount = Optarg + 0 else usage() } for (i = 1; i < Optind; i++) ARGV[i] = "" if (repeated_only == 0 && non_repeated_only == 0) repeated_only = non_repeated_only = 1 if (ARGC - Optind == 2) { outputfile = ARGV[ARGC - 1] ARGV[ARGC - 1] = "" } } The following function, 'are_equal()', compares the current line, '$0', to the previous line, 'last'. It handles skipping fields and characters. If no field count and no character count are specified, 'are_equal()' returns one or zero depending upon the result of a simple string comparison of 'last' and '$0'. Otherwise, things get more complicated. If fields have to be skipped, each line is broken into an array using 'split()' (*note String Functions::); the desired fields are then joined back into a line using 'join()'. The joined lines are stored in 'clast' and 'cline'. If no fields are skipped, 'clast' and 'cline' are set to 'last' and '$0', respectively. Finally, if characters are skipped, 'substr()' is used to strip off the leading 'charcount' characters in 'clast' and 'cline'. The two strings are then compared and 'are_equal()' returns the result: function are_equal( n, m, clast, cline, alast, aline) { if (fcount == 0 && charcount == 0) return (last == $0) if (fcount > 0) { n = split(last, alast) m = split($0, aline) clast = join(alast, fcount+1, n) cline = join(aline, fcount+1, m) } else { clast = last cline = $0 } if (charcount) { clast = substr(clast, charcount + 1) cline = substr(cline, charcount + 1) } return (clast == cline) } The following two rules are the body of the program. The first one is executed only for the very first line of data. It sets 'last' equal to '$0', so that subsequent lines of text have something to be compared to. The second rule does the work. The variable 'equal' is one or zero, depending upon the results of 'are_equal()''s comparison. If 'uniq' is counting repeated lines, and the lines are equal, then it increments the 'count' variable. Otherwise, it prints the line and resets 'count', because the two lines are not equal. If 'uniq' is not counting, and if the lines are equal, 'count' is incremented. Nothing is printed, as the point is to remove duplicates. Otherwise, if 'uniq' is counting repeated lines and more than one line is seen, or if 'uniq' is counting nonrepeated lines and only one line is seen, then the line is printed, and 'count' is reset. Finally, similar logic is used in the 'END' rule to print the final line of input data: NR == 1 { last = $0 next } { equal = are_equal() if (do_count) { # overrides -d and -u if (equal) count++ else { printf("%4d %s\n", count, last) > outputfile last = $0 count = 1 # reset } next } if (equal) count++ else { if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile last = $0 count = 1 } } END { if (do_count) printf("%4d %s\n", count, last) > outputfile else if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile close(outputfile) } As a side note, this program does not follow our recommended convention of naming global variables with a leading capital letter. Doing that would make the program a little easier to follow.