manpagez: man pages & more
info gawk
Home | html | info | man

File: gawk.info,  Node: History Sorting,  Next: Extract Program,  Prev: Word Sorting,  Up: Miscellaneous Programs

11.3.6 Removing Duplicates from Unsorted Text
---------------------------------------------

The 'uniq' program (*note Uniq Program::) removes duplicate lines from
_sorted_ data.

   Suppose, however, you need to remove duplicate lines from a data file
but that you want to preserve the order the lines are in.  A good
example of this might be a shell history file.  The history file keeps a
copy of all the commands you have entered, and it is not unusual to
repeat a command several times in a row.  Occasionally you might want to
compact the history by removing duplicate entries.  Yet it is desirable
to maintain the order of the original commands.

   This simple program does the job.  It uses two arrays.  The 'data'
array is indexed by the text of each line.  For each line, 'data[$0]' is
incremented.  If a particular line has not been seen before, then
'data[$0]' is zero.  In this case, the text of the line is stored in
'lines[count]'.  Each element of 'lines' is a unique command, and the
indices of 'lines' indicate the order in which those lines are
encountered.  The 'END' rule simply prints out the lines, in order:

     # histsort.awk --- compact a shell history file
     # Thanks to Byron Rakitzis for the general idea

     {
         if (data[$0]++ == 0)
             lines[++count] = $0
     }

     END {
         for (i = 1; i <= count; i++)
             print lines[i]
     }

   This program also provides a foundation for generating other useful
information.  For example, using the following 'print' statement in the
'END' rule indicates how often a particular command is used:

     print data[lines[i]], lines[i]

This works because 'data[$0]' is incremented each time a line is seen.

   Rick van Rein offers the following one-liner to do the same job of
removing duplicates from unsorted text:

     awk '{ if (! seen[$0]++) print }'

   This can be simplified even further, at the risk of becoming almost
too obscure:

     awk '! seen[$0]++'

This version uses the expression as a pattern, relying on 'awk''s
default action of printing the line when the pattern is true.

© manpagez.com 2000-2025
Individual documents may contain additional copyright information.