manpagez: man pages & more
info gawk
Home | html | info | man

File: gawk.info,  Node: Dupword Program,  Next: Alarm Program,  Up: Miscellaneous Programs

11.3.1 Finding Duplicated Words in a Document
---------------------------------------------

A common error when writing large amounts of prose is to accidentally
duplicate words.  Typically you will see this in text as something like
"the the program does the following..." When the text is online, often
the duplicated words occur at the end of one line and the beginning of
another, making them very difficult to spot.

   This program, 'dupword.awk', scans through a file one line at a time
and looks for adjacent occurrences of the same word.  It also saves the
last word on a line (in the variable 'prev') for comparison with the
first word on the next line.

   The first two statements make sure that the line is all lowercase, so
that, for example, "The" and "the" compare equal to each other.  The
next statement replaces nonalphanumeric and nonwhitespace characters
with spaces, so that punctuation does not affect the comparison either.
The characters are replaced with spaces so that formatting controls
don't create nonsense words (e.g., the Texinfo '@code{NF}' becomes
'codeNF' if punctuation is simply deleted).  The record is then resplit
into fields, yielding just the actual words on the line, and ensuring
that there are no empty fields.

   If there are no fields left after removing all the punctuation, the
current record is skipped.  Otherwise, the program loops through each
word, comparing it to the previous one:

     # dupword.awk --- find duplicate words in text
     {
         $0 = tolower($0)
         gsub(/[^[:alnum:][:blank:]]/, " ");
         $0 = $0         # re-split
         if (NF == 0)
             next
         if ($1 == prev)
             printf("%s:%d: duplicate %s\n",
                 FILENAME, FNR, $1)
         for (i = 2; i <= NF; i++)
             if ($i == $(i-1))
                 printf("%s:%d: duplicate %s\n",
                     FILENAME, FNR, $i)
         prev = $NF
     }

© manpagez.com 2000-2025
Individual documents may contain additional copyright information.