[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
13.3.5 Generating Word-Usage Counts
When working with large amounts of text, it can be interesting to know how often different words appear. For example, an author may overuse certain words, in which case she might wish to find synonyms to substitute for words that appear too often. This subsection develops a program for counting words and presenting the frequency information in a useful format.
At first glance, a program like this would seem to do the job:
# Print list of word frequencies { for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] } |
The program relies on awk
’s default field splitting
mechanism to break each line up into “words,” and uses an
associative array named freq
, indexed by each word, to count
the number of times the word occurs. In the END
rule,
it prints the counts.
This program has several problems that prevent it from being useful on real text files:
-
The
awk
language considers upper- and lowercase characters to be distinct. Therefore, “bartender” and “Bartender” are not treated as the same word. This is undesirable, since in normal text, words are capitalized if they begin sentences, and a frequency analyzer should not be sensitive to capitalization. -
Words are detected using the
awk
convention that fields are separated just by whitespace. Other characters in the input (except newlines) don’t have any special meaning toawk
. This means that punctuation characters count as part of words. - The output does not come out in any useful order. You’re more likely to be interested in which words occur most frequently or in having an alphabetized table of how frequently each word occurs.
The first problem can be solved by using tolower()
to remove case
distinctions. The second problem can be solved by using gsub()
to remove punctuation characters. Finally, we solve the third problem
by using the system sort
utility to process the output of the
awk
script. Here is the new version of the program:
# wordfreq.awk --- print list of word frequencies { $0 = tolower($0) # remove case distinctions # remove punctuation gsub(/[^[:alnum:]_[:blank:]]/, "", $0) for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] } |
Assuming we have saved this program in a file named ‘wordfreq.awk’, and that the data is in ‘file1’, the following pipeline:
awk -f wordfreq.awk file1 | sort -k 2nr |
produces a table of the words appearing in ‘file1’ in order of decreasing frequency.
The awk
program suitably massages the
data and produces a word frequency table, which is not ordered.
The awk
script’s output is then sorted by the sort
utility and printed on the screen.
The options given to sort
specify a sort that uses the second field of each input line (skipping
one field), that the sort keys should be treated as numeric quantities
(otherwise ‘15’ would come before ‘5’), and that the sorting
should be done in descending (reverse) order.
The sort
could even be done from within the program, by changing
the END
action to:
END { sort = "sort -k 2nr" for (word in freq) printf "%s\t%d\n", word, freq[word] | sort close(sort) } |
This way of sorting must be used on systems that do not
have true pipes at the command-line (or batch-file) level.
See the general operating system documentation for more information on how
to use the sort
program.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |