info gawk


File: gawk.info,  Node: Egrep Program,  Next: Id Program,  Prev: Cut Program,  Up: Clones

11.2.2 Searching for Regular Expressions in Files
-------------------------------------------------

The 'grep' family of programs searches files for patterns.  These
programs have an unusual history.  Initially there was 'grep' (Global
Regular Expression Print), which used what are now called Basic Regular
Expressions (BREs).  Later there was 'egrep' (Extended 'grep') which
used what are now called Extended Regular Expressions (EREs).  (These
are almost identical to those available in 'awk'; *note Regexp::).
There was also 'fgrep' (Fast 'grep'), which searched for matches of one
more fixed strings.

   POSIX chose to combine these three programs into one, simply named
'grep'.  On a POSIX system, 'grep''s default behavior is to search using
BREs.  You use '-E' to specify the use of EREs, and '-F' to specify
searching for fixed strings.

   In practice, systems continue to come with separate 'egrep' and
'fgrep' utilities, for backwards compatibility.  This minor node
provides an 'awk' implementation of 'egrep', which supports all of the
POSIX-mandated options.  You invoke it as follows:

     'egrep' [OPTIONS] ''PATTERN'' FILES ...

   The PATTERN is a regular expression.  In typical usage, the regular
expression is quoted to prevent the shell from expanding any of the
special characters as file name wildcards.  Normally, 'egrep' prints the
lines that matched.  If multiple file names are provided on the command
line, each output line is preceded by the name of the file and a colon.

   The options to 'egrep' are as follows:

'-c'
     Print a count of the lines that matched the pattern, instead of the
     lines themselves.

'-e PATTERN'
     Use PATTERN as the regexp to match.  The purpose of the '-e' option
     is to allow patterns that start with a '-'.

'-i'
     Ignore case distinctions in both the pattern and the input data.

'-l'
     Only print (list) the names of the files that matched, not the
     lines that matched.

'-q'
     Be quiet.  No output is produced and the exit value indicates
     whether the pattern was matched.

'-s'
     Be silent.  Do not print error messages for files that could not be
     opened.

'-v'
     Invert the sense of the test.  'egrep' prints the lines that do
     _not_ match the pattern and exits successfully if the pattern is
     not matched.

'-x'
     Match the entire input line in order to consider the match as
     having succeeded.

   This version uses the 'getopt()' library function (*note Getopt
Function::) and 'gawk''s 'BEGINFILE' and 'ENDFILE' special patterns
(*note BEGINFILE/ENDFILE::).

   The program begins with descriptive comments and then a 'BEGIN' rule
that processes the command-line arguments with 'getopt()'.  The '-i'
(ignore case) option is particularly easy with 'gawk'; we just use the
'IGNORECASE' predefined variable (*note Built-in Variables::):

     # egrep.awk --- simulate egrep in awk
     #
     # Options:
     #    -c    count of lines
     #    -e    argument is pattern
     #    -i    ignore case
     #    -l    print filenames only
     #    -n    add line number to output
     #    -q    quiet - use exit value
     #    -s    silent - don't print errors
     #    -v    invert test, success if no match
     #    -x    the entire line must match
     #
     # Requires getopt library function
     # Uses IGNORECASE, BEGINFILE and ENDFILE
     # Invoke using gawk -f egrep.awk -- options ...

     BEGIN {
         while ((c = getopt(ARGC, ARGV, "ce:ilnqsvx")) != -1) {
             if (c == "c")
                 count_only++
             else if (c == "e")
                 pattern = Optarg
             else if (c == "i")
                 IGNORECASE = 1
             else if (c == "l")
                 filenames_only++
             else if (c == "n")
                 line_numbers++
             else if (c == "q")
                 no_print++
             else if (c == "s")
                 no_errors++
             else if (c == "v")
                 invert++
             else if (c == "x")
                 full_line++
             else
                 usage()
         }

Note the comment about invocation: Because several of the options
overlap with 'gawk''s, a '--' is needed to tell 'gawk' to stop looking
for options.

   Next comes the code that handles the 'egrep'-specific behavior.
'egrep' uses the first nonoption on the command line if no pattern is
supplied with '-e'.  If the pattern is empty, that means no pattern was
supplied, so it's necessary to print an error message and exit.  The
'awk' command-line arguments up to 'ARGV[Optind]' are cleared, so that
'awk' won't try to process them as files.  If no files are specified,
the standard input is used, and if multiple files are specified, we make
sure to note this so that the file names can precede the matched lines
in the output:

         if (pattern == "")
             pattern = ARGV[Optind++]

         if (pattern == "")
           usage()

         for (i = 1; i < Optind; i++)
             ARGV[i] = ""

         if (Optind >= ARGC) {
             ARGV[1] = "-"
             ARGC = 2
         } else if (ARGC - Optind > 1)
             do_filenames++
     }

   The 'BEGINFILE' rule executes when each new file is processed.  In
this case, it is fairly simple; it initializes a variable 'fcount' to
zero.  'fcount' tracks how many lines in the current file matched the
pattern.

   Here also is where we implement the '-s' option.  We check if 'ERRNO'
has been set, and if '-s' was supplied.  In that case, it's necessary to
move on to the next file.  Otherwise 'gawk' would exit with an error:

     BEGINFILE {
         fcount = 0
         if (ERRNO && no_errors)
             nextfile
     }

   The 'ENDFILE' rule executes after each file has been processed.  It
affects the output only when the user wants a count of the number of
lines that matched.  'no_print' is true only if the exit status is
desired.  'count_only' is true if line counts are desired.  'egrep'
therefore only prints line counts if printing and counting are enabled.
The output format must be adjusted depending upon the number of files to
process.  Finally, 'fcount' is added to 'total', so that we know the
total number of lines that matched the pattern:

     ENDFILE {
         if (! no_print && count_only) {
             if (do_filenames)
                 print file ":" fcount
             else
                 print fcount
         }

         total += fcount
     }

   The following rule does most of the work of matching lines.  The
variable 'matches' is true (non-zero) if the line matched the pattern.
If the user specified that the entire line must match (with '-x'), the
code checks this condition by looking at the values of 'RSTART' and
'RLENGTH'.  If those indicate that the match is not over the full line,
'matches' is set to zero (false).

   If the user wants lines that did not match, we invert the sense of
'matches' using the '!' operator.  We then increment 'fcount' with the
value of 'matches', which is either one or zero, depending upon a
successful or unsuccessful match.  If the line does not match, the
'next' statement just moves on to the next input line.

   We make a number of additional tests, but only if we are not counting
lines.  First, if the user only wants the exit status ('no_print' is
true), then it is enough to know that _one_ line in this file matched,
and we can skip on to the next file with 'nextfile'.  Similarly, if we
are only printing file names, we can print the file name, and then skip
to the next file with 'nextfile'.  Finally, each line is printed, with a
leading file name, optional colon and line number, and the final colon
if necessary:

     {
         matches = match($0, pattern)
         if (matches && full_line && (RSTART != 1 || RLENGTH != length()))
              matches = 0

         if (invert)
             matches = ! matches

         fcount += matches    # 1 or 0

         if (! matches)
             next

         if (! count_only) {
             if (no_print)
                 nextfile

             if (filenames_only) {
                 print FILENAME
                 nextfile
             }

             if (do_filenames)
                 if (line_numbers)
                    print FILENAME ":" FNR ":" $0
                 else
                    print FILENAME ":" $0
             else
                 print
         }
     }

   The 'END' rule takes care of producing the correct exit status.  If
there are no matches, the exit status is one; otherwise, it is zero:

     END {
         exit (total == 0)
     }

   The 'usage()' function prints a usage message in case of invalid
options, and then exits:

     function usage()
     {
         print("Usage:\tegrep [-cilnqsvx] [-e pat] [files ...]") > "/dev/stderr"
         print("\tegrep [-cilnqsvx] pat [files ...]") > "/dev/stderr"
         exit 1
     }