File: gawk.info, Node: Egrep Program, Next: Id Program, Prev: Cut Program, Up: Clones 11.2.2 Searching for Regular Expressions in Files ------------------------------------------------- The 'grep' family of programs searches files for patterns. These programs have an unusual history. Initially there was 'grep' (Global Regular Expression Print), which used what are now called Basic Regular Expressions (BREs). Later there was 'egrep' (Extended 'grep') which used what are now called Extended Regular Expressions (EREs). (These are almost identical to those available in 'awk'; *note Regexp::). There was also 'fgrep' (Fast 'grep'), which searched for matches of one more fixed strings. POSIX chose to combine these three programs into one, simply named 'grep'. On a POSIX system, 'grep''s default behavior is to search using BREs. You use '-E' to specify the use of EREs, and '-F' to specify searching for fixed strings. In practice, systems continue to come with separate 'egrep' and 'fgrep' utilities, for backwards compatibility. This minor node provides an 'awk' implementation of 'egrep', which supports all of the POSIX-mandated options. You invoke it as follows: 'egrep' [OPTIONS] ''PATTERN'' FILES ... The PATTERN is a regular expression. In typical usage, the regular expression is quoted to prevent the shell from expanding any of the special characters as file name wildcards. Normally, 'egrep' prints the lines that matched. If multiple file names are provided on the command line, each output line is preceded by the name of the file and a colon. The options to 'egrep' are as follows: '-c' Print a count of the lines that matched the pattern, instead of the lines themselves. '-e PATTERN' Use PATTERN as the regexp to match. The purpose of the '-e' option is to allow patterns that start with a '-'. '-i' Ignore case distinctions in both the pattern and the input data. '-l' Only print (list) the names of the files that matched, not the lines that matched. '-q' Be quiet. No output is produced and the exit value indicates whether the pattern was matched. '-s' Be silent. Do not print error messages for files that could not be opened. '-v' Invert the sense of the test. 'egrep' prints the lines that do _not_ match the pattern and exits successfully if the pattern is not matched. '-x' Match the entire input line in order to consider the match as having succeeded. This version uses the 'getopt()' library function (*note Getopt Function::) and 'gawk''s 'BEGINFILE' and 'ENDFILE' special patterns (*note BEGINFILE/ENDFILE::). The program begins with descriptive comments and then a 'BEGIN' rule that processes the command-line arguments with 'getopt()'. The '-i' (ignore case) option is particularly easy with 'gawk'; we just use the 'IGNORECASE' predefined variable (*note Built-in Variables::): # egrep.awk --- simulate egrep in awk # # Options: # -c count of lines # -e argument is pattern # -i ignore case # -l print filenames only # -n add line number to output # -q quiet - use exit value # -s silent - don't print errors # -v invert test, success if no match # -x the entire line must match # # Requires getopt library function # Uses IGNORECASE, BEGINFILE and ENDFILE # Invoke using gawk -f egrep.awk -- options ... BEGIN { while ((c = getopt(ARGC, ARGV, "ce:ilnqsvx")) != -1) { if (c == "c") count_only++ else if (c == "e") pattern = Optarg else if (c == "i") IGNORECASE = 1 else if (c == "l") filenames_only++ else if (c == "n") line_numbers++ else if (c == "q") no_print++ else if (c == "s") no_errors++ else if (c == "v") invert++ else if (c == "x") full_line++ else usage() } Note the comment about invocation: Because several of the options overlap with 'gawk''s, a '--' is needed to tell 'gawk' to stop looking for options. Next comes the code that handles the 'egrep'-specific behavior. 'egrep' uses the first nonoption on the command line if no pattern is supplied with '-e'. If the pattern is empty, that means no pattern was supplied, so it's necessary to print an error message and exit. The 'awk' command-line arguments up to 'ARGV[Optind]' are cleared, so that 'awk' won't try to process them as files. If no files are specified, the standard input is used, and if multiple files are specified, we make sure to note this so that the file names can precede the matched lines in the output: if (pattern == "") pattern = ARGV[Optind++] if (pattern == "") usage() for (i = 1; i < Optind; i++) ARGV[i] = "" if (Optind >= ARGC) { ARGV[1] = "-" ARGC = 2 } else if (ARGC - Optind > 1) do_filenames++ } The 'BEGINFILE' rule executes when each new file is processed. In this case, it is fairly simple; it initializes a variable 'fcount' to zero. 'fcount' tracks how many lines in the current file matched the pattern. Here also is where we implement the '-s' option. We check if 'ERRNO' has been set, and if '-s' was supplied. In that case, it's necessary to move on to the next file. Otherwise 'gawk' would exit with an error: BEGINFILE { fcount = 0 if (ERRNO && no_errors) nextfile } The 'ENDFILE' rule executes after each file has been processed. It affects the output only when the user wants a count of the number of lines that matched. 'no_print' is true only if the exit status is desired. 'count_only' is true if line counts are desired. 'egrep' therefore only prints line counts if printing and counting are enabled. The output format must be adjusted depending upon the number of files to process. Finally, 'fcount' is added to 'total', so that we know the total number of lines that matched the pattern: ENDFILE { if (! no_print && count_only) { if (do_filenames) print file ":" fcount else print fcount } total += fcount } The following rule does most of the work of matching lines. The variable 'matches' is true (non-zero) if the line matched the pattern. If the user specified that the entire line must match (with '-x'), the code checks this condition by looking at the values of 'RSTART' and 'RLENGTH'. If those indicate that the match is not over the full line, 'matches' is set to zero (false). If the user wants lines that did not match, we invert the sense of 'matches' using the '!' operator. We then increment 'fcount' with the value of 'matches', which is either one or zero, depending upon a successful or unsuccessful match. If the line does not match, the 'next' statement just moves on to the next input line. We make a number of additional tests, but only if we are not counting lines. First, if the user only wants the exit status ('no_print' is true), then it is enough to know that _one_ line in this file matched, and we can skip on to the next file with 'nextfile'. Similarly, if we are only printing file names, we can print the file name, and then skip to the next file with 'nextfile'. Finally, each line is printed, with a leading file name, optional colon and line number, and the final colon if necessary: { matches = match($0, pattern) if (matches && full_line && (RSTART != 1 || RLENGTH != length())) matches = 0 if (invert) matches = ! matches fcount += matches # 1 or 0 if (! matches) next if (! count_only) { if (no_print) nextfile if (filenames_only) { print FILENAME nextfile } if (do_filenames) if (line_numbers) print FILENAME ":" FNR ":" $0 else print FILENAME ":" $0 else print } } The 'END' rule takes care of producing the correct exit status. If there are no matches, the exit status is one; otherwise, it is zero: END { exit (total == 0) } The 'usage()' function prints a usage message in case of invalid options, and then exits: function usage() { print("Usage:\tegrep [-cilnqsvx] [-e pat] [files ...]") > "/dev/stderr" print("\tegrep [-cilnqsvx] pat [files ...]") > "/dev/stderr" exit 1 }