info gawk


File: gawk.info,  Node: Split Program,  Next: Tee Program,  Prev: Id Program,  Up: Clones

11.2.4 Splitting a Large File into Pieces
-----------------------------------------

The 'split' utility splits large text files into smaller pieces.  The
usage follows the POSIX standard for 'split' and is as follows:

     'split' ['-l' COUNT] ['-a' SUFFIX-LEN] [FILE [OUTNAME]]
     'split' '-b' N['k'|'m']] ['-a' SUFFIX-LEN] [FILE [OUTNAME]]

   By default, the output files are named 'xaa', 'xab', and so on.  Each
file has 1,000 lines in it, with the likely exception of the last file.

   The 'split' program has evolved over time, and the current POSIX
version is more complicated than the original Unix version.  The options
and what they do are as follows:

'-a' SUFFIX-LEN
     Use SUFFIX-LEN characters for the suffix.  For example, if
     SUFFIX-LEN is four, the output files would range from 'xaaaa' to
     'xzzzz'.

'-b' N['k'|'m']]
     Instead of each file containing a specified number of lines, each
     file should have (at most) N bytes.  Supplying a trailing 'k'
     multiplies N by 1,024, yielding kilobytes.  Supplying a trailing
     'm' multiplies N by 1,048,576 (1,024 * 1,024) yielding megabytes.
     (This option is mutually exclusive with '-l').

'-l' COUNT
     Each file should have at most COUNT lines, instead of the default
     1,000.  (This option is mutually exclusive with '-b').

   If supplied, FILE is the input file to read.  Otherwise standard
input is processed.  If supplied, OUTNAME is the leading prefix to use
for file names, instead of 'x'.

   In order to use the '-b' option, 'gawk' should be invoked with its
'-b' option (*note Options::), or with the environment variable 'LC_ALL'
set to 'C', so that each input byte is treated as a separate
character.(1)

   Here is an implementation of 'split' in 'awk'.  It uses the
'getopt()' function presented in *note Getopt Function::.

   The program begins with a standard descriptive comment and then a
'usage()' function describing the options.  The variable 'common' keeps
the function's lines short so that they look nice on the page:

     # split.awk --- do split in awk
     #
     # Requires getopt() library function.

     function usage(     common)
     {
         common = "[-a suffix-len] [file [outname]]"
         printf("usage: split [-l count]  %s\n", common) > "/dev/stderr"
         printf("       split [-b N[k|m]] %s\n", common) > "/dev/stderr"
         exit 1
     }

   Next, in a 'BEGIN' rule we set the default values and parse the
arguments.  After that we initialize the data structures used to cycle
the suffix from 'aa...' to 'zz...'.  Finally we set the name of the
first output file:

     BEGIN {
         # Set defaults:
         Suffix_length = 2
         Line_count = 1000
         Byte_count = 0
         Outfile = "x"

         parse_arguments()

         init_suffix_data()

         Output = (Outfile compute_suffix())
     }

   Parsing the arguments is straightforward.  The program follows our
convention (*note Library Names::) of having important global variables
start with an uppercase letter:

     function parse_arguments(   i, c, l, modifier)
     {
         while ((c = getopt(ARGC, ARGV, "a:b:l:")) != -1) {
             if (c == "a")
                 Suffix_length = Optarg + 0
             else if (c == "b") {
                 Byte_count = Optarg + 0
                 Line_count = 0

                 l = length(Optarg)
                 modifier = substr(Optarg, l, 1)
                 if (modifier == "k")
                     Byte_count *= 1024
                 else if (modifier == "m")
                     Byte_count *= 1024 * 1024
             } else if (c == "l") {
                 Line_count = Optarg + 0
                 Byte_count = 0
             } else
                 usage()
         }

         # Clear out options
         for (i = 1; i < Optind; i++)
             ARGV[i] = ""

         # Check for filename
         if (ARGV[Optind]) {
             Optind++

             # Check for different prefix
             if (ARGV[Optind]) {
                 Outfile = ARGV[Optind]
                 ARGV[Optind] = ""

                 if (++Optind < ARGC)
                     usage()
             }
         }
     }

   Managing the file name suffix is interesting.  Given a suffix of
length three, say, the values go from 'aaa', 'aab', 'aac' and so on, all
the way to 'zzx', 'zzy', and finally 'zzz'.  There are two important
aspects to this:

   * We have to be able to easily generate these suffixes, and in
     particular easily handle "rolling over"; for example, going from
     'abz' to 'aca'.

   * We have to tell when we've finished with the last file, so that if
     we still have more input data we can print an error message and
     exit.  The trick is to handle this _after_ using the last suffix,
     and not when the final suffix is created.

   The computation is handled by 'compute_suffix()'.  This function is
called every time a new file is opened.

   The flow here is messy, because we want to generate 'zzzz' (say), and
use it, and only produce an error after all the file name suffixes have
been used up.  The logical steps are as follows:

  1. Generate the suffix, saving the value in 'result' to return.  To do
     this, the supplementary array 'Suffix_ind' contains one element for
     each letter in the suffix.  Each element ranges from 1 to 26,
     acting as the index into a string containing all the lowercase
     letters of the English alphabet.  It is initialized by
     'init_suffix_data()'.  'result' is built up one letter at a time,
     using each 'substr()'.

  2. Prepare the data structures for the next time 'compute_suffix()' is
     called.  To do this, we loop over 'Suffix_ind', _backwards_.  If
     the current element is less than 26, it's incremented and the loop
     breaks ('abq' goes to 'abr').  Otherwise, the element is reset to
     one and we move down the list ('abz' to 'aca').  Thus, the
     'Suffix_ind' array is always "one step ahead" of the actual file
     name suffix to be returned.

  3. Check if we've gone past the limit of possible file names.  If
     'Reached_last' is true, print a message and exit.  Otherwise, check
     if 'Suffix_ind' describes a suffix where all the letters are 'z'.
     If that's the case we're about to return the final suffix.  If so,
     we set 'Reached_last' to true so that the _next_ call to
     'compute_suffix()' will cause a failure.

   Physically, the steps in the function occur in the order 3, 1, 2:

     function compute_suffix(    i, result, letters)
     {
         # Logical step 3
         if (Reached_last) {
             printf("split: too many files!\n") > "/dev/stderr"
             exit 1
         } else if (on_last_file())
             Reached_last = 1    # fail when wrapping after 'zzz'

         # Logical step 1
         result = ""
         letters = "abcdefghijklmnopqrstuvwxyz"
         for (i = 1; i <= Suffix_length; i++)
             result = result substr(letters, Suffix_ind[i], 1)

         # Logical step 2
         for (i = Suffix_length; i >= 1; i--) {
             if (++Suffix_ind[i] > 26) {
                 Suffix_ind[i] = 1
             } else
                 break
         }

         return result
     }

   The 'Suffix_ind' array and 'Reached_last' are initialized by
'init_suffix_data()':

     function init_suffix_data(  i)
     {
         for (i = 1; i <= Suffix_length; i++)
             Suffix_ind[i] = 1

         Reached_last = 0
     }

   The function 'on_last_file()' returns true if 'Suffix_ind' describes
a suffix where all the letters are 'z' by checking that all the elements
in the array are equal to 26:

     function on_last_file(  i, on_last)
     {
         on_last = 1
         for (i = 1; i <= Suffix_length; i++) {
             on_last = on_last && (Suffix_ind[i] == 26)
         }

         return on_last
     }

   The actual work of splitting the input file is done by the next two
rules.  Since splitting by line count and splitting by byte count are
mutually exclusive, we simply use two separate rules, one for when
'Line_count' is greater than zero, and another for when 'Byte_count' is
greater than zero.

   The variable 'tcount' counts how many lines have been processed so
far.  When it exceeds 'Line_count', it's time to close the previous file
and switch to a new one:

     Line_count > 0 {
         if (++tcount > Line_count) {
             close(Output)
             Output = (Outfile compute_suffix())
             tcount = 1
         }
         print > Output
     }

   The rule for handling bytes is more complicated.  Since lines most
likely vary in length, the 'Byte_count' boundary may be hit in the
middle of an input record.  In that case, 'split' has to write enough of
the first bytes of the input record to finish up 'Byte_count' bytes,
close the file, open a new file, and write the rest of the record to the
new file.  The logic here does all that:

     Byte_count > 0 {
         # `+ 1' is for the final newline
         if (tcount + length($0) + 1 > Byte_count) { # would overflow
             # compute leading bytes
             leading_bytes = Byte_count - tcount

             # write leading bytes
             printf("%s", substr($0, 1, leading_bytes)) > Output

             # close old file, open new file
             close(Output)
             Output = (Outfile compute_suffix())

             # set up first bytes for new file
             $0 = substr($0, leading_bytes + 1)  # trailing bytes
             tcount = 0
         }

         # write full record or trailing bytes
         tcount += length($0) + 1
         print > Output
     }

   Finally, the 'END' rule cleans up by closing the last output file:

     END {
         close(Output)
     }

   ---------- Footnotes ----------

   (1) Using '-b' twice requires separating 'gawk''s options from those
of the program.  For example: 'gawk -f getopt.awk -f split.awk -b -- -b
42m large-file.txt split-'.