File: gawk.info, Node: Split Program, Next: Tee Program, Prev: Id Program, Up: Clones 11.2.4 Splitting a Large File into Pieces ----------------------------------------- The 'split' utility splits large text files into smaller pieces. The usage follows the POSIX standard for 'split' and is as follows: 'split' ['-l' COUNT] ['-a' SUFFIX-LEN] [FILE [OUTNAME]] 'split' '-b' N['k'|'m']] ['-a' SUFFIX-LEN] [FILE [OUTNAME]] By default, the output files are named 'xaa', 'xab', and so on. Each file has 1,000 lines in it, with the likely exception of the last file. The 'split' program has evolved over time, and the current POSIX version is more complicated than the original Unix version. The options and what they do are as follows: '-a' SUFFIX-LEN Use SUFFIX-LEN characters for the suffix. For example, if SUFFIX-LEN is four, the output files would range from 'xaaaa' to 'xzzzz'. '-b' N['k'|'m']] Instead of each file containing a specified number of lines, each file should have (at most) N bytes. Supplying a trailing 'k' multiplies N by 1,024, yielding kilobytes. Supplying a trailing 'm' multiplies N by 1,048,576 (1,024 * 1,024) yielding megabytes. (This option is mutually exclusive with '-l'). '-l' COUNT Each file should have at most COUNT lines, instead of the default 1,000. (This option is mutually exclusive with '-b'). If supplied, FILE is the input file to read. Otherwise standard input is processed. If supplied, OUTNAME is the leading prefix to use for file names, instead of 'x'. In order to use the '-b' option, 'gawk' should be invoked with its '-b' option (*note Options::), or with the environment variable 'LC_ALL' set to 'C', so that each input byte is treated as a separate character.(1) Here is an implementation of 'split' in 'awk'. It uses the 'getopt()' function presented in *note Getopt Function::. The program begins with a standard descriptive comment and then a 'usage()' function describing the options. The variable 'common' keeps the function's lines short so that they look nice on the page: # split.awk --- do split in awk # # Requires getopt() library function. function usage( common) { common = "[-a suffix-len] [file [outname]]" printf("usage: split [-l count] %s\n", common) > "/dev/stderr" printf(" split [-b N[k|m]] %s\n", common) > "/dev/stderr" exit 1 } Next, in a 'BEGIN' rule we set the default values and parse the arguments. After that we initialize the data structures used to cycle the suffix from 'aa...' to 'zz...'. Finally we set the name of the first output file: BEGIN { # Set defaults: Suffix_length = 2 Line_count = 1000 Byte_count = 0 Outfile = "x" parse_arguments() init_suffix_data() Output = (Outfile compute_suffix()) } Parsing the arguments is straightforward. The program follows our convention (*note Library Names::) of having important global variables start with an uppercase letter: function parse_arguments( i, c, l, modifier) { while ((c = getopt(ARGC, ARGV, "a:b:l:")) != -1) { if (c == "a") Suffix_length = Optarg + 0 else if (c == "b") { Byte_count = Optarg + 0 Line_count = 0 l = length(Optarg) modifier = substr(Optarg, l, 1) if (modifier == "k") Byte_count *= 1024 else if (modifier == "m") Byte_count *= 1024 * 1024 } else if (c == "l") { Line_count = Optarg + 0 Byte_count = 0 } else usage() } # Clear out options for (i = 1; i < Optind; i++) ARGV[i] = "" # Check for filename if (ARGV[Optind]) { Optind++ # Check for different prefix if (ARGV[Optind]) { Outfile = ARGV[Optind] ARGV[Optind] = "" if (++Optind < ARGC) usage() } } } Managing the file name suffix is interesting. Given a suffix of length three, say, the values go from 'aaa', 'aab', 'aac' and so on, all the way to 'zzx', 'zzy', and finally 'zzz'. There are two important aspects to this: * We have to be able to easily generate these suffixes, and in particular easily handle "rolling over"; for example, going from 'abz' to 'aca'. * We have to tell when we've finished with the last file, so that if we still have more input data we can print an error message and exit. The trick is to handle this _after_ using the last suffix, and not when the final suffix is created. The computation is handled by 'compute_suffix()'. This function is called every time a new file is opened. The flow here is messy, because we want to generate 'zzzz' (say), and use it, and only produce an error after all the file name suffixes have been used up. The logical steps are as follows: 1. Generate the suffix, saving the value in 'result' to return. To do this, the supplementary array 'Suffix_ind' contains one element for each letter in the suffix. Each element ranges from 1 to 26, acting as the index into a string containing all the lowercase letters of the English alphabet. It is initialized by 'init_suffix_data()'. 'result' is built up one letter at a time, using each 'substr()'. 2. Prepare the data structures for the next time 'compute_suffix()' is called. To do this, we loop over 'Suffix_ind', _backwards_. If the current element is less than 26, it's incremented and the loop breaks ('abq' goes to 'abr'). Otherwise, the element is reset to one and we move down the list ('abz' to 'aca'). Thus, the 'Suffix_ind' array is always "one step ahead" of the actual file name suffix to be returned. 3. Check if we've gone past the limit of possible file names. If 'Reached_last' is true, print a message and exit. Otherwise, check if 'Suffix_ind' describes a suffix where all the letters are 'z'. If that's the case we're about to return the final suffix. If so, we set 'Reached_last' to true so that the _next_ call to 'compute_suffix()' will cause a failure. Physically, the steps in the function occur in the order 3, 1, 2: function compute_suffix( i, result, letters) { # Logical step 3 if (Reached_last) { printf("split: too many files!\n") > "/dev/stderr" exit 1 } else if (on_last_file()) Reached_last = 1 # fail when wrapping after 'zzz' # Logical step 1 result = "" letters = "abcdefghijklmnopqrstuvwxyz" for (i = 1; i <= Suffix_length; i++) result = result substr(letters, Suffix_ind[i], 1) # Logical step 2 for (i = Suffix_length; i >= 1; i--) { if (++Suffix_ind[i] > 26) { Suffix_ind[i] = 1 } else break } return result } The 'Suffix_ind' array and 'Reached_last' are initialized by 'init_suffix_data()': function init_suffix_data( i) { for (i = 1; i <= Suffix_length; i++) Suffix_ind[i] = 1 Reached_last = 0 } The function 'on_last_file()' returns true if 'Suffix_ind' describes a suffix where all the letters are 'z' by checking that all the elements in the array are equal to 26: function on_last_file( i, on_last) { on_last = 1 for (i = 1; i <= Suffix_length; i++) { on_last = on_last && (Suffix_ind[i] == 26) } return on_last } The actual work of splitting the input file is done by the next two rules. Since splitting by line count and splitting by byte count are mutually exclusive, we simply use two separate rules, one for when 'Line_count' is greater than zero, and another for when 'Byte_count' is greater than zero. The variable 'tcount' counts how many lines have been processed so far. When it exceeds 'Line_count', it's time to close the previous file and switch to a new one: Line_count > 0 { if (++tcount > Line_count) { close(Output) Output = (Outfile compute_suffix()) tcount = 1 } print > Output } The rule for handling bytes is more complicated. Since lines most likely vary in length, the 'Byte_count' boundary may be hit in the middle of an input record. In that case, 'split' has to write enough of the first bytes of the input record to finish up 'Byte_count' bytes, close the file, open a new file, and write the rest of the record to the new file. The logic here does all that: Byte_count > 0 { # `+ 1' is for the final newline if (tcount + length($0) + 1 > Byte_count) { # would overflow # compute leading bytes leading_bytes = Byte_count - tcount # write leading bytes printf("%s", substr($0, 1, leading_bytes)) > Output # close old file, open new file close(Output) Output = (Outfile compute_suffix()) # set up first bytes for new file $0 = substr($0, leading_bytes + 1) # trailing bytes tcount = 0 } # write full record or trailing bytes tcount += length($0) + 1 print > Output } Finally, the 'END' rule cleans up by closing the last output file: END { close(Output) } ---------- Footnotes ---------- (1) Using '-b' twice requires separating 'gawk''s options from those of the program. For example: 'gawk -f getopt.awk -f split.awk -b -- -b 42m large-file.txt split-'.