File: gawk.info, Node: Array Sorting Functions, Prev: Controlling Array Traversal, Up: Array Sorting 12.3.2 Sorting Array Values and Indices with 'gawk' --------------------------------------------------- In most 'awk' implementations, sorting an array requires writing a 'sort()' function. This can be educational for exploring different sorting algorithms, but usually that's not the point of the program. 'gawk' provides the built-in 'asort()' and 'asorti()' functions (*note String Functions::) for sorting arrays. For example: POPULATE THE ARRAY data n = asort(data) for (i = 1; i <= n; i++) DO SOMETHING WITH data[i] After the call to 'asort()', the array 'data' is indexed from 1 to some number N, the total number of elements in 'data'. (This count is 'asort()''s return value.) 'data[1]' <= 'data[2]' <= 'data[3]', and so on. The default comparison is based on the type of the elements (*note Typing and Comparison::). All numeric values come before all string values, which in turn come before all subarrays. An important side effect of calling 'asort()' is that _the array's original indices are irrevocably lost_. As this isn't always desirable, 'asort()' accepts a second argument: POPULATE THE ARRAY source n = asort(source, dest) for (i = 1; i <= n; i++) DO SOMETHING WITH dest[i] In this case, 'gawk' copies the 'source' array into the 'dest' array and then sorts 'dest', destroying its indices. However, the 'source' array is not affected. Often, what's needed is to sort on the values of the _indices_ instead of the values of the elements. To do that, use the 'asorti()' function. The interface and behavior are identical to that of 'asort()', except that the index values are used for sorting and become the values of the result array: { source[$0] = some_func($0) } END { n = asorti(source, dest) for (i = 1; i <= n; i++) { Work with sorted indices directly: DO SOMETHING WITH dest[i] ... Access original array via sorted indices: DO SOMETHING WITH source[dest[i]] } } So far, so good. Now it starts to get interesting. Both 'asort()' and 'asorti()' accept a third string argument to control comparison of array elements. When we introduced 'asort()' and 'asorti()' in *note String Functions::, we ignored this third argument; however, now is the time to describe how this argument affects these two functions. Basically, the third argument specifies how the array is to be sorted. There are two possibilities. As with 'PROCINFO["sorted_in"]', this argument may be one of the predefined names that 'gawk' provides (*note Controlling Scanning::), or it may be the name of a user-defined function (*note Controlling Array Traversal::). In the latter case, _the function can compare elements in any way it chooses_, taking into account just the indices, just the values, or both. This is extremely powerful. Once the array is sorted, 'asort()' takes the _values_ in their final order and uses them to fill in the result array, whereas 'asorti()' takes the _indices_ in their final order and uses them to fill in the result array. NOTE: Copying array indices and elements isn't expensive in terms of memory. Internally, 'gawk' maintains "reference counts" to data. For example, when 'asort()' copies the first array to the second one, there is only one copy of the original array elements' data, even though both arrays use the values. You may use the same array for both the first and second arguments to 'asort()' and 'asorti()'. Doing so only makes sense if you are also supplying the third argument, since 'awk' doesn't provide a way to pass that third argument without also passing the first and second ones. Because 'IGNORECASE' affects string comparisons, the value of 'IGNORECASE' also affects sorting for both 'asort()' and 'asorti()'. Note also that the locale's sorting order does _not_ come into play; comparisons are based on character values only.(1) The following example demonstrates the use of a comparison function with 'asort()'. The comparison function, 'case_fold_compare()', maps both values to lowercase in order to compare them ignoring case. # case_fold_compare --- compare as strings, ignoring case function case_fold_compare(i1, v1, i2, v2, l, r) { l = tolower(v1) r = tolower(v2) if (l < r) return -1 else if (l == r) return 0 else return 1 } And here is the test program for it: # Test program BEGIN { Letters = "abcdefghijklmnopqrstuvwxyz" \ "ABCDEFGHIJKLMNOPQRSTUVWXYZ" split(Letters, data, "") asort(data, result, "case_fold_compare") j = length(result) for (i = 1; i <= j; i++) { printf("%s", result[i]) if (i % (j/2) == 0) printf("\n") else printf(" ") } } When run, we get the following: $ gawk -f case_fold_compare.awk -| A a B b c C D d e E F f g G H h i I J j k K l L M m -| n N O o p P Q q r R S s t T u U V v w W X x y Y z Z NOTE: "Under the hood," 'gawk' uses the C library 'qsort()' function to manage the sorting. 'qsort()' can call itself recursively. This means that when you write a comparison function, you should be careful to avoid the use of global variables and arrays; use only local variables and arrays that you declare as additional parameters to the comparison function. Otherwise, you are likely to cause unintentional memory corruption in your global arrays and possibly cause 'gawk' itself to fail. ---------- Footnotes ---------- (1) This is true because locale-based comparison occurs only when in POSIX-compatibility mode, and because 'asort()' and 'asorti()' are 'gawk' extensions, they are not available in that case.