manpagez: man pages & more
info gawk
Home | html | info | man

File: gawk.info,  Node: Array Sorting Functions,  Prev: Controlling Array Traversal,  Up: Array Sorting

12.3.2 Sorting Array Values and Indices with 'gawk'
---------------------------------------------------

In most 'awk' implementations, sorting an array requires writing a
'sort()' function.  This can be educational for exploring different
sorting algorithms, but usually that's not the point of the program.
'gawk' provides the built-in 'asort()' and 'asorti()' functions (*note
String Functions::) for sorting arrays.  For example:

     POPULATE THE ARRAY data
     n = asort(data)
     for (i = 1; i <= n; i++)
         DO SOMETHING WITH data[i]

   After the call to 'asort()', the array 'data' is indexed from 1 to
some number N, the total number of elements in 'data'.  (This count is
'asort()''s return value.)  'data[1]' <= 'data[2]' <= 'data[3]', and so
on.  The default comparison is based on the type of the elements (*note
Typing and Comparison::).  All numeric values come before all string
values, which in turn come before all subarrays.

   An important side effect of calling 'asort()' is that _the array's
original indices are irrevocably lost_.  As this isn't always desirable,
'asort()' accepts a second argument:

     POPULATE THE ARRAY source
     n = asort(source, dest)
     for (i = 1; i <= n; i++)
         DO SOMETHING WITH dest[i]

   In this case, 'gawk' copies the 'source' array into the 'dest' array
and then sorts 'dest', destroying its indices.  However, the 'source'
array is not affected.

   Often, what's needed is to sort on the values of the _indices_
instead of the values of the elements.  To do that, use the 'asorti()'
function.  The interface and behavior are identical to that of
'asort()', except that the index values are used for sorting and become
the values of the result array:

     { source[$0] = some_func($0) }

     END {
         n = asorti(source, dest)
         for (i = 1; i <= n; i++) {
             Work with sorted indices directly:
             DO SOMETHING WITH dest[i]
             ...
             Access original array via sorted indices:
             DO SOMETHING WITH source[dest[i]]
         }
     }

   So far, so good.  Now it starts to get interesting.  Both 'asort()'
and 'asorti()' accept a third string argument to control comparison of
array elements.  When we introduced 'asort()' and 'asorti()' in *note
String Functions::, we ignored this third argument; however, now is the
time to describe how this argument affects these two functions.

   Basically, the third argument specifies how the array is to be
sorted.  There are two possibilities.  As with 'PROCINFO["sorted_in"]',
this argument may be one of the predefined names that 'gawk' provides
(*note Controlling Scanning::), or it may be the name of a user-defined
function (*note Controlling Array Traversal::).

   In the latter case, _the function can compare elements in any way it
chooses_, taking into account just the indices, just the values, or
both.  This is extremely powerful.

   Once the array is sorted, 'asort()' takes the _values_ in their final
order and uses them to fill in the result array, whereas 'asorti()'
takes the _indices_ in their final order and uses them to fill in the
result array.

     NOTE: Copying array indices and elements isn't expensive in terms
     of memory.  Internally, 'gawk' maintains "reference counts" to
     data.  For example, when 'asort()' copies the first array to the
     second one, there is only one copy of the original array elements'
     data, even though both arrays use the values.

   You may use the same array for both the first and second arguments to
'asort()' and 'asorti()'.  Doing so only makes sense if you are also
supplying the third argument, since 'awk' doesn't provide a way to pass
that third argument without also passing the first and second ones.

   Because 'IGNORECASE' affects string comparisons, the value of
'IGNORECASE' also affects sorting for both 'asort()' and 'asorti()'.
Note also that the locale's sorting order does _not_ come into play;
comparisons are based on character values only.(1)

   The following example demonstrates the use of a comparison function
with 'asort()'.  The comparison function, 'case_fold_compare()', maps
both values to lowercase in order to compare them ignoring case.

     # case_fold_compare --- compare as strings, ignoring case

     function case_fold_compare(i1, v1, i2, v2,    l, r)
     {
         l = tolower(v1)
         r = tolower(v2)

         if (l < r)
             return -1
         else if (l == r)
             return 0
         else
             return 1
     }

   And here is the test program for it:

     # Test program

     BEGIN {
         Letters = "abcdefghijklmnopqrstuvwxyz" \
                   "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
         split(Letters, data, "")

         asort(data, result, "case_fold_compare")

         j = length(result)
         for (i = 1; i <= j; i++) {
             printf("%s", result[i])
             if (i % (j/2) == 0)
                 printf("\n")
             else
                 printf(" ")
         }
     }

   When run, we get the following:

     $ gawk -f case_fold_compare.awk
     -| A a B b c C D d e E F f g G H h i I J j k K l L M m
     -| n N O o p P Q q r R S s t T u U V v w W X x y Y z Z

     NOTE: "Under the hood," 'gawk' uses the C library 'qsort()'
     function to manage the sorting.  'qsort()' can call itself
     recursively.  This means that when you write a comparison function,
     you should be careful to avoid the use of global variables and
     arrays; use only local variables and arrays that you declare as
     additional parameters to the comparison function.  Otherwise, you
     are likely to cause unintentional memory corruption in your global
     arrays and possibly cause 'gawk' itself to fail.

   ---------- Footnotes ----------

   (1) This is true because locale-based comparison occurs only when in
POSIX-compatibility mode, and because 'asort()' and 'asorti()' are
'gawk' extensions, they are not available in that case.

© manpagez.com 2000-2025
Individual documents may contain additional copyright information.