info gawk


File: gawk.info,  Node: String Functions,  Next: I/O Functions,  Prev: Numeric Functions,  Up: Built-in

9.1.4 String-Manipulation Functions
-----------------------------------

The functions in this minor node look at or change the text of one or
more strings.

   'gawk' understands locales (*note Locales::) and does all string
processing in terms of _characters_, not _bytes_.  This distinction is
particularly important to understand for locales where one character may
be represented by multiple bytes.  Thus, for example, 'length()' returns
the number of characters in a string, and not the number of bytes used
to represent those characters.  Similarly, 'index()' works with
character indices, and not byte indices.

     CAUTION: A number of functions deal with indices into strings.  For
     these functions, the first character of a string is at position
     (index) one.  This is different from C and the languages descended
     from it, where the first character is at position zero.  You need
     to remember this when doing index calculations, particularly if you
     are used to C.

   In the following list, optional parameters are enclosed in square
brackets ([ ]).  Several functions perform string substitution; the full
discussion is provided in the description of the 'sub()' function, which
comes toward the end, because the list is presented alphabetically.

   Those functions that are specific to 'gawk' are marked with a pound
sign ('#').  They are not available in compatibility mode (*note
Options::):

* Menu:

* Gory Details::                More than you want to know about '\' and
                                '&' with 'sub()', 'gsub()', and
                                'gensub()'.

'asort('SOURCE [',' DEST [',' HOW ] ]') #'
'asorti('SOURCE [',' DEST [',' HOW ] ]') #'
     These two functions are similar in behavior, so they are described
     together.

          NOTE: The following description ignores the third argument,
          HOW, as it requires understanding features that we have not
          discussed yet.  Thus, the discussion here is a deliberate
          simplification.  (We do provide all the details later on; see
          *note Array Sorting Functions:: for the full story.)

     Both functions return the number of elements in the array SOURCE.
     For 'asort()', 'gawk' sorts the values of SOURCE and replaces the
     indices of the sorted values of SOURCE with sequential integers
     starting with one.  If the optional array DEST is specified, then
     SOURCE is duplicated into DEST.  DEST is then sorted, leaving the
     indices of SOURCE unchanged.

     When comparing strings, 'IGNORECASE' affects the sorting (*note
     Array Sorting Functions::).  If the SOURCE array contains subarrays
     as values (*note Arrays of Arrays::), they will come last, after
     all scalar values.  Subarrays are _not_ recursively sorted.

     For example, if the contents of 'a' are as follows:

          a["last"] = "de"
          a["first"] = "sac"
          a["middle"] = "cul"

     A call to 'asort()':

          asort(a)

     results in the following contents of 'a':

          a[1] = "cul"
          a[2] = "de"
          a[3] = "sac"

     The 'asorti()' function works similarly to 'asort()'; however, the
     _indices_ are sorted, instead of the values.  Thus, in the previous
     example, starting with the same initial set of indices and values
     in 'a', calling 'asorti(a)' would yield:

          a[1] = "first"
          a[2] = "last"
          a[3] = "middle"

          NOTE: You may not use either 'SYMTAB' or 'FUNCTAB' as the
          second argument to these functions.  Attempting to do so
          produces a fatal error.  You may use them as the first
          argument, but only if providing a second array to use for the
          actual sorting.

     You are allowed to use the same array for both the SOURCE and DEST
     arguments, but doing so only makes sense if you're also supplying
     the third argument.

'gensub(REGEXP, REPLACEMENT, HOW' [', TARGET']') #'
     Search the target string TARGET for matches of the regular
     expression REGEXP.  If HOW is a string beginning with 'g' or 'G'
     (short for "global"), then replace all matches of REGEXP with
     REPLACEMENT.  Otherwise, treat HOW as a number indicating which
     match of REGEXP to replace.  Treat numeric values less than one as
     if they were one.  If no TARGET is supplied, use '$0'.  Return the
     modified string as the result of the function.  The original target
     string is _not_ changed.

     The returned value is _always_ a string, even if the original
     TARGET was a number or a regexp value.

     'gensub()' is a general substitution function.  Its purpose is to
     provide more features than the standard 'sub()' and 'gsub()'
     functions.

     'gensub()' provides an additional feature that is not available in
     'sub()' or 'gsub()': the ability to specify components of a regexp
     in the replacement text.  This is done by using parentheses in the
     regexp to mark the components and then specifying '\N' in the
     replacement text, where N is a digit from 1 to 9.  For example:

          $ gawk '
          > BEGIN {
          >      a = "abc def"
          >      b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
          >      print b
          > }'
          -| def abc

     As with 'sub()', you must type two backslashes in order to get one
     into the string.  In the replacement text, the sequence '\0'
     represents the entire matched text, as does the character '&'.

     The following example shows how you can use the third argument to
     control which match of the regexp should be changed:

          $ echo a b c a b c |
          > gawk '{ print gensub(/a/, "AA", 2) }'
          -| a b c AA b c

     In this case, '$0' is the default target string.  'gensub()'
     returns the new string as its result, which is passed directly to
     'print' for printing.

     If the HOW argument is a string that does not begin with 'g' or
     'G', or if it is a number that is less than or equal to zero, only
     one substitution is performed.  If HOW is zero, 'gawk' issues a
     warning message.

     If REGEXP does not match TARGET, 'gensub()''s return value is the
     original unchanged value of TARGET.  Note that, as mentioned above,
     the returned value is a string, even if TARGET was not.

'gsub(REGEXP, REPLACEMENT' [', TARGET']')'
     Search TARGET for _all_ of the longest, leftmost, _nonoverlapping_
     matching substrings it can find and replace them with REPLACEMENT.
     The 'g' in 'gsub()' stands for "global," which means replace
     everywhere.  For example:

          { gsub(/Britain/, "United Kingdom"); print }

     replaces all occurrences of the string 'Britain' with 'United
     Kingdom' for all input records.

     The 'gsub()' function returns the number of substitutions made.  If
     the variable to search and alter (TARGET) is omitted, then the
     entire input record ('$0') is used.  As in 'sub()', the characters
     '&' and '\' are special, and the third argument must be assignable.

'index(IN, FIND)'
     Search the string IN for the first occurrence of the string FIND,
     and return the position in characters where that occurrence begins
     in the string IN.  Consider the following example:

          $ awk 'BEGIN { print index("peanut", "an") }'
          -| 3

     If FIND is not found, 'index()' returns zero.

     With BWK 'awk' and 'gawk', it is a fatal error to use a regexp
     constant for FIND.  Other implementations allow it, simply treating
     the regexp constant as an expression meaning '$0 ~ /regexp/'.
     (d.c.)

'length('[STRING]')'
     Return the number of characters in STRING.  If STRING is a number,
     the length of the digit string representing that number is
     returned.  For example, 'length("abcde")' is five.  By contrast,
     'length(15 * 35)' works out to three.  In this example, 15 * 35 =
     525, and 525 is then converted to the string '"525"', which has
     three characters.

     If no argument is supplied, 'length()' returns the length of '$0'.

          NOTE: In older versions of 'awk', the 'length()' function
          could be called without any parentheses.  Doing so is
          considered poor practice, although the 2008 POSIX standard
          explicitly allows it, to support historical practice.  For
          programs to be maximally portable, always supply the
          parentheses.

     If 'length()' is called with a variable that has not been used,
     'gawk' forces the variable to be a scalar.  Other implementations
     of 'awk' leave the variable without a type.  (d.c.)  Consider:

          $ gawk 'BEGIN { print length(x) ; x[1] = 1 }'
          -| 0
          error-> gawk: fatal: attempt to use scalar `x' as array

          $ nawk 'BEGIN { print length(x) ; x[1] = 1 }'
          -| 0

     If '--lint' has been specified on the command line, 'gawk' issues a
     warning about this.

     With 'gawk' and several other 'awk' implementations, when given an
     array argument, the 'length()' function returns the number of
     elements in the array.  (c.e.)  This is less useful than it might
     seem at first, as the array is not guaranteed to be indexed from
     one to the number of elements in it.  If '--lint' is provided on
     the command line (*note Options::), 'gawk' warns that passing an
     array argument is not portable.  If '--posix' is supplied, using an
     array argument is a fatal error (*note Arrays::).

'match(STRING, REGEXP' [', ARRAY']')'
     Search STRING for the longest, leftmost substring matched by the
     regular expression REGEXP and return the character position (index)
     at which that substring begins (one, if it starts at the beginning
     of STRING).  If no match is found, return zero.

     The REGEXP argument may be either a regexp constant ('/'...'/') or
     a string constant ('"'...'"').  In the latter case, the string is
     treated as a regexp to be matched.  *Note Computed Regexps:: for a
     discussion of the difference between the two forms, and the
     implications for writing your program correctly.

     The order of the first two arguments is the opposite of most other
     string functions that work with regular expressions, such as
     'sub()' and 'gsub()'.  It might help to remember that for
     'match()', the order is the same as for the '~' operator: 'STRING ~
     REGEXP'.

     The 'match()' function sets the predefined variable 'RSTART' to the
     index.  It also sets the predefined variable 'RLENGTH' to the
     length in characters of the matched substring.  If no match is
     found, 'RSTART' is set to zero, and 'RLENGTH' to -1.

     For example:

          {
              if ($1 == "FIND")
                  regex = $2
              else {
                  where = match($0, regex)
                  if (where != 0)
                      print "Match of", regex, "found at", where, "in", $0
                 }
          }

     This program looks for lines that match the regular expression
     stored in the variable 'regex'.  This regular expression can be
     changed.  If the first word on a line is 'FIND', 'regex' is changed
     to be the second word on that line.  Therefore, if given:

          FIND ru+n
          My program runs
          but not very quickly
          FIND Melvin
          JF+KM
          This line is property of Reality Engineering Co.
          Melvin was here.

     'awk' prints:

          Match of ru+n found at 12 in My program runs
          Match of Melvin found at 1 in Melvin was here.

     If ARRAY is present, it is cleared, and then the zeroth element of
     ARRAY is set to the entire portion of STRING matched by REGEXP.  If
     REGEXP contains parentheses, the integer-indexed elements of ARRAY
     are set to contain the portion of STRING matching the corresponding
     parenthesized subexpression.  For example:

          $ echo foooobazbarrrrr |
          > gawk '{ match($0, /(fo+).+(bar*)/, arr)
          >         print arr[1], arr[2] }'
          -| foooo barrrrr

     In addition, multidimensional subscripts are available providing
     the start index and length of each matched subexpression:

          $ echo foooobazbarrrrr |
          > gawk '{ match($0, /(fo+).+(bar*)/, arr)
          >           print arr[1], arr[2]
          >           print arr[1, "start"], arr[1, "length"]
          >           print arr[2, "start"], arr[2, "length"]
          > }'
          -| foooo barrrrr
          -| 1 5
          -| 9 7

     There may not be subscripts for the start and index for every
     parenthesized subexpression, because they may not all have matched
     text; thus, they should be tested for with the 'in' operator (*note
     Reference to Elements::).

     The ARRAY argument to 'match()' is a 'gawk' extension.  In
     compatibility mode (*note Options::), using a third argument is a
     fatal error.

'patsplit(STRING, ARRAY' [', FIELDPAT' [', SEPS' ] ]') #'
     Divide STRING into pieces (or "fields") defined by FIELDPAT and
     store the pieces in ARRAY and the separator strings in the SEPS
     array.  The first piece is stored in 'ARRAY[1]', the second piece
     in 'ARRAY[2]', and so forth.  The third argument, FIELDPAT, is a
     regexp describing the fields in STRING (just as 'FPAT' is a regexp
     describing the fields in input records).  It may be either a regexp
     constant or a string.  If FIELDPAT is omitted, the value of 'FPAT'
     is used.  'patsplit()' returns the number of elements created.
     'SEPS[I]' is the possibly null separator string after 'ARRAY[I]'.
     The possibly null leading separator will be in 'SEPS[0]'.  So a
     non-null STRING with N fields will have N+1 separators.  A null
     STRING has no fields or separators.

     The 'patsplit()' function splits strings into pieces in a manner
     similar to the way input lines are split into fields using 'FPAT'
     (*note Splitting By Content::).

     Before splitting the string, 'patsplit()' deletes any previously
     existing elements in the arrays ARRAY and SEPS.

'split(STRING, ARRAY' [', FIELDSEP' [', SEPS' ] ]')'
     Divide STRING into pieces separated by FIELDSEP and store the
     pieces in ARRAY and the separator strings in the SEPS array.  The
     first piece is stored in 'ARRAY[1]', the second piece in
     'ARRAY[2]', and so forth.  The string value of the third argument,
     FIELDSEP, is a regexp describing where to split STRING (much as
     'FS' can be a regexp describing where to split input records).  If
     FIELDSEP is omitted, the value of 'FS' is used.  'split()' returns
     the number of elements created.  SEPS is a 'gawk' extension, with
     'SEPS[I]' being the separator string between 'ARRAY[I]' and
     'ARRAY[I+1]'.  If FIELDSEP is a single space, then any leading
     whitespace goes into 'SEPS[0]' and any trailing whitespace goes
     into 'SEPS[N]', where N is the return value of 'split()' (i.e., the
     number of elements in ARRAY).

     The 'split()' function splits strings into pieces in the same way
     that input lines are split into fields.  For example:

          split("cul-de-sac", a, "-", seps)

     splits the string '"cul-de-sac"' into three fields using '-' as the
     separator.  It sets the contents of the array 'a' as follows:

          a[1] = "cul"
          a[2] = "de"
          a[3] = "sac"

     and sets the contents of the array 'seps' as follows:

          seps[1] = "-"
          seps[2] = "-"

     The value returned by this call to 'split()' is three.

     As with input field-splitting, when the value of FIELDSEP is '" "',
     leading and trailing whitespace is ignored in values assigned to
     the elements of ARRAY but not in SEPS, and the elements are
     separated by runs of whitespace.  Also, as with input field
     splitting, if FIELDSEP is the null string, each individual
     character in the string is split into its own array element.
     (c.e.)  Additionally, if FIELDSEP is a single-character string,
     that string acts as the separator, even if its value is a regular
     expression metacharacter.

     Note, however, that 'RS' has no effect on the way 'split()' works.
     Even though 'RS = ""' causes the newline character to also be an
     input field separator, this does not affect how 'split()' splits
     strings.

     Modern implementations of 'awk', including 'gawk', allow the third
     argument to be a regexp constant ('/'...'/') as well as a string.
     (d.c.)  The POSIX standard allows this as well.  *Note Computed
     Regexps:: for a discussion of the difference between using a string
     constant or a regexp constant, and the implications for writing
     your program correctly.

     Before splitting the string, 'split()' deletes any previously
     existing elements in the arrays ARRAY and SEPS.

     If STRING is null, the array has no elements.  (So this is a
     portable way to delete an entire array with one statement.  *Note
     Delete::.)

     If STRING does not match FIELDSEP at all (but is not null), ARRAY
     has one element only.  The value of that element is the original
     STRING.

     In POSIX mode (*note Options::), the fourth argument is not
     allowed.

'sprintf(FORMAT, EXPRESSION1, ...)'
     Return (without printing) the string that 'printf' would have
     printed out with the same arguments (*note Printf::).  For example:

          pival = sprintf("pi = %.2f (approx.)", 22/7)

     assigns the string 'pi = 3.14 (approx.)' to the variable 'pival'.

'strtonum(STR) #'
     Examine STR and return its numeric value.  If STR begins with a
     leading '0', 'strtonum()' assumes that STR is an octal number.  If
     STR begins with a leading '0x' or '0X', 'strtonum()' assumes that
     STR is a hexadecimal number.  For example:

          $ echo 0x11 |
          > gawk '{ printf "%d\n", strtonum($1) }'
          -| 17

     Using the 'strtonum()' function is _not_ the same as adding zero to
     a string value; the automatic coercion of strings to numbers works
     only for decimal data, not for octal or hexadecimal.(1)

     Note also that 'strtonum()' uses the current locale's decimal point
     for recognizing numbers (*note Locales::).

'sub(REGEXP, REPLACEMENT' [', TARGET']')'
     Search TARGET, which is treated as a string, for the leftmost,
     longest substring matched by the regular expression REGEXP.  Modify
     the entire string by replacing the matched text with REPLACEMENT.
     The modified string becomes the new value of TARGET.  Return the
     number of substitutions made (zero or one).

     The REGEXP argument may be either a regexp constant ('/'...'/') or
     a string constant ('"'...'"').  In the latter case, the string is
     treated as a regexp to be matched.  *Note Computed Regexps:: for a
     discussion of the difference between the two forms, and the
     implications for writing your program correctly.

     This function is peculiar because TARGET is not simply used to
     compute a value, and not just any expression will do--it must be a
     variable, field, or array element so that 'sub()' can store a
     modified value there.  If this argument is omitted, then the
     default is to use and alter '$0'.(2)  For example:

          str = "water, water, everywhere"
          sub(/at/, "ith", str)

     sets 'str' to 'wither, water, everywhere', by replacing the
     leftmost longest occurrence of 'at' with 'ith'.

     If the special character '&' appears in REPLACEMENT, it stands for
     the precise substring that was matched by REGEXP.  (If the regexp
     can match more than one string, then this precise substring may
     vary.)  For example:

          { sub(/candidate/, "& and his wife"); print }

     changes the first occurrence of 'candidate' to 'candidate and his
     wife' on each input line.  Here is another example:

          $ awk 'BEGIN {
          >         str = "daabaaa"
          >         sub(/a+/, "C&C", str)
          >         print str
          > }'
          -| dCaaCbaaa

     This shows how '&' can represent a nonconstant string and also
     illustrates the "leftmost, longest" rule in regexp matching (*note
     Leftmost Longest::).

     The effect of this special character ('&') can be turned off by
     putting a backslash before it in the string.  As usual, to insert
     one backslash in the string, you must write two backslashes.
     Therefore, write '\\&' in a string constant to include a literal
     '&' in the replacement.  For example, the following shows how to
     replace the first '|' on each line with an '&':

          { sub(/\|/, "\\&"); print }

     As mentioned, the third argument to 'sub()' must be a variable,
     field, or array element.  Some versions of 'awk' allow the third
     argument to be an expression that is not an lvalue.  In such a
     case, 'sub()' still searches for the pattern and returns zero or
     one, but the result of the substitution (if any) is thrown away
     because there is no place to put it.  Such versions of 'awk' accept
     expressions like the following:

          sub(/USA/, "United States", "the USA and Canada")

     For historical compatibility, 'gawk' accepts such erroneous code.
     However, using any other nonchangeable object as the third
     parameter causes a fatal error and your program will not run.

     Finally, if the REGEXP is not a regexp constant, it is converted
     into a string, and then the value of that string is treated as the
     regexp to match.

'substr(STRING, START' [', LENGTH' ]')'
     Return a LENGTH-character-long substring of STRING, starting at
     character number START.  The first character of a string is
     character number one.(3)  For example, 'substr("washington", 5, 3)'
     returns '"ing"'.

     If LENGTH is not present, 'substr()' returns the whole suffix of
     STRING that begins at character number START.  For example,
     'substr("washington", 5)' returns '"ington"'.  The whole suffix is
     also returned if LENGTH is greater than the number of characters
     remaining in the string, counting from character START.

     If START is less than one, 'substr()' treats it as if it was one.
     (POSIX doesn't specify what to do in this case: BWK 'awk' acts this
     way, and therefore 'gawk' does too.)  If START is greater than the
     number of characters in the string, 'substr()' returns the null
     string.  Similarly, if LENGTH is present but less than or equal to
     zero, the null string is returned.

     The string returned by 'substr()' _cannot_ be assigned.  Thus, it
     is a mistake to attempt to change a portion of a string, as shown
     in the following example:

          string = "abcdef"
          # try to get "abCDEf", won't work
          substr(string, 3, 3) = "CDE"

     It is also a mistake to use 'substr()' as the third argument of
     'sub()' or 'gsub()':

          gsub(/xyz/, "pdq", substr($0, 5, 20))  # WRONG

     (Some commercial versions of 'awk' treat 'substr()' as assignable,
     but doing so is not portable.)

     If you need to replace bits and pieces of a string, combine
     'substr()' with string concatenation, in the following manner:

          string = "abcdef"
          ...
          string = substr(string, 1, 2) "CDE" substr(string, 6)

'tolower(STRING)'
     Return a copy of STRING, with each uppercase character in the
     string replaced with its corresponding lowercase character.
     Nonalphabetic characters are left unchanged.  For example,
     'tolower("MiXeD cAsE 123")' returns '"mixed case 123"'.

'toupper(STRING)'
     Return a copy of STRING, with each lowercase character in the
     string replaced with its corresponding uppercase character.
     Nonalphabetic characters are left unchanged.  For example,
     'toupper("MiXeD cAsE 123")' returns '"MIXED CASE 123"'.

   At first glance, the 'split()' and 'patsplit()' functions appear to
be mirror images of each other.  But there are differences:

   * 'split()' treats its third argument like 'FS', with all the special
     rules involved for 'FS'.

   * Matching of null strings differs.  This is discussed in *note FS
     versus FPAT::.

                       Matching the Null String

   In 'awk', the '*' operator can match the null string.  This is
particularly important for the 'sub()', 'gsub()', and 'gensub()'
functions.  For example:

     $ echo abc | awk '{ gsub(/m*/, "X"); print }'
     -| XaXbXcX

Although this makes a certain amount of sense, it can be surprising.

   ---------- Footnotes ----------

   (1) Unless you use the '--non-decimal-data' option, which isn't
recommended.  *Note Nondecimal Data:: for more information.

   (2) Note that this means that the record will first be regenerated
using the value of 'OFS' if any fields have been changed, and that the
fields will be updated after the substitution, even if the operation is
a "no-op" such as 'sub(/^/, "")'.

   (3) This is different from C and C++, in which the first character is
number zero.