info gawk


File: gawk.info,  Node: Bytes vs. Characters,  Next: Using extensions,  Up: Wc Program

11.2.7.1 Modern Character Sets
..............................

In the early days of computing, single bytes were used for storing
characters.  The most common character sets were ASCII and EBCDIC, which
each provided all the English upper- and lowercase letters, the 10
Hindu-Arabic numerals from 0 through 9, and a number of other standard
punctuation and control characters.

   Today, the most popular character set in use is Unicode (of which
ASCII is a pure subset).  Unicode provides tens of thousands of unique
characters (called "code points") to cover most existing human languages
(living and dead) and a number of nonhuman ones as well (such as Klingon
and J.R.R. Tolkien's elvish languages).

   To save space in files, Unicode code points are "encoded", where each
character takes from one to four bytes in the file.  UTF-8 is possibly
the most popular of such "multibyte encodings".

   The POSIX standard requires that 'awk' function in terms of
characters, not bytes.  Thus in 'gawk', 'length()', 'substr()',
'split()', 'match()' and the other string functions (*note String
Functions::) all work in terms of characters in the local character set,
and not in terms of bytes.  (Not all 'awk' implementations do so,
though).

   There is no standard, built-in way to distinguish characters from
bytes in an 'awk' program.  For an 'awk' implementation of 'wc', which
needs to make such a distinction, we will have to use an external
extension.