[Top] | [Contents] | [Index] | [ ? ] |
Footnotes
(1)
The 2008 POSIX standard can be found online at http://www.opengroup.org/onlinepubs/9699919799/.
(2)
These commands are available on POSIX-compliant systems, as well as on traditional Unix-based systems. If you are using some other operating system, you still need to be familiar with the ideas of I/O redirection and pipes.
(3)
Often, these systems
use gawk
for their awk
implementation!
(4)
All such differences
appear in the index under the
entry “differences in awk
and gawk
.”
(5)
GNU stands for “GNU’s not Unix.”
(6)
The terminology “GNU/Linux” is explained in the Glossary.
(7)
If you use Bash as your shell, you should execute the command ‘set +H’ before running this program interactively, to disable the C shell-style command history, which treats ‘!’ as a special character. We recommend putting this command into your personal startup file.
(8)
Although we generally recommend the use of single quotes around the program text, double quotes are needed here in order to put the single quote into the message.
(9)
The ‘#!’ mechanism works on GNU/Linux systems, BSD-based systems and commercial Unix systems.
(10)
The
line beginning with ‘#!’ lists the full file name of an interpreter
to run and an optional initial command-line argument to pass to that
interpreter. The operating system then runs the interpreter with the given
argument and the full argument list of the executed program. The first argument
in the list is the full file name of the awk
program. The rest of the
argument list contains either options to awk
, or data files,
or both.
(11)
The ‘LC_ALL=C’ is
needed to produce this traditional-style output from ls
.
(12)
The ‘?’ and ‘:’ referred to here is the
three-operand conditional expression described in
Conditional Expressions.
Splitting lines after ‘?’ and ‘:’ is a minor gawk
extension; if ‘--posix’ is specified
(see section Command-Line Options), then this extension is disabled.
(13)
Not recommended.
(14)
Your version of gawk
may use a different directory; it
will depend upon how gawk
was built and installed. The actual
directory is the value of ‘$(datadir)’ generated when
gawk
was configured. You probably don’t need to worry about this,
though.
(15)
In other literature, you may see a bracket expression referred to as either a character set, a character class, or a character list.
(16)
Use two backslashes if you’re using a string constant with a regexp operator or function.
(17)
Experienced C and C++ programmers will note that it is possible, using something like ‘IGNORECASE = 1 && /foObAr/ { … }’ and ‘IGNORECASE = 0 || /foobar/ { … }’. However, this is somewhat obscure and we don’t recommend it.
(18)
If you don’t understand this,
don’t worry about it; it just means that gawk
does
the right thing.
(19)
At least that we know about.
(20)
In POSIX awk
, newlines are not
considered whitespace for separating fields.
(21)
The sed
utility is a “stream editor.”
Its behavior is also defined by the POSIX standard.
(22)
At least, we don’t know of one.
(23)
When FS
is the null string (""
)
or a regexp, this special feature of RS
does not apply.
It does apply to the default field separator of a single space:
‘FS = " "’.
(24)
This is not quite true. RT
could
be changed if RS
is a regular expression.
(25)
The “tty” in ‘/dev/tty’ stands for “Teletype,” a serial terminal.
(26)
The technical terminology is rather morbid. The finished child is called a “zombie,” and cleaning up after it is referred to as “reaping.”
(27)
This is a full 16-bit value as returned by the wait()
system call. See the system manual pages for information on
how to decode this value.
(28)
The internal representation of all numbers, including integers, uses double precision floating-point numbers. On most modern systems, these are in IEEE 754 standard format.
(29)
Pathological cases can require up to 752 digits (!), but we doubt that you need to worry about this.
(30)
It happens that Brian Kernighan’s
awk
, gawk
and mawk
all “get it right,”
but you should not rely on this.
(31)
gawk
has
followed these rules for many years,
and it is gratifying that the POSIX standard is also now correct.
(32)
Technically, string comparison is supposed
to behave the same way as if the strings are compared with the C
strcoll()
function.
(33)
This program has a bug; it prints lines starting with ‘END’. How would you fix it?
(34)
The original version of awk
kept
reading and ignoring input until the end of the file was seen.
(35)
In
POSIX awk
, newline does not count as whitespace.
(36)
Some early implementations of Unix awk
initialized
FILENAME
to "-"
, even if there were data files to be
processed. This behavior was incorrect and should not be relied
upon in your programs.
(37)
Thanks to Michael Brennan for pointing this out.
(38)
The C version of rand()
on many Unix systems
is known to produce fairly poor sequences of random numbers.
However, nothing requires that an awk
implementation use the C
rand()
to implement the awk
version of rand()
.
In fact, gawk
uses the BSD random()
function, which is
considerably better than rand()
, to produce random numbers.
(39)
mawk
uses a different seed each time.
(40)
Computer-generated random numbers really are not truly random. They are technically known as “pseudorandom.” This means that while the numbers in a sequence appear to be random, you can in fact generate the same sequence of random numbers over and over again.
(41)
Unless you use the ‘--non-decimal-data’ option, which isn’t recommended. See section Allowing Nondecimal Input Data, for more information.
(42)
Note that this means
that the record will first be regenerated using the value of OFS
if
any fields have been changed, and that the fields will be updated
after the substitution, even if the operation is a “no-op” such
as ‘sub(/^/, "")’.
(43)
This is different from C and C++, in which the first character is number zero.
(44)
A program is interactive if the standard output is connected to a terminal device. On modern systems, this means your keyboard and screen.
(45)
See section Glossary, especially the entries “Epoch” and “UTC.”
(46)
The GNU date
utility can
also do many of the things described here. Its use may be preferable
for simple time-related operations in shell scripts.
(47)
Occasionally there are minutes in a year with a leap second, which is why the seconds can go up to 60.
(48)
Unfortunately,
not every system’s strftime()
necessarily
supports all of the conversions listed here.
(49)
If you don’t understand any of this, don’t worry about
it; these facilities are meant to make it easier to “internationalize”
programs.
Other internationalization features are described in
Internationalization with gawk
.
(50)
This is because ISO C leaves the
behavior of the C version of strftime()
undefined and gawk
uses the system’s version of strftime()
if it’s there.
Typically, the conversion specifier either does not appear in the
returned string or appears literally.
(51)
This example
shows that 0’s come in on the left side. For gawk
, this is
always true, but in some languages, it’s possible to have the left side
fill with 1’s. Caveat emptor.
(52)
This program won’t actually run,
since foo()
is undefined.
(53)
For some operating systems, the gawk
port doesn’t support GNU gettext
.
Therefore, these features are not available
if you are using one of those operating systems. Sorry.
(54)
Americans use a comma every three decimal places and a period for the decimal point, while many Europeans do exactly the opposite: 1,234.56 versus 1.234,56.
(55)
The
xgettext
utility that comes with GNU
gettext
can handle ‘.awk’ files.
(56)
This example is borrowed
from the GNU gettext
manual.
(57)
This is good fodder for an “Obfuscated
awk
” contest.
(58)
Perhaps it would be better if it were called “Hippy.” Ah, well.
(59)
When two elements
compare as equal, the C qsort()
function does not guarantee
that they will maintain their original relative order after sorting.
Using the string value to provide a unique ordering when the numeric
values are equal ensures that gawk
behaves consistently
across different environments.
(60)
You may also use one of the predefined sorting names that sorts in decreasing order.
(61)
This
is true because locale-based comparison occurs only when in POSIX
compatibility mode, and since asort()
and asorti()
are
gawk
extensions, they are not available in that case.
(62)
This is very different from the same operator in the C shell.
(63)
The effects are
not identical. Output of the transformed
record will be in all lowercase, while IGNORECASE
preserves the original
contents of the input record.
(64)
While all the library routines could have
been rewritten to use this convention, this was not done, in order to
show how our own awk
programming style has evolved and to
provide some basis for this discussion.
(65)
gawk
’s ‘--dump-variables’ command-line
option is useful for verifying this.
(66)
This is changing; many systems use Unicode, a very large character set that includes ASCII as a subset. On systems with full Unicode support, a character can occupy up to 32 bits, making simple tests such as used here prohibitively expensive.
(67)
ASCII
has been extended in many countries to use the values from 128 to 255
for country-specific characters. If your system uses these extensions,
you can simplify _ord_init
to loop from 0 to 255.
(68)
It would
be nice if awk
had an assignment operator for concatenation.
The lack of an explicit operator for concatenation makes string operations
more difficult than they really need to be.
(69)
This
function was written before gawk
acquired the ability to
split strings into single characters using ""
as the separator.
We have left it alone, since using substr()
is more portable.
(70)
It is often the case that password information is stored in a network database.
(71)
It also introduces a subtle bug; if a match happens, we output the translated line, not the original.
(72)
This is the traditional usage. The POSIX usage is different, but not relevant for what the program aims to demonstrate.
(73)
wc
can’t just use the value of
FNR
in endfile()
. If you examine
the code in
Noting Data File Boundaries,
you will see that
FNR
has already been reset by the time
endfile()
is called.
(74)
Since gawk
understands multibyte locales, this code counts characters, not bytes.
(75)
On some older
systems,
tr
may require that the lists be written as
range expressions enclosed in square brackets (‘[a-z]’) and quoted,
to prevent the shell from attempting a file name expansion. This is
not a feature.
(76)
This
program was written before gawk
acquired the ability to
split each character in a string into separate array elements.
(77)
“Real world” is defined as “a program actually used to get something done.”
(78)
This program was written before gawk
had the
gensub()
function. Consider how you might use it to simplify the code.
(79)
Fully explaining the sh
language is beyond
the scope of this book. We provide some minimal explanations, but see
a good shell programming book if you wish to understand things in more
depth.
(80)
On some very old versions of awk
, the test
‘getline junk < t’ can loop forever if the file exists but is empty.
Caveat emptor.
(81)
See the standard and its rationale.
(82)
The IA64 architecture is also known as “Itanium.”
(83)
This version is edited
slightly for presentation. See
‘extension/filefuncs.c’ in the gawk
distribution
for the complete version.
(84)
Compiled programs are typically written in lower-level languages such as C, C++, or Ada, and then translated, or compiled, into a form that the computer can execute directly.
(85)
Pathological cases can require up to 752 digits (!), but we doubt that you need to worry about this.
(86)
You asked for it, you got it.
[Top] | [Contents] | [Index] | [ ? ] |