[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
A.7 Regexp Ranges and Locales: A Long Sad Story
This section describes the confusing history of ranges within
regular expressions and their interactions with locales, and how this
affected different versions of gawk
.
The original Unix tools that worked with regular expressions defined
character ranges (such as ‘[a-z]’) to match any character between
the first character in the range and the last character in the range,
inclusive. Ordering was based on the numeric value of each character
in the machine’s native character set. Thus, on ASCII-based systems,
[a-z]
matched all the lowercase letters, and only the lowercase
letters, since the numeric values for the letters from ‘a’ through
‘z’ were contigous. (On an EBCDIC system, the range ‘[a-z]’
includes additional, non-alphabetic characters as well.)
Almost all introductory Unix literature explained range expressions as working in this fashion, and in particular, would teach that the “correct” way to match lowercase letters was with ‘[a-z]’, and that ‘[A-Z]’ was the the “correct” way to match uppercase letters. And indeed, this was true.
The 1993 POSIX standard introduced the idea of locales (see section Where You Are Makes A Difference). Since many locales include other letters besides the plain twenty-six letters of the American English alphabet, the POSIX standard added character classes (see section Using Bracket Expressions) as a way to match different kinds of characters besides the traditional ones in the ASCII character set.
However, the standard changed the interpretation of range expressions.
In the "C"
and "POSIX"
locales, a range expression like
‘[a-dx-z]’ is still equivalent to ‘[abcdxyz]’, as in ASCII.
But outside those locales, the ordering was defined to be based on
collation order.
In many locales, ‘A’ and ‘a’ are both less than ‘B’. In other words, these locales sort characters in dictionary order, and ‘[a-dx-z]’ is typically not equivalent to ‘[abcdxyz]’; instead it might be equivalent to ‘[aBbCcdXxYyz]’, for example.
This point needs to be emphasized: Much literature teaches that you should use ‘[a-z]’ to match a lowercase character. But on systems with non-ASCII locales, this also matched all of the uppercase characters except ‘Z’! This was a continuous cause of confusion, even well into the twenty-first century.
To demonstrate these issues, the following example uses the sub()
function, which does text replacement (see section String-Manipulation Functions). Here,
the intent is to remove trailing uppercase characters:
$ echo something1234abc | gawk-3.1.8 '{ sub("[A-Z]*$", ""); print }' -| something1234a |
This output is unexpected, since the ‘bc’ at the end of ‘something1234abc’ should not normally match ‘[A-Z]*’. This result is due to the locale setting (and thus you may not see it on your system).
Similar considerations apply to other ranges. For example, ‘["-/]’ is perfectly valid in ASCII, but is not valid in many Unicode locales, such as ‘en_US.UTF-8’.
Early versions of gawk
used regexp matching code that was not
locale aware, so ranges had their traditional interpretation.
When gawk
switched to using locale-aware regexp matchers,
the problems began; especially as both GNU/Linux and commercial Unix
vendors started implementing non-ASCII locales, and making them
the default. Perhaps the most frequently asked question became something
like “why does [A-Z]
match lowercase letters?!?”
This situation existed for close to 10 years, if not more, and
the gawk
maintainer grew weary of trying to explain that
gawk
was being nicely standards-compliant, and that the issue
was in the user’s locale. During the development of version 4.0,
he modified gawk
to always treat ranges in the original,
pre-POSIX fashion, unless ‘--posix’ was used (see section Command-Line Options).
Fortunately, shortly before the final release of gawk
4.0,
the maintainer learned that the 2008 standard had changed the
definition of ranges, such that outside the "C"
and "POSIX"
locales, the meaning of range expressions was
undefined.(81)
By using this lovely technical term, the standard gives license
to implementors to implement ranges in whatever way they choose.
The gawk
maintainer chose to apply the pre-POSIX meaning in all
cases: the default regexp matching; with ‘--traditional’, and with
‘--posix’; in all cases, gawk
remains POSIX compliant.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |