info grep


File: grep.info,  Node: Character Encoding,  Next: Matching Non-ASCII,  Prev: Problematic Expressions,  Up: Regular Expressions

3.8 Character Encoding
======================

The ‘LC_CTYPE’ locale specifies the encoding of characters in patterns
and data, that is, whether text is encoded in UTF-8, ASCII, or some
other encoding.  *Note Environment Variables::.

   In the ‘C’ or ‘POSIX’ locale, every character is encoded as a single
byte and every byte is a valid character.  In more-complex encodings
such as UTF-8, a sequence of multiple bytes may be needed to represent a
character, and some bytes may be encoding errors that do not contribute
to the representation of any character.  POSIX does not specify the
behavior of ‘grep’ when patterns or input data contain encoding errors
or null characters, so portable scripts should avoid such usage.  As an
extension to POSIX, GNU ‘grep’ treats null characters like any other
character.  However, unless the ‘-a’ (‘--binary-files=text’) option is
used, the presence of null characters in input or of encoding errors in
output causes GNU ‘grep’ to treat the file as binary and suppress
details about matches.  *Note File and Directory Selection::.

   Regardless of locale, the 103 characters in the POSIX Portable
Character Set (a subset of ASCII) are always encoded as a single byte,
and the 128 ASCII characters have their usual single-byte encodings on
all but oddball platforms.