File: grep.info, Node: Character Encoding, Next: Matching Non-ASCII, Prev: Problematic Expressions, Up: Regular Expressions 3.8 Character Encoding ====================== The ‘LC_CTYPE’ locale specifies the encoding of characters in patterns and data, that is, whether text is encoded in UTF-8, ASCII, or some other encoding. *Note Environment Variables::. In the ‘C’ or ‘POSIX’ locale, every character is encoded as a single byte and every byte is a valid character. In more-complex encodings such as UTF-8, a sequence of multiple bytes may be needed to represent a character, and some bytes may be encoding errors that do not contribute to the representation of any character. POSIX does not specify the behavior of ‘grep’ when patterns or input data contain encoding errors or null characters, so portable scripts should avoid such usage. As an extension to POSIX, GNU ‘grep’ treats null characters like any other character. However, unless the ‘-a’ (‘--binary-files=text’) option is used, the presence of null characters in input or of encoding errors in output causes GNU ‘grep’ to treat the file as binary and suppress details about matches. *Note File and Directory Selection::. Regardless of locale, the 103 characters in the POSIX Portable Character Set (a subset of ASCII) are always encoded as a single byte, and the 128 ASCII characters have their usual single-byte encodings on all but oddball platforms.