[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
6.17.8 Character Encoding of Source Files
Scheme source code files are usually encoded in ASCII or UTF-8, but the
built-in reader can interpret other character encodings as well. When
Guile loads Scheme source code, it uses the file-encoding
procedure (described below) to try to guess the encoding of the file.
In the absence of any hints, UTF-8 is assumed. One way to provide a
hint about the encoding of a source file is to place a coding
declaration in the top 500 characters of the file.
A coding declaration has the form coding: XXXXXX
, where
XXXXXX
is the name of a character encoding in which the source
code file has been encoded. The coding declaration must appear in a
scheme comment. It can either be a semicolon-initiated comment, or the
first block #!
comment in the file.
The name of the character encoding in the coding declaration is
typically lower case and containing only letters, numbers, and hyphens,
as recognized by set-port-encoding!
(see section set-port-encoding!
). Common examples of character encoding
names are utf-8
and iso-8859-1
,
as defined by IANA. Thus, the coding declaration is mostly compatible with Emacs.
However, there are some differences in encoding names recognized by
Emacs and encoding names defined by IANA, the latter being essentially a
subset of the former. For instance, latin-1
is a valid encoding
name for Emacs, but it’s not according to the IANA standard, which Guile
follows; instead, you should use iso-8859-1
, which is both
understood by Emacs and dubbed by IANA (IANA writes it uppercase but
Emacs wants it lowercase and Guile is case insensitive.)
For source code, only a subset of all possible character encodings can
be interpreted by the built-in source code reader. Only those
character encodings in which ASCII text appears unmodified can be
used. This includes UTF-8
and ISO-8859-1
through
ISO-8859-15
. The multi-byte character encodings UTF-16
and UTF-32
may not be used because they are not compatible with
ASCII.
There might be a scenario in which one would want to read non-ASCII
code from a port, such as with the function read
, instead of
with load
. If the port’s character encoding is the same as the
encoding of the code to be read by the port, not other special
handling is necessary. The port will automatically do the character
encoding conversion. The functions setlocale
or by
set-port-encoding!
are used to set port encodings
(see section Ports).
If a port is used to read code of unknown character encoding, it can
accomplish this in three steps. First, the character encoding of the
port should be set to ISO-8859-1 using set-port-encoding!
.
Then, the procedure file-encoding
, described below, is used to
scan for a coding declaration when reading from the port. As a side
effect, it rewinds the port after its scan is complete. After that,
the port’s character encoding should be set to the encoding returned
by file-encoding
, if any, again by using
set-port-encoding!
. Then the code can be read as normal.
Alternatively, one can use the #:guess-encoding
keyword argument
of open-file
and related procedures. See section File Ports.
- Scheme Procedure: file-encoding port
- C Function: scm_file_encoding (port)
Attempt to scan the first few hundred bytes from the port for hints about its character encoding. Return a string containing the encoding name or
#f
if the encoding cannot be determined. The port is rewound.Currently, the only supported method is to look for an Emacs-like character coding declaration (see how Emacs recognizes file encoding in The GNU Emacs Reference Manual). The coding declaration is of the form
coding: XXXXX
and must appear in a Scheme comment. Additional heuristics may be added in the future.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated on April 20, 2013 using texi2html 5.0.