info bigloo

6.1.10 Unicode (UCS-2) Strings

UCS-2 strings cannot be read by the standard reader but UTF-8 strings can. The special syntax for UTF-8 is described by the regular expression: #u"([^]|\")*".

The library functions for Unicode string processing are:

bigloo procedure: ucs2-string? obj

bigloo procedure: make-ucs2-string k
bigloo procedure: make-ucs2-string k char
bigloo procedure: ucs2-string k …

bigloo procedure: ucs2-string-length s-ucs2
bigloo procedure: ucs2-string-ref s-ucs2 k
bigloo procedure: ucs2-string-set! s-ucs2 k char

bigloo procedure: ucs2-string=? s-ucs2a s-ucs2b
bigloo procedure: ucs2-string-ci=? s-ucs2a s-ucs2b
bigloo procedure: ucs2-string<? s-ucs2a s-ucs2b
bigloo procedure: ucs2-string>? s-ucs2a s-ucs2b
bigloo procedure: ucs2-string<=? s-ucs2a s-ucs2b
bigloo procedure: ucs2-string>=? s-ucs2a s-ucs2b
bigloo procedure: ucs2-string-ci<? s-ucs2a s-ucs2b
bigloo procedure: ucs2-string-ci>? s-ucs2a s-ucs2b
bigloo procedure: ucs2-string-ci<=? s-ucs2a s-ucs2b
bigloo procedure: ucs2-string-ci>=? s-ucs2a s-ucs2b

bigloo procedure: subucs2-string s-ucs2 start end
bigloo procedure: ucs2-string-append s-ucs2 …
bigloo procedure: ucs2-string->list s-ucs2
bigloo procedure: list->ucs2-string chars
bigloo procedure: ucs2-string-copy s-ucs2

bigloo procedure: ucs2-string-fill! s-ucs2 char: Stores char in every element of the given s-ucs2 and returns an unspecified value.

bigloo procedure: ucs2-string-downcase s-ucs2: Builds a newly allocated ucs2-string with lower case letters.

bigloo procedure: ucs2-string-upcase s-ucs2: Builds a new allocated ucs2-string with upper case letters.

bigloo procedure: ucs2-string-downcase! s-ucs2: Physically downcases the s-ucs2 argument.

bigloo procedure: ucs2-string-upcase! s-ucs2: Physically upcases the s-ucs2 argument.

bigloo procedure: ucs2-string->utf8-string s-ucs2
bigloo procedure: utf8-string->ucs2-string string: Convert UCS-2 strings to (or from) UTF-8 encoded ascii strings.

bigloo procedure: utf8-string? string [strict #f]

Returns #t if and only if the argument string is a well formed UTF8 string. Otherwise returns #f.

If the optional argument strict is #t, half utf16-surrogates are rejected. The optional argument strict defaults to #f.

bigloo procedure: utf8-string-length string: Returns the number of characters of an UTF8 string. It raises an error if the string is not a well formed UTF8 string (i.e., it does satisfies the utf8-string? predicate.

bigloo procedure: utf8-string-ref string i: Returns the character (represented as an UTF8 string) at the position i in string.

library procedure: utf8-substring string start [end]

string must be a string, and start and end must be exact integers satisfying:

  0 <= START <= END <= (string-length STRING)

The optional argument end defaults to (utf8-string-length STRING).

utf8-substring returns a newly allocated string formed from the characters of STRING beginning with index START (inclusive) and ending with index END (exclusive).

If the argument string is not a well formed UTF8 string an error is raised. Otherwise, the result is also a well formed UTF8 string.

bigloo procedure: iso-latin->utf8 string
bigloo procedure: iso-latin->utf8! string
bigloo procedure: utf8->iso-latin string
bigloo procedure: utf8->iso-latin! string
bigloo procedure: utf8->iso-latin-15 string
bigloo procedure: utf8->iso-latin-15! string: Encode and decode iso-latin strings into utf8. The functions iso-latin->utf8-string!, utf8->iso-latin! and utf8->iso-latin-15! may return, as result, the string they receive as argument.

bigloo procedure: cp1252->utf8 string
bigloo procedure: cp1252->utf8! string
bigloo procedure: utf8->cp1252 string
bigloo procedure: utf8->cp1252! string: Encode and decode cp1252 strings into utf8. The functions cp1252->utf8-string! and utf8->cp1252! may return, as result, the string they receive as argument.

bigloo procedure: 8bits->utf8 string table

bigloo procedure: 8bits->utf8! string table

bigloo procedure: utf8->8bits string invtable

bigloo procedure: utf8->8bits! string inv-table

These are the general conversion routines used internally by iso-latin->utf8 and cp1252->utf8. They convert any 8 bits string into its equivalent UTF-8 representation and vice versa.

The argument table should be either #f, which means that the basic (i.e., iso-latin-1) 8bits -> UTF8 converion is used, or it must be a vector of at maximun 127 entries containing strings of characters. This table contains the encodings for the 8 bits characters whose code range from 128 to 255.

The table is not required to be complete. That is, it is not required to give the whole character encoding set. Only the characters that need a non-iso-latin canonical representation must be given. For instance, the CP1252 table can be defined as:

(define cp1252
   '#("\xe2\x82\xac" ;; 0x80
      ""             ;; 0x81
      "\xe2\x80\x9a" ;; 0x82
      "\xc6\x92"     ;; 0x83
      "\xe2\x80\x9e" ;; 0x84
      "\xe2\x80\xa6" ;; 0x85
      "\xe2\x80\xa0" ;; 0x86
      "\xe2\x80\xa1" ;; 0x87
      "\xcb\x86"     ;; 0x88
      "\xe2\x80\xb0" ;; 0x89
      "\xc5\xa0"     ;; 0x8a
      "\xe2\x80\xb9" ;; 0x8b
      "\xc5\x92"     ;; 0x8c
      ""             ;; 0x8d
      "\xc5\xbd"     ;; 0x8e
      ""             ;; 0x8f
      ""             ;; 0x90
      "\xe2\x80\x98" ;; 0x91
      "\xe2\x80\x99" ;; 0x92
      "\xe2\x80\x9c" ;; 0x93
      "\xe2\x80\x9d" ;; 0x94
      "\xe2\x80\xa2" ;; 0x95
      "\xe2\x80\x93" ;; 0x96
      "\xe2\x80\x94" ;; 0x97
      "\xcb\x9c"     ;; 0x98
      "\xe2\x84\xa2" ;; 0x99
      "\xc5\xa1"     ;; 0x9a
      "\xe2\x80\xba" ;; 0x9b
      "\xc5\x93"     ;; 0x9c
      ""             ;; 0x9d
      "\xc5\xbe"     ;; 0x9e
      "\xc5\xb8"))   ;; 0x9f

The argument inv-table is a inverse table that can be build from a table and using the function inverse-utf8-table.

procedure: inverse-utf8-table vector: Inverse an UTF8 table into an object suitable for utf8->8bits and utf8->8bits!.

This document was generated on March 31, 2014 using texi2html 5.0.