[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
11.2.6 Additional functions for plural forms
The functions of the gettext
family described so far (and all the
catgets
functions as well) have one problem in the real world
which have been neglected completely in all existing approaches. What
is meant here is the handling of plural forms.
Looking through Unix source code before the time anybody thought about internationalization (and, sadly, even afterwards) one can often find code similar to the following:
printf ("%d file%s deleted", n, n == 1 ? "" : "s");
After the first complaints from people internationalizing the code people
either completely avoided formulations like this or used strings like
"file(s)"
. Both look unnatural and should be avoided. First
tries to solve the problem correctly looked like this:
if (n == 1) printf ("%d file deleted", n); else printf ("%d files deleted", n);
But this does not solve the problem. It helps languages where the
plural form of a noun is not simply constructed by adding an
āsā
but that is all. Once again people fell into the trap of believing the
rules their language is using are universal. But the handling of plural
forms differs widely between the language families. For example,
Rafal Maszkowski <rzm@mat.uni.torun.pl>
reports:
In Polish we use e.g. plik (file) this way:
1 plik 2,3,4 pliki 5-21 pliko'w 22-24 pliki 25-31 pliko'wand so on (o’ means 8859-2 oacute which should be rather okreska, similar to aogonek).
There are two things which can differ between languages (and even inside language families);
- The form how plural forms are built differs. This is a problem with languages which have many irregularities. German, for instance, is a drastic case. Though English and German are part of the same language family (Germanic), the almost regular forming of plural noun forms (appending an āsā) is hardly found in German.
-
The number of plural forms differ. This is somewhat surprising for
those who only have experiences with Romanic and Germanic languages
since here the number is the same (there are two).
But other language families have only one form or many forms. More information on this in an extra section.
The consequence of this is that application writers should not try to
solve the problem in their code. This would be localization since it is
only usable for certain, hardcoded language environments. Instead the
extended gettext
interface should be used.
These extra functions are taking instead of the one key string two
strings and a numerical argument. The idea behind this is that using
the numerical argument and the first string as a key, the implementation
can select using rules specified by the translator the right plural
form. The two string arguments then will be used to provide a return
value in case no message catalog is found (similar to the normal
gettext
behavior). In this case the rules for Germanic language
is used and it is assumed that the first string argument is the singular
form, the second the plural form.
This has the consequence that programs without language catalogs can
display the correct strings only if the program itself is written using
a Germanic language. This is a limitation but since the GNU C library
(as well as the GNU gettext
package) are written as part of the
GNU package and the coding standards for the GNU project require program
being written in English, this solution nevertheless fulfills its
purpose.
- Function: char * ngettext (const char *msgid1, const char *msgid2, unsigned long int n)
The
ngettext
function is similar to thegettext
function as it finds the message catalogs in the same way. But it takes two extra arguments. The msgid1 parameter must contain the singular form of the string to be converted. It is also used as the key for the search in the catalog. The msgid2 parameter is the plural form. The parameter n is used to determine the plural form. If no message catalog is found msgid1 is returned ifn == 1
, otherwisemsgid2
.An example for the use of this function is:
printf (ngettext ("%d file removed", "%d files removed", n), n);
Please note that the numeric value n has to be passed to the
printf
function as well. It is not sufficient to pass it only tongettext
.In the English singular case, the number – always 1 – can be replaced with "one":
printf (ngettext ("One file removed", "%d files removed", n), n);
This works because the ‘printf’ function discards excess arguments that are not consumed by the format string.
If this function is meant to yield a format string that takes two or more arguments, you can not use it like this:
printf (ngettext ("%d file removed from directory %s", "%d files removed from directory %s", n), n, dir);
because in many languages the translators want to replace the ‘%d’ with an explicit word in the singular case, just like “one” in English, and C format strings cannot consume the second argument but skip the first argument. Instead, you have to reorder the arguments so that ‘n’ comes last:
printf (ngettext ("%$2d file removed from directory %$1s", "%$2d files removed from directory %$1s", n), dir, n);
See C Format Strings for details about this argument reordering syntax.
When you know that the value of
n
is within a given range, you can specify it as a comment directed to thexgettext
tool. This information may help translators to use more adequate translations. Like this:if (days > 7 && days < 14) /* xgettext: range: 1..6 */ printf (ngettext ("one week and one day", "one week and %d days", days - 7), days - 7);
It is also possible to use this function when the strings don’t contain a cardinal number:
puts (ngettext ("Delete the selected file?", "Delete the selected files?", n));
In this case the number n is only used to choose the plural form.
- Function: char * dngettext (const char *domain, const char *msgid1, const char *msgid2, unsigned long int n)
The
dngettext
is similar to thedgettext
function in the way the message catalog is selected. The difference is that it takes two extra parameter to provide the correct plural form. These two parameters are handled in the same wayngettext
handles them.
- Function: char * dcngettext (const char *domain, const char *msgid1, const char *msgid2, unsigned long int n, int category)
The
dcngettext
is similar to thedcgettext
function in the way the message catalog is selected. The difference is that it takes two extra parameter to provide the correct plural form. These two parameters are handled in the same wayngettext
handles them.
Now, how do these functions solve the problem of the plural forms? Without the input of linguists (which was not available) it was not possible to determine whether there are only a few different forms in which plural forms are formed or whether the number can increase with every new supported language.
Therefore the solution implemented is to allow the translator to specify the rules of how to select the plural form. Since the formula varies with every language this is the only viable solution except for hardcoding the information in the code (which still would require the possibility of extensions to not prevent the use of new languages).
The information about the plural form selection has to be stored in the
header entry of the PO file (the one with the empty msgid
string).
The plural form information looks like this:
Plural-Forms: nplurals=2; plural=n == 1 ? 0 : 1;
The nplurals
value must be a decimal number which specifies how
many different plural forms exist for this language. The string
following plural
is an expression which is using the C language
syntax. Exceptions are that no negative numbers are allowed, numbers
must be decimal, and the only variable allowed is n
. Spaces are
allowed in the expression, but backslash-newlines are not; in the
examples below the backslash-newlines are present for formatting purposes
only. This expression will be evaluated whenever one of the functions
ngettext
, dngettext
, or dcngettext
is called. The
numeric value passed to these functions is then substituted for all uses
of the variable n
in the expression. The resulting value then
must be greater or equal to zero and smaller than the value given as the
value of nplurals
.
The following rules are known at this point. The language with families are listed. But this does not necessarily mean the information can be generalized for the whole family (as can be easily seen in the table below).(5)
- Only one form:
Some languages only require one single form. There is no distinction between the singular and plural form. An appropriate header entry would look like this:
Plural-Forms: nplurals=1; plural=0;
Languages with this property include:
- Asian family
Japanese, Vietnamese, Korean
- Two forms, singular used for one only
This is the form used in most existing programs since it is what English is using. A header entry would look like this:
Plural-Forms: nplurals=2; plural=n != 1;
(Note: this uses the feature of C expressions that boolean expressions have to value zero or one.)
Languages with this property include:
- Germanic family
English, German, Dutch, Swedish, Danish, Norwegian, Faroese
- Romanic family
Spanish, Portuguese, Italian, Bulgarian
- Latin/Greek family
Greek
- Finno-Ugric family
Finnish, Estonian
- Semitic family
Hebrew
- Artificial
Esperanto
Other languages using the same header entry are:
- Finno-Ugric family
Hungarian
- Turkic/Altaic family
Turkish
Hungarian does not appear to have a plural if you look at sentences involving cardinal numbers. For example, “1 apple” is “1 alma”, and “123 apples” is “123 alma”. But when the number is not explicit, the distinction between singular and plural exists: “the apple” is “az alma”, and “the apples” is “az almák”. Since
ngettext
has to support both types of sentences, it is classified here, under “two forms”.The same holds for Turkish: “1 apple” is “1 elma”, and “123 apples” is “123 elma”. But when the number is omitted, the distinction between singular and plural exists: “the apple” is “elma”, and “the apples” is “elmalar”.
- Two forms, singular used for zero and one
Exceptional case in the language family. The header entry would be:
Plural-Forms: nplurals=2; plural=n>1;
Languages with this property include:
- Romanic family
Brazilian Portuguese, French
- Three forms, special case for zero
The header entry would be:
Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n != 0 ? 1 : 2;
Languages with this property include:
- Baltic family
Latvian
- Three forms, special cases for one and two
The header entry would be:
Plural-Forms: nplurals=3; plural=n==1 ? 0 : n==2 ? 1 : 2;
Languages with this property include:
- Celtic
Gaeilge (Irish)
- Three forms, special case for numbers ending in 00 or [2-9][0-9]
The header entry would be:
Plural-Forms: nplurals=3; \ plural=n==1 ? 0 : (n==0 || (n%100 > 0 && n%100 < 20)) ? 1 : 2;
Languages with this property include:
- Romanic family
Romanian
- Three forms, special case for numbers ending in 1[2-9]
The header entry would look like this:
Plural-Forms: nplurals=3; \ plural=n%10==1 && n%100!=11 ? 0 : \ n%10>=2 && (n%100<10 || n%100>=20) ? 1 : 2;
Languages with this property include:
- Baltic family
Lithuanian
- Three forms, special cases for numbers ending in 1 and 2, 3, 4, except those ending in 1[1-4]
The header entry would look like this:
Plural-Forms: nplurals=3; \ plural=n%10==1 && n%100!=11 ? 0 : \ n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;
Languages with this property include:
- Slavic family
Russian, Ukrainian, Belarusian, Serbian, Croatian
- Three forms, special cases for 1 and 2, 3, 4
The header entry would look like this:
Plural-Forms: nplurals=3; \ plural=(n==1) ? 0 : (n>=2 && n<=4) ? 1 : 2;
Languages with this property include:
- Slavic family
Czech, Slovak
- Three forms, special case for one and some numbers ending in 2, 3, or 4
The header entry would look like this:
Plural-Forms: nplurals=3; \ plural=n==1 ? 0 : \ n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;
Languages with this property include:
- Slavic family
Polish
- Four forms, special case for one and all numbers ending in 02, 03, or 04
The header entry would look like this:
Plural-Forms: nplurals=4; \ plural=n%100==1 ? 0 : n%100==2 ? 1 : n%100==3 || n%100==4 ? 2 : 3;
Languages with this property include:
- Slavic family
Slovenian
You might now ask, ngettext
handles only numbers n of type
‘unsigned long’. What about larger integer types? What about negative
numbers? What about floating-point numbers?
About larger integer types, such as ‘uintmax_t’ or
‘unsigned long long’: they can be handled by reducing the value to a
range that fits in an ‘unsigned long’. Simply casting the value to
‘unsigned long’ would not do the right thing, since it would treat
ULONG_MAX + 1
like zero, ULONG_MAX + 2
like singular, and
the like. Here you can exploit the fact that all mentioned plural form
formulas eventually become periodic, with a period that is a divisor of 100
(or 1000 or 1000000). So, when you reduce a large value to another one in
the range [1000000, 1999999] that ends in the same 6 decimal digits, you
can assume that it will lead to the same plural form selection. This code
does this:
#include <inttypes.h> uintmax_t nbytes = ...; printf (ngettext ("The file has %"PRIuMAX" byte.", "The file has %"PRIuMAX" bytes.", (nbytes > ULONG_MAX ? (nbytes % 1000000) + 1000000 : nbytes)), nbytes);
Negative and floating-point values usually represent physical entities for
which singular and plural don’t clearly apply. In such cases, there is no
need to use ngettext
; a simple gettext
call with a form suitable
for all values will do. For example:
printf (gettext ("Time elapsed: %.3f seconds"), num_milliseconds * 0.001);
Even if num_milliseconds happens to be a multiple of 1000, the output
Time elapsed: 1.000 seconds
is acceptable in English, and similarly for other languages.
The translators’ perspective regarding plural forms is explained in Translating plural forms.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated on June 7, 2014 using texi2html 5.0.