[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
3.2 Floating-point numbers
Not all real numbers can be represented exactly. (There is an easy mathematical proof for this: Only a countable set of numbers can be stored exactly in a computer, even if one assumes that it has unlimited storage. But there are uncountably many real numbers.) So some approximation is needed. CLN implements ordinary floating-point numbers, with mantissa and exponent.
The elementary operations (+
, -
, *
, /
, …)
only return approximate results. For example, the value of the expression
(cl_F) 0.3 + (cl_F) 0.4
prints as ‘0.70000005’, not as
‘0.7’. Rounding errors like this one are inevitable when computing
with floating-point numbers.
Nevertheless, CLN rounds the floating-point results of the operations +
,
-
, *
, /
, sqrt
according to the “round-to-even”
rule: It first computes the exact mathematical result and then returns the
floating-point number which is nearest to this. If two floating-point numbers
are equally distant from the ideal result, the one with a 0
in its least
significant mantissa bit is chosen.
Similarly, testing floating point numbers for equality ‘x == y’
is gambling with random errors. Better check for ‘abs(x - y) < epsilon’
for some well-chosen epsilon
.
Floating point numbers come in four flavors:
-
Short floats, type
cl_SF
. They have 1 sign bit, 8 exponent bits (including the exponent’s sign), and 17 mantissa bits (including the “hidden” bit). They don’t consume heap allocation. -
Single floats, type
cl_FF
. They have 1 sign bit, 8 exponent bits (including the exponent’s sign), and 24 mantissa bits (including the “hidden” bit). In CLN, they are represented as IEEE single-precision floating point numbers. This corresponds closely to the C/C++ type ‘float’. -
Double floats, type
cl_DF
. They have 1 sign bit, 11 exponent bits (including the exponent’s sign), and 53 mantissa bits (including the “hidden” bit). In CLN, they are represented as IEEE double-precision floating point numbers. This corresponds closely to the C/C++ type ‘double’. -
Long floats, type
cl_LF
. They have 1 sign bit, 32 exponent bits (including the exponent’s sign), and n mantissa bits (including the “hidden” bit), where n >= 64. The precision of a long float is unlimited, but once created, a long float has a fixed precision. (No “lazy recomputation”.)
Of course, computations with long floats are more expensive than those with smaller floating-point formats.
CLN does not implement features like NaNs, denormalized numbers and gradual underflow. If the exponent range of some floating-point type is too limited for your application, choose another floating-point type with larger exponent range.
As a user of CLN, you can forget about the differences between the
four floating-point types and just declare all your floating-point
variables as being of type cl_F
. This has the advantage that
when you change the precision of some computation (say, from cl_DF
to cl_LF
), you don’t have to change the code, only the precision
of the initial values. Also, many transcendental functions have been
declared as returning a cl_F
when the argument is a cl_F
,
but such declarations are missing for the types cl_SF
, cl_FF
,
cl_DF
, cl_LF
. (Such declarations would be wrong if
the floating point contagion rule happened to change in the future.)
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated on August 27, 2013 using texi2html 5.0.