info cln

3.2 Floating-point numbers

Not all real numbers can be represented exactly. (There is an easy mathematical proof for this: Only a countable set of numbers can be stored exactly in a computer, even if one assumes that it has unlimited storage. But there are uncountably many real numbers.) So some approximation is needed. CLN implements ordinary floating-point numbers, with mantissa and exponent.

The elementary operations (+, -, *, /, …) only return approximate results. For example, the value of the expression (cl_F) 0.3 + (cl_F) 0.4 prints as ‘0.70000005’, not as ‘0.7’. Rounding errors like this one are inevitable when computing with floating-point numbers.

Nevertheless, CLN rounds the floating-point results of the operations +, -, *, /, sqrt according to the “round-to-even” rule: It first computes the exact mathematical result and then returns the floating-point number which is nearest to this. If two floating-point numbers are equally distant from the ideal result, the one with a 0 in its least significant mantissa bit is chosen.

Similarly, testing floating point numbers for equality ‘x == y’ is gambling with random errors. Better check for ‘abs(x - y) < epsilon’ for some well-chosen epsilon.

Floating point numbers come in four flavors:

Short floats, type cl_SF. They have 1 sign bit, 8 exponent bits (including the exponent’s sign), and 17 mantissa bits (including the “hidden” bit). They don’t consume heap allocation.
Single floats, type cl_FF. They have 1 sign bit, 8 exponent bits (including the exponent’s sign), and 24 mantissa bits (including the “hidden” bit). In CLN, they are represented as IEEE single-precision floating point numbers. This corresponds closely to the C/C++ type ‘float’.
Double floats, type cl_DF. They have 1 sign bit, 11 exponent bits (including the exponent’s sign), and 53 mantissa bits (including the “hidden” bit). In CLN, they are represented as IEEE double-precision floating point numbers. This corresponds closely to the C/C++ type ‘double’.
Long floats, type cl_LF. They have 1 sign bit, 32 exponent bits (including the exponent’s sign), and n mantissa bits (including the “hidden” bit), where n >= 64. The precision of a long float is unlimited, but once created, a long float has a fixed precision. (No “lazy recomputation”.)

Of course, computations with long floats are more expensive than those with smaller floating-point formats.

CLN does not implement features like NaNs, denormalized numbers and gradual underflow. If the exponent range of some floating-point type is too limited for your application, choose another floating-point type with larger exponent range.

As a user of CLN, you can forget about the differences between the four floating-point types and just declare all your floating-point variables as being of type cl_F. This has the advantage that when you change the precision of some computation (say, from cl_DF to cl_LF), you don’t have to change the code, only the precision of the initial values. Also, many transcendental functions have been declared as returning a cl_F when the argument is a cl_F, but such declarations are missing for the types cl_SF, cl_FF, cl_DF, cl_LF. (Such declarations would be wrong if the floating point contagion rule happened to change in the future.)

This document was generated on August 27, 2013 using texi2html 5.0.