float(3) BSD Library Functions Manual float(3)
NAME
float -- description of floating-point types available on OS X and iOS
DESCRIPTION
This page describes the available C floating-point types. For a list of math library functions that operate on these types, see the page on the math library, "man math".
TERMINOLOGY
Floating point numbers are represented in three parts: a sign, a mantissa (or significand), and an exponent. Given such a representation with sign s, mantissa m, and exponent e, the corresponding numerical value is s*m*2**e. Floating-point types differ in the number of bits of accuracy in the man- tissa (called the precision), and set of available exponents (the expo- nent range). Floating-point numbers with the maximum available exponent are reserved operands, denoting an infinity if the significand is precisely zero, and a Not-a-Number, or NaN, otherwise. Floating-point numbers with the minimum available exponent are either zero if the significand is precisely zero, and denormal otherwise. Note that zero is signed: +0 and -0 are distinct floating point numbers. Floating-point numbers with exponents other than the maximum and minimum available are called normal numbers.
PROPERTIES OF IEEE-754 FLOATING-POINT
Basic arithmetic operations in IEEE-754 floating-point are correctly rounded: this means that the result delivered is the same as the result that would be achieved by computing the exact real-number operation on the operands, then rounding the real-number result to a floating-point value. Overflow occurs when the value of the exact result is too large in magni- tude to be represented in the floating-point type in which the computa- tion is being performed; doing so would require an exponent outside of the exponent range of the type. By default, computations that result in overflow return a signed infinity. Underflow occurs when the value of the exact result is too small in mag- nitude to be represented as a normal number in the floating-point type in which the computation is being performed. By default, underflow is grad- ual, and produces a denormal number or a zero. All floating-points number of a given type are integer multiples of the smallest non-zero floating-point number of that type; however, the con- verse is not true. This means that, in the default mode, (x-y) = 0 only if x = y. The sign of zero transforms correctly through multiplication and divi- sion, and is preserved by addition of zeros with like signs, but x - x yields +0 for every finite floating-point number x. The only operations that reveal the sign of a zero are x/(+-0) and copysign(x,+-0). In par- ticular, comparisons (x > y, x != y, etc) are not affected by the sign of zero. The sign of infinity transforms correctly through multiplication and division, and infinities are unaffected by addition or subtraction of any finite floating-point number. But Inf-Inf, Inf*0, and Inf/Inf are, like 0/0 or sqrt(-3), invalid operations that produce NaN. NaNs are the default results of invalid operations, and they propagate through subsequent arithmetic operations. If x is a NaN, then x != x is TRUE, and every other comparison predicate (x > y, x = y, x <= y, etc) evaluates to FALSE, regardless of the value of y. Additionally, predi- cates that entail an ordered comparison (rather than mere equality or inequality) signal Invalid Operation when one of the arguments is NaN. IEEE-754 provides five kinds of floating-point exceptions, listed below: Exception Default Result __________________________________________ Invalid Operation NaN or FALSE Overflow +-Infinity Divide by Zero +-Infinity Underflow Gradual Underflow Inexact Rounded Value NOTE: An exception is not an error unless it is handled incorrectly. What makes a class of exceptions exceptional is that no single default response can be satisfactory in every instance. On the other hand, because a default response will serve most instances of the exception satisfactorily, simply aborting the computation cannot be justified. For each kind of floating-point exception, IEEE-754 provides a flag that is raised each time its exception is signaled, and remains raised until the program resets it. Programs may test, save, and restore the flags, or a subset thereof.
PRECISION AND EXPONENT RANGE OF SPECIFIC FLOATING-POINT TYPES
On both OS X and iOS, the type float corresponds to IEEE-754 single pre- cision. A single-precision number is represented in 32 bits, and has a precision of 24 significant bits, roughly like 7 significant decimal dig- its. 8 bits are used to encode the exponent, which gives an exponent range from -126 to 127, inclusive. The header <float.h> defines several useful constants for the float type: FLT_MANT_DIG - The number of binary digits in the significand of a float. FLT_MIN_EXP - One more than the smallest exponent available in the float type. FLT_MAX_EXP - One more than the largest exponent available in the float type. FLT_DIG - the precision in decimal digits of a float. A decimal value with this many digits, stored as a float, always yields the same value up to this many digits when converted back to decimal notation. FLT_MIN_10_EXP - the smallest n such that 10**n is a non-zero normal num- ber as a float. FLT_MAX_10_EXP - the largest n such that 10**n is finite as a float. FLT_MIN - the smallest positive normal float. FLT_MAX - the largest finite float. FLT_EPSILON - the difference between 1.0 and the smallest float bigger than 1.0. On both OS X and iOS, the type double corresponds to IEEE-754 double pre- cision. A double-precision number is represented in 64 bits, and has a precision of 53 significant bits, roughly like 16 significant decimal digits. 11 bits are used to encode the exponent, which gives an exponent range from -1022 to 1023, inclusive. The header <float.h> defines several useful constants for the double type: DBL_MANT_DIG - The number of binary digits in the significand of a dou- ble. DBL_MIN_EXP - One more than the smallest exponent available in the double type. DBL_MAX_EXP - One more than the exponent available in the double type. DBL_DIG - the precision in decimal digits of a double. A decimal value with this many digits, stored as a double, always yields the same value up to this many digits when converted back to decimal notation. DBL_MIN_10_EXP - the smallest n such that 10**n is a non-zero normal num- ber as a double. DBL_MAX_10_EXP - the largest n such that 10**n is finite as a double. DBL_MIN - the smallest positive normal double. DBL_MAX - the largest finite double. DBL_EPSILON - the difference between 1.0 and the smallest double bigger than 1.0. On Intel macs, the type long double corresponds to IEEE-754 double extended precision. A double extended number is represented in 80 bits, and has a precision of 64 significant bits, roughly like 19 significant decimal digits. 15 bits are used to encode the exponent, which gives an exponent range from -16383 to 16384, inclusive. The header <float.h> defines several useful constants for the long double type: LDBL_MANT_DIG - The number of binary digits in the significand of a long double. LDBL_MIN_EXP - One more than the smallest exponent available in the long double type. LDBL_MAX_EXP - One more than the exponent available in the long double type. LDBL_DIG - the precision in decimal digits of a long double. A decimal value with this many digits, stored as a long double, always yields the same value up to this many digits when converted back to decimal nota- tion. LDBL_MIN_10_EXP - the smallest n such that 10**n is a non-zero normal number as a long double. LDBL_MAX_10_EXP - the largest n such that 10**n is finite as a long dou- ble. LDBL_MIN - the smallest positive normal long double. LDBL_MAX - the largest finite long double. LDBL_EPSILON - the difference between 1.0 and the smallest long double bigger than 1.0. On ARM iOS devices, the type long double corresponds to IEEE-754 double precision. Thus, the values of the LDBL_* macros are identical to those of the corresponding DBL_* macros.
SEE ALSO
math(3), complex(3)
STANDARDS
Floating-point arithmetic conforms to the ISO/IEC 9899:2011 standard. BSD March 28, 2007 BSD
Mac OS X 10.9.1 - Generated Tue Jan 7 19:42:11 CST 2014