You can declare variables as float, double, or long double, depending on the needs of your application. The principal differences between the three types are the significance they can represent, the storage they require, and their range. Table 7.1 shows the relationship between significance and storage requirements.
Table 7.1 Floating-Point Types
Type | Significant Digits | Number of Bytes |
float | 6–7 | 4 |
double | 15–16 | 8 |
long double | 19 | 10 |
Floating-point variables are represented by a mantissa, which contains the value of the number, and an exponent, which contains the order of magnitude of the number.
Table 7.2 shows the number of bits allocated to the mantissa and the exponent for each floating-point type. The most-significant bit of any float, double, or long double is always the sign bit. If it is 1, the number is considered negative; otherwise, it is considered a positive number.
Table 7.2 Lengths of Exponents and Mantissas
Type | Exponent Length | Mantissa Length |
float | 8 bits | 23 bits |
double | 11 bits | 52 bits |
long double | 15 bits | 64 bits |
Because exponents are stored in an unsigned form, the exponent is biased by half its possible value. For type float, the bias is 127; for type double, it is 1,023; for type long double, it is 16,383. You can compute the actual exponent value by subtracting the bias value from the exponent value.
The mantissa is stored as a binary fraction greater than or equal to 1 and less than 2. For types float and double, there is an implied leading 1 in the mantissa in the most-significant bit position, so the mantissas are actually 24 and 53 bits long, respectively, even though the most-significant bit is never stored in memory.
Instead of the storage method just described, the floating-point package can store binary floating-point numbers as denormalized numbers. Denormalized numbers are nonzero floating-point numbers with reserved exponent values in which the most-significant bit of the mantissa is zero. By using denormalized format, the range of a floating-point number can be extended at the cost of precision. You cannot control whether a floating-point number is represented in normalized or denormalized form; the floating-point package determines the representation. The floating-point packages never use denormalized form unless the exponent becomes less than the minimum that can be represented in a normalized form.
Table 7.3 shows the minimum and maximum value you can store in variables of each floating-point type. The values listed in this table apply only to normalized floating-point numbers; denormalized floating-point numbers have a smaller minimum value. Note that numbers retained in 80x87 registers are always represented in 80-bit normal form; numbers can only be represented in denormal form when stored in 32- or 64-bit floating-point variables (type float and type long).
Table 7.3 Range of Floating-Point Types
Type | Minimum Value | Maximum Value |
float | 1.175494351 E–38 | 3.402823466 E+38 |
double | 2.2250738585072014 E–308 | 1.7976931348623158 E+308 |
long double | 3.362103143112093503 E–4932 | 1.189731495357231765 E+4932 |
If precision is less of a concern than storage, consider using type float for floating-point variables. Conversely, if precision is the most important criterion, use type long double.
Summary: Microsoft C/C++ observes type-widening rules.
Floating-point variables can be promoted to a type of greater significance (for example, from type float to type double). Promotion often occurs when you perform arithmetic on floating-point variables. This arithmetic is always done in as high a degree of precision as the variable with the highest degree of precision. For example, consider the following type declarations:
float f_short;
double f_long;
long double f_longer;
f_short = f_short * f_long;
In the preceding example, the variable f_short is promoted to type double and multiplied by f_long; then the result is rounded to type float before being assigned to f_short.
In the example below (which uses the declarations from the preceding example), the arithmetic is done in float (32-bit) precision on the variables; the result is then promoted to type long double.
f_longer = f_short * f_short;