INF: Some Causes of Differences in Floating-Point Results

ID Number: Q46749

5.10 6.00 6.00a 6.00ax 7.00

MS-DOS

Summary:

This article discusses some reasons why programs might produce

different floating-point results when compiled with different compiler

options.

The program below produces different results when complied using

cl -AM -FPi prog.c

than when using the following:

cl -AM -FPa prog.c

Part of the reason for the different results is that /FPa and /FPi

generate math routines that don't work the same. /FPi math emulates

the 80x87, to the point of actually converting 8-byte doubles to

10-byte internal format and doing the math in internal format. /FPa

uses an 8-byte format for calculations; therefore, it is less

accurate. This often accounts for differences in results.

More Information:

Also, the second number printed in the /FPi case is smaller than

DBL_MIN, as defined in FLOAT.H. This situation is also correct because

DBL_MIN is the smallest possible NORMALIZED value. (Normalized means

that the high-order bit of the mantissa is a one.)

"Denormals" (numbers where there are zeros in some of the high-order

bits of the mantissa), however, can represent numbers "x" in the

ranges + DBL_MIN > x > 0 and 0 > x > -DBL_MIN. Although this is an

unusual situation, it is not an error. A denormal is less precise than

a normalized number; however, a denormal is still more precise than 0

(zero) (which is the next best representation). By allowing use of

denormal numbers, we make our floating-point result slightly more

accurate. The alternate math library (/FPa) represents denormal

numbers as 0 (zero).

There is a good explanation of floating-point exceptions (including

the denormal exception, which is always masked) in the FPEXCEPT.DOC

file that comes with the C compiler. For more detailed background, see

the Intel "80387 Programmer's Reference Manual."

Another possible cause of differences in floating-point results is the

inclusion or omission of the /Op option. When /Op is omitted, the

compiler may skip storing intermediate results as 64-bit objects in

memory, leaving them instead in the 80-bit registers of the 80x87 (or

emulator package). This increases the speed and accuracy of the

calculation. However, this can decrease the consistency of the

calculations because other intermediate results may have been stored

in 64-bit objects in memory anyway. Including /Op forces all

intermediate results to be stored in memory, giving more consistent

results. This option is often handy in programs involving complicated

floating-point calculations.

The program and its output follow:

Sample Code

-----------

#include <stdio.h> // START OF PROG.C

#include <float.h>

main()

{

double a,d,c,prod1,prod2;

_fpreset();

a=9.5788979e-283;

b=8.050847e-1;

c=9.5588526e-28;

prod1=a*b;

printf("\n product1 = %1.15le \n",prod1);

prod2=c*prod1;

printf("\n product2 = %1.15le \n",prod2);

} // END OF PROG.C

Results

-------

// RESULTS OBTAINED USING CL -AM -FPi PROG.C

product1 = 7.711824142152130e-283

product2 = 7.371619025195353e-310 // This value is less than DBL_MIN

// RESULTS OBTAINED USING CL -AM -FPa PROG.C

product1 = 7.711824142152130e-283

product2 = 0.000000000000000e+000

Additional reference words: 5.00 5.10 6.00 6.00a 6.00ax 7.00