Does Size Matter?

The native type of a CPU is the size of an integral value that the CPU works with most efficiently. On the 386 and later, the preferred size is 32 bits. However, Intel CPUs can also work with 16-bit WORD values and 8-bit BYTE-sized values.
If you're one of those wacky funsters who's read the Intel architecture manual (and stayed awake), you might have noticed that many instructions have two forms. One works on byte-sized operands, while the other works with the native-size operand. On a current chip running a 32-bit operating system, this would be a 32-bit DWORD. When Windows 95 thunks down to 16-bit code, the preferred size is 16 bits.
Now, if the preferred size is 32 bits, how can the CPU work with 16-bit values? The answer is the operand size prefix. This 1-byte value (0x66 if you're curious) precedes the instruction's opcodes and toggles the native data size for that instruction. If you're running in 32-bit mode, and you use the operand size prefix, the instruction will operate on 16-bit WORDs. Likewise, if you're in 16-bit code, but you specify the size prefix, you'll use 32-bit operands.
The thing about these prefix bytes is that they increase the total code size and potentially slow things down. The amount of slowdown primarily depends on the CPU architecture you're running under. To test the potential performance degradation, I wrote the NativeSize program shown in Figure A.
NativeSize doesn't do anything spectacular. My goal was to use both 16-bit WORDs and 32-bit DWORDs in an identical manner to compare the relative times. Because instructions execute so quickly, I had to repeat the operations many times to get a measurable interval. Eventually, I settled on using a pair of nested for loops, with the WORDs and DWORDs acting as the counters. I had to use a nested loop since the maximum number of iterations for an unsigned 16-bit loop counter is 65535, which is too small to time reliably.
Here's the compiler-generated code for the WORD version of the inner loop. Note that the two instructions that reference the counter at address 0x0040AC2C explicitly use "WORD PTR" (and hence, have a size prefix). The code for the 32-bit version of the loop (not shown) looks identical, except that it uses "DWORD PTR" sized counters.


 401051:  CMP WORD PTR [0040AC2C],FFFF
 40105A:  JNB 00401066
 
 40105C:  NOP
 40105D:  INC WORD PTR [0040AC2C]
 401064:  JMP 00401051
 
 401066:
On my Pentium Pro system, the WORD-based loops take around 3.5 times as long as the otherwise identical DWORD loops. I got similar results on a Pentium II chip. These results aren't surprising, since the Pentium Pro was heavily optimized toward 32-bit code. The mainstream computer press gave the Pentium Pro a bad time when it first appeared, since its 16-bit operations weren't significantly faster than on a Pentium. My take: avoid the problem. Run a real 32-bit operating system and real 32-bit apps.
On my ancient, creaky 486 (66MHz), the WORD loops take on average about 1.18 times as long as the DWORD loops. On Pentium CPUs, the WORD and DWORD versions of the loop take the same amount of time. The upshot of all this? The native size effect is real, but its overall effect depends on the CPU architecture.