This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.


June 1998

Microsoft Systems Journal Homepage

Under the Hood

Download Jun98hoodcode.exe (1KB)

Matt Pietrek does advanced research for the NuMega Labs of Compuware Corporation, and is the author of several books. His Web site at http://www.tiac.net/users/mpietrek has a FAQ page and information on previous columns and articles.

Apparently, there's quite a bit more interest in Win32® assembly language than I had originally thought. After the February 1998 issue of MSJ hit the stands, I received quite a bit of positive email and favorable comments from folks at trade shows. Many readers said, "Have you also thought about covering...?"
      My February 1998 column could have been called "Just Enough Assembly Language to Get By." Since it was such a hit, it's time for the sequel: "Just Enough Assembly Language to Get By, Part II." I'll look at additional instructions and instruction sequences that come up often. I'll also describe some of the most common scenarios when an instruction faults, and what to look for.
      Before JMPing into the details, make sure you're at least familiar with the Intel x86 registers and instruction addressing modes. I covered both subjects in my February column. Also note that none of the instructions mentioned in my February column—and none of the ones I'll mention here—require anything more than an 80386 system because the subset of instructions that compilers typically use was standardized at least 12 years ago.

Common Instructions

Instructions INC value, DEC value
Purpose Increments or decrements integer value by 1
Example
INC ESI
INC [EBP-8]
DEC [EAX+4]
      The INC and DEC instructions are used to increment and decrement values kept in memory or registers. As you might imagine, these instructions map precisely to the ++ and - - operators in C++ for standard integer operations.
      You could use the ADD or SUB instructions to achieve the same effect as INC and DEC, although it would be more expensive in terms of size. Since they are so commonly used, the smallest versions of the INC/DEC instructions take only a single byte. Looking at the Intel opcode map, you'll see that there's an opcode for each of the eight general-purpose registers that INC can be used against (EAX, EBX, ECX, EDX, ESI, EDI, ESP, and EBP). Another eight opcodes are used for the DEC instruction and the same set of registers.
Instructions MUL value, value DIV value, value
Purpose Multiplication and division
Example
 MUL EAX,EDX
 MUL AL,BYTE PTR [EBP-14h]
 DIV EAX,EBX
      I didn't cover the ADD and SUB instructions in my February column since their operation is straightforward. However, the MUL and DIV instructions have some quirks that make them difficult to read and downright quirky to write. Throughout this column, when I mention (E)AX, I'm referring to AL, AX, or EAX. Likewise, when I mention (E)DX, I'm referring to DL, DX, or EDX.
      Both MUL and DIV treat their operands as unsigned values. The operands can't be immediate values (such as 3); rather, they must be in registers or memory. You may have noticed that the destination value (the first argument) always seems to be (E)AX. This is by design. The use of the (E)AX register is an implicit part of the instruction. Beyond the implicit use of (E)AX, the (E)DX register is also silently involved. The high bits of the MUL instruction end up in (E)DX. Likewise, for the DIV instruction, E(DX) holds the remainder and (E)AX holds the quotient.
      If you write any assembler code, MUL and DIV get even weirder. The assembler (both MASM and the Visual C++® inline assembler) won't let you specify the (E)AX operand. Thus, if you want the instruction MUL EAX,ECX, you would write MUL ECX—just another example of the intuitive language syntax that's made assembly language wildly popular in recent years.
Instructions IMUL value, value IDIV value, value
Purpose Signed multiplication and division
Example
 IMUL WORD PTR [EBP+8]
 IMUL EDX,ECX,8
 IDIV EAX,DWORD PTR [EDX]
      The IMUL and IDIV instructions treat the operands as signed values. Contrast this to MUL and DIV, which work on unsigned values. IDIV uses (E)AX as the implicit first operand, just as DIV does. Also, like its DIV counterpart, IDIV only works with register or memory values. IMUL, on the other hand, doesn't fit the general patterns of MUL, DIV, and IDIV. It can work with immediate values and it can have a non-(E)AX register as the destination. There's even a form of the IMUL instruction that takes three operands. To my knowledge, this is the only instruction in the Intel opcode set with this distinction.
Instructions PUSHAD, POPAD
Purpose Saves or restores all general-purpose registers via the stack
      PUSHAD and POPAD push or pop EAX, ECX, EDX, EBX, ESP, EBP, ESI, and EDI on the stack, in that order. These instructions are used in situations where many registers may be modified and the programmer wants to leave no evidence of the execution in the code. Although interrupt handlers are passé for most programmers, they're a perfect example of where PUSHAD and POPAD come in handy. Besides taking fewer opcodes than eight individual PUSH instructions, they also execute faster (five clock cycles on a Pentium).
Instructions PUSHFD, POPFD
Purpose Push or pop the EFLAGS register
      In some cases, it's inconvenient to use the flags set by a prior operation immediately. Alternatively, you may want to make sure that some operation you're about to execute won't change the current flag values. For these situations, PUSHFD and POPFD are the easiest methods to save and restore those bits.
      PUSHFD is one of the atomic components of an interrupt. When an interrupt or an exception occurs, the following code effectively executes:
PUSHFD, PUSH CS, PUSH EIP. 
Following the three pushes, the EIP register changes to the interrupt handler address contained in the appropriate slot in the Interrupt Descriptor Table (IDT). Likewise, the IRETD effectively does a POPFD as part of returning from an interrupt.
Instructions SHL, SHR, SHLD, SHRD
Purpose Shift bits to the left or right
Example
 SHL EBX,3
 SHR EBX,CL
 SHLD EDX,ECX,4
 SHRD ESI,EDI,CL
      The SHL and SHR instructions are logically equivalent to the C++ << and >> operators. Many of you probably recall that bitwise shifting is a quick way to perform multiplication and division by powers of 2. For example, the SHL EBX,3 instruction has the same effect as multiplying EBX by 8 (23 == 8). Indeed, if you write C++ code that multiplies or divides an unsigned value by 2, 4, 8, 16, and so on, it will most likely compile to a SHL instruction.
      When shifting left, the low-order bits are filled with zeroes. The final high-order bit that's "shifted out" is moved to the carry flag (CF). In other words, the carry flag is like a virtual 33rd bit. When shifting right, the high-order bits are filled with zeroes, and the last bit shifted out moves to the carry flag.
Instruction ADD [EAX],AL
Purpose None
      You may see a lot of this particular instruction, and you'll probably see it repeated. However, ADD[EAX], AL has no special significance. The opcode bytes for this instruction are 00 00. In other words, it's what you'll see if you're viewing a series of data bytes that all contain the value 0. Nothing to see here. You can all go home now.
Instruction CLD
Purpose Clears the direction flag
      In my February 1998 column, I described the string instructions LODSx, SCASx, STOSx, and MOVSx. Each of these instructions uses the ESI or EDI register to point at the memory to be read or written to. These instructions are typically used in conjunction with the REP, REPE, or REPNE prefixes, which cause the string instruction to execute several times until some specific condition is met.
      After each REPx-induced iteration, the CPU changes the ESI or EDI register to point to an adjacent memory location. The direction in which the registers move is given by the direction flag. If the direction flag is clear, ESI or EDI is incremented after each instruction (thus causing the next higher memory location to be referenced in the next iteration). When the direction flag is set, ESI or EDI decrements after each iteration.
      Most of the time it's easiest to work moving forward in memory (toward higher addresses) so that the direction flag is usually clear. However, it's generally not safe to assume that the flag is clear. Thus, you'll often see the CLD instruction somewhere before a string operation such as REP MOVSB.
Instructions NOT value, NEG value
Purpose Negation of values
Example
 NOT DWORD PTR [EBP-8]
 NEG EDX
      The NOT instruction does ones-complement negation. That is, it applies the NOT operation to each bit in the operand. An initial value of 0 will become 0xFFFFFFFF after a NOT instruction. The C++ ~ operator is typically implemented via the NOT instruction.
      The NEG instruction does twos-complement negation. (If you're not 100 percent up on ones versus twos-complement negation, don't feel bad. I learned this stuff 10 years ago in college, and I've completely forgotten it!) An easier way to think of the NEG instruction is that it puts a - sign in front of the value. Thus, using NEG on -3 yields 3, while NEG applied to 4 yields -4. To summarize, you can think of NOT as affecting individual bits, while NEG operates on the entire value.
Instruction NOP
Purpose No operation
      The NOP instruction does nothing and affects nothing. It's a single-byte opcode that executes in one clock cycle and is primarily used to pad code. For example, a compiler might want the beginning of a procedure to start on a 16- byte boundary. The compiler/linker would insert enough NOP instructions between the end of one procedure and the beginning of the next procedure to create the desired alignment.
      If you're confident in your assembler abilities, the NOP instruction can be applied to code in memory or in the executable file. You might know that some instruction you're about to execute will cause a fault in a debugger. If you want to skip that instruction, use the debugger to write enough NOP opcodes (0x90) to eliminate the instruction. This is useful to squash hardcoded INT 3 breakpoint instructions while you're running under the debugger, effectively not stopping at the breakpoint. Really advanced users can implement NOP instructions to obliterate entire regions of code in an executable. (Warning! Harder than it looks.)
      Another advanced use of the NOP instruction is when you want to make it easy to patch or hook into your code. At the beginning of a procedure or block of code, put in enough NOP instructions for the desired goal. Subsequent patching or hooking code can write JMPs, CALLs, or whatever into the NOP area.
Instruction INT 3
Purpose Debugger interrupt
      INT 3 has two uses—one intended by the original CPU designers, the other accidental. The INT 3 instruction is the standard method to suspend a program and transfer control to a debugger. In normal use, programs don't include INT 3 instructions in their code. Rather, when you set a traditional breakpoint with a debugger, it temporarily overwrites the target instruction with an INT 3 instruction. (The LODPRF32 program from my July 1995 column illustrates this.) Note that an INT 3 instruction is the heart of the DebugBreak API for Intel CPUs.
      The other offbeat use of the INT 3 instruction is as a paranoid NOP. In those cases where a NOP would be used for padding (and theoretically never executed), an INT 3 can be used instead. Like NOP, an INT 3 instruction is only a single byte. The key difference is that if a bug crept in and you executed the INT 3 instruction, you'd pop into the debugger. In the same scenario, the CPU would blithely sail through NOP instructions and wreak havoc someplace farther away from the original error.
      The Microsoft® linker uses INT 3s as paranoid NOPs when creating padding for incremental linking. The linker also uses them as padding between procedures it wants to align on a particular memory boundary. Usually this alignment is on a multiple of 16 bytes unless you have the "optimize for size" compiler option set. Figure 1 shows a section of code from CALC.EXE that illustrates INT 3 padding in action.
Instruction LOCK
Purpose This instruction locks the memory bus during the next instruction
Example
LOCK INC DWORD PTR [EDX+04] 
      Technically speaking, LOCK is an instruction prefix rather than an instruction in its own right. In a multiprocessor environment, multiple processors could access the same memory location at the same time. The LOCK prefix insures that the instruction associated with it will have exclusive access to the destination memory location.
      If you've ever examined the EnterCriticalSection API, you'll see that if the critical section isn't currently held, the code essentially just increments a counter. A LOCK prefix is used with an INC instruction to guarantee that one thread won't increment the counter while another thread on another CPU is reading it. You'll also see the LOCK instruction used with multiprocessor synchronization APIs such as InterlockedExchange and InterlockedIncrement.
      A final thought on the LOCK prefix: you may recall a bug on older Pentium CPUs where a particular instruction sequence could cause the CPU to freeze up. (See the February 1998 Editor's Note if you need a refresher.) That instruction sequence isn't a valid sequence, and the LOCK prefix plays a vital role in the ensuing CPU meltdown.

Common Instruction Sequences

Sequence CMP register_X, immediate_value_A
                  JE XXXXXXXX
                  CMP register_X, immediate_value_B
                  JE XXXXXXXX
Purpose C++ switch statement
Example
 CMP EAX,1
 JE  00400248
 CMP EAX,3
 JE  0040026E
 CMP EAX,7
 JE  004002A0
      This sequence (compare and JMP if equal) is the most straightforward encoding of a C++ switch statement that I've seen. It's also very easy to pick out when you encounter it in a debugger. In the example code, the switch statement would look something like this:

 switch ( value )
 {
     case 1: // code for case 1
     case 3: // code for case 3
     case 7: // code for case 7
 }
      The trick to understanding this code sequence is realizing that compiler-generated code for switch statements usually differs from your mental model. The code for all the case comparisons is usually generated in one place. Following the value comparison code are discrete blobs of code that implement the code specified for a particular case. The value comparison code is optimized to quickly figure out just which case blob to jump to.
      By no means is this sequence the only encoding for switch statements. More efficient encodings may involve JMP tables or subtractive countdowns using the zero flag. However, these encodings definitely don't fit into my criteria of "just enough to get by."
Sequence opcode [register+offset]
Purpose Structure member access
Example
 PUSH [EAX+157C]
 MOV  EAX,[ESI+34]
 ADD  [EAX+44],ESI
      Here's a common scenario: you have a pointer to a structure or class instance with which you read, write, or otherwise manipulate some field. In this situation, the compiler typically puts the pointer value into a register. The offset of the specified field within the structure is then added to the register. For instance, consider this structure:

 struct Foo {
     int     i;
     short   j;
     char    k;
 }
If you had a pointer to an instance of this structure and wanted to add 2 to each structure member, the code would look something like this (assuming ESI points to the structure instance):

 ADD DWORD PTR [ESI],2   ;; Foo.i
 ADD WORD PTR  [ESI+4],2 ;; Foo.j
 ADD BYTE PTR  [ESI+6],2 ;; Foo.k
Note that for the first structure field (i ), the field offset is 0, so no addition is needed. The i field is 4 bytes long, placing the next field (j) at offset 4. The j field is a short, so it's only two bytes long. The final field (k) is at offset 6, which I arrived at by adding 4 and 2.
      Compilers must place structure fields into memory locations in exactly the same sequence as the structure is declared. Thus, you can usually look at any structure or class definition and figure out the offsets of various fields. Be aware that compilers often place padding between structure fields so that each field starts at some natural boundary (typically 4 or 8 bytes). Using #pragma pack lets you specify the exact padding (or lack thereof) in your structure definitions.
Sequence MOV value,EAX, many times in a row
Purpose Serial initialization of several variables to the same value
Example
 MOV EAX,0
 MOV [EBP-4],EAX
 MOV [EBP-10],EAX
 MOV [EBP-18],EAX
      When a collection of variables is assigned the same value, the compiler may load the value into a register and copy the register into each of the variables. For example, at the beginning of a function you might initialize several int variables to the value 0. The example code sequence shows one way this might be encoded.
Sequence CMP register_X,01
                  SBB register_X, register_X
                  NEG register_X
Purpose Converts 0 input value to 1, all other values to 0
Example
 CMP EAX,01
 SBB EAX,EAX
 NEG EAX
      In many cases, generated code needs to inspect a value to determine if it's 0. If so, the result of the inspection should be nonzero (typically 1). If the input value is any value other than 0, the result should be 0. Using 0 to mean Boolean FALSE, and everything else being TRUE, this instruction sequence does a logical NOT of the input value.
      The code comprising this instruction sequence certainly isn't intuitive. Its distinctive characteristic is the use of the SBB instruction (integer subtraction with borrow). SBB is rarely used outside of this sequence.
      The first instruction (CMP) sets or clears the carry flag as appropriate. SBB then uses the carry flag as part of its subtraction. Since the two arguments to SBB in this sequence are always the same, the carry flag alone determines the outcome (which is always 0 or -1). The NEG instruction finishes up by changing a -1 to a 1 and leaving 0 values alone.

Oops! How did I Get Here?
      Let's examine some of the common clues you can look for when something faults and you're rudely popped into the debugger. Think of this as a first aid quick reference. You won't find instructions on surgery here, but the common cuts and scrapes can be dealt with.
      Picture this scenario: everything is working fine until suddenly your program stops in the debugger because of a fault, and none of the code looks familiar. Never fear. The faulting address usually yields some sort of information that steers you toward a resolution.
      One of the more common and easy to find bugs is calling through a NULL function pointer. The signature characteristic of this bug is that the instruction pointer (EIP) is 0 or very close to 0.
      Under Windows NT®, the first 64KB of the address space is off limits, so the fault occurs exactly at address 0. In Windows® 95, it's slightly more tricky. Memory at address 0 is accessible, but it's certainly not code. In this case, the faulting address may or may not be 0. However, the faulting address will almost certainly be just a little bit higher (for example, 0x00000003). When this happens, the CPU miraculously manages to execute one or two "instructions" before it hits something that triggered a fault.
      Regardless of where you faulted, the vital information you need to know is: where were you executing before the NULL function pointer was called? In these situations the stack window may not be helpful, since the calling routine almost certainly won't appear in the stack window. This is a by-product of the way call stacks are walked. (See my May 1997 column for details on stack walking.)
      Luckily, when a NULL pointer call happens, there is a way to see where you came from. A CALL instruction pushes a return address on the stack. If you can find this return address, you can change the code window to display at that location. To find the return address, use the data window to display memory starting at the ESP value. Make sure that the memory is being displayed in the DWORD format. The first DWORD at ESP is most likely the return address. Remember, the return address you obtain will be for the instruction after the bad CALL instruction. You'll need to back up in the code window to see the code that led up to the CALL.
      In Figure 2, I've shown a NULL function pointer fault in the Visual Studio debugger. In the register window, the ESP value is 0x12FF7C. This is the same value that I've changed the data pane to display in DWORD format. The left column is the memory address. The second DWORD at the top (0x00401009) is the return address.

Figure 2 A NULL Pointer Fault
Figure 2 A NULL Pointer Fault

      Incidentally, if the DWORD at ESP doesn't turn out to be a valid return address, it's certainly worth your while to look further up on the stack for values that look like they could be return addresses. If something looks like a valid address, change the code window to display at that address and see if you can make sense of it. If your ESP register is bogus, try looking for return addresses at positive offsets from the EBP register. Remember, this isn't an exact science. You're sifting through the rubble, looking for something that will give you a clue as to where you'll start doing more in-depth investigation.
      Moving away from NULL function pointers, let's say you've faulted in some code that you don't recognize, but the faulting address is nowhere near 0. What's worse, the code looks like garbage. In other words, it doesn't look like the normal instructions you'd see. Instead, you see instructions such as ARPL, AAA, and OUTSB. There are two likely ways your code got there. First, you may have called through a corrupted function pointer. Second, you may have corrupted the return address on the stack. When the RET instruction executed, control transferred to the bogus address.
      In either situation, the underlying problem is valid code addresses that were overwritten with garbage. In this case, your chance of getting a valid return address is lessened. However, you may be able to get an idea of what happened by looking at the faulting address. Try interpreting the fault address as a stream of data—you may find a pattern.
       Figure 3 shows the code for a small Hello World program with a big bug. The szBuffer array is only four characters wide, while the strcpy function copies the whole 13 bytes of "Hello World!" This buffer overrun actually overwrites the stack frame where function main's return address is stored. When I run the program, it correctly prints out "Hello World!," but then faults at address 0x21646C72.
      The faulting address yields a clue if you think of the address as a pattern of bytes. In memory, 0x21646C72 is stored as four sequential bytes: 0x72, 0x6C, 0x64, and 0x21. Note that each of these values is above 0x20, and below 0x80. That happens to be the range of printable ASCII characters. Looking up the four bytes in an ASCII table, you get

 0x72 = 'r'
 0x6C = 'l'
 0x64 = 'd'
 0x21 = '!'
      As you can see, those four bytes form the end of the string "Hello World!" You could then search your code for places where rld! appears. While not a perfect answer, you'll have substantially narrowed down the places to begin an initial search for the problem. Admittedly, this is a contrived example and there are tools available that find these types of memory overwrites. Nonetheless, I've found many obnoxiously difficult bugs only because I noticed a familiar pattern in the corrupted data.

Other Common Causes of Faults
      Common sources of faults are the string instructions shown in Figure 4. Usually string instructions were either given bad data to start with or they operated past their intended range of memory. Remember, these string instructions implicitly use ESI, EDI, or both registers. They're almost always used with a REP, REPE, or REPNE prefix, which causes the instruction to execute multiple times with the registers incrementing or decrementing after each iteration.
      Tracking down the core cause of a fault from one of these string instructions is almost always trivial. Figure 4 shows which registers the instructions use. Regardless of the particular instruction in the group, the registers are pointer values. It's immediately noticeable if a NULL pointer is the culprit. For example, if the faulting instruction is REP STOSB and you see that EDI is 0, you know that the CPU was trying to write using a NULL pointer.
      If the registers in question aren't 0, check if their value is a multiple of 4KB—the size of a page on Intel CPUs. It's entirely possible that the instruction has executed successfully a number of times until the ESI or EDI register pointed to a page of memory that's not accessible. An easy way to know if you're on a page boundary is to look at the bottom three digits of the hex address. If they are 000, you're on a page boundary.
      You can double-check this invalid memory diagnosis by trying to display memory at the value of ESI or EDI. If the debugger can't see it, your code can't either. I'm assuming you're using an application debugger such as Visual Studio. If you're using a system-level debugger, this may not be true since the memory may only be visible from kernel mode. On the other hand, if you're using a system-level debugger, you probably already know how to track down this kind of problem.
      If you use recursive functions (or just lots of stack space), stack faults might plague you. Unfortunately, the operating system and debugger don't go out of their way to clarify that it's a stack overflow problem. For example, Figure 5 shows a very simple program that recurses until it runs out of stack space and faults. Figure 6 shows the none too helpful fault dialog that results.

Figure 6 An Unhelpful Fault Dialog
Figure 6 An Unhelpful Fault Dialog

      If you select Cancel to debug, the Visual Studio debugger briefly tells you that a stack overflow occurred, but not at the same time as it shows you the faulting instruction. However, there are clues you can infer from the debugger that would indicate a stack overflow. For starters, the ESP register value is probably on a 4KB boundary. Likewise, the faulting instruction is probably a PUSH. There are other ways to cause a stack fault, but most of the time it will look something like what I've described.
      While I'm on the subject of the stack, my final tidbit this month is on problems caused by PUSHing or POPing too much data to or from the stack. When whole programs were written in assembly language, programmers spent a lot of time matching up every PUSH instruction with an equivalent POP or ADD ESP,XX instruction. However, since compilers are so widespread, this tedious process isn't normally necessary.
      Believe it or not, it's still sometimes necessary to verify that what's pushed on the stack eventually gets removed. For example, if the code for calling a __stdcall function places two DWORD values on the stack, the called function should end with a RET 8 instruction. Likewise, if you see a __cdecl function being called with three DWORD parameters, there should be an ADD ESP,0Ch instruction following the call. More importantly, the called function should return with a simple RET instruction. If you're not familiar with __cdecl versus __stdcall functions, see my February 1998 column.
      These kinds of stack parameter mismatch problems can be minimized by following a few simple rules. First, make sure that there's only one prototype for any given function. Put that prototype in a .H file, never in a .C or .CPP file. Finally, make sure that the source file that actually defines the function includes the .H file. If you follow all these steps, you'll get a compiler or linker error rather than a bogus program.
      I've seen programmers cheat by including prototypes for just one or two functions in their code modules. (You know who you are!) These functions have a prototype in a .H file, but the programmer doesn't want to incur the overhead of bringing in a whole .H file for just a few items. Inevitably something changes and the programmer ends up counting PUSHs, POPs, and ADD ESPs because the code crashes.

Wrap-up
      I use "DUMPBIN /DISASM filename.obj" to look at the code generated by the C++ compiler. However, Paul DiLascia (my fellow MSJ columnist) mentioned that Visual C++ has a compiler switch, /Fas, that produces an .ASM file from the input C++ code. The .ASM file that is generated contains all the necessary blood and guts that go along with hardcore assembler programming. Although you may never need to program in assembler, it's always enlightening to see what your tools are doing under the hood.

Have a question about programming in Windows? Send it to Matt at mpietrek@tiac.com.

From the June 1998 issue of Microsoft Systems Journal.