NT Internals August 98

by Mark Russinovich
Reprinted from Windows NT® Magazine

Computers never seem to have enough memory, no matter how much memory is installed. One of the most complex and difficult tasks Windows NT faces is managing the limited physical memory present on a computer. This challenge is heightened by the fact that NT must divide physical memory among many processes that might be running simultaneously, giving each process an appropriate memory share. Further, NT must be able to adjust its behavior within a wide range of memory sizes, from as little as 16MB to as much as 1GB or more.

This month I'll introduce the concept of virtual memory, which is based on hardware-supported memory paging. This column will lay a foundation for understanding how NT defines process address spaces. I'll discuss how NT allocates virtual-memory addresses and the internal data structures that record allocation. I'll also describe two powerful features of NT memory management: memory sharing and copy-on-write. Next month, I'll describe more internal data structures, how NT implements shared memory, and working set tuning.

Virtual Memory

As do almost all modern operating systems (OSs), including Windows 3.x, Windows 95, Win98, and UNIX, NT relies on hardware support to provide processes with the illusion that a computer has more memory than it actually has. The mechanism that implements this illusion is known as paged virtual memory. The Memory Manager (or Virtual Memory Manager) executive subsystem is the NT component responsible for NT's paged virtual memory and for exporting functions that other subsystems and applications can use to interact with paged virtual memory. Processes access memory through their virtual memory address space. In NT, the size of a process' address space is defined by the number of bytes that can be described in 32 bits, which is 232, or 4GB (64-bit NT, which will be available on Alpha and Intel Merced—Deschutes—processors, will have 264 bytes of addressibility). Thus, barring other resource limitations, a process can theoretically allocate 4GB of code and data. However, most computers today have less than 128MB of physical memory. The 4GB address space of a process is therefore known as virtual memory, because it doesn't directly correspond to physical memory.

Before I describe how a computer's OS and hardware collude to trick processes into thinking they have 4GB of memory, I'll review memory address translation. The memory management unit (MMU) on Alpha and x86 processors manages physical and virtual memory in blocks called pages. On x86 processors, the size of a page is 4KB, whereas Alpha processors maintain 8KB pages. A computer's physical memory pages are known as page frames, which the processor numbers consecutively with page frame numbers (PFNs).

When a process references its virtual memory, it does so with a 32-bit pointer, and the MMU's job is to determine the location in physical memory to which the virtual address maps. The MMU determines this location by dividing the address into three separate indexes, as Figure 1, shows: page directory index, page table index, and page byte offset. The MMU uses each index in a three-step address resolution process.

Figure 1

The first step begins with the MMU locating a table known as the page directory. The MMU locates the page directory by reading the contents of a special hardware processor register. On the x86 this register is the Control Register 3 (CR3), and on the Alpha the register is the Page Directory Register (PDR). Each process has its own private page directory, and the address of that directory is stored in the process' control block data structure. Whenever the scheduler switches from one process to another, NT updates the page directory register with the value stored in the control block of the process that takes over the CPU. The first step in the resolution process continues with the MMU using the page directory index from the virtual address to index to the page directory table. The MMU retrieves from the page directory the page frame number of the location of another table, the page table. Entries in a page directory table are called page directory entries (PDEs) and are 32 bits in size.

In the second step of the resolution process, the MMU uses the page table index from the virtual address on the page table the MMU located. The entry the MMU finds at the page table index identifies a page frame in physical memory. Finally, in the third step, the MMU uses the page byte offset as an index into the physical page and isolates the data that the process wants to reference. Entries in a page table are called page table entries (PTEs). Similar to a PDE, a PTE is 32 bits wide.

You might wonder why the MMU goes through this three-step gyration. It does so to save memory. Consider a one-step translation in which a virtual address is divided into two components, one locating a PTE in a page table and the other serving as a page offset. To map a 4GB address space, the page table would have to consist of 1,048,576 entries. Four bytes (32 bits equals four 8-bit bytes) per entry would mean that the page table would require 4MB of physical memory to map each process' address space. With a two-level scheme such as the one the x86 and Alpha use, only a process' page directory must be fully defined—page tables are defined only as necessary. Thus, if the majority of a process' 4GB address space is unallocated, a significant saving in memory results because page tables are not then allocated to define the unused spaces.

Nevertheless, the three-step translation process would cause a system's performance to be unbearably poor if the process occurred on every memory access. Therefore, x86 and Alpha processors come with an address translation cache such as the one Figure 2 shows, which stores the most recent virtual page to physical page translations. When a process makes a memory reference, the MMU takes the virtual page number and simultaneously compares it with the virtual page number of every translation pair stored in the cache (this type of simultaneous-compare memory is known as associative memory). If there's a match, the MMU can bypass the page directory and page table lookups because it has already obtained the page frame number from the simultaneous compare. Address translation caches are known as Translation Look-aside Buffers (TLBs) or Translation Buffers (TBs). One of the costs associated with the scheduler switching from one thread to another is that, if a newly scheduled thread is from a different process, the TLB must clear the mappings that belong to the old process. Then, the three-step translation is required to fill the TLB with mapping pairs for the new process.

Figure 2: Paging

What I've described so far is the address translation that occurs when a process references a valid virtual mem-ory address. A process can also make several types of invalid memory references. I'll review error cases first, then I'll discuss situations in which NT considers an invalid memory reference legal and correct—this type of reference is necessary to implement true virtual memory.

Not many processes today require 4GB of address space. Therefore, the address map of most processes is almost entirely empty or undefined. When a process references undefined parts of its virtual memory map, the MMU detects the undefined space by finding, in the first step of translation, a PDE marked invalid in the page directory, or by finding in the second step of translation a PTE marked invalid in a page table. PDEs and PTEs contain enough space in addition to the indexes they store to keep several bits that serve as bookkeeping information. One bit of a PDE and PTE is the valid bit, which is set only when the translation through the PDE or PTE is configured as legal. If this bit is not set (i.e., it's off), the MMU will stop translation and raise a processor exception called a page fault.

Besides actively playing a role in defining the contents of PDEs and PTEs to define address spaces, NT must respond to page faults and react to them appropriately. When the MMU invokes NT's page-fault handling code, the Memory Manager must check to see if the reference that raised the exception is to an undefined address. If the reference is to an undefined address, the reference raises an access violation (which usually results in the termination of the process) if the processor was executing in user mode, or the blue screen if it was executing in kernel mode. (See my column "Inside the Blue Screen," December 1997, for more information about the difference between user mode and kernel mode).

Because the MMU raises a page fault when a process references an invalid PDE or PTE, paged virtual memory can rely on a nifty trick. To support the illusion that an application process has access to more data and code than physical memory can hold, NT's

Memory Manager will move parts of the application to a file on disk called a paging file. NT marks as invalid the PTEs that would otherwise map the pages in the process' address space that correspond to the paged-out data. Thus, when a process tries to reference a paged-out part of its virtual memory, the MMU generates a page-fault. The Memory Manager's page-fault handler then looks into its internal data structures and discovers that the reference that triggered the page fault was not to an undefined address but rather to an address whose data is stored temporarily in the paging file. The Memory Manager then makes room in physical memory for the page from the paging file that the process is requesting. This operation often means that another page of data from the current process or from another process is sent out to the paging file (a page-out operation). Once the Memory Manager creates space in physical memory for the requested page, the process reads the requested page from the paging file (a page-in operation).

After the page-in operation completes, NT updates the page table of the process that raised the page fault, and the page table points at the new page frame. The processor instruction that caused the memory reference restarts, and on the second time through, the translation succeeds without a page fault and accesses the requested physical data. For both page-in and page-out operations, the Memory Manager works with a disk driver to perform the I/O.

Thus using invalid bits to its advantage, the Memory Manager makes a computer appear to have a total amount of memory that is equal to the size of physical memory plus the sizes of all the paging files. You can create up to 16 paging files in NT, placing each on a separate logical drive.

Process Address Spaces

Now that you understand the mechanics of paging, let's talk about process address spaces and how the Memory Manager defines and keeps track of them. As I've stated, each process has a 4GB virtual address space. This space is divided into two areas, in which the lower addresses map to data and code that are private to the process, and the upper addresses map to system data and code that are common to all processes, as Figure 3 shows. When the scheduler changes execution from one process to another, the page directory register in the process changes so that the process-private portion of the processor's mappings is updated. The system mappings are kept global through common page tables that every process' page directories point to.

Figure 3

The permissions that apply to pages in each region also reflect the split in mappings. The PTEs of x86 and Alpha processors contain a permissions bit. This bit specifies whether a page is accessible from a program executing in user mode. If a process refers to a page that is not accessible from user mode, the MMU generates a page fault, and the Memory Manager generates an access violation for the process (which is caught by Dr. Watson). As the Memory Manager sets up address spaces, it marks all system PTEs as accessible only from kernel mode. Thus, the Executive subsystems, device drivers, hardware abstraction layer (HAL), and Kernel can reference system memory, but user applications cannot touch system memory. This restriction protects key OS data structures and makes the system secure.

Figure 3 shows that on pre-Service Pack 3 (SP3) systems, the boundary between the lower address space (user mode) and the upper address space (kernel mode) is at 2GB. On SP3 and NT 4.0, Enterprise Edition, the boundary can be moved with a switch on a boot.ini line (/3GB) so that user programs have access to 3GB and system memory can access 1GB. This change enables memory-intensive applications such as database servers to directly address more memory, thus relying less on shuffling data to and from disk in a manner similar to paging.

Memory allocation in NT is a two-step process—virtual memory addresses are reserved first, and committed second. The reservation process is simply a way NT tells the Memory Manager to reserve a block of virtual memory pages to satisfy other memory requests by the process. However, Memory Manager makes no changes to a process at the time of a reservation because the reservation does not use actual memory. When a process wants to use addresses it has reserved, it must commit them. Access to uncommitted reserved memory will typically result in an access violation. When a process wants to commit addresses, the Memory Manager ensures that the process has enough memory quota to do so. The Memory Manager also checks to see that there is enough commit memory (physical memory plus the size of all the paging files) for the commit request. There are many cases in which an application will want to reserve a large block of its address space for a particular purpose (keeping data in a contiguous block makes the data easy to manage) but might not want to use all of the space. The application can specify both reservation and commit in a single request to the Memory Manager with a specific API, and this is the way most applications allocate memory.

One example of the Memory Manager's use of the single-request functionality is in the management of a thread's call stack. The Memory Manager reserves a range of memory (usually 1MB) in a process' address space for each thread created in a process. However, the Memory Manager actually commits only one page of each stack. As a thread uses the stack and touches reserved pages, the Memory Manager commits the reserved pages, increasing the size of the stack.

When a process reserves memory, it must specify the amount of memory but can also request a starting virtual address and specific protections to be placed on the memory. NT uses bits in a PTE that the MMU defines to indicate whether a page can be written to or read from, and whether code can be executed in the page. NT sets the addresses that the API parameters specify. If the PTE bits specify a page as read-only and a process attempts to write to the page, the MMU generates a page fault, and NT's page-fault handler will flag an access violation for the process. The system uses this protection to easily detect errant programs that try to do things such as modify their own code image.

NT keeps track of the reserved or committed address ranges in a process' address space by using a tree data structure, such as the example Figure 4 shows. Each node in the tree represents a range of pages that have identical characteristics with respect to protection and commit state information. This tree structure is a binary splay tree, which means that the depth of the tree (the maximum number of links from the root to any leaf, or bottom, node) is kept to a minimum. The tree's nodes are called Virtual Address Descriptors (VADs), and the process' process control block stores the root of the tree. When a process makes a memory reservation or commit request, the Memory Manager can quickly determine where there are free spaces in the process address map, and whether the request overlaps memory that is already reserved or committed.

Figure 4: File Mapping and Memory Sharing

A powerful feature of the NT Memory Manager is that it lets programs map files into their virtual address space. This mapping is accomplished in two steps. In the first step, the program creates an object, called a Section Object, to describe a file that the Memory Manager can map. The Section Object holds information about the name of a file, its size, and what portions of it are mapped. The Win32 function CreateFileMapping results in a Section Object's creation.

In the second step, the program maps all or part of the file into the process address space. The process invokes the Win32 function MapViewOfFile to do this mapping, specifying the start offset in the file to begin mapping and the number of bytes to map. The process specifies the protection (e.g., read-only, read/write) on the view at this time. After the view is mapped, the process can read to and write from the file simply by reading and writing data to the view's addresses in the process address map. The Memory Manager transparently works with the file systems to ensure that all processes accessing the file see any updates the original process makes and that the updates are eventually flushed to disk.

The file-mapping capability makes file I/O extremely straightforward, and NT uses it extensively. When an application launches, the Process Manager maps a view of the application's image to the address space of the application's process. The Process Manager transfers control to the entry point of the image in the address space, and as the application executes, any pages referenced for the first time generate page faults. The page faults result in the Memory Manager reading the applications pages from the file on disk.

NT uses a variation of file mapping to share memory. In this variation, MapViewOfFile takes a parameter that specifies memory mapping, rather than file mapping. The Memory Manager backs views of the section with the paging file when memory mapping is specified. The data in the view is no different from regular virtual memory that a process reserves and commits. However, NT can give sections of the data names in the Object Manager namespace so that two or more processes can open the same section, map views to their own address spaces, and then communicate with one another through this shared memory.

Copy-on-Write

The Memory Manager implements an important optimization of memory sharing called copy-on-write. There are several common scenarios in which a process might want to use the same data as another process but keep any modifications it makes private to itself. For example, if two or more processes start the same application and one process modifies the data in the image, other processes should not see that modification. The obvious way to accomplish this kind of private modification is to load multiple copies of the application into memory; however, this strategy wastes space. Instead, the Memory Manager marks the physical pages containing the image read-only and notes in an internal data structure (which I'll describe next month) that the page is a copy-on-write page. When a process tries to modify the copy-on-write page, that action will generate a page fault (because the page is read-only). The Memory Manager will see that the process referenced a copy-on-write page and will copy the contents of the copy-on-write page to another page that is private to the process that made the reference. Then, when the faulting instruction restarts, the process can modify its private copy of the page. Figure 5 demonstrates this procedure. In the figure, Process 1 and Process 2 share three copy-on-write pages. If Process 1 writes to one of the pages, it will get its own private copy of the page, and Process 2 will retain the original of the page.

Figure 5

NT uses the copy-on-write functionality mostly for executable images. When programs start, the Process Manager maps copy-on-write images. The POSIX subsystem also uses copy-on-write for handling the POSIX fork operation. When a POSIX process forks, the Process Manager creates a child process that will inherit the address space of the parent process. Instead of making a copy of all the parent's memory, the Memory Manager just marks the address-space pages of both the parent and child processes as copy-on-write. When either process makes a modification to its memory, it receives a private copy of the page it wants to modify, and the original of the page will then be private to the other process.

Stay Tuned

Next month, I'll dig deeper into the Memory Manager to describe the internal data structures it uses to keep track of pages. I'll present the details of how the Memory Manager implements shared memory, and I'll give you an overview of the way working sets—the amount of physical memory assigned to each process—are tuned as the system runs.

ABOUT THE AUTHOR
Mark Russinovich holds a Ph.D. in computer engineering from Carnegie Mellon University. He is coauthor of many popular utilities for Windows NT, including NTFSDOS, Filemon, Regmon, and Recover. You can reach him at mark@sysinternals.com or http://www.sysinternals.com.