Inside Win2K Scalability Enhancements, Part 1

Learn how this OS takes advantage of physical memory

With Windows NT firmly entrenched in the low- to midrange server market, Microsoft has set Windows 2000's (Win2K's) sights on large servers with enterprise-class workloads. The current industry characterization of large servers is four or more processors and multiple gigabytes of physical memory. Variants of the UNIX OS dominate such systems because UNIX has matured throughout the 1990s in performance, reliability, manageability, and availability-aspects crucial to large-server computing. To be an attractive alternative to UNIX on large systems, Win2K must excel in all these areas.

The laundry list of features and enhancements that make Win2K an improvement over NT 4.0 addresses the shortcomings that have prevented NT 4.0 from penetrating the large-server market. In this two-column series, I highlight the new features and fine-tuning that Microsoft introduces in Win2K to make it a scalable OS. This month, I describe new features and optimizations that let Win2K better exploit large amounts of physical memory. Next month, I'll continue with a look at optimizations that improve Win2K's processor use in multiprocessor systems.

Memory Scalability
Large servers' workloads are typically memory-intensive. The workloads' data set comprises multiple gigabytes and challenges the physical memory present on large servers. For example, you might use a large server to run a database server that manages a multigigabyte corporatewide database or several department databases with a cumulative size of several gigabytes. Other examples of large-memory workloads are enterprise resource planning (ERP), scientific, or financial analysis applications with multiple gigabytes of input data. Mass storage devices have higher latency than main memory by several orders of magnitude, as well as lower throughput, so building a server that can store all or most of a workload's data sets in main memory is important.

Consider a database query that a user runs against a 6GB database. If the server has only 1GB of physical memory, the database application must read the entire contents of the database from disk into memory during the query's processing. With a disk throughput of about 10MBps, the query would take approximately 10 minutes. But if the same server has 8GB of memory, the database application can cache the entire database in memory and will require no disk access for a query. If memory throughput is on the order of 1GBps, the query will take only seconds. The difference in query response times is the difference between a disk-bound server and a server that is suited for the workload.

Most of the time, server applications don't require access to a workload's entire data set. Instead, the applications cache the most frequently accessed portions of the data set in memory, leaving the infrequently accessed portions on disk. An example of a caching application is a Web server, which loads frequently accessed files into a memory cache for fast delivery of the files. In general, the more file data the Web server caches in memory, the less frequently the Web server has to fetch files from disk.

It seems obvious that the more memory a server has, the larger the workload the server can efficiently run. However, simply adding more memory to a server doesn't necessarily result in a server application scaling to take advantage of the memory. Efficient memory scaling has two requirements. First, an OS must be able to use the memory that might exist. Second, the OS must let server applications directly access the memory.

Most 64-bit OSs have no problem meeting either requirement. Such OSs are typically able to match 64-bit hardware in the amount of physical memory they can address. Similarly, 64-bit applications have almost 264 bits of virtual memory at their disposal, so the amount of memory that they can directly access exceeds the amount that computer systems of the foreseeable future will support. In contrast, 32-bit OSs have several shortcomings that necessitate special features before the OSs can support large amounts of memory.

The first shortcoming in 32-bit systems is that 32-bit computer hardware design, specifically the Intel x86 line, has historically supported at most 232 bytes-4GB-of physical memory. The second shortcoming in 32-bit architectures exists because a 32-bit reference size affects applications' virtual memory address limits. A 32-bit memory address implies a virtual address space of at most 4GB. Most OSs, including Win2K, NT, and UNIX, divide applications' virtual address space into two regions: a region that is private to each application, and a region that maps the memory that the OS, device drivers, and file system cache occupy. Figure 1 illustrates these regions. This division lets the OS and device drivers directly access application memory for efficient data transfers between applications and the OS. If the system were to give each application an entire 4GB address space and the OS a separate address space, then system calls, including file I/O, would require a relatively costly transfer of data from application address spaces to the system's address space, and vice versa.

In NT 4.0, the application-to-OS division is in the middle of the 4GB address space, such that applications have 2GB of private memory and the system assigns itself the remaining 2GB. On the x86 version of NT Server, Enterprise Edition (NTS/E), Win2K Advanced Server (Win2K AS), and Win2K Datacenter Server (Datacenter), an administrator can enable the /3GB boot switch, which moves the division so that applications have 3GB of private memory and the system has 1GB.

An application's private address space size places an upper limit on the amount of in-memory data that an application can directly manipulate. For example, on a 32-bit computer with 4GB of physical memory and a 3GB/1GB application-to-OS virtual-address split, a database server can manage at most 3GB of database data without having to read from the disk. The performance picture is complicated if the OS serves any disk reads that the application must perform from a file-system cache; in this scenario, the application might avoid actual disk I/O. Therefore, even with 3GB of private virtual memory, applications on NT 4.0 can, under some circumstances, directly and indirectly access almost 4GB of physical memory (assuming that the system has that much memory). However, applications are at the OS's mercy as to what data the system caches in memory beyond the 3GB to which it has direct access. In addition, if a server application is sharing the computer with other active applications, the OS must also divide the physical memory among those applications.

Physical Memory on the Alpha
Microsoft offers Win2K for both the x86 and Alpha processors. The Alpha is a 64-bit processor that Digital Equipment (now Compaq) originally developed. Recent implementations of the Alpha (i.e., the EV5 and EV6 generations) support at least 8GB of physical memory. Different members of the Alpha line represent physical addresses with varying numbers of bits, which determines how much physical memory each processor can support. For example, the 21164PC processor, which is part of the EV5 generation, implements 33-bit physical addresses, but the newer 21264 processor uses 43-bit physical addresses. Thus, the 21164PC can use as much as 233 bytes (8GB) of physical memory, whereas the 21264 can use 243 bytes (8TB).

All 32-bit Alpha versions of NT, including Win2K, NT 4.0, and NT 3.1, store 35-bit physical addresses in their internal-memory-management data structures. A 35-bit representation limits physical memory support to 32GB. Therefore, although certain versions of the Alpha processor can support more memory, the 32-bit version of Win2K can use a maximum of only 32GB.

Breaking the 4GB Physical Barrier on the x86
Running Win2K or NT forces an Alpha processor to support memory sizes smaller than its native capabilities allow. In contrast, the x86's original design supports a maximum of only 4GB of internal and external memory. For the x86 processor design to support more than 4GB of memory, changes to its design were necessary. Because Intel realized that a 4GB limit would hamper the x86 processor's growth in enterprise computing, the company added new operating modes to the x86. Intel released the Pentium Pro with a new mode called Physical Address Extension (PAE), and the company introduced 36-bit Page Size Extension (PSE36) in the Pentium II processor.

In its traditional operating mode, the x86 implements a two-level paging architecture to translate virtual addresses (which an OS and its applications use) into physical addresses (which memory hardware uses). The x86 memory management unit (MMU) divides virtual addresses into three fields, as Figure 2, page 54, shows. The CR3 special processor register anchors the page directory data structure, and the first field of a virtual address serves as an index to the directory. The MMU extracts a 4-byte address, or page directory entry (PDE), from the page directory at the appropriate index to locate a page table. The second field of the virtual address identifies the target entry in the page table. Page table entries (PTEs) are 4 bytes (32 bits) in size. A PTE contains a 20-bit address of a physical page, and because a page is 4096 (212) bytes on the x86, the x86 has a maximum of 220+12 bytes, or 4GB, of physical memory. The last field of a virtual address denotes the offset into the page that the PTE specifies.

PSE36 lets an OS direct the MMU to perform one-level address translation on select PDEs, a process that Figure 3 illustrates. An OS enables the translation by marking a PDE as page-size extended, so that the MMU uses the physical address in the PDE as the final page address, rather than as a page table's address. In addition, the MMU uses 14 bits for the page address, instead of a PDE's standard 20 bits, but the system interprets the pages as 4MB (222 bytes) in size. This alteration results in 36-bit physical addresses (14 bits of PDE address plus 22 bits of page size), which is large enough to reference 64GB of data. PSE36 memory's drawback is that the large page size (4MB vs. the standard 4KB) makes the page inefficient for general-purpose use. Early in 1998, Intel developed a special device driver, the Intel PSE36 Driver, to give applications using PSE36 an interface to memory above 4GB. The driver runs only under NTS/E and lets a maximum of one application use memory above the 4GB boundary as a type of RAM disk. All application and OS memory is below the 4GB boundary, so when an application wants to write to PSE36 memory, the application notifies the PSE36 Driver, which must copy the application's buffer to the specified location above 4GB. Figure 4 illustrates this memory-write process.

The additional physical memory that PSE36 makes available to a server application typically enhances the performance of the application, which would otherwise perform disk I/O. Unfortunately, the copy operations that result when the PSE36 Driver transfers data to and from memory above 4GB can hurt overall performance. Therefore, Microsoft hasn't promoted PSE36 but chose instead to use the x86's PAE mode to implement large-memory support.

When the x86 executes in PAE mode, the MMU divides virtual addresses into four fields, as Figure 5 shows. The MMU still implements page directories and page tables, but a third level, the page directory pointer table, exists above them. PAE mode can address more memory than the standard translation mode not only because of the extra level of translation but also because PDEs and PTEs are 8 bytes, rather than 4. The system represents physical addresses internally with 24 bits, which gives the x86 the ability to support a maximum of 224+12 bytes, or 64GB, of memory. PAE's advantage over PSE36 is dramatic: An OS can use all the physical memory as general-purpose memory, so copy operations to access memory above 4GB aren't necessary.

Because PAE mode is either on or off and the mode has a different virtual-to-physical translation model than standard x86 mode has, vendors must modify x86 OSs to use PAE mode. Microsoft developed a Win2K kernel version that implements PAE memory translation on the x86; if a system is PAE-capable and has more than 4GB of memory, the boot loader NT Loader (NTLDR) loads the PAE kernel. Thus, rather than load the ntoskrnl.exe image as the kernel, NTLDR loads ntkrnlpa.exe. (Uniprocessor and multiprocessor versions of PAE and non-PAE kernels exist.) After the kernel loads, Win2K Professional (Win2K Pro) and Win2K Server restrict memory usage to 4GB. Win2K AS and Datacenter can use the additional memory above 4GB: In Win2K AS, the PAE kernel will use at most 8GB of physical memory; in Datacenter, the kernel will use the maximum 64GB, if that much memory is present. Contrast this memory usage with the Alpha versions of Win2K, all of which use up to 32GB of memory. Table 1 summarizes Win2K's physical memory support on both x86 and Alpha hardware.

Very Large Memory
Although an Alpha 21264 processor running in 32-bit mode lets Win2K manage up to 32GB of physical memory, applications are still stuck with a 4GB virtual address space by default. Internally, the Alpha represents all virtual addresses as 64-bit values; however, the Alpha uses the sign-extension technique to translate 32-bit addresses into 64-bit addresses. When Win2K or NT creates a 4GB address space for an application, the application uses the 2GB at the bottom of the 64-bit address range and the 2GB at the top of the 64-bit address range, as Figure 6, page 57 shows. (MMUs interpret Alpha virtual addresses as 43-bit or 48-bit sign-extended values-depending on processor execution mode-so the Alpha isn't a true 64-bit processor with respect to virtual addresses.)

Early in Win2K's development, Microsoft saw an opportunity to extend Alpha applications' addressing capabilities. The company introduced the very large memory (VLM) API, whereby the Win2K kernel lets an application create up to 28GB more virtual memory in its private address space, for a total of 30GB.

Some restrictions exist regarding the VLM an application allocates. First, virtual VLM translates directly to physical memory. Thus, if an application allocates 2GB of VLM, the application is allocating 2GB of physical memory for its exclusive use. The data that an application stores in VLM resides in physical memory that the system never pages out to a paging file on disk in the way that the data and code in standard virtual memory can be paged out. A system must have a minimum of 128MB of memory for the system to enable the VLM API, and the API is available only on Win2K's Alpha version. A final restriction is that applications can use the virtual addresses they obtain via the VLM API only with other VLM APIs. This restriction exists because the virtual addresses that the VLM API returns are 64 bits wide, but most standard Win32 APIs take 32-bit parameters. An important point regarding VLM is that the physical memory limits that Win2K imposes on itself affect the total amount of VLM that applications can allocate.

Address Windowing Extensions
During Win2K's development, Intel introduced the 450NX chipset that lets x86 processors use PAE to break the 4GB physical memory boundary, and Microsoft implemented a portable API that the company aims at systems with large memory. Applications use the Address Windowing Extensions (AWE) API to allocate physical memory for their exclusive use and to gain access to all or part of the physical memory that the applications allocate through a window in their address space.

To use the AWE API to allocate physical memory, an application calls the Win32 function AllocateUserPhysicalPages. Then, the application uses the standard Win32 API VirtualAlloc to create a window in the private portion of the application's 4GB address space. VirtualAlloc accepts the MEM_PHYSICAL flag, which signals to the Win2K kernel that the application is creating a physical memory window. After it allocates physical memory and creates the window, the application can map portions of physical memory into the window. For example, if an application creates a 256MB window in its address space and allocates 4GB of physical memory (on a system with at least 4GB of physical memory), the application can use the MapUserPhysicalPages or MapUserPhysicalPagesScatter Win32 APIs to access any portion of the physical memory by mapping the memory into the 256MB window. The size of the application's window determines the maximum amount of physical memory that the application can access with a given mapping. Figure 7 shows an AWE window with a physical memory mapping.

The AWE API exists on all Win2K versions and is enabled regardless of how much physical memory a system has; however, AWE is most effective on systems with at least 2GB of physical memory. Because applications have only 2GB or 3GB (depending on whether the /3GB boot switch is enabled) of private virtual memory, the AWE API gives applications a mechanism to directly control more memory than their address space would otherwise dictate. For example, on a Win2K AS system with 8GB of physical memory, a database server application can use AWE to implement almost 8GB of memory as a database cache, to which the server has direct access through its AWE windows.

AWE provides two major benefits in addition to the direct access to huge amounts of physical memory that the API enables. First, all 32-bit and 64-bit platforms uniformly support AWE; second, you can use AWE-allocated memory with nearly all the Win32 APIs.

Improving SMP Memory Performance
In addition to kernel enhancements that let the OS and applications take advantage of large-memory systems, Win2K has several memory-related performance enhancements for operation on multiprocessors. NT 4.0 introduced the lookaside lists feature. A lookaside list is a pool of fixed-size kernel memory buffers that the Win2K kernel and device drivers create as private memory caches to serve specific purposes.

When an application executes a file-system operation, such as a file read, the I/O Manager must allocate a buffer to serve as the I/O request packet (IRP) that describes the request. The I/O Manager hands the IRP to the file-system driver responsible for managing the file that the read targets. When the file system finishes servicing the read, the I/O Manager must free the buffer it used to store the IRP. Without lookaside lists, the I/O Manager must frequently allocate and free memory buffers that store IRPs. To improve performance, Win2K's I/O Manager creates an IRP lookaside list. In a situation in which the I/O Manager would usually free an IRP buffer back to the general memory pool, the I/O Manager instead stores the buffer on its IRP lookaside list. Then, when it needs to allocate a buffer to serve as an IRP, the I/O Manager checks the lookaside list. If the lookaside list stores at least one freed buffer, the I/O Manager doesn't need to call upon the general kernel buffer manager. The kernel tunes the number of freed buffers that lookaside lists store according to how often a device driver or a kernel subsystem such as the I/O Manager allocates from the list; the more frequent the allocations, the more buffers the kernel allows on a list. When a list reaches the size limit that its usage patterns determine, the kernel frees buffers from the list back to the general memory pool.

Win2K adds a twist to the lookaside list performance optimization. On a multiprocessor, the cache coherency mechanism must keep data that the system modifies in the data cache of one processor synchronized with copies of the data that the other processors might cache. Cache coherency adds overhead to a multiprocessor's execution because the cache coherency algorithms need to use the multiprocessor's data bus; this use prevents processors from accomplishing useful work. In NT 4.0, all processors share the kernel's IRP lookaside list, which means that updating the lookaside list can cause the cache coherency mechanism to degrade performance. In addition, having one lookaside list means that processors have to synchronize their access to the list using spinlocks, and spinlocks also cause overhead on the multiprocessor bus and slow down a CPU's processing. Win2K creates separate IRP lookaside lists for each processor to avoid these performance degradations.

A system duplicates roughly 10 kernel lookaside lists across processors. In addition to the I/O Manager with its IRP lookaside list, the Win2K Object Manager and Cache Manager are two other subsystems that use this technique. The kernel's general buffer manager also uses this optimization when the manager creates per-processor lookaside lists for storing 32-byte buffers. When a device driver or kernel subsystem bypasses a lookaside list for a 32-byte or smaller allocation request, the kernel's general buffer manager checks its lookaside list for available buffers.

In addition to per-processor lookaside lists, Microsoft has made several other more subtle optimizations to the Win2K Memory Manager to enhance memory scaling on multiprocessors. For example, Win2K has improved workingset tuning, a mechanism for keeping frequently accessed application data in physical memory.

Pool Size and Cache Size
NT 4.0 implements nonpaged pool for nonpaged memory. Device drivers and the OS store data structures that must stay in physical memory and not be paged out to disk in nonpaged pool. The Memory Manager bases the pool's size on several parameters, including how much physical memory is present (the pool's maximum size in NT 4.0 is 128MB). The Microsoft TCP/IP driver, which must allocate nonpaged memory for every TCP/IP connection that is active on the computer, relies heavily on nonpaged pool. The size of nonpaged memory can therefore limit active TCP/IP connections. The TCP/IP driver and other drivers that run enterprise-class Web server workloads can push the NT 4.0 nonpaged pool's 128MB limit, so Microsoft raised the maximum size of nonpaged pool in Win2K to 256MB. In the process, Microsoft also packed the data structures that manage nonpaged pool more tightly to save space.

Finally, in NT 4.0, the Cache Manager can use a maximum of 512MB of the virtual address space that the Memory Manager assigns to the system. Win2K's Cache Manager raises that maximum to 960MB. This increase lets Win2K's Cache Manager more efficiently manage larger numbers of cached files because the Cache Manager doesn't need to perform as much remapping of physical memory into the cache's virtual memory. However, a larger amount of virtual memory for the cache has no effect on the number of disk I/Os that the Cache Manager performs. A common misconception is that the Win2K and NT file-system caches efficiently use physical memory only up to the virtual size of the cache. In reality, the Win2K and NT Cache Managers efficiently use all the physical memory you plug into a system.

Multiprocessor Scaling
The changes in Win2K that let its kernel and applications use more physical memory than NT 4.0 supports extend Win2K's capabilities for handling large-server and enterprise-class data sets. Further, performance optimizations related to memory sharing among processors let Win2K run more efficiently on multiprocessors, another characteristic that will help Win2K further penetrate enterprise computing. Next month, I'll continue examining Win2K scalability enhancements and highlight Win2K features and optimizations that help the kernel and applications more effectively use multiprocessors.

Editor's Note: As this article went to press, Microsoft announced that it will no longer support Alpha for Win2K.

Windows NT Magazine
Bugs, Comments, Suggestions        Subscribe
Copyright Duke Communications Intl, Inc. All rights reserved.