Alex Stockton
Microsoft Corporation
November 1999
Summary: Describes the techniques used in Microsoft's Visual C++® implementation of Nile.com, an online bookstore used in Doculabs' @Bench application server performance benchmark. (15 printed pages)
Overview
Server Techniques
Database Techniques
Development Techniques
Resources
About the Author
Nile.com is a sample e-commerce application developed by Doculabs* to demonstrate application server performance. As the name implies, the sample is for an online bookstore.
This article describes Microsoft's implementation of Nile.com, which has also been implemented by several other companies in order to demonstrate their application server performance. Built with Visual C++ and a number of other Microsoft products and technologies described in this article, Microsoft's version of Nile.com won a PC Week Labs comparison of competing implementations in July 1999 (see http://www.zdnet.com/pcweek/stories/news/0,4153,409380,00.html).
The Microsoft® Visual C++ implementation of the Nile.com online bookstore was developed as an ISAPI extension DLL hosted by Internet Information Server (IIS) 4.0. The application's data was held in SQL Server™ 7.0 and accessed through the ODBC API. Load balancing and fault tolerance were provided by Microsoft Windows® Load Balancing Server (WLBS).
In this article, I'll describe the techniques used to implement the Nile.com application and, more importantly, show you a range of principles that you can use to achieve great performance in your own server applications. I'll concentrate more on summarizing the reusable knowledge gained from developing the application than on the details of our particular implementation.
Note Before reading this article, I recommend you read George Reilly's excellent article "Server Performance and Scalability Killers," which outlines the top ten ways of destroying server performance and how to avoid them.
*Doculabs is an information advisory company specializing in intranet, extranet and document technologies.
An ISAPI extension is simply a Windows DLL that exposes three entry points, described in the following table.
Entry point | Description |
GetExtensionVersion | Called by IIS as soon as it loads the extension DLL. An implementation of this function will return the extension's version information to IIS and perform any other initialization required for that particular ISAPI extension. |
HttpExtensionProc | Called by IIS for each client request. Implementations of this function generate the HTML (or perform other actions) necessary to satisfy the client's request. |
TerminateExtension | Called by IIS just before it unloads the extension DLL. A typical implementation of this function performs any cleanup necessary for that particular ISAPI extension. |
A client sends a request to an ISAPI extension by submitting a URL (such as the one shown next) to the Web server:
http://www.nile.com/nile.dll?viewcart
protocol | server | item | query string
The part of the URL before the question mark indicates a path to the ISAPI extension that should be invoked. When IIS receives a request destined for a particular ISAPI extension, it calls HttpExtensionProc on the relevant DLL (after ensuring that it's loaded and initialized) and passes a pointer to a structure (an extension control block) holding all the details of the client's request.
ISAPI extensions use the extension control block to access items like the query string (the part of a URL following the question mark), cookies, and form data, as well as functions exposed by IIS itself. The extension control block contains all the information that an ISAPI extension needs to understand the client's request and return the results. It's the responsibility of each extension to interpret that request appropriately for the particular application.
ISAPI extensions can run in the IIS process, taking advantage of the performance benefits associated with being close to the controlling application. IIS also allows ISAPI extensions to be run in their own process to get the robustness and security benefits that confers. For best performance, you should run your ISAPI extensions in the same process as IIS if you are sure that your code is stable. ISAPI extensions are loaded on first use and unloaded when the Web service is stopped.
IIS uses an I/O thread pool to process all incoming requests. Requests for static files (.htm, .jpg, and so on) are satisfied immediately, while requests for dynamic content are dispatched to the appropriate ISAPI extension DLL. If the work done by the ISAPI extension DLL blocks the IIS thread passing the request for any length of time, performance will suffer—blocking IIS threads reduces the ability of the Web server to handle further requests. You can avoid this problem in your ISAPI extensions by implementing a thread pool of your own.
When requests come in, they are queued up, allowing the IIS threads to return from HttpExtensionProc extremely quickly. The worker threads created by your ISAPI extension remove the requests from the queue and perform the necessary processing on each request to return the desired results to the client.
Writing an efficient thread pool for Windows NT® 4.0 is relatively easy using the API for manipulating I/O completion ports. You can use CreateIoCompletionPort to create the port used to queue requests, PostQueuedCompletionStatus to post requests to the queue or to post a signal that indicates the pool should shutdown, and GetQueuedCompletionStatus to pull items off the queue.
Note For a great discussion of the benefits of I/O completion ports, as well as an explanation of some essential threading concepts, and tips for writing server applications, see "Writing Scalable Applications for Windows NT" by John Vert.
In the future, you may not have to write your own thread pool. Windows 2000 provides a thread pool implementation through QueueUserWorkItem and related functions.
Populating a thread pool with an appropriate number of threads requires experimentation under real-world conditions. The hardware, amount of processing required for each request, and user load are all factors that interact to produce a situation requiring a particular thread count. Too many threads will cause cycles to be wasted in context switching, too few threads will result in response times increasing above acceptable levels.
In our implementation of Nile.com, we found acceptable performance at 24 concurrent threads per processor.
The process of building HTML pages (once the necessary information has been obtained from the database) involves relatively straightforward string manipulations. The four major components of this activity are:
Each of these steps, while simple, provides an opportunity for optimizations. Some of the items you should consider are discussed next.
You need to allocate some memory (a buffer) to store the page that you'll be transmitting to the client. For best performance, you should allocate your page buffer, along with any temporary buffers that you may need, on the stack if possible. Stack allocations are incredibly fast (much faster than using the heap), so use stack memory in preference to heap memory wherever it makes sense to do so.
Note that the amount of available stack memory is more limited than heap memory—each thread's default stack size is only 1 megabyte (MB) whereas the size of the heap is really only limited by hardware. However, 1 MB is more than enough for building the text of a typical HTML page. The source for standard HTML pages is unlikely to exceed 50 kilobytes (KB) in size and the extra overhead of a typical thread's stack variables is not going to push this figure up to 1 MB. It is possible to set a thread's stack size to a larger value if you have control over the thread's creation, so this shouldn't be an issue even for larger pages or complex applications.
C++ provides intrinsic support only for static stack allocations. In other words, when you allocate memory on the stack using C++ array syntax, you need to know at compile time how large the array will be. One advantage of this seeming limitation is that it discourages unnecessary dynamic calculations. Suppose you have a buffer that might need to be 10, 20, or 50 bytes large in different situations. In most cases, the run-time cost of calculating the minimum size of the buffer is much larger than any cost associated with overallocating, so just statically allocate a 50-byte buffer and avoid the run-time calculation. You'll get good performance if you analyze your data and allocate a static stack buffer that's large enough for the biggest page that you'll generate.
Note that you can dynamically allocate stack memory from C++ by using the CRT function _alloca. You get the benefit of dynamic allocation without the overhead of using the heap. This can be a great performance enhancement in certain situations, but there are a couple of things to watch out for. First, the memory allocated by _alloca is freed only when the function from which it was called returns. This means that you have to be careful about calling _alloca in a loop—you don't want to run out of stack memory. Second, you shouldn't call _alloca directly from a C++ or SEH exception handler, because _alloca works on the stack frame and exception handlers have their own memory frame.
However you decide to allocate memory, do ensure that you don't overflow the stack. If you use a string class like MFC's CString or the Standard Library's std::string, you can feel more confident that you won't overflow the buffer, but you face the potential problem of run-time checks and multiple heap allocations as the string increases in size. These string classes may provide useful functionality to your application, but in many cases they are overkill for the simple process of building an HTML page.
When converting data to text, you need a buffer to hold the results of the conversion. You also need a buffer to hold the whole page that will be transmitted to the client. Wherever possible, these buffers should be the same. You can usually convert the data directly into the right position in the page buffer. Don't waste cycles by making the conversion to a temporary buffer and copying the data to the final destination. Streams are useful for this type of conversion.
Encoding algorithms (such as those that convert plain text to HTML format) typically examine every character in a string, so they're expensive. If you know that your data doesn't need to be encoded, don't pass it to an encoding function. For example, numeric data or literal strings shouldn't need to be passed to an HTML encoding algorithm. Numeric data will never contain the characters that HTML requires to be converted and literal strings can be converted ahead of time to avoid doing unnecessary work at run time. For best performance, tailor your encoding algorithm to replace only those elements that you know have a chance of appearing in your data. If you know that your data doesn't include certain elements that usually need encoding, don't search for those elements. Avoid making multiple passes over the text that you're encoding.
If you care about performance, never call strcat (or related functions) more than once on the same destination string. strcat has to navigate the entire length of the destination string to find the terminating null character before copying the source string. Instead, you can maintain a pointer to the last character written and continue writing at that point using strcpy. If you already know the length of a string, you can use that information for further optimization with memcpy. Note that the length of any string literal is known at compile time, so you should never calculate the length of such strings at run time.
The HTML pages for Nile.com were built up in a statically sized stack-allocated character buffer—a data member of the thread class. We defined a simple class to manage writing to the page buffer. This class simply provided a couple of template functions for efficiently writing different data types to the end of the string in the buffer. One template function optimized writes for literal strings. The static portions of the pages were compiled directly into the ISAPI DLL and used the literal string optimization for best performance.
There are two main choices when transmitting HTML pages to the client from an ISAPI extension—send the data synchronously or transmit it asynchronously.
The advantage of synchronous calls is the low overhead and ease of programming. The disadvantage is that synchronous calls can tie up threads that may need to perform other work.
Asynchronous calls keep the calling threads free to continue with other tasks, but may cause their own inefficiencies in resource management as they increase the complexity of the application. Freeing a thread to continue with its work may not be helpful if the thread is prevented from using a resource by an outstanding asynchronous call or needs to create a new resource to avoid having to wait.
Note If you're interested in performance, you should never return a page through multiple synchronous calls. Each synchronous call corresponds to a round trip across the network. For best performance, you should minimize trips across the network.
In our implementation of Nile.com, most of the data was transmitted to the clients synchronously using a single call to WriteClient for each page. A small number of static pages were returned by asynchronous calls to TransmitFile. All I/O was handled by our pool of worker threads.
We decided early on that the ISAPI DLL created as part of our implementation of Nile.com would be stateless. This had a great simplifying effect on the code and offered huge performance and scalability benefits. Maintaining state and synchronizing its access efficiently across thousands of clients is a hard problem to solve, so rather than write code ourselves, we used the best tool for the job—an enterprise-level database server. Microsoft SQL Server 7.0 was designed to handle exactly this task.
The global data, such as the book details, user information, and orders, was already in the database according to a predefined schema outlined in the @Bench specification. The location of the customer's shopping cart was up to us, so we chose to store this data directly to the database.
Note We didn't attempt to implement complex caching schemes for the dynamic data read from or written to the database—we had a fast database server and a good network connection, so we chose against that option without testing. A well-implemented caching scheme might have boosted performance or allowed the database to serve more clients. A badly implemented caching scheme might have damaged performance or killed scalability.
Increasing thread contention is one quick way for a caching scheme to kill performance, so if you're implementing or using a cache, think about providing one cache per thread. Test your application to find out whether any caching scheme that you implement really does have the benefits you expect.
There was a small amount of application state that we passed to the clients to manage. The most important state held on each client was a session ID, which we stored as a cookie. The session ID was used to identify a particular user and shopping cart and was generated when the user logged on to the site.
Without this session ID being passed transparently behind the scenes, we would have had to require users to provide us with their user names and passwords every time they wanted to access some personal information. Storing the ID as a cookie means we can ask for this information just once but continue to identify users without further action on their part.
The only other state involved in the application was required when the user performed a search. In this situation, we'd return the first set of results (up to 10 books) and provide links to return the next 10 books that satisfied the search criteria. If the user navigated forward through the results, we also supplied a link to navigate back through the result set.
Figure 1. Query results
The query string for the links that we wrote to these dynamically generated pages contained all the necessary information to repeat the search and return the next portion of the result set without requiring any state to be stored in the ISAPI extension.
One of the major benefits of this stateless model is that we knew that we wouldn't lose any information in the event of a failure. All important state was stored atomically to the database. The small amount of state stored on each client was easily replaceable.
Note A stateless ISAPI implementation also allows for a great deal of flexibility when load balancing across multiple machines. Clients can have their requests handled by any machine without requiring any special consideration to be given to where the last request was handled. WLBS can handle both stateless and stateful applications (by changing the "affinity" setting), but if you need to implement application-level load balancing logic, the statefulness of your server application will affect the ways in which you can do that.
Optimizing the database was hugely important to the success of our implementation in the performance tests. There are a large number of database optimizations that can be made for any application. Here are just some of the questions that we asked ourselves during the development of Nile.com:
The decisions we made are described next.
We considered both ODBC and OLE DB as our data access technologies. While we felt we could get similar performance from each technology, we ended up choosing ODBC because we knew we would need to do a great deal of bulk, sequential row fetching. OLE DB would have done OK, but we suspected that it would have introduced more network round trips because it does a two-phase fetch (you first get a handle to a row and then the data). Based on our final configuration, we felt either technology would have worked well.
To work with ODBC, we created a set of simple wrapper classes to manage the ODBC environment, connection, and statement handles. Each class exposed only the functionality required for our application, which resulted in about eight methods per class.
All SQL statements are held as stored procedures in the database. Using stored procedures simplifies database management, and improves performance. SQL Server can precompile stored procedures to improve execution times.
One particular benefit of using stored procedures is that it provides an important encapsulation and abstraction layer. The successful interaction between the database schema and the stored procedures is vital to achieving great performance. Indexes on the database should be tuned to match particular procedures and their performance requirements. It makes sense to keep these highly dependent items physically connected and centrally managed by storing the procedures in the database.
Note that it's important to bind the parameters used by your stored procedures to variables within your code rather than passing a text string containing the parameters' values. In other words, the statement you use to call the stored procedure should look like this:
"{ CALL sp_Whatever(?) }"
The value of the parameter denoted by the question mark should be passed by a variable. It shouldn't look like this:
"{ CALL sp_Whatever(3) }"
The reason that you should avoid passing parameter values as part of the text string is to keep type conversions to a minimum and located on the client machines, and to enable SQL Server's caching engine to recognize that a statement has been seen before and avoid a statement compilation. This will reduce the load on the database server's processors.
Note ODBC provides the SQLBindParameter function to allow you to bind parameters to variables in your code. OLE DB and ADO provide similar functionality with similar performance benefits.
Note the use of the CALL syntax here. For best performance, you should use this same syntax to call stored procedures in your own code. The SQL Server ODBC driver is optimized to use a remote procedure call (RPC) mechanism to call the procedure when the ODBC CALL syntax is used, which is more efficient than the mechanism used to send a Transact-SQL Execute statement to the server. This syntax allows multiple SQL statements to be batched, minimizing network round trips.
Note For more information, see "Batching Stored Procedure Calls."
In ODBC applications, it is possible to execute an SQL statement directly using SQLExecDirect, or to prepare a SQL statement and execute it subsequently using SQLPrepare and SQLExecute. Prepared execution removes the overhead associated with repeatedly parsing and compiling an SQL statement, so it is an important optimization technique in some situations. However, it is not appropriate when calling stored procedures. Applications targeting SQL Server should only use direct execution to call a stored procedure because preparing a statement requires an extra round trip across the network and the building of an execution plan that simply calls the stored procedure execution plan. This is unnecessary overhead. In our Nile.com implementation, all SQL statements called stored procedures, so all of them were executed directly using SQLExecDirect.
Note ADO and OLE DB provide similar functionality to ODBC—statements can be prepared or executed directly. The guidelines presented here apply to those technologies also.
Stored procedures can return information through result sets or through out parameters. We implemented the stored procedures to return information through out parameters wherever it was practical to do so to avoid the overhead of creating a result set where it wasn't needed.
If you decide to use out parameters in your own code, note that the values returned by out parameters are not guaranteed to be set until after all results returned by the procedure have been fetched and SQLMoreResults has returned SQL_NO_DATA.
Here you can see some code that returns an integer through an out parameter:
SQLRETURN ret = SQLBindParameter(hStatement, 1, SQL_PARAM_OUTPUT,
SQL_C_ULONG, SQL_INTEGER, 0, 0,
&dwID, 0, &cbID);
ret = SQLExecDirect( hStatement,
(SQLCHAR*) "{ CALL sp_GetID(?) }",
SQL_NTS);
ret = SQLMoreResults(hStatement);
SQLBindParameter binds the parameter to the variable dwID; SQLExecDirect executes the stored procedure; and SQLMoreResults ensures that the value of the out parameter is present in the bound variable.
ODBC provides a couple of ways of retrieving the data held in a result set. In our implementation of Nile.com, we used SQLBindCol and SQL_ATTR_ROW_ARRAY_SIZE to perform row-wise binding, efficiently returning all the results that we needed by attaching the result set to an array of variables within our code.
Read-only, forward-only cursors as implemented by the SQL Server 7.0 ODBC driver are highly optimized for speed and should be used wherever possible. They begin pushing the entire rowset down even before data is fetched. These cursor settings are the default.
Read-only, forward-only cursors may also be appropriate even when there is some requirement for navigating through a result set. In the case of the book search results, where the results were shown 10 records at a time, we actually found it was slightly more efficient to bind all the results of the search and then use code in our ISAPI DLL to skip ahead to the interesting records than it was to try to limit the result set on the server or use a different cursor type to perform the navigation.
Applying the right indexes to your tables is one of the most important optimization techniques available to database developers. There can be order of magnitude differences in query execution times between badly indexed tables and correctly indexed tables, so it's important to get it right.
Here are a few simple rules that you can follow to create a covering index—an index that will provide best performance for a particular SELECT statement:
Note The order of columns in an index is important, so add the columns in the sequence indicated. Where there are multiple columns of each type, add them in the order used in the query.
As an example, suppose you have the following query:
SELECT column1, column3, column4 FROM table1 WHERE column2 = value ORDER BY column5
You could create an index for this query like this:
CREATE
INDEX
index1
ON
table1
(column5
,column2
,column1
,column3, column4
)
You can see that the column used to order records appears first, the column used to filter records in the WHERE clause appears next in the index, and subsequent columns appear in the index in the same order as they did in the query.
Note that applying lots of indexes to tables that change frequently can be the wrong thing to do. Good indexing helps improve the execution time of Select statements, but too many indexes can really hurt queries that INSERT or UPDATE records. When records are changed, indexes must also be updated. The more indexes that must be updated, the slower the query becomes.
The only way to determine whether an index really helps the performance of your application is to test it under real-world conditions. However, you can get close to the optimal indexing scheme more quickly by following the guidelines above and by using SQL Server 7.0's Index Tuning Wizard to help you tune your indexes based on workload data generated during full load testing of your database. If you ensure that the data you make available to the Index Tuning Wizard is a realistic representation of the environment you expect your application to run in, you can get good answers to questions you have about the best index for a table that is both read and updated frequently.
In our implementation of Nile.com, we were able to provide a covering index optimized for each of the stored procedures used in the application. This more than doubled overall performance.
Just as it's important to match your indexes to your queries, it's also important to optimize your queries to match the data being used by your application. It's vital to develop an understanding of how the data in your database actually relates to the data used by your application in any particular situation.
When you're designing your queries, don't think solely in terms of the data in the database that you need for your application, think also about the records in the database that aren't being used by your application. Look at ways of excluding that data from your result sets as early as possible to avoid the weight of that unused data. In particular, make sure that you exclude unused data before performing expensive operations such as joins or ordering clauses, and avoid any query that will result in a table scan.
You can use the SQL Server Query Analyzer to provide you with important feedback on your queries. Use this tool to find out where the query processor is spending its time so that you can eliminate those bottlenecks wherever possible. Avoid joining tables and changing the order of your results if you can. Think creatively about ways of eliminating work or moving the load between machines. Do your application's users really need access to 260,000 search results or will the top 50 matches do? Does your middle tier or front end have the capacity do work that's currently being done at the database?
Opening and maintaining an ODBC connection has a cost associated with it. It takes some time to establish a connection, and each open connection uses memory on the database server and the client. High performance server applications have to be able to scale up to thousands of users, so you can't open a connection for each client. Similarly, a large Web site needs to handle thousands of requests per second, so opening and closing a connection on each request isn't a practical alternative. For best performance, connections need to be limited in number and long-lived.
The easiest way for us to handle this requirement in our implementation of Nile.com was for each thread in the pool to manage its own connection to the database. The lifetime of the thread defined the lifetime of the connection. The implementation was incredibly easy—each thread in the pool was represented by an instance of a class, each database connection was a data member of that class. Initialization of the database connection was carried out in the constructor of the class representing the worker thread.
One benefit of this approach is that there's no contention between the worker threads for a limited number of connections, which would have been the case if the ODBC connections had been pooled independently of the worker threads. By giving each thread responsibility for its own database connection, we successfully avoided any expensive and complex synchronization that other techniques would have introduced.
Microsoft supplies built-in pooling for ODBC (connection pooling) or OLE DB (resource pooling) to help you manage your database connections. Using these pools makes sense if the number of clients that might need access to a particular type of connection is much greater than the number of available connections, or if clients that need access to the database connections are constantly being created or destroyed. Neither of these situations applied to our implementation of Nile.com—the number of clients (threads) was fixed at a level that was also appropriate for the number of database connections, and each client (thread) was long-lived so it didn't need to spend time reacquiring the connection when the connection was needed.
By default, IIS 4.0 caches ISAPI extension DLLs. This is great in a production web site because it means good performance, but not so great on a test server because it prevents you from replacing your DLLs freely during development. There are four main ways of getting around this problem:
During the development of Nile.com, we used the last option. We wrote a simple loader DLL in the form of an ISAPI extension that would load another ISAPI DLL and unload it after 2.5 seconds of inactivity. This enabled us to quickly and easily rebuild and redeploy the ISAPI extension without having to stop and restart the Web service or the Web site, and without having to endure the terrible performance imposed by disabling ISAPI caching.
Note that we didn't use the loader DLL during performance tuning, only during initial development and debugging. The performance of the loader was good for simple functional tests, but not appropriate for a high-load, high-performance Web site.
Alex Stockton worked on the Microsoft implementation of Nile.com that won the Doculabs/PC Week application server shootout. Alex is a programmer/writer in the Visual C++ group.
--------------------------------------------
This document is provided for informational purposes only and Microsoft Corporation makes no warranties, either express or implied, in this document. The entire risk of the use or the results of the use of this document remains with the user. Companies, names, and data used in examples herein are fictitious unless otherwise noted. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express written permission of Microsoft Corporation.
Copyright 1999 Microsoft Corporation. All rights reserved.
Microsoft, Windows and Windows NT are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries