April 1999

| New Windows 2000 Pooling Functions Greatly Simplify Thread Management | 
| With Windows 2000, Microsoft has added new thread pooling functions to make thread creation, destruction, and general management easier. In many, but not all cases, this new general-purpose thread pool fits the bill and can save you countless hours of development time. | 
| This article assumes you're familiar with C++, Win32 | 
| Code for this article: ThreadPool.exe (2KB) 
 
Jeffrey Richter wrote Advanced Windows, Third Edition (Microsoft Press, 1997) and Windows 95: A Developer's Guide (M&T Books, 1995). Jeff is a consultant and teaches Win32 programming courses (www.solsem.com). He can be reached at www.jeffreyRichter.com. 
 | 
| By now,  everybody knows that creating multithreaded applications is very difficult. The way I see it, there are two big obstacles: synchronizing thread access to resources and managing the creation and destruction of threads. For synchronizing resource access, Windows® offers many primitives to help you: events, semaphores, mutexes, critical sections, and so on. These are all fairly easy to use as they are. If the system could automatically protect shared resources, it would make things even easier. Unfortunately, there's a ways to go before Windows can offer this protection in a manner that makes everybody happy. Likewise, everybody has their own opinions on how to manage the creation and destruction of threads. I have created several different implementations of thread pools over the past years. Each implementation was fine-tuned for the particular scenario that I was addressing at the time. With Windows 2000, Microsoft has added new thread pooling functions to make thread creation, destruction, and general management easier. This new general-purpose thread pool is definitely not the right thing for every situation, but in many cases it does fit the bill and can save you countless hours of development time. The new thread pooling functions attempt to address four scenarios: 
 To accomplish all of these tasks, the thread pool actually consists of four separate components. Figure 1 shows the four components and the rules that govern their behavior. When a process initializes, it obviously doesn't get any of the overhead associated with these components. However, as soon as one of the new thread pooling functions is called, some of these components are created for the process and some stay around until the process terminates. As you can see, the overhead of using the thread pool is not trivial; there are quite a few threads and internal data structures that will become part of your process. So it is very important that you carefully consider what the thread pool will and won't do for you; don't just blindly start using these functions. OK, enough with the disclaimer junk. Let's move on to what this stuff does and how to use it. Scenario 1: Queuing Execution of Asynchronous Functions Let's say that you have a server process with a main thread that waits for a client's request. Upon receipt of this request, the main thread spawns a separate thread responsible for handling the client's request. This allows your application's main thread to cycle around and wait for another client's request. What I have just described is a very typical implementation of a client/server application. This scenario is very straightforward to implement, but here's how you could implement it using the new thread pool functions. When the server process's main thread receives the client's request, it could call this function: | 
|  | 
| When you call this function, it simply queues a "work item" to a thread in the thread pool and returns immediately. A work item simply means that a function is called (identified by the pfnCallback parameter) and that it's passed a single parameter, pvContext. Eventually, some thread in the pool will process the work item, causing your function to be called. The callback function that you write must have the following prototype: | 
|  | 
| Even though you must prototype this function as returning a DWORD, the return value is actually ignored. The thing to notice here is that you never called CreateThread yourself. A thread pool was created for your process automatically and some thread within the pool called your function. Also, this pool thread will not be destroyed immediately after processing the client's request. Instead, it goes back into the pool so that it is ready to handle any other queued work items. Your application may be much more efficient now because you are not creating and destroying threads for every single client request. Also, because the threads are bound to a completion port, the number of concurrently runnable threads is limited to 2 times the number of CPUs. This reduces thread context switches. Personally, I think that this is a little scary; there are threads in my process that are doing things on their own. I'm used to creating threads myself and managing them as I see fit. Don't get me wrongI think that the thread pooling functions are great and I plan to use them for lots of my own projects. Just be careful and think about what the system should be doing for you. Don't just work with it blindly. What's happening under the covers is that QueueUserWorkItem checks the number of threads that are in the non-I/O component and, depending on the load (the number of queued work items), it may add another thread to this component. Now, QueueUserWorkItem performs the equivalent of calling PostQueuedCompletionStatus, passing your work item information to an I/O completion port. Ultimately, a thread waiting on the completion port extracts your message (because it calls GetQueuedCompletionStatus) and calls your function. When your function returns, the thread calls GetQueuedCompletionStatus again, waiting for another work item. It is expected that the thread pool will frequently be used to handle asynchronous I/O requests. An asynchronous I/O request means that a thread queues an I/O request to a device driver. While the device driver is busy performing the I/O, the thread that queued the request is not blocked and may continue executing other things. Asynchronous I/O is the secret to creating high-performance, scalable applications because it allows a single thread to handle requests from various clients as they come in; the thread doesn't have to handle each client's request serially and doesn't have to block, waiting for I/O requests to complete. However, Windows has a restriction that is placed on asynchronous I/O requests: if a thread issues an asynchronous I/O request to a device driver and then the thread terminates, the I/O request is lost and no thread will be notified when the I/O request actually completes. A well-designed thread pool allows the number of threads in it to expand and shrink depending on the needs of its clients. So, if a thread issues an asynchronous I/O request and the thread dies because the pool is shrinking, the I/O request dies too. This is usually not desired and some solution is required. If you want to queue a work item that will issue an asynchronous I/O request, you cannot have the work item posted to the thread pool's non-I/O component. Instead, you must queue the work item to the I/O component of the thread pool. The I/O component consists of a set of threads that never die if they have any pending I/O requests, and for this reason it should only be used for executing code that will have pending asynchronous I/O requests. To queue a work item for the I/O component, you still call QueueUserWorkItem, but for the Flags parameter you must pass WT_EXECUTEINIOTHREAD. Normally, you'll just pass WT_EXECUTEDEFAULT (defined as 0) for the Flags parameter, which causes the work item to be posted to the non-I/O component's threads. Another feature of a well-designed thread pool is that it always tries to keep threads available to handle requests. If a pool contains four threads and 100 work items get queued, only four work items can be handled at a time. This may not be a problem if a work item takes only a few milliseconds to execute, but if your work items require much more time, you will stop handling requests in a timely fashion. Certainly, the system can't be smart enough to anticipate what your work item functions are going to do, but if you know that a work item may take a long time to execute, you should call QueueUserWorkItem, passing it the WT_EXECUTELONGFUNCTION flag. This flag helps the thread pool decide whether it should add a new thread into the pool. When you use the WT_EXECUTELONGFUNCTION flag, it forces the thread pool to always create a new thread if all of the threads in the pool are busy. So if you queue 10,000 work items (with the WT_EXECUTELONGFUNCTION flag) at the same time, then 10,000 threads get added to the thread pool. If you don't want 10,000 threads created, you must space out the calls to QueueUserWorkItem so that some work items get a chance to complete. The thread pool can't place a maximum limit on the number of threads in the pool or starvation/deadlock could occur. Imagine queuing 10,000 work items that all block on an event that will be signaled by item 10,001. If there were a maximum of 10,000 threads, work item 10,001 couldn't be executed and all 10,000 threads would be blocked forever. While using the thread pool functions, you must always be looking for potential deadlock situations. Of course, you must be careful if your own work item functions block on critical sections, semaphores, mutexes, and so onthis makes deadlock much more possible. Always be aware of which component's thread (I/O, non-I/O, wait, or timer) is executing your code. Also, be very careful if your work item functions are in DLLs that may be unloaded dynamically. A thread that calls a function in an unloaded DLL will generate an access violation. To ensure that you do not unload a DLL with queued work items, you must reference count your queued work items. Increment a counter before you call QueueUserWorkItem and decrement the counter as your work item function completes. Only if the reference count is 0 is it safe to unload the DLL. Scenario 2: Calling Functions at Periodic Timer Intervals Sometimes applications need to perform certain tasks at certain times. Windows NT® 4.0 introduced a waitable timer kernel object that made it easy to get a time-based notification. Many programmers create a waitable timer object for each time-based task that the application needs to perform. This is not only unnecessary, it is quite wasteful of system resources. It is possible to create a single waitable timer, set it to the next due time, reset the timer for the next time, and so on. Granted, the code to accomplish this is a little tricky. But you don't have to do it. Instead, you can let the new thread pool functions manage this for you. To schedule a work item to be executed at a certain time, you first create a timer queue by calling: | 
|  | 
| A timer queue is a way for you to organize a set of timers. For example, imagine a single executable file that hosts several services. Each of these services may require timers to fire to help them maintain their state such as when a client is no longer responding, when to gather and update some statistical information, and so on. It would be inefficient to have a waitable timer and dedicated thread for each service. Instead, each service could have its own timer queue (a lightweight resource) and share the timer component's thread and waitable timer object. In addition, when a service terminates, it can just delete its timer queue, which deletes all the timers created in it. Once you have an existing timer queue, you can create timers in it with: | 
|  | 
| For the second parameter, pass the handle of the timer queue in which you want to create this timer. If you have just a few timers that you're creating, you can simply pass NULL for the TimerQueue parameter and avoid the call to CreateTimerQueue altogether. Passing NULL here tells the function to use a default timer queue and simplifies your coding effort. The pfnCallback and pvContext parameters indicate what function should be called and what should be passed to that function when the time comes due. The DueTime parameter indicates how many milliseconds should pass before the function is called the first time. (A value of 0 causes the function to be called as soon as possible, making this function similar to QueueUserWorkItem.) The Period parameter indicates how many milliseconds should pass before the function is called in the future. Passing 0 for the Period makes this a one-shot timer, causing the work item to be queued only once. The handle of the new timer is returned via the function's phNewTimer parameter. The worker callback function must have the following prototype: | 
|  | 
| When this function is called, the TimerOrWaitFired parameter will always be TRUE, indicating that the timer had fired. Now, let's talk about CreateTimerQueueTimer's Flags parameter. This parameter tells the function how to queue the work item when the time comes due. You can use WT_ EXECUTEDEFAULT if you want a non-I/O component thread to process the work item, or WT_EXECUTEINIOTHREAD if you want to wait on an asynchronous I/O request at a certain time. You can also use WT_EXECUTELONGFUNCTION if you think that your work item will require a long time to execute. There is another flag, WT_EXECUTEINTIMERTHREAD, that requires a bit more explaining. From Figure 1, you see that there is a timer component to the thread pool. This component is responsible for creating the single waitable timer kernel object and for managing its due time. This component always consists of just a single thread. When you call CreateTimerQueueTimer, you are causing the timer component's thread to wake up, add your timer to a queue of timers, and reset the waitable timer kernel object. The timer component's thread then goes into an alertable sleep, waiting for the waitable timer to queue an Asynchronous Procedure Call (APC) to it. After the waitable timer queues the APC, the thread wakes, updates the timer queue, resets the waitable timer, and then decides what to do with the work item that should now be executed. Then the thread checks for the WT_EXECUTEDEFAULT, WT_EXECUTEINIOTHREAD, WT_EXECUTELONGFUNCTION, and WT_EXECUTEINTIMERTHREAD flags. It should be obvious what the WT_EXECUTEINTIMERTHREAD flag does: it causes the timer component's thread to execute the work item. While this makes execution of the work item more efficient, it is very dangerous! If the work item function blocks for a long time, the timer component's thread can't do anything else. Note that the waitable timer may still be queuing APC entries to the thread, but these work items won't be handled until the currently executing function returns. If you are going to execute code using the timer thread, the code should execute quickly and should not block. The WT_EXECUTEINIOTHREAD and WT_EXECUTEINTIMERTHREAD flags are mutually exclusive. If you don't pass either flag (or use the WT_EXECUTEDEFAULT flag), then the work item will be queued to the non-I/O component's threads. Also, the WT_EXECUTELONGFUNCTION flag is ignored if the WT_EXECUTEINTIMERTHREAD flag is specified. When you no longer want a timer to fire, you must delete the timer by calling: | 
|  | 
| You must call this function even for one-shot timers that have fired. The TimerQueue parameter indicates which queue the timer is in. The Timer parameter identifies the timer you want to delete. The handle was returned by an earlier call to CreateTimerQueueTimer. The last parameter allows you to know when there are no outstanding work items queued because of this timer. If you pass INVALID_HANDLE_VALUE for the CompletionEvent parameter, DeleteTimerQueueTimer will not return until all queued work items for this timer have executed completely. Think about what this means; if you do a blocking delete of a timer during its own work item processing, you will create a deadlock situation, right? You are waiting for the work item to finish processing, but you are halting its processing while waiting for it to finish! You can only do a blocking delete of a timer if you are not the thread processing the timer's work item. Also, if you are executing via the timer component's thread, you should not attempt a blocking delete of any timer or deadlock will occur. Attempting to delete a timer queues an APC notification to the timer component's thread. If this thread is waiting for a timer to be deleted, it can't also be deleting the timer, so deadlock occurs. Instead of passing INVALID_HANDLE_VALUE for the CompletionEvent parameter, you can pass NULL. This tells the function that you want the timer deleted as soon as possible. In this case, DeleteTimerQueueTimer returns immediately, but you will not know when all of this timer's queued work items have completed processing. Finally, you can pass the handle of an event kernel object as the CompletionEvent parameter. When you do this, DeleteTimerQueueTimer again returns immediately and the timer component's thread will set the event after all of the timer's queued work items have completed processing. Make sure that before calling DeleteTimerQueueTimer, the event is not signaled or your code will think that the queued work items have executed before they really have. Once you have created a timer, you may want to alter its due time or period by calling: | 
|  | 
| Here, you pass the handle of a timer queue and the handle of an existing timer that you want to modify. You can change the timer's DueTime and Period. Note that attempting to change a one-shot timer that has already fired has no effect. Also note that you can freely call this function without having to worry about deadlock. When you no longer have a need for a set of timers, you can delete a timer queue by calling: | 
|  | 
| This function takes the handle of an existing timer queue and deletes all of the timers in it so that you don't have to call DeleteTimerQueueTimer explicitly for every timer in the queue. The CompletionEvent parameter has the same semantics as it does for the DeleteTimerQueueTimer function. This means that the same deadlock possibilities exist. Be careful. Before I move on to the next scenario, let me point out a couple of additional notes. First, the timer component of the thread pool creates the waitable timer so that it queues APC entries versus signaling the object. This means that the operating system is queuing APC entries continuously and that timer events are never lost. So setting a periodic timer guarantees that your work item is queued at every interval. If you create a periodic timer that fires every 10 seconds, your callback function will get called every 10 seconds. Be aware that this will happen using multiple threads and you may have to synchronize portions of your work item function. If you don't like this behavior and you'd prefer that your work items be queued 10 seconds after each one executes, you should create one-shot timers at the end of your work item function. Alternatively, you can create a single timer with a high timeout value and call ChangeTimerQueueTimer at the end of the work item function. The code shown in Figure 2 demonstrates how to implement a message box that automatically closes itself if the user doesn't respond within a certain amount of time. Scenario 3: Calling Functions when Single Kernel Objects Become Signaled Microsoft has discovered that there are many applications that spawn threads simply to wait for a kernel object to become signaled. Once the object is signaled, the thread posts some sort of notification to another thread and then loops back, waiting for the object to signal again. Some developers even write code where they have several threads, each one waiting on a single object. If you have many objects that you're waiting on, this is incredibly wasteful of system resources. Sure, there is a lot less overhead in creating threads versus creating processes, but threads are not free. Each thread has a stack, and there are a lot of CPU instructions required to create and destroy threads. You should always try to minimize this. If you want to register a work item to be executed when a kernel object is signaled, you can use another new thread pooling function: | 
|  | 
| Calling this function communicates your parameters to the wait component of the thread pool. Here, you are telling this component that you want a work item queued when the kernel object (identified by hObject) is signaled. You can also pass a timeout value so that the work item is queued in a certain amount of time even if the kernel object does not become signaled. Timeout values of 0 and INFINITE are legal here. Basically, this function works like the well-known WaitForSingleObject function. After registering a wait, this function returns a handle (via the phNewWaitObject parameter) identifying the wait. Internally, the wait component uses WaitForMultipleObjects to wait for the registered objects, and is bound by any limitations that already exist for this function. One such limitation is the inability to wait for a single handle multiple times. So if you want to register a single object multiple times, you must call DuplicateHandle and register the original handle and the duplicated handle individually. Of course, WaitForMultipleObjects is waiting for any one of the objects to be signaled versus waiting for all of the objects to be signaled. For those of you familiar with WaitForMultipleObjects, you know that it can only wait on at most 64 (MAXIMUM_WAIT_OBJECTS) objects at one time. So, what happens if you register more than 64 objects with RegisterWaitForSingleObject? The answer is that the wait component adds another thread that also calls WaitForMultipleObjects. In reality, every 63 objects require that another thread be added to this component because the threads need to also wait on a waitable timer object, which controls the timeouts. When the work item is ready to be executed it is, by default, queued to the non-I/O component's threads. One of those threads will eventually wake and can call your function, which must have the following prototype: | 
|  | 
| The TimerOrWaitFired parameter will be TRUE if the wait timed out and FALSE if the object became signaled while waiting for it. For RegisterWaitForSingleObject's dwFlags parameter, you can pass WT_EXECUTEINWAITTHREAD, which causes one of the wait component's threads to execute the work item function itself. This is more efficient because the work item doesn't have to be queued to the non-I/O component, but is dangerous because the wait component's thread that is executing your work item function can't be waiting for other objects to be signaled. You should only use this flag if your work item function executes very quickly. You can also pass the WT_EXECUTEINIOTHREAD if your work item is going to have a pending asynchronous I/O request. The WT_EXECUTELONGFUNCTION flag can also be used to tell the thread pool your function may take a long time to execute and that it should consider adding a new thread to the pool. This flag can only be used if the work item is being posted to the non-I/O or I/O components; you should not execute a long function using a wait component's thread. The last flag that you should be aware of is WT_ EXECUTEONLYONCE. Say that you register a wait on a process kernel object. Once that process object becomes signaled, it stays signaled. This will cause the wait component to queue work items continuously. For a process object, you probably do not want this behavior and you can prevent it using the WT_EXECUTEONLYONCE flag. This flag tells the wait component to stop waiting on the object after its work item has executed once. In contrast to WT_EXECUTEONLYONCE, let's say that you are waiting on an autoreset event kernel object. Once this object becomes signaled, the object is reset to its non-signaled state and its work item is queued. At this point, the object is still registered and the wait component waits again for the object to be signaled or for the timeout (which got reset) to expire. When you no longer want the wait component to wait on your registered object, you must unregister it. This is true even for waits that were registered with the WT_EXECUTEONLYONCE flag and have queued work items. You unregister a wait by calling: | 
|  | 
| The first parameter indicates a registered wait (as returned from RegisterWaitForSingleObject), and the second parameter indicates how you want to be notified when all queued work items for the registered wait have completed execution. Just like the DeleteTimerQueueTimer function, you can pass NULL (if you don't want a notification), INVALID_HANDLE_VALUE (to block the call until all queued work items have executed), or the handle of an event object (which gets signaled when the queued work items have executed). For a non-blocking call, if there are no queued work items, UnregisterWaitEx returns TRUE; otherwise, if some queued work items exist, it returns FALSE and GetLastError returns STATUS_PENDING. Again, you must be careful when passing INVALID_ HANDLE_VALUE to UnregisterWaitEx to avoid deadlock. A work item function shouldn't block itself while attempting to unregister the wait that caused the work item to execute. This is the equivalent of saying: suspend my execution until I'm done executingdeadlock. However, UnregisterWaitEx is designed to avoid deadlocking if the work item is being executed by a wait component's thread and you are unregistering the wait whose work item you're currently executing. One more thing, do not close the kernel object's handle until unregistering the wait has completed. Closing the handle before the wait is unregistered makes the handle invalid and the wait component's thread will then call WaitForMultipleObjects internally, passing an invalid handle. WaitForMultipleObjects will now always fail immediately and the entire wait component will no longer function properly. Finally, you should not call PulseEvent to signal a registered event object. If you do, it is very likely that the wait component's thread will be busy doing something and the pulse will be missed. This problem should not be new to you; PulseEvent exhibits this problem with almost all threading architectures. Scenario 4: Calling Functions when Asynchronous I/O Requests Complete The last scenario is the very common one where your server application has some pending asynchronous I/O requests. When these requests complete, you want to have a pool of threads ready to process the completed I/O requests. This is the architecture that I/O completion ports were originally designed for. If you were managing your own thread pool, you would create an I/O completion port and create a pool of threads that wait on this port. You would also open a bunch of I/O devices and associate their handles with the completion port. As asynchronous I/O requests complete, the device drivers queue the work items to the completion port. This is a great architecture that allows for a few threads to handle several work items efficiently, and it's fantastic that the thread pooling functions have this built in, saving you a lot of time and effort developing this yourself. To take advantage of this architecture, all you have to do is open your device and associate it with the non-I/O component of the thread pool. Remember the non-I/O component's threads all wait on an I/O completion port. To associate a device with this component, you call: | 
|  | 
| Internally, this function calls CreateIoCompletionPort, passing it the FileHandle and the handle of the internal completion port. Calling this function also guarantees that there is always at least one thread in the non-I/O component. The CompletionKey associated with this device will be the address of the overlapped completion routine. This way, whenever I/O to this device completes, the non-I/O component knows which function to call so that it can process the completed I/O request. The completion routine must have the following prototype: | 
|  | 
| Notice that you do not pass an OVERLAPPED structure to BindIoCompletionCallback. The OVERLAPPED structure is passed to functions like ReadFile and WriteFile. The system keeps track of this overlapped structure internally with the pending I/O request. When the request completes, the system places the address of the structure in the completion port so that it can then be passed to your OverlappedCompletionRoutine. Also, because the address of the completion routine is the completion key, to get additional context information into the OverlappedCompletionRoutine function you should use the traditional trick of placing the context information at the end of the OVERLAPPED structure. You should also be aware that closing a device causes all of its pending I/O requests to complete immediately with an error code. Be prepared to handle this in your callback function. If, after closing the device, you want to make sure that no callbacks are being executed, you must do reference counting yourself in your application. In other words, increment a counter every time an I/O request is pending and decrement the counter each time an I/O request completes. Currently, there are no special flags that you can pass to BindIoCompletionCallback, so always pass 0. I feel that there is one flag that should be here: WT_EXECUTEINIOTHREAD. If an I/O request completes, this gets queued to a non-I/O component thread. It is likely that in your OverlappedCompletionRoutine function you'll have another asynchronous I/O request pending. But remember that if a thread that issues I/O requests terminates, the I/O requests are destroyed too. Also, remember that the threads in the non-I/O component are created and destroyed depending on the workload. If the workload is low, it is possible that a thread in this component could terminate with outstanding I/O requests pending. If BindIoCompletionCallback supported the WT_EXECUTEINIOTHREAD flag, then a thread waiting on the completion port would wake and post the result to an I/O component thread. Since these threads never die, if there are any pending I/O requests, you can issue I/O requests without the fear of them being destroyed.     Well, while the WT_EXECUTEINIOTHREAD flag would be nice, you can easily emulate the behavior I just described. All you have to do is  just call QueueUserWorkItem in your OverlappedCompletionRoutine function, passing the WT_EXECUTEINIOTHREAD flag and whatever data you need (at least the overlapped structure probably). This is all that the thread pooling functions would do for you anyway. | 
|  For related information see: About Processes and Threads at http://msdn.microsoft.com/library/psdk/winbase/prothred_0n03.htm. Also check http://www.microsoft.com/msdn for daily updates on developer programs, resources and events. | 
| From the April 1999 issue of Microsoft Systems Journal 
 |