Pooled Threads

Improve Scalability With New Thread Pool APIs

Robert Saccone

Code download available at:VistaThreadPools2007_10.exe(175 KB)

Portions of this article are based on a prerelease version of Windows Server 2008. Details contained herein are subject to change.

This article discusses:

  • Thread pool functionality
  • Legacy thread pool limitations
  • New thread pool APIs
  • Thread pool objects
This article uses the following technologies:
Windows Vista, Windows Server 2008

Contents

Thread Pool Overview
Thread Pool Objects
Callback Environment Objects
Work Objects
Wait Objects
Timer Objects
I/O Completion Objects
Simplifying Cleanup with Cleanup Groups
Callback Instance APIs
Try It Out with the Sample Applications

With the release of Windows Vista® and the upcoming release of Windows Server® 2008, Microsoft has enriched the Windows® platform with so much new technology for developers of managed applications that it's easy to overlook advancements that the native Windows developer can benefit from. The thread pool component that has been part of the platform since the release of Windows 2000, for example, has undergone a complete rearchitecture. The new implementation brings with it a new thread pool API that should make it much easier for developers to write correct code. The legacy APIs are still supported so that legacy applications can continue to run, but, as you'll see, there are many benefits to be gained from moving to the new APIs.

Thread Pool Overview

Before diving into the new APIs, let's review the type of functionality that the thread pool component provides. In a nutshell, the thread pool allows functions to be called at timed intervals, asynchronously, when a kernel object has been signaled, and when asynchronous I/O requests complete. Each time one of these callback functions should be invoked, a request to invoke a callback function is added to the end of a first-in-first-out (FIFO) queue that the thread pool maintains. Worker threads internally managed by the thread pool remove items from the queue and invoke the callback functions. Since the thread pool manages the lifetimes of the worker threads, the application developer does not have to explicitly manage thread lifetimes. This also means that the application should never terminate a thread pool worker thread using the TerminateThread function or by calling the ExitThread function from a function that is executing on a worker thread.

Applications often create dedicated threads that spend a great deal of time waiting for an event or periodically awakening to poll for some state. The thread pool makes this more efficient. For a server class application that is trying to process numerous client requests, the thread pool enables the preferred server concurrency model for the Windows platform, which is to process as many requests concurrently as there are processors in the system. One alternative model I've seen developers try is to dedicate one thread for each client being serviced, but this model does not scale well with a large number of clients. Not only do the large number of threads consume significant resources, but the cost of context switching between them becomes significant and impacts the overall quality of service each client receives.

The new thread pool component addresses a number of limitations of the legacy thread pool. For example, the new thread pool lets you create multiple thread pools per process, while the old model allowed only one. This lets you isolate the tasks the application performs by some chosen criteria. For example, suppose you're writing a distributed server application that is going to use the thread pool to process asynchronous I/O completions on network connections. The network connections are of two types, one from a client application and the second from other instances of the server application running on different computers. Let's further assume that you have to support a large number of simultaneous client connections per server and that the number of client requests is much greater than server-to-server connections, but the server-to-server connections have a higher priority.

When both client and server network connections are processed on a single thread pool, there is a single queue to which all the completion notifications are queued. This means that an I/O completion from a server connection will not be processed until all I/O completions in front of it in the queue are processed, including those from any clients. This can result in significant delays in processing server connections when the system is under heavy client load. Dedicating a thread pool to client connections and another to server connections allows for isolation because each thread pool has its own queue and set of worker threads. The scheduler in the operating system will ensure that the processors are shared fairly between the threads in both pools. However, it's important to use multiple thread pools judiciously as creating too many per process can hurt performance and reduce throughput.

The legacy thread pool used two types of worker threads: I/O and non-I/O, which at times caused confusion. If the developer didn't understand the distinction, the result could be limited performance or incorrect behavior. Moreover, having two distinct thread groupings meant that the thread pool implementation itself was less efficient since it couldn't share threads between callback functions of different types. In the new thread pool, this distinction is removed and all worker threads are the same.

In the legacy API, QueueUserWorkItem was used to queue a function to be called asynchronously on a thread pool worker thread. The legacy API provided no way for the application to determine when the worker thread had finished executing the callback function. If the callback function was located in a DLL, it was virtually impossible to guarantee that it was safe to unload the DLL. This meant that unloading a DLL occasionally caused the hosting process to crash. There was also no way to cancel callbacks that were in the thread pool's queue awaiting execution, so there was no choice but to wait for all requests to drain, which could introduce significant delays.

Finally, the legacy API does not separate resource allocation from resource use. Since these resource allocations can and do fail, the legacy API makes it hard to develop systems that have reliability guarantees. The new thread pool API makes the separation between resource allocation and usage distinct, so that once a resource has been successfully allocated, there is virtually no possibility of failure for well-written code when those resources are put to use.

The new thread pool API is object-based, where each type of object has a set of functions for creation, cleanup, and modifying properties. Figure 1 summarizes the object types exposed by the API. It's important to understand that all the objects created by the thread pool, and all the worker threads managed by it, become part of the application process using them.

Figure 1 Thread Pool Object Types

Object Type Description
TP_POOL Pool of threads used to execute callbacks.
TP_TIMER Invoke a callback function at a due time.
TP_WAIT Invoke a callback function when a kernel object is signaled or the wait times out.
TP_WORK Invoke a callback function asynchronously.
TP_IO Invoke a callback function when an asynchronous I/O completes.
TP_CLEANUP_GROUP Track one or more thread pool callback objects.
TP_CALLBACK_ENVIRON Bind a thread pool to its callback objects, and optionally a cleanup group.

Thread Pool Objects

The first step in preparing a thread pool for use is to create one using the CreateThreadpool function, which is shown in Figure 2. The function takes a reserved parameter that must be NULL. If the function succeeds, it returns a PTP_POOL representing the newly allocated thread pool. If the function fails, it returns NULL, and GetLastError can be used to get extended error information.

Figure 2 Thread Pool Object API

PTP_POOL WINAPI CreateThreadpool(PVOID reserved); 
BOOL WINAPI SetThreadpoolThreadMinimum(PTP_POOL ptpp, DWORD cthrdMin); 
VOID WINAPI SetThreadpoolThreadMaximum(PTP_POOL ptpp, DWORD cthrdMost); 
VOID WINAPI CloseThreadpool(PTP_POOL ptpp);

Once the thread pool is created, it is possible to control the minimum and maximum number of threads the pool will manage using two new APIs. By default, the thread pool minimum is 0 and the maximum is 500. These numbers were chosen for backward compatibility with the legacy thread pool as some applications using the legacy thread pool required large numbers of threads. These defaults could change in future versions of Windows as any application that uses such a large number of threads is likely to perform poorly due to excessive context switching.

There are some interesting features to be aware of with these APIs. Notice that SetThreadpoolThreadMinimum returns a BOOL while SetThreadpoolThreadMaximum returns nothing. The reason is that if setting the minimum number of threads requires SetThreadpoolThreadMinimum to increase the number of worker threads in the pool to meet the new minimum, and a failure occurs while allocating the threads, this failure can be reported back to the caller. Setting the maximum number of threads in the pool doesn't cause any resource allocations because all the function does is set an upper bound on how many threads the thread pool can create. The thread pool will vary the number of threads from the minimum to the maximum based upon the actual workload of the thread pool. Unfortunately, SetThreadpoolThreadMinimum has a bug in Windows Vista that manifests when it needs to increase the number of worker threads in the pool: the function returns without immediately creating the additional threads. This bug will be fixed in Windows Vista Service Pack 1 and isn't an issue in Windows Server 2008. For Windows Vista today, however (barring the issue of worker-thread-creation failure), once enough work has been queued to cause the thread pool to create the specified minimum number of threads, the thread pool will honor the minimum and maintain at least that many threads in the pool.

Setting the minimum and the maximum number of threads to the same value creates a pool of persistent threads that will never terminate until the pool is closed. This is handy for using a thread pool with a function, such as RegNotifyChangeKeyValue, that must be called from a persistent thread. For more information on RegNotifyChangeKeyValue see msdn2.microsoft.com/ms724892.

Note that calling SetThreadpoolThreadMinimum with a minimum greater than the current maximum not only causes a new minimum to be set but also causes the maximum to be set to that new minimum. Calling SetThreadpoolThreadMaximum with a maximum that is less than the minimum causes the minimum to be set to the new maximum. The thread count parameter is specified as a DWORD in both of these functions, but the entire range of a DWORD is not available for the thread count. Internally, the value is treated as a LONG and is validated to be greater than or equal to 0.

This is an appropriate time to explain the error-reporting philosophy of the thread pool APIs. If an operation is one that can legitimately be expected to fail, the result is reported by a return code from the function being called. However, errors that are exceptional are reported using structured exceptions. For example, for the functions SetThreadpoolThreadMinimum and SetThreadpoolThreadMaximum, passing an invalid value for the number of worker threads or a NULL thread pool pointer causes a structured exception to be raised. However, a failure to create a new worker thread during a call to SetThreadpoolThreadMinimum is reported to the caller as FALSE because a resource allocation failure is an error that should be expected. It may be tempting to wrap structured exception handlers around calls to the thread pool APIs so that an application will not terminate due to the unhandled exception, but this is a bad idea as catching these exceptions and continuing to execute the program only serves to hide a problem in the application itself. This only makes it harder to diagnose the cause of the problem when the application ultimately fails.

In order to use these APIs properly, it helps to have an understanding of how the concurrency model of the thread pool is implemented. This model is based on the number of available processors in the system. For example, if the system has two physical processors with two cores per processor, optimally you should have only four threads that are runnable most of the time. This eliminates context-switching overhead. However, if the application has only one callback outstanding most of the time, sizing the thread pool to only one thread is perfectly reasonable. Going back to the two-processor, two-core example, assuming that the application has a dozen outstanding callbacks, there should be at least four worker threads in the pool to allow each core to process one item. However, in order to maintain optimal concurrency, it may be necessary for the thread pool to contain more than four worker threads. To understand why, consider a case in which each of the four cores is already busy executing a callback function and there is also a callback pending in the thread pool's queue. What happens if one of those worker threads executing callbacks blocks? There are three possibilities:

  • If the thread pool has an available thread in the pool, it will dispatch another thread to remove the next item from the queue and invoke the pending callback function.
  • If the thread pool has no other available worker threads and the number of worker threads already created is less than the maximum thread count, after a short delay it will create a new worker thread that will be dispatched to execute the pending callback function.
  • If the thread pool has no other available worker threads, and the number of worker threads the thread pool has already created has reached the maximum thread count, no additional threads will be created and the pending callback item will remain in the queue until a previously dispatched worker finishes executing its callback function and returns to the thread pool to determine if there is another item to be executed.

For this reason, if a callback function might block while executing, the maximum size of the thread pool must be larger than the number of available processors in the system in order to realize the maximum concurrency that the hardware supports. Ideally, a callback function should never block. A callback function that blocks not only decreases concurrency but also decreases the level of worker thread reuse. Because of this, it's definitely worth making the effort to eliminate or minimize the length of time a callback function blocks. You can reduce that time if you understand what the callback is blocking on. For example, if the callback is performing synchronous I/O, consider changing it to asynchronous I/O that completes on the thread pool instead. And look to minimize any synchronization that must occur between callback functions because lock contention can cause blocking.

If most of the callback functions don't block, the thread pool can be sized to contain fewer threads in excess of the number of available processors. If most of the callback functions are going to end up blocking, the thread pool should be sized to contain many more worker threads than the number of available processors. In this case, most threads will end up in a wait state. As callback functions complete, in order to maximize overall throughput the thread pool will hold back worker threads, reducing the number of runnable threads to a number that's optimal for the processor configuration. The point is that you want to size the thread pool so that when there are tasks to run and sufficient processor bandwidth available, the pool can create more threads to do work. However, if you find that your application actually uses an inordinately large number of threads, like the 500 in the default setting, or if there is excessive context switching cost, the application is not designed properly and you need to examine your code using the performance monitor and a code profiler to determine where improvements can be made.

Once you're finished with the thread pool, it should be closed using the CloseThreadpool function. The thread pool is closed immediately if there are no outstanding callback objects that are bound to the thread pool. If there are, then the thread pool is released asynchronously when those outstanding objects are freed.

Callback Environment Objects

Now that you understand how to create a thread pool, the next step is to look at what a callback environment object is and how it should be used. A callback environment object is used to bind a thread pool instance to the instances of thread pool callback objects your application creates to actually do work on the thread pool. A callback environment object also allows you to attach a cleanup group object, which makes the cleanup of the thread pool callback objects simpler. Cleanup groups will be fully discussed later on in the article. The callback environment APIs are shown in Figure 3.

Figure 3 Callback Environment APIs

VOID InitializeThreadpoolEnvironment(PTP_CALLBACK_ENVIRON pcbe); 
VOID DestroyThreadpoolEnvironment(PTP_CALLBACK_ENVIRON pcbe); 
VOID SetThreadpoolCallbackPool(PTP_CALLBACK_ENVIRON pcbe, PTP_POOL ptpp); 
VOID SetThreadpoolCallbackLibrary(PTP_CALLBACK_ENVIRON pcbe, PVOID mod); 
VOID SetThreadpoolCallbackRunsLong(PTP_CALLBACK_ENVIRON pcbe); 
VOID SetThreadpoolCallbackCleanupGroup(PTP_CALLBACK_ENVIRON pcbe, PTP_CLEANUP_GROUP ptpcg, 
  PTP_CLEANUP_GROUP_CANCEL_CALLBACK pfng);

The first step in creating a callback environment is to either declare a TP_CALL_BACK_ENVIRON structure in static storage, on the stack, or allocate one from the heap. The next step is to initialize it using the InitializeThreadpoolEnvironment function, which takes a pointer to TP_CALLBACK_ENVIRON. Finally, to associate a thread pool with the callback environment, use SetThreadpoolCallbackPool. If you don't associate a thread pool with the callback environment or if NULL is specified in the call to SetThreadpoolCallbackPool, the process's default thread pool will be used. As you'll see, the callback environment is eventually used to create the various thread pool callback object instances. Once your application has finished modifying the properties of the callback environment and creating all the instances of the callback objects it needs, the callback environment should be destroyed using the DestroyThreadpoolEnvironment function.

The SetThreadpoolCallbackRunsLong API is used to give a hint to the thread pool that the callback functions associated with this environment may not return quickly. You'll see later how this affects the result returned from calling the API function CallbackMayRunLong.

I mentioned earlier that the legacy thread pool lacked support for an application that needed to determine when it was safe to unload a DLL containing a callback function that was to be executed on the thread pool. The new thread pool API provides the SetThreadpoolCallbackLibrary function to guarantee that a library remains loaded as long as a callback function inside a DLL is still executing. Essentially, what happens is that before the callback function is invoked, the operating system's loader lock (which is always held when a DLL is in the process of being loaded or unloaded) is acquired and the DLL's reference count is incremented; the loader lock is then released. After the callback completes execution, the loader lock is acquired again to decrement the reference count. This makes it impossible for the DLL to be unloaded while a callback function is executing. Note that while callbacks are pending on the thread pool's queue, the reference count on the DLL remains unchanged. This means that a DLL can be unloaded with callbacks pending. However, it is the job of the DllMain function that you write to make sure that you handle the DLL_PROCESS_DETACH event and cancel all pending callbacks. I'll get to this in a bit.

Work Objects

A work object is used to cause the thread pool to invoke a callback function asynchronously on a worker thread. The CreateThreadpoolWork function, shown in Figure 4 along with all the other work object related APIs, is used to create a work object.

Figure 4 Work Object APIs

PTP_WORK WINAPI CreateThreadpoolWork(PTP_WORK_CALLBACK pfnwk, 
  PVOID Context, PTP_CALLBACK_ENVIRON pcbe); 
VOID WINAPI SubmitThreadpoolWork(PTP_WORK pwk); 
VOID CALLBACK WorkCallback(PTP_CALLBACK_INSTANCE Instance, PVOID Context, PTP_WORK Work); 
VOID WINAPI WaitForThreadpoolWorkCallbacks(PTP_WORK pwk, BOOL fCancelPendingCallbacks); 
VOID WINAPI CloseThreadpoolWork(PTP_WORK pwk); 
BOOL WINAPI TrySubmitThreadpoolCallback(PTP_SIMPLE_CALLBACK pfns, 
  PVOID pv, PTP_CALLBACK_ENVIRON pcbe); 
VOID CALLBACK SimpleCallback(PTP_CALLBACK_INSTANCE Instance, PVOID Context);

The first parameter, pfnwk, is a pointer to the callback function to be executed by a worker thread. The second parameter, Context, is typed as PVOID and can be used to supply any application-specific data that the callback function needs. Finally, the last parameter is a pointer to the TP_CALLBACK_ENVIRON. The work object will be bound to the thread pool and, if any, the cleanup group that has been associated with the callback environment.

After the work object has been created, it can be queued to the thread pool using the SubmitThreadpoolWork function. Eventually a worker thread will remove the work object from the queue and invoke its associated callback function. Each call to SubmitThreadpoolWork will generate one call to the work item's callback function.

The WorkCallback function for a work object takes three parameters. The first, Instance, is used to identify the specific instance of the callback function's execution and remains valid only for the execution of the callback function. The Context parameter is the pointer that was supplied as the second parameter to CreateThreadpoolWork. The final parameter, Work, is the instance of the work object for which the callback function is being invoked.

It's important to understand that a callback function can be invoked by any worker thread in the thread pool. The rule of thumb to follow is that the callback function shouldn't make any assumptions about the worker thread on which it will execute and should leave the worker thread in the same state it was in prior to the callback function invocation. For example, if the callback function is going to use COM, then it must call CoInitializeEx each time it is invoked. The callback function should also call CoUninitialize before returning since a worker thread can be reused to dispatch callback functions for a number of different tasks, or it might even be terminated when it is returned to the thread pool. This prevents resources leaks and leaving state information around that could adversely affect the execution of the next callback function, which may have different requirements for the execution environment of the thread. Application Verifier is a runtime verification tool for unmanaged code that assists in finding subtle programming errors that may be difficult to identify in normal application testing. It has been enhanced to assist in finding programming errors related to the thread pool. Among the errors it can detect are unbalanced CoInitializeEx and CoUninitialize calls, thread priority and affinity changes that haven't been reverted before returning the worker thread to the thread pool, impersonation that hasn't been reverted, and orphaned critical sections. For a comprehensive list, consult the documentation installed with Application Verifier, which can be downloaded from microsoft.com/downloads/details.aspx?familyid=bd02c19c-1250-433c-8c1b-2619bd93b3a2.

The WaitForThreadpoolWorkCallbacks function blocks the calling thread until all outstanding callback functions for a work object have completed execution. The second parameter controls whether pending callbacks (callbacks that have been queued to the thread pool's queue but have not yet been dispatched to a worker thread for execution) should be allowed to execute or be canceled. Be careful with this API—using it inside of the work callback function can cause a deadlock.

Rounding out the API set relating to work objects is the CloseThreadpoolWork function. The work object is freed immediately if there are no outstanding callbacks; otherwise the work object will be freed asynchronously once outstanding callbacks complete. This also means that any pending callbacks waiting for execution on the thread pool's queue will be canceled.

The remaining function in Figure 4, TrySubmitThreadpoolCallback, offers the equivalent of creating a work object, submitting it to the thread pool, and ensuring that the work object will be closed once the callback specified by the pfns parameter has completed executing. The signature of the callback function must conform to the signature of the SimpleCallback in Figure 4. It is slightly different from the callback associated with a work object. Since the thread pool takes care of allocating and releasing the work object internally, the callback function is only passed the callback function instance pointer and the application defined context specified in the Context parameter of TrySubmitThreadpoolCallback. Since TrySubmitThreadpoolCallback allocates resources to do its work, there is a possibility it could fail, which is why it returns a BOOL return code.

Wait Objects

Wait objects are used to invoke a callback function once a kernel object has become signaled or when the specified wait period times out. The CreateThreadpoolWait function is used to create a wait object. Its parameters follow the same pattern as CreateThreadpoolWork with the exception that the pointer to the callback function supplied in the first parameter must match the signature of the WaitCallback function, as shown in Figure 5.

Figure 5 Wait Object APIs

PTP_WAIT WINAPI CreateThreadpoolWait(PTP_WAIT_CALLBACK pfnwa, 
  PVOID pv, PTP_CALLBACK_ENVIRON pcbe); 
VOID WINAPI SetThreadpoolWait(PTP_WAIT pwa, HANDLE h, PFILETIME pftTimeout); 
VOID CALLBACK WaitCallback(PTP_CALLBACK_INSTANCE Instance, PVOID Context, 
  PTP_WAIT Wait, TP_WAIT_RESULT WaitResult); 
VOID WINAPI WaitForThreadpoolWaitCallbacks(PTP_WAIT pwa, BOOL fCancelPendingCallbacks); 
VOID WINAPI CloseThreadpoolWait(PTP_WAIT pwa);

After creating a wait object, the next step is to cause the thread pool to wait for a kernel object to become signaled. The SetThreadpoolWait function is used to do just that. The first parameter, pwa, is a pointer to the wait object instance being set. The second parameter is the HANDLE to the kernel object to wait on. The final parameter, pftTimeout, is a pointer to a FILETIME structure that indicates the amount of time the thread pool should wait for the kernel object to be signaled. The amount of time to wait can be expressed in absolute or relative terms and is specified in 100-nanosecond units. Passing a positive value indicates that the timeout is an absolute time since 1/1/1600. Passing a negative value indicates a time relative to the current time at the call of the function. Passing a 0 indicates that the wait times out immediately. Finally, passing NULL indicates an infinite wait that never times out.

Once the kernel object becomes signaled or the wait times out, the thread pool will invoke the wait object's WaitCallback function. The first two parameters are exactly the same as in the WorkCallback function described earlier. The Wait parameter indicates which wait object the callback is being invoked for and the WaitResult is used to indicate the reason for the invocation. WaitResult will be WAIT_ABANDONED_0, WAIT_OBJECT_0, or WAIT_TIMEOUT. If WaitResult is set to WAIT_OBJECT_0, it means that the kernel object became signaled and the wait was satisfied.

If WaitResult is set to WAIT_TIMEOUT, it means the wait is unsatisfied and the timeout interval specified in the call to SetThreadpoolWait has elapsed. A result of WAIT_ABANDONED_0 indicates that the specified object is a mutex and that it was not released by the thread that owned the mutex object before it terminated. However, using a mutex with a wait object should be avoided because the worker thread that invokes the WaitCallback function is not the thread that performed the wait on the mutex. The thread pool uses a different type of thread, called a waiter thread, to actually perform the wait on the kernel object. It is the waiter thread that actually owns the mutex. Waiter threads are not exposed to applications, so there is no way to release the mutex once the waiter thread acquires ownership of it.

After the WaitCallback function has been called for a wait object, SetThreadpoolWait must be called again in order reuse the wait object and have the thread pool wait again for a kernel object to become signaled. Note that when calling SetThreadpoolWait again to reuse a wait object, you have the option to once again specify any kernel object handle to wait on. You don't have to use the handle that was specified in the first call to SetThreadpoolWait if you want to wait on a different kernel object. Finally, the remaining wait object APIs, WaitForThreadpoolWaitCallbacks and CloseThreadpoolWait, behave in exactly the same manner as their work object counterparts.

Timer Objects

Timer objects are used to invoke a callback function when the timer object reaches its due time. They are created using the CreateThreadpoolTimer function, which is shown in Figure 6 along with the other timer object-related APIs. Once again the parameters follow the same pattern as the CreateThreadpoolWork function (in Figure 4), with the exception that the pointer to the callback function supplied in the first parameter must match the signature of the TimerCallback function, as shown in Figure 6.

Figure 6 Timer Object APIs

PTP_TIMER WINAPI CreateThreadpoolTimer(PTP_TIMER_CALLBACK pfnti, 
  PVOID pv, PTP_CALLBACK_ENVIRON pcbe); 
VOID WINAPI SetThreadpoolTimer(PTP_TIMER pti, PFILETIME pftDueTime, 
  DWORD msPeriod, DWORD msWindowLength);
VOID CALLBACK TimerCallback(PTP_CALLBACK_INSTANCE Instance, 
  PVOID Context, PTP_TIMER Timer); 
VOID WINAPI WaitForThreadpoolTimerCallbacks(PTP_TIMER pti, BOOL fCancelPendingCallbacks); 
BOOL WINAPI IsThreadpoolTimerSet(PTP_TIMER pti); 
VOID WINAPI CloseThreadpoolTimer(PTP_TIMER pti);

After creating a timer object, the next step is to set it using the SetThreadpoolTimer function. The pftDueTime parameter is used to set the time when the timer should initially come due. This time is expressed in the same way as the time out used in the SetThreadpoolWait function described above. As its name suggests, the msPeriod parameter sets up a timer, expressed in milliseconds, that fires periodically. Therefore, once the initial time interval expressed by pftDueTime has expired and caused a callback to be queued to the thread pool, each subsequent time interval expiration defined by msPeriod will cause another callback to be queued.

The msWindowLength parameter specifies a time window, expressed in milliseconds, during which the thread pool may delay expiration of the timer. This parameter promotes efficiency when you're using a large number of timers and the expiration time doesn't have to be exact. The window is a fudge factor of sorts that allows the system to coalesce all the timer expirations that fall into the window together so that they can be batched. This is more efficient than waking up a thread, expiring one timer, sleeping, waking up a thread, expiring another timer, sleeping, and so forth. The delay will occur only if there aren't any timers in the window that absolutely must be expired. To better understand how msWindowLength might be used, consider a server app that has a large number of incoming client connections and that in order to reduce resource usage, connections that haven't been active for five minutes should be closed. In this case, specifying a non-zero window length for the inactivity timers, which may result in keeping a connection around slightly past its due time, may be acceptable.

SetThreadpoolTimer can also be used to set a new due time, period, and window length for a timer that was previously set. If the pftDueTime parameter is NULL, it will stop queuing callbacks to the TimerCallback function but callbacks that have already been queued will be executed. Thus you can cancel a timer without having to close the timer object so that it can be reused.

Once the timer object comes due, a request to invoke the callback will be queued to the thread pool. A worker thread will pick up the request and invoke the TimerCallback function supplied in the call to SetThreadpoolTimer. The parameters to the callback function are the same as those described for a work object callback function, with the exception that the third parameter is a PTP_TIMER instead of a PTP_WORK.

Of the remaining timer API functions in Figure 5, the IsThreadpoolTimerSet function, as its name implies, returns TRUE if the timer has been set and FALSE otherwise. The remaining two, WaitForThreadpoolTimerCallbacks and CloseThreadpoolTimer behave exactly as their work object counterparts do.

I/O Completion Objects

The final type of callback object supported by the thread pool is the I/O completion object. Figure 7 lists all the APIs relating to I/O completion objects. An I/O completion object is used to bind a file handle to the thread pool so that asynchronous I/O completion notifications are queued to the thread pool for processing by worker threads. The parameters for the CreateThreadpoolIo function follow a similar pattern to the other create functions with the addition of the HANDLE parameter, which must be opened for overlapped I/O completion.

Figure 7 I/O Completion Object APIs

PTP_IO WINAPI CreateThreadpoolIo(HANDLE fl, PTP_WIN32_IO_CALLBACK pfnio, 
  PVOID pv, PTP_CALLBACK_ENVIRON pcbe); 
VOID WINAPI StartThreadpoolIo(PTP_IO pio); 
VOID CALLBACK IoCompletionCallback(PTP_CALLBACK_INSTANCE Instance, PVOID Context, 
  PVOID Overlapped, ULONG IoResult, ULONG_PTR NumberOfBytesTransferred, PTP_IO Io); 
VOID WINAPI WaitForThreadpoolIoCallbacks(PTP_IO pio, BOOL fCancelPendingCallbacks); 
VOID WINAPI CancelThreadpoolIo(PTP_IO pio); 
VOID WINAPI WaitForThreadpoolIoCallbacks(PTP_IO pio, BOOL fCancelPendingCallbacks); 
VOID WINAPI CloseThreadpoolIo(PTP_IO pio);

As shown in Figure 7, the first parameter to CreateThreadpoolIo is the HANDLE to receive completion notifications, while parameters two, three, and four represent respectively the pointer to the callback function to be invoked, an optional application-specific context, and a pointer to the callback environment. CreateThreadpoolIo returns a non-NULL pointer if successful and a NULL pointer otherwise.

In order to cause I/O completion notifications to be processed by the thread pool, the StartThreadpoolIo function must be called prior to issuing each asynchronous I/O request operation on the handle. Forgetting to do this has serious consequences because the thread pool will ignore the I/O completion when it occurs and cause memory corruption. You must call a related function, CancelThreadpoolIo, when the call to initiate an asynchronous I/O operation returns with a failure other than ERROR_IO_PENDING. Accidently omitting the call to CancelThreadpoolIo will cause the thread pool to leak memory.

Once an I/O completion occurs, a worker thread will call the IoCompletionCallback associated with the I/O completion object. The signature for the function follows the same convention as all the other callbacks: the first two parameters to the function are a pointer to the callback instance and the context pointer supplied to the create function. The third parameter, Overlapped, is the pointer to the overlapped structure supplied when initiating the asynchronous I/O operation. The fourth is IoResult, which contains the result of the operation. IoResult will contain NO_ERROR if the operation completed successfully; otherwise it will contain a system error code (see msdn2.microsoft.com/ms681381). The fifth parameter, NumberOfBytesTransferred, contains the number of bytes transferred during the I/O operation while the sixth parameter is a pointer to the I/O completion object itself.

Once again, the remaining functions in Figure 7,WaitForThreadpoolIoCallbacks and CloseThreadpoolIo, behave just like their work object counterparts.

Simplifying Cleanup with Cleanup Groups

Now let's see how cleanup groups can help simplify the process of cleaning up the thread pool callback objects an application creates. By associating a cleanup group with a thread pool callback environment, all thread pool callback objects created using that callback environment will be tracked by the cleanup group. Once your application is finished using callback objects, all it has to do to ensure that each one is closed is to make one function call instead of one close call for each callback object. In fact, if a callback object is associated with a cleanup group, the close API for it shouldn't be called.

The first step in using a cleanup group is to create a cleanup group object using the CreateThreadpoolCleanupGroup function, which is shown in Figure 8.

Figure 8 Cleanup Group APIs

PTP_CLEANUP_GROUP WINAPI CreateThreadpoolCleanupGroup(void); 
VOID WINAPI CloseThreadpoolCleanupGroupMembers(PTP_CLEANUP_GROUP ptpcg, 
  BOOL fCancelPendingCallbacks, PVOID pvCleanupContext); 
VOID CALLBACK CleanupGroupCancelCallback(PVOID ObjectContext, PVOID CleanupContext); 
VOID WINAPI CloseThreadpoolCleanupGroup(PTP_CLEANUP_GROUP ptpcg);

The next step is to associate the cleanup group with the callback environment it is to be used with. The SetThreadpoolCallbackCleanupGroup function from Figure 3 sets up this association.

The first parameter to the function is a pointer to the callback environment. The second parameter is the PTP_CLEANUP_GROUP, and the third parameter is a pointer to a callback function that will be invoked when the CloseThreadpoolCleanupGroupMembers function is called.

Once an application has finished using the thread pool callback objects that are being tracked by the cleanup group, all it has to do is call CloseThreadpoolGroupMembers, which will block until all callback functions currently executing complete. If fCancelPendingCallbacks is TRUE, callbacks that have been queued to the thread pool that have not yet started executing will also be canceled. If fCancelPendingCallbacks is FALSE, CloseThreadpoolCleanupGroupMembers will not return until all the pending callback functions are dispatched to a worker thread and complete execution. The pvCleanupContext parameter is used to pass application data to the optional cancel cleanup callback function that was specified in the call to SetThreadpoolCallbackCleanupGroup. The cancel callback function will be invoked once for each thread pool callback object being cleaned up. The cancel callback's signature must match the CleanupGroupCancelCallback function shown in Figure 3. The first parameter, ObjectContext, is the optional data that was supplied to the creation function for the thread pool callback object that is being cleaned up. The second parameter, CleanupContext, is the optional data that was supplied by the caller to the CloseThreadpoolGroupMembers function. After calling CloseThreadpoolCleanupGroupMembers, the CloseThreadpoolCleanupGroup function is used to close the cleanup group and free any resources associated with it. It's important that CloseThreadpoolCleanupGroup not be called while the cleanup group has members because doing so may cause resources to leak.

Callback Instance APIs

The final set of APIs to examine are shown in Figure 9. They are all meant to be used from the worker thread that is executing the callback object's callback function as they all require that the PTP_CALLBACK_INSTANCE be passed to the callback function. The first function, CallbackMayRunLong, is used to tell the thread pool that the callback function wants to run for an extended period of time. The thread pool tracks the number of worker threads that are executing long-running callbacks. Recall that earlier, we used the SetThreadpoolCallbackRunsLong function to tell the thread pool that callback functions associated with the supplied callback environment are all long-running callbacks.

Figure 9 Callback Instance APIs

BOOL WINAPI CallbackMayRunLong( PTP_CALLBACK_INSTANCE Instance); 
VOID WINAPI DisassociateCurrentThreadFromCallback( PTP_CALLBACK_INSTANCE Instance); 
VOID WINAPI SetEventWhenCallbackReturns( PTP_CALLBACK_INSTANCE Instance, HANDLE evt); 
VOID WINAPI ReleaseSemaphoreWhenCallbackReturns( PTP_CALLBACK_INSTANCE Instance, HANDLE sem, DWORD crel); 
VOID WINAPI LeaveCriticalSectionWhenCallbackReturns( PTP_CALLBACK_INSTANCE Instance, PCRITICAL_SECTION pcs); 
VOID WINAPI ReleaseMutexWhenCallbackReturns( PTP_CALLBACK_INSTANCE Instance, HANDLE mut); 
VOID WINAPI FreeLibraryWhenCallbackReturns( PTP_CALLBACK_INSTANCE Instance, HMODULE mod);

A TRUE result from CallbackMayRunLong indicates that the thread pool has worker threads available for processing long-running callbacks. In considering worker thread availability, the thread pool only considers the current set of worker threads that exist at the time the function is called. A FALSE return code indicates that all available worker threads are busy executing long-running callbacks already. In this case, the callback function should return as quickly as possible and delay the long-running work to a later time if it wants to keep a worker thread available to execute short-running callbacks.

The next function in Figure 9, DisassociateCurrentThreadFromCallback, breaks the association between the currently executing callback function and the object that initiated the callback. The current thread will no longer count as executing a callback on behalf of the object. For example, if the callback function for a work object calls DisassociateCurrentThreadFromCallback, it can then call WaitForThreadpoolWorkCallbacks using the pointer to the work object that was passed into the callback function without risking a deadlock. However, DisassociateCurrentThreadFromCallback does retain the association of the currently executing callback to the cleanup group of the callback environments so that if another thread has called CloseThreadpoolCleanupGroupMembers, the function will wait for the thread executing the callback function to return to the thread pool. This ensures that DLLs will not be unloaded while there are still threads executing code within them. One thing to note is that if you call DisassociateCurrentThreadFromCallback, and plan on reusing the object from within the callback function (say for calling a function such as SubmitThreadpoolWork for a work object), this must be synchronized with any calls to CloseThreadPoolCleanupGroupMembers because trying to reuse the object once the CloseThreadPoolCleanupGroupMembers has started to execute can cause an exception to be thrown inside the callback function.

The next set of functions coordinates the completion of the execution of a callback function with a synchronization object. The SetEventWhenCallbackReturns function is used to put an event object into the signaled state when the current callback completes. Similarly the functions ReleaseSemaphoreWhenCallbackReturns, ReleaseMutexWhenCallbackReturns, and LeaveCriticalSectionWhenCallbackReturns are all designed to release different types of lock objects when the current callback function completes. These functions can help reduce programming errors by ensuring that no matter how the callback function returns, the specified lock will be released. Hopefully in a future release of Windows new functions will be added to support Slim Reader/Writer Locks as well.

The FreeLibraryWhenCallbackReturns function can be used to make the thread pool invoke FreeLibrary on the passed module handle when the specified callback instance completes execution. It's up to the application to ensure that all outstanding callbacks have completed execution and that all pending callbacks on the thread pool's queue have been canceled before the callback function returns.

Earlier, the SetThreadpoolCallbackLibrary function was described as a way to prevent a DLL from being unloaded prematurely while a callback function is still executing code inside the DLL. The cost of this insurance is fairly high because of the overhead involved in acquiring and releasing the loader lock before and after the invocation of the callback function. Note also that there is a single loader lock in the process and it may have heavy contention on it, which means that relying on the thread pool's mechanism to ensure a DLL doesn't unload prematurely may have a negative impact on the performance of the application. Depending upon the application scenario, it may be more efficient to build an application-specific mechanism that provides for safe unloading of a DLL using a combination of the functions described in this section and Window's synchronization primitives.

Try It Out with the Sample Applications

As you've seen, the new thread pool component includes many improvements to help you easily write applications that are both highly reliable and scalable. Key among these improvements is the ability to host more than one thread pool per process, where each thread pool has its own separate set of characteristics so you can partition processes by type of work performed.

I hope this overview of the new thread pool has whetted your appetite to explore how using it can benefit your applications. To get you started, I've included two sample applications with this article. The first, ThreadPoolDemo, lets you experiment with work, wait, and timer objects so you can explore how they work. ThreadPoolDemo will run on either the process's default thread pool or on a custom thread pool, allowing the minimum and maximum thread counts to be specified through command line parameters.

The work object demonstration, which is executed by specifying the -Work command-line option, creates a work object. By using the supplied count with the command-line parameter -I, it will submit the work item to the thread pool the number of times specified. You can use the -B option to define how long the callback function should block, and the -E option to specify how long the callback function should execute. By default, the program will run until all the submitted work items have been executed. Specifying either –CC or –CW on the command line causes the application to immediately call CloseThreadpoolCleanupGroupMembers after submitting the work item the specified number of times. The difference between the two options is that –CC cancels all callbacks that have not yet started while –CW waits for them to execute before returning from the CloseThreadpoolCleanupGroupMembers call.

You can also experiment with timer objects. You can configure multiple timer objects where each can have its own due time and an optional period and window. Using multiple timers with different window sizes allows you to see the effect of the system coalescing timer expirations together.

Finally, the wait-object demonstration let's you define one or more event specifications for the thread pool to wait on. Each event specification can include an optional due time to indicate when the event is to be signaled, as well as an optional timeout interval for defining when the wait should expire. These options let you experiment with combinations of events that become signaled, timeout from waits, or never become signaled and never timeout.

For details on command-line parameters for each of the callback object types, type one of the following in a command prompt: "ThreadPoolDemo –Work ?", "ThreadPoolDemo –Timer ?", or "ThreadPoolDemo –Wait ?".

The second sample application included with this article is called CopyFile. Essentially it is an update of the Windows SDK file-copying example that demonstrates how a completion port works. CopyFile will copy a source file to a destination file using the thread pool and demonstrates how I/O Completion Objects work. Typing "CopyFile –Usage" at the command line displays a complete description of the program parameters in the console window.

I'd like to thank Rob Earhart, Eric Li, and Sandeep Ranade for answering my questions and providing insightful feedback; and I'd like to thank Rob Shewan for reviewing the content and providing me with valuable feedback.

Robert Saccone is a Principal Architect in the Forefront Server Security group. His areas of interest are large-scale software design, distributed systems, and operating systems implementations. You can contact him at rsaccone@msn.com.