Tips for Improving Time-Critical Code

项目
09/15/2006

Writing fast, time-critical code requires that you carefully analyze all aspects of your code and how it interacts with the system. This section will suggest alternatives to some of the more obvious coding techniques when fast execution times are required.

One of the key factors in writing fast code is to "know your code." If you are using a library, make sure you have the source code (or have a thorough understanding of how the library works where it pertains to performance-critical code) which will enable you to make informed decisions.

In addition to the suggestions provided in this section there are tools like the profiler and Windows NT performance monitor (perfmon.exe) that may assist you in analyzing and maximizing your code's performance.

This section offers suggestions to ensure that the time-critical portions of your code perform satisfactorily.

Cache Hits and Page Faults
Sorting and Searching
MFC and Class Libraries
Shared Libraries
Heaps
Threads
Small Working Set

Summary of Improving Time-Critical Sections of Code

The following points summarize where you should concentrate your efforts to make your code run as efficiently as possible:

Know which parts of your program have to be fast.
Know the size and speed of your code.
Know the cost of new features.
Know the minimum work needed to accomplish the job.

Cache Hits and Page Faults

Missed cache hits, on both the internal and external cache, as well as page faults (going to secondary storage for program instructions and data) slow the performance of a program.

It is estimated that a missed CPU cache hit results in a loss of 10–20 clock cycles. A missed external cache hit results in a loss of 20–40 clock cycles. A page fault results in about 1 million lost instructions. Therefore, it is in the best interest of program execution to write code that will reduce the number of missed cache hits and page faults.

One reason for slow programs is that they take more page faults or miss the cache more often than necessary. To avoid this, it’s important to use data structures with good locality of reference, which means keeping related things together. Sometimes a data structure which looks great turns out to be horrible because of poor locality of reference, and sometimes the reverse is true. Here are two examples:

Dynamically allocated linked lists can reduce program performance because when you search for an item or when you traverse a list to the end, each skipped link could miss the cache or cause a page fault. A list implementation based on simple arrays might actually be much faster because of better caching and fewer page faults—even allowing for the fact that the array would be harder to grow, it still might be faster.
Hash tables that use dynamically allocated linked lists can degrade performance. By extension, hash tables that use dynamically allocated linked lists to store their contents might perform substantially worse than you might expect. In fact, in the final analysis, a simple linear search through an array might actually be faster (depending on the circumstances). Array-based hash tables (so-called “closed hashing”) is an often-overlooked implementation which frequently has superior performance.

Sorting and Searching

Sorting is an inherently time consuming operation (compared to other typical operations you might have to do). The best way to speed up code that is slow due to sorting is to find a way to not sort at all.

Perhaps the sort can be deferred to a non-performance–critical time.
Perhaps only part of the data truly needs to be sorted.
Perhaps information was available at some other time that would have made the sort easier to do later.

You might be able to build the list in sorted order. But this might slow you down overall because inserting each new element in order could require a more complicated data structure with possibly worse locality of reference. There are no easy answers; try several approaches and measure the differences.

If you find that you must sort there are a few things to remember:

Use a stock sort; writing your own is a recipe for bugs.
Anything you can do before the sort that simplifies what you need to do during the sort is probably a good thing. If a sort runs in O(n log n) and you can make a one-time pass over your data to simplify the comparisons such that the sort can now run in O(n), you’re well ahead*.*
Think about the locality of reference of the sort algorithm you’re using versus the data that you expect it to run on.

There are fewer alternatives to searching than for sorting if you want to improve performance. If you need to do a search and if the search is time-critical, a binary search or hash table lookup is almost always the right answer. But again keep locality in mind. A linear search through a small array might be faster than a binary search through a data structure with a lot of pointers, which could cause page faults or cache misses.

MFC and Class Libraries

The Microsoft Foundation Class library (MFC) provides a framework that can greatly simplify writing code. Other class libraries are available for the same purpose. The following discussion is specific to MFC but is valid for similar idioms in other libraries.

When writing time-critical code, you should be aware of the overhead implied in using some of the classes. Examine the MFC code that your time-critical section is using to see if it meets your performance requirements. The following is a list of some performance issues you should be aware of in MFC.

CString The memory for a is dynamically allocated, which means it calls the C run-time library which in turn could call the operating system. So don’t use a CString simply to store a constant string; use const char *. Often a simple char array on the stack is the right answer.

The CString class simplifies the writing of code that does string manipulation. Keep in mind, however, that any operation you perform with a CString object could imply a significant amount of overhead that could affect your performance. Step through the code that is generated to see exactly what is executed. It may be faster to use the run-time library string functions to have more complete control over what is executed. Generally speaking a CString is as efficient as any other dynamically allocated string, which means you bear the overhead of the dynamic allocation and release of memory.
CArray and CList The and classes provide functionality to your MFC program that is not available to regular arrays. You may not need CArray if you know the specific limits necessary for the array; in that case you can just use a global fixed array. If you use CArray, you should use CArray::SetSize to establish its size and specify the number of elements by which to grow whenever a reallocation is necessary. If you do not use SetSize, adding elements could cause your array to be frequently reallocated and copied, which is inefficient and can fragment memory. You should also be aware that if you insert an item into an array, CArray will move subsequent items in memory. Before moving, CArray may need to grow the array. Both of these actions could cause cache misses or page faults. This is one of the things that is not obvious if you just use the CArray class members, and again you should step through the code that MFC uses to see if you can write something more specific to your scenario that would improve your code's performance. For example, you could provide CArray specializations for specific types, since it is a template.

CList is a doubly-linked list, which means that element insertion is fast at the head, tail, and at a known POSITION in the list. A sequential search is necessary to look up an element by value or index. This search can be slow if the list is long. If your code just requires a singly-linked list, then you may want to reconsider using CList, whose overhead includes updating an additional pointer for all operations as well as the memory for that pointer. While the additional memory is not great, it is another opportunity for missed cache hits or page faults.
IsKindOf This function can generate many calls and potentially visit a lot of memory in different data areas (bad locality of reference), which you can observe as you step through the MFC source code. IsKindOf is useful for a debug build (in an ASSERT call, for example) but you should consider whether it can be avoided in a release build.
PreTranslateMessage function You should use PreTranslateMessage in cases where a particular tree of windows needs different keyboard accelerators. You can also use PreTranslateMessage where you must insert message handling into the message pump. PreTranslateMessage is the approved way of altering the way MFC dispatches messages. You should be careful to override PreTranslateMessage only at the level needed. For example, it is not necessary to override CMainFrame::PreTranslateMessage if all you are interested in is messages going to children of a particular view. In that case, just override PreTranslateMessage for the view class.

You should not circumvent the normal dispatch path by using PreTranslateMessage to handle any message sent to any window. For this you should use window procedures and MFC message maps.
OnIdle function Idle events can occur at times you do not expect, such as between WM_KEYDOWN and WM_KEYUP events. Timers may be a more efficient way to trigger your desired behavior. Also, you should not force OnIdle to be called repetitively by generating false messages or by always returning TRUE from an override of OnIdle because this will never allow your thread to sleep. Again, a timer or a separate thread might be more appropriate.

Shared Libraries

Code reuse is desirable. However, if you are going to use someone else’s code, you should make sure you know exactly what it does in those cases where performance is critical to you. The best way to understand this is by stepping through the source code or by measuring with tools such as the profiler, PView, or Windows NT Performance monitor.

Heaps

Use multiple heaps with discretion. Additional heaps are created with and and let you manage and then dispose of a related set of allocations. You don’t want to commit too much memory. If you’re using multiple heaps, pay special attention to the amount of memory that is initially committed.

Instead of multiple heaps, you may want to use helper functions to wrap the default heap. You can write helper functions to interface between your code and the default heap. Helper functions could facilitate custom allocation strategies that could improve the performance of your application. For example, if you perform many small allocations frequently, you may want to localize these allocations to one part of the default heap. You can allocate a large block of memory and then use a helper function to suballocate from that block. If you do this, you will not have additional heaps with unused memory because the allocation is coming out of the default heap.

On the other hand, using the default heap might get you less locality of reference than you desire. You really have to measure the effects of moving objects from heap to heap with tools such as the profiler, PView, Spy++, or Windows NT Performance monitor.

Measure your heaps so you can account for every allocation on the heap. Use the C run-time debug heap routines to checkpoint and dump your heap. You could then read the output into Excel and use pivot tables to view the results. Note the total number and size of allocations as well as the distribution of the allocations. You may want to compare these versus the size of working sets. You may also be interested in clustering of related-sized objects.

You can also use the Windows NT performance counters to monitor memory usage.

Threads

When you need a background task, threads may not be your best solution. Effective idle handling of the event may be a better solution. It’s easier to understand the locality of reference of your program with one thread than with two or more.

A good rule of thumb is that you should only use a thread if an operating system notification that you need to block on is at the root of the background work. Threads are best in such a case because it is not practical to block a main thread on an event.

Threads also present communication problems. You will have to manage the communication link between your threads, whether it be with a list of messages or by allocating and using shared memory. Managing your communication link will likely require synchronization to avoid race conditions and deadlock problems. Such complexity can easily turn into bugs and performance problems.

For addititonal information, see Idle Loop Processing and Multithreading Topics.

Small Working Set

Smaller working sets mean fewer page faults and more cache hits. This directly translates to into performance gains. We’ve mentioned locality of reference throughout this document; a small working set can mean good locality of reference. The process working set is the closest metric the operating system directly provides for telling you if you have good or bad locality of reference.

See ****for information on how to set the upper and lower limits of the working set. See for information on how to obtain the minimum and maximum working set sizes. You can use or to view the size of the working set.

通过