Performance Tidbits

(Some additional remarks on this posting can be found here -- feel free to continue comments on that chain)

Here are a few things that I often look for when reviewing code or APIs for performance issues.  None of these are absolutes but they’re little things that seem to come up fairly often.  I've covered many of these before.

In no particular order:

Delegates:   Are you using delegates when you could just be using polymorphism?  Delegates let you arrange for any method on any object to be called.  With just an interface or virtual you get a fixed method on any object which is often good enough.  Delegates cost you on a per instance basis – virtual methods cost on a per class basis.

Virtual Methods:   Are you using virtual methods when direct calls would do?  Many times people go with virtual methods to allow for future extensibility.  Extensibility is a good thing but it does come at a price – make sure your full extensibility story is worked out and that your use of virtual functions is actually going to get you to where you need to be.  For instance, sometimes people think through the call site issues but then don’t consider how the “extended” objects are going to be created.  Later they realize that (most of) the virtual functions didn’t help at all and they needed an entirely different model to get the “extended” objects into the system.

Sealing: Sealing can be a way of limiting the polymorphism of your class to just those sites where polymorphism is needed.  If you will fully control the type then sealing can be a great thing for performance as it enables direct calls and inlining.

Type and API Inflation:   In the managed world there seems a tendency to add more classes with more members each of which does less.  This can be a great design principal but it isn’t always appropriate.  Consider your unit-of-work carefully and make sure the APIs are chunky enough to do their job well (see below).  Don’t forget that each class and function has a static overhead associated with it – all things being equal less classes with less functions gives better performance.  Make sure that you aren’t adding the kitchen sink to your classes just because it fits.

API Chunkiness:   Wrong-sized API’s often translate into wrong-sized transactions to an underlying database or memory store.  Carefully consider issues like the unit of work (how big the transactions are) and isolation even if you aren’t talking to a database.  It’s often good to think about in-memory structures as though they were a database that needed be accessed concurrently even if they aren't.

Concurrency:   Don’t use a more complex concurrency model than is necessary.  Often times the very simplest model is all that is necessary – complex sharing rules or low level synchronization frequently ends up hurting much more than it helps.  Put synchronization at the layer of your implementation that best understands the “transaction” and have none at higher or lower levels if you can possibly avoid it.  Steer clear of complicated synchronization methods especially those that require specific knowledge of strongness/weakness in the processor's memory model.

Fewer Dlls:   All things being equal you tend to get better performance out of fewer large DLLs than you do out of lots of smaller DLLs.  This does break down at some point – especially if the DLL eagerly initializes parts of itself.  Where more DLLs tends to win is in superior patching opportunities and if you can avoid loading many/most of the DLLs at all in the normal cases.  Consolidate DLLs where it makes sense to do so.

Late Bound Semantics: Simply put, if you don’t need Reflection then don’t use it. However hard we work on reflective access to types and members it will never be as fast or economical as the early-bound semantics.  If you are using Reflection for extensibility, consider patterns where only the extended cases pay the cost of Reflection.

Less pointers:   I’d love to see more arrays of primitives, and less forests of pointers. Pointer rich data structures generally do less well on modern processors.  Pretend you have to pay me one picopenny per pointer. How big would the check be by the end of the year?

Cache with Policy:   Make sure that any caches you build have a well understood policy for removing/aging items and don’t just grow forever.  A cache without a proper policy isn’t a cache it’s a memory leak.  Weak pointer based caches look cute on paper but often still suffer from bad policy.

For more information have a look at the Performance and Scalability PAG  -- chapter 5 particularly targets managed code.

Comments

  • Anonymous
    August 24, 2004
    The problem with this is ease of use.

    While C#/.NET 2.0 introduces Anonymous methods, it doesn't do the same for Anonymous classes, which Java has had for some years now.

    So, while in Java you can do this:

    class SomeClass
    {
    abstract public void DoX();
    }
    SomeClass instance = new SomeClass()
    {
    public void DoX()
    {
    //implementation
    }
    };
    instance.DoX();


    In C# you have to do this:


    class SomeClass
    {
    public void DoX(StrategyDelegate sd)
    {
    sd();
    }
    }
    SomeClass instance = new SomeClass();
    instance.DoX(delegate
    {
    //implementation
    });
  • Anonymous
    August 24, 2004
    Hi Rico,

    >Don’t forget that each class and function has >a static overhead associated with it – all >things being equal less classes with less >functions gives better performance

    I have been wondering about this actually,
    is there any chance you could expand on this point?

    Specifically, which parts of the CLR degrade in performance based on the number of classes I define in my application?

    Thanks,

    Sam
  • Anonymous
    August 24, 2004
    It may be that there are lingering non-linearities in some of the CLR class management algorithms but that isn't something I see people hit at all.

    What I was referring to is the fact that there is space cost associated with each class -- metadata to load, method tables, method descriptions per method and so forth. The situation is somewhat better if you ngen but nonetheless there is definately a space cost there. Reducing the overhead is always a good thing.
  • Anonymous
    August 24, 2004
    Just for fun, try counting the number of these performance "issues" that apply to System.Windows.Forms. And I'm in no way knocking the SWF designers... I know that UI doesn't need to be that performant, but it does have a big effect on memory usage.

    1) Delegates...Lots of delegates...unless you inherit from most of the controls and provide any custom functionality you need in overridden methods. It's a fairly clean design, but it also leads to more pointers.

    2) Virtual Methods...lots of these too. Even the core methods that handle the message loop are virtual. Of course, this makes it easier to provide customized behavior. But I should also mention that many properties are virtual as well.

    3) Sealing...Not a big problem, since sealing most of these classes would really reduce extensibility. And there aren't too many sealed classes I want unsealed here, except for a few big ones, such as the Common Dialogs.

    4) Type and API Inflation...lots of classes, but I don't know that I'd say too many. I wish the superceded classes could be moved to another library (such as the old MainMenu, etc.), but that would break compatability. Besides, the library is NGen'd.

    5) API Chunkiness...I'm not sure how this applies.

    6) Concurrency...Not much of an issue for the forms library, since you're not supposed to use the controls outside of their message-loop thread. Invoke is provided, and most synchronization is left up to the consumer

    7) Fewer DLLs...Not a problem. It's one huge DLL. It uses System, System.Drawing, System.Data and maybe a few others, but those need to be separate for obvious reasons.

    8) Late Bound Semmantics...Quite a bit here. Both data binding and automatic localization are late bound. The first could be solved safely with delegates to properties (which I don't think is going to ever happen), while the second could be solved by doing your own localization code.

    9) Less Pointers...LOTS of pointers in SWF. Not only does each Control maintain an object collection of all its child controls, but the designer adds ANOTHER pointer for each designed control, along with all of the parent pointers. Luckily Whidbey lets you design controls without a member.

    10) Cache with Policy...Not sure if this is a problem in SWF. I'd have to load it up in Reflector to see, but that's a lot of code to skim.

    And one more...Control deriving from MarshalByRefObject seems like a bit of an oversight. If I want to communicate with my app across domains, I will use a more abstract version of my app, not the form itself. To make controls and forms perform well, the bulk of their code really needs to be moved to external classes, which leads to Type Inflation.
  • Anonymous
    August 24, 2004
    Thanks for the advice!

    I recently did some performance tests on an area you didn't mention: calculations. In my case, I was wondering what the actual performance differences are between doing calculations using Doubles and simulating fixed-point operations with Longs, and also in analyzing the effect of having instances of Double.NaN in a data stream. Background: I am implementing some signal processing algorithms. These tests were of course simple, and not too much conclusions should be drawn from them, but I found a few surprises.

    - As soon as you do some multiplications (in addition to additions/compares), floating point is faster. This is more or less expected.

    - When checking algorithms that do nothing but simple operations (addition, subtraction, compare), longs are faster, but the margin by which they are faster was surprising: it is highly dependent on the processor! For the one Intel CPU based machine I tested it on, floating point arithmetic was about 1.5 times slower than long integer arithmetic, while for the AMD processor the difference was negligible...

    - Beware of NaNs: they severely slow down processing. I realize NaNs are not encountered in many programming scenarios, but in the signal processing scenarios I was analyzing they are used (to indicate missing data). In a processing pipeline that only uses (floating point) additions, subtractions and compares, they slowed down the test by a factor 100 if I used only NaNs as inputs (which is an unrealistic scenario, but still...)

    - Explicitly testing for NaNs (putting an 'if(Double.IsNan(x)){}else{}' block around your core calculation code) does barely improve the performance in the case where all processed samples actually are NaNs, and degrades performance in other cases quite significantly. It seems that the 'Double.IsNan()' method itself has a similar influence on the pipeline as actually having NaNs in the data...
  • Anonymous
    August 24, 2004
    A note on sealing:

    This only improves performance if you have virtual methods on that class or its parents, and you're calling them. It helps because calling a virtual method that could not possibly be overridden will be changed into a direct call or inlined by JIT.

    I still recommend sealing as much as you can, though. You can always unseal it later without breaking anything.
  • Anonymous
    August 25, 2004
    Hi Rico,

    Thanks for the info.

    <Quote>
    What I was referring to is the fact that there is space cost associated with each class -- metadata to load,

    method tables, method descriptions per method and so forth. The situation is somewhat better if you ngen but

    nonetheless there is definately a space cost there. Reducing the overhead is always a good thing.
    </Quote>

    It would be interesting to know how much space a simple class (say with one non virtual method)
    takes in the CLR with all its associated data. Can you think of a good way to measure this?

    Thanks,

    Sam
  • Anonymous
    August 25, 2004
    Write a little program generator that makes something like this:

    class Test
    {
    public static void Main(String [] args)
    {
    Foo1.f();
    Foo2.f();
    ...
    Console.ReadLine(); // pause to allow measurement
    }
    class Foo1 { static public void f() {} }
    class Foo2 { static public void f() {} }
    ...
    }

    Make it for 500 Foos and measure the size with your favorite tool. Then make it for 1000 foos and measure the size again, divide the delta by 500 for the per class size.

    Consider various different measures, such as size of the assembly, working set size of the process, working set size using ngen, etc. etc.

    Might make a nice Quiz #5
  • Anonymous
    August 25, 2004
    Luc

    Performance implication of evenly splitting code into 50% add 50% multiply (or any other ratio) is processor architecture specific, just so you know. I BELIEVE it's the Intel chips that like an even split better (they also do exceedingly well at vectorized SSE2 but have problems with branchy and scalar-y code). Obviously the biggest place where you see a nice even split like that between adds and multiplies is in matrix multiply.

    AMD chips GENERALLY have better performance on code that wasn't written for low-level performance optimizations but Intel chips generally better if you've really killed yourself to maximize cache coherency and use of vectorized instructions.

    However, you probably want to worry just as much (if not more) about how you can:
    1. maximize parallelism in your algorithms
    2. distribute this parallel load best among multiple cores, chips, and computers.

    Most of signal processing work is pretty EP-ish and .NET offers some different ways besides the old MPI of exploiting this.
  • Anonymous
    August 28, 2004
    Rico,

    I thought I would give your suggestion a shot last night.

    If there is one thing I have learned from the experience, it's that accurately measuring the memory usage of a .net application is quite a tricky business, considering working sets and so forth.

    The best I have come up with so far is "somewhere around the 200 bytes mark", but I have little confidence in that figure.

    Have you any suggestions on how I might go about getting an accurate memory reading?

    Sam
  • Anonymous
    August 29, 2004
    The comment has been removed
  • Anonymous
    September 01, 2004
    Rico

    Any thoughts on Java's advanced JIT has been inlining virtual method calls for many years.
    http://java.sun.com/products/hotspot/whitepaper.html

  • Anonymous
    September 03, 2004
    I haven't read the whitepaper -- it's usually a very bad idea for me to look at Sun's Intellectual Property in any way as sadly it's more likely to put me in a bad position than a good one. But let me talk about dynamic inlining -- it's not a new idea anyway.

    There are cases where "inlining" a virtual function call works out well (i.e. guess the class it probably is, put a test for that class, and then either run the inlined code or else make the call if it turns out the class is wrong)

    1) You have to know what the call is probably going to be

    2) You have to be willing to eat the extra code size

    3) What's being inlined needs to be big enough that adding an extra test won't slow it down percentage-wise by much yet not so big that inlining is moot (keeping in mind processor caches are fixed size so copying the code lots places isn't such a great idea -- by default bigger is slower)

    4) It's best if you know not only which function to inline but also which of the many call sites is the important one to inline -- less bloat that way.

    5) Even if all of those things work out it's still not as good as not having the test and the fallback virtual dispatch at all and just inlining the code. So still seal if you can :)

  • Anonymous
    September 05, 2004
    Rico-

    The Sun inliner doesn't put a test into the code. It just inserts the virtual method straight. So it ends up being the exact same native code (and performance) as sealing. They have a speculative optimizer/deoptimizer pair to compute when devirtualizing a method in this way is safe.
  • Anonymous
    September 13, 2004
    Hey Rico

    Got anything to add in response to Nicholas' comments?

    Is this perhaps the reason why .NET methods were made non-virtual by default, while Java methods are virtual by default?

    M.
  • Anonymous
    September 13, 2004
    I wasn't party to either decision so it's hard for me to say why the default is what it is. I surely can't speak to my competitors choice. For C# that's really a question for Anders though, not me.

    I don't think we'd make such a choice on the basis of what our inliner happens to do at any given moment. They probably thought it best to align with C++ on the matter.

  • Anonymous
    September 17, 2004
    M-

    It probably wasn't an impact on the decision in Java. It was several years in before JIT compilers became a regular feature, and several more before these kinds of aggressive optimizations were being made. Of course, the idea of a JIT is not recent so Sun may have had it in mind, although I personally don't think that was the case. You might be able to find more information about that design decision by asking Gosling, or examining the prototype Oak language.

    The reverse statement also might be true: the decision to make methods virtual by default may have spurred Sun to invest more resources into their VM and optimize that case.

    As for C#, I think Rico says it well.