Concurrency, Part 10 - How do you know if you've got a scalability issue?

Well, the concurrency series is finally running down (phew, it's a lot longer than I expected it to be)...

Today's article is about determining how you know if you've got a scalability problem.

First, a general principle: All non trivial, long lived applications have scalability problems.  It's possible that the scalability issues don't matter to your application.  For example, if you application is Microsoft Word (or mIRC, or Firefox, or just about any other application that interacts with the user)) then scalability isn't likely to be an issue for your application - the reality is that the user isn't going to try to make your application faster by throwing more resources at the application.

As I write wrote the previous paragraph, I just realized that it describes the heart of scalability issues - if the user of your application feels it's necessary to throw more resources at your application, then your application needs to have to worry about scalability.  It doesn't matter if the resources being thrown at your application are disk drives, memory, CPUs, GPUs, blades, or entire computers, if the user decides that hat your system is bottlenecked on a resource, they're going to try to throw more of that resource at your application to make it run faster.  And that means that your application needs to be prepared to handle it.

Normally, these issues are only for server applications living in data farms, but we're starting to see the "throw more hardware at it" idea trickle down into the home space.  As usual, the gaming community is leading the way - the AlienWare SLI machines are a great example of this - to improve your 3d graphics performance, simply throw more GPUs at the problem.

I'm not going to go into diagnosing bottlenecks in general, there are loads of resources available on the web for it (my first Google hit on Microsoft.com was this web cast from 2003).

But for diagnosing CPU bottlenecks related to concurrency issues, there's actually a relatively straightforward way of determining if you've got a scalability issue associated with your locks.  And that's to look at the "Context Switches/sec" perfmon counter.  There's an article on how to measure this in the Windows 2000 resource kit here, so I won't go into the details, but in a nutshell, you start the perfmon application, select all the threads in your application, and look at the context switches/sec for each thread.

You've got a scalability problem related to your locks if the context switches/second is somewhere above 2000 or so.

And that means you need to dig into your code to find the "hot" critical sections.  The good news is that it's not usually to hard to detect which critical section is "hot" - hook a debugger up to your application, start your stress and put a breakpoint in the ntdll!RtlEnterCriticalSection routine.  You'll get a crazy number of hits, but if you look at your call stacks, then the "hot" critical will start to show up.  It sounds tedious (and it is somewhat) but it is surprisingly effective.   There are other techniques for detecting the "hot" critical sections in your process but they are not guaranteed to work on all releases on Windows (and will make Raymond Chen very, very upset if you use them).

Sometimes, your CPU bottleneck is simply that you're doing too much work on a single thread - if it simply takes too much time to calculate something, then you need to start seeing if it's possible to parallelize your code - you're back in the realm of making your code go faster and out of the realm of concurrent programming.  Another option that you might have is the OpenMP language extensions for C and C++ that allow the compiler to start parallelizing your code for you.

But even if you do all that and ensure that your code is bottleneck free, you still can have scalability issues.  That's for tomorrow.

Edit: Fixed spelling mistakes.

Comments

  • Anonymous
    March 03, 2005
    The comment has been removed
  • Anonymous
    March 03, 2005
    A valid point. But I doubt that anyone's going to believe that adding more disk drives to their machine is going to speed up Firefox.

    Similarly, they're not likely to add more network adapters or more CPUs.

    On the CPU issue, there's an easy check: Look at your CPU usage when you're running Firefox (or mIRC, or Word, or whatever). If it's at 100% in the application's process, then there might be an issue, but if it isn't at 100% it's not likely there's a huge problem.

    If your application interacts with a user, then (with the exception of games), the application is going to be spending 99% of its time waiting on the user.
  • Anonymous
    March 03, 2005
    I was just wondering about how many context switches is too many. On an idle Win2k3 server I see system wide switches at about 500 or 600/sec. Under load that skurockets to 6000+, which imho is way too many, but that number is the total number of context switches. Your estimate of 2000 being too many - is that per thread? Or in the whole system?
  • Anonymous
    March 03, 2005
    The comment has been removed
  • Anonymous
    March 03, 2005
    This definately has been a great series and I know .net isn't your area of expertise, but would you have any idea if we have a managed app how we can go in and look at what is locking what? I am just begining to think this is an issue with one of my earlier .net apps as it is now growing huge and is pretty much asyncronous so something in there is taking some time.

    Had another question I was going to ask you as well about this I thought about it late last night. How does multithreading affect multiple devices? Like when dealing with hardware. I thought of this last night. Ok I recently bought a Zen Micro, I love that it just plugs into Windows Media Player and Sincs and everything. However last night I bought a new CD I went to Rip it through WMP and then Sinc it to my device. Now I noticed that WMP only rips one track at a time, While Ripping it is playing the song as well and I can use the UI as well. So yeah there are a few threads going on there. But why is it only ripping one track at a time. Is that because the CD Rom can only be read from a single thread at a time? If so how are you playing, are you playing what was already written to disk? or is the CD Rom actually allowing multiple reads from multiple threads, then is this really a good idea if I am playing track 1 while recording track 17.

    My questions isn't Windows Media Player specific you can answer however you wish I am more currious about devices and threading to them. But last night that is what made me specifically think of that.
  • Anonymous
    March 03, 2005
    Jeff,
    Your question dovetails quite neatly into tomorrows article, I'll try to make sure that I discuss it.
  • Anonymous
    March 03, 2005
    Maybe your series is so long because you keep cc'ing things in the title?
  • Anonymous
    March 03, 2005
    CC'ing?
  • Anonymous
    March 03, 2005
    "CC'ing?"

    A subtle dig at the typo in the title?

    Great series though, very interesting stuff.
  • Anonymous
    March 03, 2005
    Jeff Parker:
    More then one thread can read from a CD-ROM at the same time. The ReadFile API doesn't care much about the device type. But the CD-ROM device itself has only one laser beam for reading data. Also the physical layout of a CD is optimized for playing audio files, so you've got a long spiral of data. Andrew Tannenbaum's "Modern Operating Systems"[1] gives a very detailed discussion of how a CD-ROM works :). Knowing this, you see that WMP would just hurt performance if it tried to rip more songs at the same time. You can try this and see it for yourself too: Insert a CD-ROM with at least two big files on it and try to copy both at the same time to a HDD. It should be quicker to do it sequentially rather then in parallell.

    Great articles, Larry! Looking forward to read the next :)

    [1] http://www.prenhall.com/divisions/esm/app/author_tanenbaum/custom/mos2e/
  • Anonymous
    March 03, 2005
    > First, a general principal

    Palese! ^_^

    3/3/2005 11:36 AM Anonymous Coward

    > I would hazard a guess that Word is doing a
    > lot in the background.

    It is indeed, but a lot of it isn't in your list and isn't anything I've guessed. One time I had Word 2000 displaying a document, just sitting there doing nothing with a portion of the document sitting there on the screen, no animations or funny stuff like that, using 99% of the CPU time. (Windows Task Manager was surely using the other 1% to display its green rectangle.) Eventually I left it sitting there on one computer and scrolled it occasionally, and used a different computer to type a translation. After about 8 hours it had used about 7 hours 59 minutes of CPU time. I have a feeling that throwing CPUs at it wouldn't have helped.
  • Anonymous
    March 03, 2005
    Ingrid, good catch, I missed that one.

    Larry, you made another principal/principle mistake too ;-)
  • Anonymous
    March 03, 2005
    The comment has been removed
  • Anonymous
    March 03, 2005
    Unless you've set the processor affinity (I KNEW there was another API set I missed yesterday), then it'll bounce from one CPU to another.
  • Anonymous
    March 04, 2005
    I would have thought that NT would by default try to maintain some form of affinity per thread anyway? Obviously it won't try very hard unless you set the processor affinity, but I would have thought that the scheduler would at least give it a go.
  • Anonymous
    March 04, 2005
    Andrew, NT tries, but it's not totally successful. You can see that if you have an MP machine - start up a single threaded CPU bound task and look at the task manager - both CPUs will be 50% utilized.
  • Anonymous
    March 04, 2005
    Sysinternals/Mark Russinovich gives some details about how the NT scheduler works with thread affinity here [1]. So, even if you don't explicitly set the thread affinity (hard affinity) the scheduler will set one (soft).

    For more details check out the "Inside NT Scheduler" article series there :)

    [1] http://www.sysinternals.com/publ.shtml#scheduler
  • Anonymous
    March 04, 2005
    Concurency is still spelled wrong, it should be concurrency. (two 'r's). <br><br>Please delete this message unless you feel it adds some value to your comments.<br><br>But please use a spell checker!<br><a target="_new" href="http://spellbound.sourceforge.net/">http://spellbound.sourceforge.net/</a> (for Firefox).<br><a target="_new" href="http://www.iespell.com">http://www.iespell.com</a><br>(for IE)<br>both are free.<br><br>We non-native speakers spelling mistakes can cause bigger problems than for normal. Kindly take the time to check.<br><br>But fantastic writings, please keep blogging regularly.
  • Anonymous
    March 04, 2005
    Concurency is still spelled wrong, it should be concurrency. (two 'r's). <br><br>Please delete this message unless you feel it adds some value to your comments.<br><br>But please use a spell checker!<br><a target="_new" href="http://spellbound.sourceforge.net/">http://spellbound.sourceforge.net/</a> (for Firefox).<br><a target="_new" href="http://www.iespell.com">http://www.iespell.com</a><br>(for IE)<br>both are free.<br><br>We non-native speakers spelling mistakes can cause bigger problems than for normal. Kindly take the time to check.<br><br>But fantastic writings, please keep blogging regularly.
  • Anonymous
    March 05, 2005
    > put a breakpoint in the ntdll!RtlEnterCriticalSection routine

    If you want to catch contention in debugger (perhaps to look at the call stack) then you should use ntdll!RtlpWaitForCriticalSection instead. Typically, the vast majority of EnterCriticalSection calls don't block and thus don't cause a context switch. It's only the blocking calls that you should be concerned with.

    But if you just want to find out which critical sections are causing context switches, the easiest way is to use the "!locks -v" command in windbg/cdb and look at the ContentionCount values.
  • Anonymous
    March 06, 2005
    On a real SMP system, even if the looping thread bounces from one CPU to another, it can't completely hose CPUs that it's not executing on. If other threads are executing poor code that tries spinlocks for ages then those other threads can hose other CPUs. If the looping thread eats a lot of bandwidth to memory or I/O then other CPUs will be slowed down by it.

    3/4/2005 8:53 AM Larry Osterman

    > start up a single threaded CPU bound task
    > and look at the task manager - both CPUs
    > will be 50% utilized.

    I guess that's because of the number of other threads that also get scheduled for short times? When it's time to resume the hogging thread, the CPU previously used by the hogging thread might be executing something else at that instant?

    On an HT machine (not SMP I know), doing a VB6 compile, one pseudo-CPU is around 70% used and the other is around 30% used, for an average of 51%. The 51% is pretty constant because other threads are nearly idle.