Careful with that axe, part one: Should I specify a timeout?

Careful

(This is part one of a two-part series on the dangers of aborting a thread. Part two is here.)

The other day, six years ago, I was was talking a bit about how to decide whether to keep waiting for a bus, or to give up and walk. It led to a quite interesting discussion on the old JoS forum. But what if the choice isn’t “wait for a bit then give up”, instead it is “wait for a bit, and then take an axe to the thread”? A pattern I occasionally see is something like I’ve got a worker thread that I started up, I ask it to shut down, and then I wait for it to do so. If it doesn’t shut down soon, take an axe to it:

this.running = false;
if (!workerThread.Join(timeout))
workerThread.Abort();

Is this a good idea?

It depends on just how badly the worker thread behaves and what it is doing when it is misbehaving.

If you can guarantee that the work is short in duration, for whatever 'short' means to you, then you don't need a timeout. If you cannot guarantee that, then I would suggest first rewriting the code so that you can guarantee that; life becomes much easier if you know that the code will terminate quickly when you ask it to.

If you cannot, then what's the right thing to do? The assumption of this scenario is that the worker is ill-behaved and does not terminate in a timely manner when asked to. So now we've got to ask ourselves "is the scenario that the worker is slow by design, buggy, or hostile?"

In the first option, the worker is simply doing something that takes a long time and for whatever reason, cannot be interrupted. What's the right thing to do here? I have no idea. This is a terrible situation to be in. Presumably the worker is not shutting down quickly because doing so is dangerous or impossible. In that case, what are you going to do when the timeout times out? You've got something that is dangerous or impossible to shut down, and its not shutting down in a timely manner. Your choices seem to be

(1) do nothing
(2) wait longer
(3) do something impossible. Preferably before breakfast.
(4) do something dangerous

Choice one is identical to not waiting at all; if that’s what you’re going to do then why wait in the first place? Choice two just changes the timeout to a different value; this is question begging. By assumption we're not waiting forever. Choice three is impossible. That leaves “do something dangerous”. Which seems… dangerous.

Knowing what the right thing to do in order to minimize harm to user data depends upon the exact circumstances that are causing the danger; analyze it carefully, understand all the scenarios, and figure out the right thing to do. There’s no slam-dunk easy solution here; it will depend entirely on the real code running.

Now suppose the worker is supposed to be able to shut down quickly, but does not because it has a bug. Obviously, if you can, fix the bug. If you cannot fix the bug -- perhaps it is in code you do not own -- then again, you are in a terrible fix. You have to understand what the consequences are of not waiting for already-buggy-and-therefore-unpredictable code to finish before disposing of the resources that you know it is using right now on another thread. And you have to know what the consequences are of terminating a thread while a buggy worker thread is still busy doing heaven only knows what to operating system state.

If the code is hostile and is actively resisting being shut down then you have already lost. You cannot halt the thread by normal means, and you cannot even reliably thread abort it. There is no guarantee whatsoever that aborting a hostile thread actually terminates it; the owner of the hostile code that you have foolishly started running in your process could be doing all of its work in a finally block or other constrained region which prevents thread abort exceptions.

The best thing to do is to never get into this situation in the first place; if you have code that you think is hostile, either do not run it at all, or run it in its own process, and terminate the process, not the thread when things go badly.

In short, there's no good answer to the question "what do I do if it takes too long?" You are in a terrible situation if that happens and there is no easy answer. Best to work hard to ensure you don't get into it in the first place; only run cooperative, benign, safe code that always shuts itself down cleanly and rapidly when asked. Careful with that axe, Eugene.

Next time, what about exceptions?

(This is part one of a two-part series on the dangers of aborting a thread. Part two is here.)

Comments

  • Anonymous
    February 21, 2010
    I never realized that a finally block prohibited thread abort exceptions. I think I understand the reasoning behind it (don't die without cleaning up), but is malicious C# really as easy as try {    throw new Exception("MWAH HA HA!!"); } finally {   MaliciousFunction(); } If so, what mechanisms can terminate a malicious .Net process that is executing in a finally block?  What would happen in this situation if MaliciousFunction() ran unmanaged code?

  • Anonymous
    February 22, 2010
    The comment has been removed

  • Anonymous
    February 22, 2010
    The comment has been removed

  • Anonymous
    February 22, 2010
    I would add one extra option to that list: (5) ask another entity (a human user, perhaps)

  • Anonymous
    February 22, 2010
    The comment has been removed

  • Anonymous
    February 22, 2010
    Some commandments of mutli-threading. All threads shall terminate quickly when ask to. Thou shall only use one primitive for multi-threading. Thou shall treat every thread as if it is a feature. Thou shall not share data incautiously. Thou shall not thrash the L2 cache.

  • Anonymous
    February 22, 2010
    The comment has been removed

  • Anonymous
    February 22, 2010
    I really hate this situation. This is often the case when working with someone else's interface. You call into some interface whose implementation you have no control over, and it decides to never return. Ugh. I wish some developers would be honest with their naming. Instead of calling it FetchValue(...), they should call it FetchValue_OrHangForeverWithoutGivingTheDeveloperAnyWayToCancelThisStupidCall(...).

  • Anonymous
    February 22, 2010
    Was that a Hitch-hiker's reference in #3?

  • Anonymous
    February 22, 2010
    Mike: no, Alice in Wonderland.

  • Anonymous
    February 22, 2010
    "Presumably the worker is not shutting down quickly because doing so is dangerous or impossible." That's an unwarranted assumption.  The worker might not be shutting down simply because it's doing some long-running computation, and the worker code wasn't peppered with frequent checks for cancellation requests.  Moreover, doing so makes the worker code harder to read, is rather laborious, and possibly very difficult since you need to balance performance with quick cancellation.  Thread.Abort is the only practical option here, and it's also safe if the code is working only on managed memory with no outside dependencies. However, I hear F# (or was it the new parallel library in .NET 4?) has a nice trick where parallelized loops do automatic cancellation checks at each iteration, so you don't have to manually add those checks to the worker code.  That should take care of some cases that are amenable to this kind of parallelization.

  • Anonymous
    February 22, 2010
    The comment has been removed

  • Anonymous
    February 22, 2010
    The comment has been removed

  • Anonymous
    February 23, 2010
    Funny, Just got back from The Australian Pink Floyd. The didn't play that, but it was very good!! What about threads which itselfs uses timeouts I use worker threads to monitor devices over the internet. The worker threads are sending commando's with timeouts over sochets. When stopping the worker thread I wait the longest timeout, but sometimes the thread does not respond in a timely fashion. So I'm using the Axe. Didn't find another option.

  • Anonymous
    February 24, 2010
    The comment has been removed

  • Anonymous
    February 24, 2010
    The comment has been removed

  • Anonymous
    February 24, 2010
    The comment has been removed

  • Anonymous
    February 25, 2010
    The great thing about managed code is that you can safely take the axe to any thread that doesn't manipulate shared state. Back in the olden days terminating a thread leaked whatever memory it had allocated, leaving files locked, sockets open, and so forth, meaning you couldn't rely on axing threads as a strategy, so you had to pepper threads with stopping points. Managed code lets you avoid that, so it is reasonable to use thread termination as a way of dealing with things like users canceling a long-running calculation.

  • Anonymous
    February 26, 2010
    I have a similar situation here at work. I'm designing a system where we make a "cross-environment method call", which is, of course, asynchronous. The caller, A, delegates this task to my process orchestrator, B, telling B that a process § should be completed (or, B(§)). Now, since the process § is parameter based and can call as many different programs that the user wants, I have no clue how long should § take to complete. In fact, any program P can enter a infinite loop.   Now, since A is make a cross-environment call, he deals with atomicity using compensation methods. So if the call to B(§) fails or times out, compensation should be done for every step up to B(§), but not to B(§) itself, since it failed. But, let's said that what really takes long in dealing B(§) is a database commit. The commit is the last step B executes, and B can no longer expires, even if the timeout has already occurred. By the time B finishes commiting, A has already began compesating other steps in his own process. This a liability and, right now, I can only think of dealing with it be fine-tuning the time outs... which is no good at all.