What's wrong with this code, part 21, a psychic debugging example
Over the weekend, one of the developers in my group sent me some mail - he was seeing one of the registers in his code getting corrupted across a procedure call. He was quite surprised to see this, and asked me for any suggestions.
With the help of the info he gave me, I was able to figure out what had gone wrong with his code, and I realized that it'd make a great "what's wrong with this code" example.
There are three parts to the code associated with this "what's wrong". The first is an interface definition:
class IPsychicInterface
{
public:
virtual bool DoSomeOperation(int argc, _TCHAR *argv[]) = 0;
};
Next, you have a tiny test application:
int _tmain(int argc, _TCHAR* argv[]){ register int value1 = 1; IPsychicInterface *psychicInterface = GetPsychicInterface(); register int value2 = 2;
psychicInterface->DoSomeOperation(argc, argv);
assert(value1 == 1);
assert(value2 == 2);
return 0;
}
The failure happened when the caller returned from psychicInterface->DoSomeOperation - upon the return, the ESI register, which is supposed to be preserved got trashed. Further debugging showed that the reason that ESI was trashed was that the stack was imbalanced after the call to DoSomeOperation.
There's one more piece of information that I was given that let me immediately realize the root cause of the problem.
I know that if I include that information, what went wrong should be blindingly obvious, so I'm going to be mean and ask you to tell me what the one additional piece of information was. The reason that the other developer in my group didn't find it was simply because he was looking at too much data - if I had pointed out that one piece of additional data, he'd have instantly figured it out too.
So the "answer" to this part of the "What's wrong" problem is "What single additional piece of information was I given that made this problem simple to solve?"
Comments
Anonymous
September 24, 2007
The additional piece of information may be it works if the developer makes the definition 'register int value=1' global. In most of the compilers the stores to local register variables are clobbered because the assemble code in the calling function may use this same register(the availability of the register depends on dataflow analysis of the compiler so local registers are reused) for some other purpose. one suggestion would be to find which registers are saved and restored during a function call smilar to %EBP ,%ESP (which are the local frame and stack pointers), don't store any thing to %EBP,%ESP your stack will be corrupted :)Anonymous
September 24, 2007
Does the calling convention on the implementation of DoSomeOperation() match the interface definition?Anonymous
September 24, 2007
Although slightly off topic, Vamsi's comment reminded me of a Michael Abrash article at http://www.ddj.com/184405848 - "Many people don't realize that under Windows it's safe to use ESP as a general-purpose register because the OS switches to a system stack when fielding interrupts and context switching. (Note, though, that the code is no longer multithread-safe unless thread-local storage is used because ESP must be stored somewhere other than the stack so it can be restored at the end of the routine.)" Health warning - messing around with esp is dangerous, and has to be treated with extreme care - however in certain situations (such as what Michael was facing) it may be worthwhile. It's tempting to try this - does anyone have any experience with this approach? Is the behaviour the same across versions of Windows?Anonymous
September 24, 2007
The comment has been removedAnonymous
September 24, 2007
He told you that the class that implemented the interface was exported from a DLL. That changed the calling convention from __cdecl to __stdcall.Anonymous
September 24, 2007
The comment has been removedAnonymous
September 24, 2007
I'm probably just shooting in the dark on this, but the lack of a prototype for GetPsychicInterface() bothers me. What if this is a 64-bit system, and the compiler defaults the return type to "int"? 'Course that wouldn't be the problem here, 'cause you'd very likely crash on the call to DoSomeOperation--unless you just got lucky, or unlucky depending on your point of view.Anonymous
September 24, 2007
I'd guess that the piece of information is that GetPsychicInterface() is defined in a different DLL than the application. Then you would go on to assume that the app was using wide chars and the DLL was not, or vice versa,Anonymous
September 24, 2007
"What single additional piece of information was I given that made this problem simple to solve?" The threat model? :)Anonymous
September 24, 2007
dewb: DoSomeOperation uses the thiscall calling convention as do all C++ methods.Anonymous
September 24, 2007
Adam: Good answer, unfortunately not right - this isn't a threat modeling problem :). Ben: This all happens in a single executable. It's not a calling convention mismatch, because the calling convention is specified by the fact that this is a pure virtual method on a function. Several people are close to the root cause, but I'll go back to my original question: What was the one piece of information that's necessary to confirm the diagnosis.Anonymous
September 24, 2007
The information that would tip me off is that it only happens in an optimized build or when frame pointer omission is enabled. I'm surprised this didn't expose itself in a debug build, because the enhanced stack checking added in VS2002 will ordinarily flag errors like this.Anonymous
September 24, 2007
I'd guess that the IPsychicInterface implementation returned BOOL rather than bool. (Although this requires a C-style cast too and missing the actual inheritance from the implementation.)Anonymous
September 24, 2007
James, that wouldn't cause a stack overrun (the compiler still returns 4 bytes of data when returning a bool, it's just that only the low 8 bits are used). Phaeron: The problem occurs on both optimized and non-optimized builds. And FPO doesn't change the outcome. There was a debug/retail mismatch (the OS was retail, and the binary was debug), but the problem ocurred even with the retail bits.Anonymous
September 24, 2007
My guess would be that multiple inheritance is used and there's an incorrect cast, which causes DoSomeOperation() to call a different method (with a different signature) than actually intended.Anonymous
September 24, 2007
This is not necessarily the problem here, but using _TCHAR in an interface definition like this can be risky. Even if _tmain() and the object implementation are in the same executable, they could be compiled with differing ANSI/Unicode modes. But again, I don't think that's the problem.Anonymous
September 24, 2007
The name IPsychicInterface multiply self-documents itself as being an interface, so obviously there is multiple inheritance. I'm not sure what piece of information I'd ask for though. I'd ask to see the code of GetPsychicInterface, and the object that GetPsychicInterface gets the interface interface of -- which means I fail my psychics exam.Anonymous
September 24, 2007
My guess is that the missing piece is "It runs fine on x86 but it fails on x64"?Anonymous
September 24, 2007
I'd place a wild guess at the _TCHAR being the root cause: one part of the code being ANSI and the other being Unicode.Anonymous
September 24, 2007
"the stack was imbalanced after the call to DoSomeOperation" is the clue that others have focused on, and I will too. It sounds like a calling convention mismatch. So, the single piece of missing information is "have you had your morning coffee yet?" or "what is the default calling convention". I'd guess that the CC used was __cdecl and the code that implemented the interface was declared as __stdapi. As a side note, the code reads like unmanaged C++ (TCHAR)...if this were managed code it never would have happened.Anonymous
September 24, 2007
DoSomeOperation had an errant inline assembly routine that left the stack imbalanced for certain values of argc/argv :)Anonymous
September 24, 2007
Was the program compiled with ANSI or UNICODE defined? Was the interface compiled with ANSI or UNICODE defined?Anonymous
September 24, 2007
Different Unicode settings resulting in TCHAR #defined as char in one module, and wchar in another?Anonymous
September 25, 2007
You're passing TCHARs to DoSomeOperation. Was this an ANSI build?Anonymous
September 25, 2007
Laura: Actually it failed on x86 as well. Charlie: You're right, but _TCHAR is just what I used in my example - the real version didn't do that. GetPsychicInterface isn't relevant to the problem.Anonymous
September 25, 2007
Mo: _TCHAR won't cause a stack imbalance.Anonymous
September 25, 2007
Most of you are dancing around the right answer (calling convention mismatch). But what's the one piece of information that I needed to exactly diagnose the problem?Anonymous
September 25, 2007
Is it that this is managed C++ and DoSomeOperation calls a .NET delegate improperly?Anonymous
September 25, 2007
My "education" guess is this. "static_cast didn't work so we restored to a "C" style cast". Why? With a thiscall, the stack must be cleaned up by the callee. So since the routines are defined properly in the interface, what would cause the callee to improperly clean the stack? Well, as far as the callee is concerned, the stack was cleaned properly. Why? Because due to the "C" style cast, the wrong object was returned resulting in the wrong vtable being used. This would cause the wrong routine to fail. Sadly, that is the best I can do.Anonymous
September 25, 2007
The comment has been removedAnonymous
September 25, 2007
The comment has been removedAnonymous
September 25, 2007
Is the correct function actually being called?Anonymous
September 25, 2007
The comment has been removedAnonymous
September 25, 2007
(since this moderated I have seen up through nksingh's post and Larry's response. I also looked up thiscall on wikipedia) My guess... The function sig. changed from var args so both the caller and callee cleaned up the stack. The server was not recompiled. I'm still not happy though since one of the early debugging steps is 'rebuild all' and that should be before 'ask Larry'. Anyway that is my guess for today.Anonymous
September 25, 2007
Larry, that cast would matter with multiple inheritance and forward declarations. Thus the part of "static_cast wouldn't work". C casting a between forward declarations would just result in the lhs being the same as the rhs pointer. But with multiple inheritance, any casting of a base class that isn't the first class usually results in a different lhs pointer (very compiler specific). But I can't think of another way the wrong routine would be called. ponder... ponder... ponder... I am assuming the code in question is all being built by the same compiler. C++ isn't very inter-operable between DLLs and different compilers.Anonymous
September 25, 2007
Does IPsychicInterface have operators overloaded that we should know about? -> ? Just a wild guess...Anonymous
September 25, 2007
slightly off topic isn't register deprecated in C? I believe all compilers threat it like normal declaration and will probably not put the variable in register?Anonymous
September 25, 2007
As I mentioned yesterday , one of the other developers in my group had hit a sticky problem, and he askedAnonymous
September 25, 2007
As I mentioned yesterday , one of the other developers in my group had hit a sticky problem, and he askedAnonymous
September 25, 2007
Larry, are the classes involved have multiple inheritance and a virtual base?Anonymous
September 25, 2007
One usually cannot cast from a pointer to derived class to a virtual base. (unless forced by c style cast). the call usually goes via a thunk which would expect 2 arguments (vtable and offset). In this case only one was provided.Anonymous
September 25, 2007
What the file extention was. e.g. .cpp or .cAnonymous
September 25, 2007
.cpp, not that it matters.Anonymous
September 25, 2007
Were you given compiler output that included a warning about inconsistent linkage or something like that?Anonymous
September 25, 2007
Nope, there were no warnings at all. Tomorrow I'll post both the .cpp files associated with the bug.Anonymous
September 25, 2007
Second guess: The one piece of information is “if I change the order of value1/value2/psychicInterface, it works”. (Where “works” is “seems to work”).Anonymous
September 26, 2007
So for the past couple of posts , I've been walking through a psychic debugging experience I had overAnonymous
September 26, 2007
So for the past couple of posts , I've been walking through a psychic debugging experience I had