What's wrong with this code, part 21 - A Psychic Debugging Example - The answers.

So for the past couple of posts, I've been walking through a psychic debugging experience I had over the weekend.

As I presented the problem, there were three pieces of information needed to debug the problem.

An interface:

class IPsychicInterface
{
public:
    virtual bool DoSomeOperation(int argc, _TCHAR *argv[]) = 0;
};

A test application:

int _tmain(int argc, _TCHAR* argv[]){    register int value1 = 1;    IPsychicInterface *psychicInterface = GetPsychicInterface();    register int value2 = 2;

    psychicInterface->DoSomeOperation(argc, argv);
    assert(value1 == 1);
    assert(value2 == 2);
    return 0;
}

and some assembly language code:

0040106B  pop         edi 
0040106C  pop         ebx 
0040106D  mov         al,1
0040106F  pop         esi 
00401070  ret         0Ch 

As I mentioned in my last post, the problem was tracked down to a stack imbalance when calling the DoSomeOperation method, and when I saw the postamble for DoSomeOperation, I quickly realized the answer to the problem.

There are essentially 4 separate calling conventions supported by Microsoft's compilers - it turns out that you can figure out several of them from just looking at the code.  For the stdcall and thiscall calling conventions, input parameters are passed onto the routine, and the callee is responsible for cleaning up the stack (this contrasts with the cdecl calling convention where the caller is responsible for cleaning the stack).  From the postamble, we know that this function is either a stdcall or a thiscall function, since the "ret" instruction adjusts the stack.

I've already stated that this is x86 code, and the RET 0CH indicates that the routine pops off 12 bytes of values off the stack.  This is clearly a problem, because the DoSomeOperation routine only takes two parameters (which would take 8 bytes). The RET 0CH implies that the implementation of DoSomeOperation took 3 parameters!

 

This implies that we're dealing with a violation of the one definition rule (ODR).  The One Definition Rule is a part of the C++ standard (section 3.2) which states: "No translation unit shall contain more than one definition of any variable, function, class type, enumeration type or template.".  In other words, when you declare a function in separate object files, you need to make sure that they all use the same definitions of structures. 

Most commonly ODR violations this show up when you change a header file but don't rebuild all the source files that depend on that file - there's a ton of work that's been done to automatically manage dependencies to avoid this particular issue.

And if you look at the source code for the PsychicInterface logic, you'll see the problem immediately:

class IPsychicInterface {public:    virtual bool DoSomeOperation(int argc, _TCHAR *argv[], _TCHAR *envp[]) = 0;}; bool CPsychicInterface::DoSomeOperation(int argc, _TCHAR *argv[], _TCHAR *envp[]){    int count = argc;    while (count--)    {        printf("%S", argv[count]);    }    return true;}

The PsychicInterface code has it's own private definition of IPsychicInterface which doesn't match the definition in the test application.

Obviously this is an utterly contrived example.  The real problem was much more complicated than this - the violation was in an export from a DLL, and involved external components, which made this more complicated.  In many ways, it was similar to the problem that Raymond talked about here (except in this case, we're in a position to fix the code involved).

Comments

  • Anonymous
    September 26, 2007
    PingBack from http://www.artofbam.com/wordpress/?p=3163

  • Anonymous
    September 26, 2007
    And the answer to the bonus question, why didn't the compiler catch it?" is related to the fact that it is a DLL export?

  • Anonymous
    September 26, 2007
    Doh, forgot about that one.  Thanks Andrew. The reason the compiler didn't catch it was because of ODR - there were two conflicting definitions of the interface. In the real-world example, the problem was that the export from the Dll lost the name decoration that would have caught the problem at link time - it was still an ODR issue, but made more subtle because there were multiple translation units. Raymond's version of the problem had people simply casting away the warning ("Hey, it looks right to me, must be a bug in the compiler").

  • Anonymous
    September 26, 2007
    > The One Definition Rule is a part of the C++ standard (section > 3.2) which states: "No translation unit shall contain more than > one definition of any variable, function, class type, > enumeration type or template.". Now you've scrambled things by adding that red herring.  You quoted a rule which is talking about a single translation unit, which a compiler can very easily diagnose because the compiler translates one translation unit and can easily detect multiple definitions in one translation unit.  This was irrelevant to the user's bug that you've been discussing. > In other words, when you declare a function in separate > object files, you need to make sure that they all use the > same definitions of structures. Those are completely separate words because it's a completely separate concept.  Separate object files, yes.  From separate compilations.  Compilations of separate translation units.  The compiler doesn't get to diagnose this because the compiler doesn't see both translation units at the same time.  This was the user's bug. Since you made two conflicting definitions of the One Definition Rule in one paragraph, it was very easy to diagnose your violation of the one one definition rule definition rule.

  • Anonymous
    September 26, 2007
    > Raymond's version of the problem had people simply casting > away the warning Very sadly, some Microsoft Press books by very famous authors make this exact mistake.  They encourage readers to learn to make the same mistake.  One involves device drivers and one involves COM.  Sigh.

  • Anonymous
    September 27, 2007
    It seems like the One Definition Rule is a more specific version of Don't Repeat Yourself. They repeated themselves when making the DoSomeOperation interface, then changed one copy.

  • Anonymous
    September 27, 2007
    OK, I'm a dunce.  Isn't popping of 12 bytes correct?  8 for the two parameters and 4 more for the this pointer? I thought thiscall was just like stdcall except that the object pointer was passed as an extra parameter.

  • Anonymous
    September 27, 2007
    Adrian: Actually, the thiscall calling convention passes in the "this" pointer in the ECX register.

  • Anonymous
    September 29, 2007
    The comment has been removed

  • Anonymous
    October 01, 2007
    In __fastcall, the callee is also expected to remove any arguments from the stack (as in __stdcall) if there were in fact any stack based arguments.