The case of amazingly morphing intrinsic function

I was in a code review earlier this week and someone pointed out the InterlockedOr, unlike the other InterlockedXxx operations, does not return the previous value. I found that hard to believe. I pulled up the MSDN docs and pointed this out

Return Value

InterlockedOr returns the original value stored in the variable pointed to by Destination.

So, we set out to find out which one of us was right. It turns out that both of us are right, it just depends on how you call InterlockedOr! We created the following test application to see what was happening

     #include <windows.h>
    #include <stdio.h>
    
    extern "C"
    LONG
    _InterlockedOr (
        IN OUT LONG volatile *Target,
        IN LONG Set
        );
    #pragma intrinsic(_InterlockedOr)
    
    int _cdecl main(int argc, char *argv[])
    {
        LONG old, cur = 0xF;
    
        old = _InterlockedOr(&cur, 0xF0);
        _InterlockedOr(&cur, 0xF0);
    
        return 0;
    }

And then looked at the assembly that was generated

     >   16:     old = _InterlockedOr(&cur, 0xF0);
    0:000> u
    test!main+0x10 [d:\work\size\main.cpp @ 16]:
    00e611d0 b9f0000000      mov     ecx,0F0h
    00e611d5 8d55f8          lea     edx,[ebp-8]
    00e611d8 8b02            mov     eax,dword ptr [edx]
    00e611da 8bf0            mov     esi,eax
    00e611dc 0bf1            or      esi,ecx
    00e611de f00fb132        lock cmpxchg dword ptr [edx],esi
    00e611e2 75f6            jne     test!main+0x1a (00e611da)
    00e611e4 8945fc          mov     dword ptr [ebp-4],eax
    0:000> p

    >   17:     _InterlockedOr(&cur, 0xF0);
    0:000> u
    test!main+0x27 [d:\work\size\main.cpp @ 17]:
    00e611e7 b9f0000000      mov     ecx,0F0h
    00e611ec 8d55f8          lea     edx,[ebp-8]
    00e611ef f0090a          lock or dword ptr [edx],ecx  

Completely different code is generated if capture the return value vs ignoring it! Wow, I didn't expect the compiler to do that.

When you capture the return value, the InterlockedOr intrinsic generates the standard pattern used to generate an Interlocked operation that is not supported by hardware (capture the old value, interlock compare and exchange for the new value, repeat if the old value from compare and exchange is not the captured old value). When you ignore the return value, a lock or (which does not set eax to the previous value) generated without a loop. If you compare the two implementations, the first implementation (the loop) can be a lot slower than the second both in terms of actually looping and in terms of prediction misses since there is a branch which could be mispredicted. If this API in your hot path, this might be very good information to know and change your usage of it to squeeze a little more cycles out of it. After testing this out, InterlockedAnd also exhibits the same behavior and, going out on a limb, I would assume the other Interlocked intrinsic also generate code the same way depending on if the return value is captured or not.

Comments

  • Anonymous
    November 11, 2006
    They do! I often have to use GCC and I'm trying to make the code as compatible/portable between compilers as possible, so we've had to build our own header with the intrinsics implemented in gcc's own atomic language (which is based of Intel's official specs). Unfortunately there's no way to make them morphing like this, so the slower loop path must always be implemented.

  • Anonymous
    November 11, 2006
    This was actually the subject of a cl 13.x compiler bug: http://kernelmustard.com/2006/08/02/a-definitive-answer-to-the-interlocked-code-generation-questions/ http://blogs.msdn.com/mlacey/archive/2006/08/01/686098.aspx Checking the return value of these functions was broken. Seems to be fixed in 14.x, but a lot of people still build with 3790.1830 or previous.

  • Anonymous
    November 11, 2006
    There's also this one ;-) http://www.osronline.com/showThread.cfm?link=97910 subtle nudge to d

  • Anonymous
    August 10, 2007
    Did you notice that the assembly code was incorrect? The WDK 6001 compiler still generates wrong code. And the error matches erroneous InterlockedOr_Inline, InterlockedAnd_Inline, InterlockedXor_Inline functions. The loop should return before mov eax,[edx], otherwise, if [edx] changes before cmpxchg (causing the branch), the loop will never end. Such mistakes (as in *_Inline functions in WINBASE.H) are easily caught by a code review, which obviously was not done for such an important function. Note that InterlockedAnd64_Inline function was obviously written by a more experienced programmer. It's cleaner and it's correct!

  • Anonymous
    August 10, 2007
    Even though InterlockedAnd64_Inline is correct, the code generated by x64 compiler for _InterlockedAnd64 is incorrect (same error).

  • Anonymous
    August 28, 2007
    alegr, when you get a chance, you should take a look at the documentation for cmpxchg, which states (from the AMD manual): "If the two values are equal the instruction copies the value in the second operand to the first operand and sets the ZF flag in the rFLAGS register to 1.  Otherwise, it copies the value in the first operand to the AL, AX, EAX, or RAX register and clears the ZF flag to 0." In particular, consider how the second sentence in the preceding paragraph applies to the code in question.