debugging a crash – An example

Debugging & troubleshooting applications is a major part of my job, but not using debuggers like WinDBG. Although I have a great liking around using the debuggers, I don’t spend a lot of my time these days using this debugger. So when I get time, I generally read books and use sample code to brush up my debugging skills. I had read the book – Advanced Windows Debugging a couple of times. I recently picked up one of the examples from the book to debug a crash. The book uses a slightly different approach to debugging the example problem. It gives us the liberty of source + private symbols. Unfortunately, in support, when we get memory dumps we get the state of the crash and usually won’t have access to third party source & symbols. So why not blog about how I understood the cause of the crash from a support perspective?

The sample is plain and simple. The program takes an argument of type string. I am instructed to run it under a debugger – WinDBG & pass a string as command line argument which is long and it should crash the application. The task is to determine why the program crashed. 

The moment the program starts running, it breaks into the debugger upon hitting the initial breakpoint. Thereafter, I just type ‘g’ and hit enter to let the program run. As indicated, the program immediately crashes due to an Access Violation and breaks into the debugger.

0:000> g
(114c.19b4): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=000cfefc ebx=00000000 ecx=000cff3a edx=00502a36 esi=00000001 edi=0100367c
eip=010012a7 esp=000cff44 ebp=000c0000 iopl=0 nv up ei pl nz na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010206
05Program!wmain+0x17:
010012a7 8b550c mov edx,dword ptr [ebp+0Ch] ss:002b:000c000c=????????

From this, I can tell a few things as to why we crashed. The system attempted to de-reference a pointer located at the address contained in EBP register with an offset of 0C. We attempted to dereference the pointer so that we can store its value in the register EDX. The value we got was invalid and is at location 000c00c

OK, so why is this address bad? What is the type of address we attempted to de-reference?

0:000> !address 000c000c
TEB 7efdd000 in range 7efdb000 7efde000
ProcessParametrs 00251788 in range 00250000 00256000
Environment 00250810 in range 00250000 00256000
00090000 : 00090000 - 0003c000
Type 00020000 MEM_PRIVATE
State 00002000 MEM_RESERVE
Usage RegionUsageStack
Pid.Tid 114c.19b4

Alright; so we know this address is on the stack and it is for the thread with the id 19b4. If I list the threads in this process, here is what I get.

0:000> ~
. 0 Id: 114c.19b4 Suspend: 1 Teb: 7efdd000 Unfrozen

So there is only one thread in this process and this address is on the stack space allocated for the crashing thread. Is this address valid?

0:000> dd 000c000c l4
000c000c ???????? ???????? ???????? ????????

Obviously, not! The key thing to note here is that this invalid address is an offset from the EBP register, which we do not manipulate directly from our code. However, it is a key register in managing access to parameters and local variables for a call frame. An understanding of how stack works will help debug this better. Here’s a diagram.

Stack Layout Overflow into saved EBP
StackLayout StackLayout-Overflw

Thus, it seems here that we are crashing due to a buffer overrun that corrupts the stack, specifically the EBP register. Let’s begin by looking at the callstack

0:000> kv
ChildEBP RetAddr Args to Child
000cff44 01001467 00000002 00502940 00503ab0 05Program!wmain+0x17 (FPO: [Non-Fpo]) (CONV: cdecl)
000cff88 76fa3677 7efde000 000cffd4 77ea9d72 05Program!__wmainCRTStartup+0x102 (FPO: [Non-Fpo]) (CONV: cdecl)
000cff94 77ea9d72 7efde000 7697df09 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo]) (CONV: fastcall)
000cffd4 77ea9d45 010015a5 7efde000 00000000 ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo]) (CONV: stdcall)
000cffec 00000000 010015a5 7efde000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo]) (CONV: stdcall)

Now, we need to go back in time to prove our theory. Where do we begin? One theory is that we can dis-assemble backwards from the current failure spot to find out the code we executed just before we crashed. Thus:

0:000> ub . l5
05Program!wmain+0x9 :
01001299 7523 jne 05Program!wmain+0x2e (010012be)
0100129b 8b450c mov     eax,dword ptr [ebp+0Ch]
0100129e 8b4804 mov ecx,dword ptr [eax+4]
010012a1 51 push ecx
010012a2 e839000000 call 05Program!HelperFunction (010012e0)

ub means un-assemble backwards and the period (.) means from the current EIP value. l5 means length; un-assemble 5 instructions. So the above command goes back 5 instructions from current value of EIP register and starts listing the instructions, which will give us the preceding 5 instructions. The highlighted one is the code that we executed just before we failed.

Similarly u means un-assemble forward from the current value of EIP. In the listing below, you can see the instruction that failed (highlighted):

0:000> u . l3
05Program!wmain+0x17 :
010012a7 8b550c mov edx,dword ptr [ebp+0Ch]
010012aa 8b4204 mov eax,dword ptr [edx+4]
010012ad 50 push eax

Alright, so now we know that just before we failed, we called another function named HelperFunction at address 010012e0. If we look back 3 instructions before this function was called (green highlight), we see an access to the EBP register where we de-reference a pointer at ebp+0ch during the execution of the wmain function. We didn’t crash there, so we can be sure that the EBP register has the correct value when executing that instruction. Further, anytime we call a function, we always save the return address plus the EBP of the current call frame so that we can restore it upon returning from the function. In this case, we save the EBP for wmain method. Thus it stands to reason that after executing the function HelperFunction, somehow we corrupted the stack and overwrote the saved EBP value, which results in an access violation when we are back in the wmain method, because, the EBP register now is pointing to an location that is not correct for the wmain method.

Next, we investigate further by completely dis-assembling the HelperFunction using the uf command

0:000> uf 010012e0
05Program!HelperFunction :
010012e0 8bff mov edi,edi
010012e2 55 push ebp
010012e3 8bec mov ebp,esp
010012e5 83ec3c sub esp,3Ch
010012e8 8b4508 mov eax,dword ptr [ebp+8]
010012eb 50 push eax
010012ec 8d4dc4 lea ecx,[ebp-3Ch]
010012ef 51 push ecx
010012f0 ff1574100001 call dword ptr [05Program!_imp__wcscpy (01001074)]
010012f6 83c408 add esp,8
010012f9 8be5 mov esp,ebp
010012fb 5d pop ebp
010012fc c20400 ret 4

Alright, so now we can see a few more things.

  • First we see the standard function prologue. Save the current EBP (which is for wmain function & move the stack pointer).
  • Allocate space for local variables. In this case we are allocating 60 bytes (3C hex – sub esp,3Ch)
  • Then we move the address of first parameter passed to HelperFunction into EAX register
  • Then push that address on the stack
  • The LEA stands for load effective address. So we push the beginning address for the space allocated for local variable into ECX register.
  • We then call the imported function WCSCPY.

The _imp__ before a function means it’s calling into another library. In this case a C runtime library. This method is used for copying a string from one variable to another using pointers. See the MSDN documentation. So essentially, the HelperFunction passes the pointers to the source and destination to the WCSCPY function.

So now our theory is getting a bit stronger. The WCSCPY method is used to copy WCHAR (Wide character unicode strings) which are terminated with a NULL. Further, from experience, a developer may know that WCSCPY method does not check for bounds. We know that the space allocated on the stack was 60 bytes, which means the code declared a WCHAR string of size 30 bytes long. Why? Since the API we are using works on WCHAR types, we need 2 bytes to store each character and thus we allocate 60 bytes.

So, if we end up writing more than 60 bytes, we will overflow into the saved EBP of the wmain function and perhaps return address & more depending on how large the string being copied is. Now our goal is to find out the length of the offending string and how much it has overflowed. Unfortunately, we are past execution of the code for HelperFunction where all of this happened and we have an EBP register value that is corrupt. So the stack has technically “unwound” since we returned from the HelperFunction. What that means is the stack pointer has unwound, but the values should still be there. So we can poke around a bit to get to that string value.

The problem is to find out what was the starting address on the stack where the allocation for this string begins. To do that, we need to try and visualize what all “existed” on the stack before we ended up in the current state. So what all gets stored on the stack?

  • Parameters
  • Return Addess
  • Saved EBP
  • Local Variables

Given this fact, we can re-construct the stack to its previous state by examining the code in the HelperFunction. Usually, a good knowledge of assembly language is a plus point when doing this. I don’t know assembly much. I just use my favourite search engine (Bing) to figure out what each instruction does and also refer the Intel x86 instruction code (available on Intel site) shared by my other colleagues.

To reconstruct the stack, we must keep in mind the current state of the stack and understand the execution path. As I mentioned before, we are back in the wmain function with a bad EBP register. Where is ESP register pointing to? At the top of the stack.

If we execute a “dps esp” command now, we can see this…

0:000> dps @esp
000cff44 000cff88
000cff48 01001467 05overrun!__wmainCRTStartup+0x102

… which means the stack pointer is pointing to saved EBP. In this case, it is for the __wmainCRTStartup method. So our stack pointer is still intact. If we look back at the dis-assembly for the HelperFunction, the last few instructions are as follows: 010012f0 ff1574100001 call dword ptr [05Program!_imp__wcscpy

010012f0 ff1574100001 call dword ptr [05overrun!_imp__wcscpy (01001074)]
010012f6 83c408 add esp,8
010012f9 8be5 mov esp,ebp
010012fb 5d pop ebp
010012fc c20400 ret 4

So what does this do? After we return from executing the wcscpy method, we clean up the parameters passed to that method. It is also important to remember that the wcscpy function also restores the EBP for the HelperFunction. So the EBP register is now pointing to the correct location for HelperFunction. We then copy the value in the EBP register to the ESP register. Why? Before going back to wmain method, we need to restore its EBP. So this instruction simply moves the stack pointer to that location so that the next instruction (pop ebp) can copy the saved EBP value (now corrupt), for wmain method on the stack into the EBP register. The pop instruction also moves the stack pointer by 4 bytes. Thus, the stack pointer will now point to the return address within wmain.

Returning from the HelperFunction: This is the last step and the RET instruction pops the return address from the stack into the EIP register and the stack pointer also moves to 4 bytes. This gives control back to the calling function. Only the stack pointer and instruction pointers are modified by a subroutine return.

So since we are now ready to execute wmain again, the next instruction (mov edx,dword ptr [ebp+0Ch] ) is executed, but it fails since de-referencing EBP results in access violation.

So to reconstruct the stack and get to the local variable that overflowed, we factor in the following:

  • Parameter passed to the HelperFunction (4 bytes)
  • The space required for saving of return address + EBP for the wmain method (4 bytes each)
  • The local variable of 60 bytes.

All of this adds up to 72 bytes.

NOTE: I am not factoring in what happens on the stack when 05Program!_imp__wcscpy function executes since locals variables and register values for that function will be saved in stack locations beyond what area we need to look up. Further the cleanup of the stack space allocated by that method will be done by the callee itself. (CDECL calling convention. In the CDECL calling convention the called function is responsible for cleanup of the stack).

So now, for the math – Since the stack grows in the reverse order, we need to subtract 72 bytes.

0:000> ?esp-0n72
Evaluate expression: 851708 = 000cfefc

So now, we can use the dps command to dump the DWORDS and symbols as follows:

0:000> dps 000cfefc
000cfefc 00520041 //
000cff00 00610065 //
000cff04 006c006c //
000cff08 004c0079 //
000cff0c 00720061 //
000cff10 00650067 //
000cff14 00420044 //
000cff18 006f0043 // local variables
000cff1c 006e006e //
000cff20 00630065 //
000cff24 00690074 //
000cff28 006e006f //
000cff2c 00740053 //
000cff30 00690072 //
000cff34 0067006e //
000cff38 000c0000 //Saved EBP
000cff3c 010012a7 05Program!wmain+0x17 //return address
000cff40 005029f8
000cff44 000cff88
000cff48 01001467 05Program!__wmainCRTStartup+0x102
000cff4c 00000002
000cff50 00502940
000cff54 00503ab0
000cff58 98f01d46
000cff5c 00000000
000cff60 00000000
000cff64 7efde000
000cff68 000cff74
000cff6c 00000000
000cff70 000cff58
000cff74 8a754a32
000cff78 000cffc4

The values on top of the stack now look like some text characters represented in hexadecimal values. For example, if we take the value 00520041 and start disassembling it from right to left in groups of 4 bits, we get 0041 and 0052. We can then run the .formats command

0:000> .formats 0041
Evaluate expression:
Hex: 00000041
Decimal: 65
Octal: 00000000101
Binary: 00000000 00000000 00000000 01000001
Chars: ...A
Time: Thu Jan 01 05:31:05 1970
Float: low 9.10844e-044 high 0
Double: 3.21143e-322

0:000> .formats 0052
Evaluate expression:
Hex: 00000052
Decimal: 82
Octal: 00000000122
Binary: 00000000 00000000 00000000 01010010
  Chars: ...R
Time: Thu Jan 01 05:31:22 1970
Float: low 1.14906e-043 high 0
Double: 4.05134e-322

From the highlighted section above, we can see that the first 2 are characters as we thought, since it was a string copy operation. So let’s dump the entire string. We can use the du command since it is a unicode string

0:000> du 000cfefc
000cfefc "AReallyLargeDBConnectionString"

Alright, so now we have the offending string. The length of this string is 30 characters long. We allocated 60 bytes since this is a WCHAR string & each character needs 2 bytes. So then why are we access violating? Let’s look at the dps command output again, specifically, the following lines.

000cff34 0067006e
000cff38 000c0000

The values in blue highlight are the last 2 digits of the string, which is “ng” and the values in cyan highlight is where we terminate the string. So we overflow into the Saved EBP register space of the stack by 2 bytes when we terminate the string, thus corrupting the saved EBP value on the stack. When we return from the function, this value is now stored into the EBP register. So now when we try to dereference locals and parameter values from the wmain function, we end up looking in the wrong memory location since the de-referencing is based on the relative address from the EBP register. In short, we have a stack corruption caused by a buffer overrun which is causing this crash.

SOLUTION: Most debugging books always mention this – Use only how much you allocate & always check what you are writing fits into the allocated space. Since we allocated space for 30 characters, we can technically write only 29 characters and leave one byte for the string termination. The developer must validate the input string length to ensure it fits in the space allocated or throw an error message to gracefully handle the situation to avoid stack corruption. Even better, follow the recommendation/security note from the MSDN Documentation for this function which is:

Security Note Because strcpy does not check for sufficient space in strDestination before copying strSource, it is a potential cause of buffer overruns. Consider using strncpy instead.

Comments

  • Anonymous
    April 10, 2011
    Good walk through Sudeep... The recommended practice is to use the _s versions of the functions which are secure. In your case wcscpy_s msdn.microsoft.com/.../td1esda9.aspx