Exchange 2010: AD Recipient Cache and Locking Threads in EdgeTransport - Or How We Figured Out Something Wasn't Right With Mail Stuck in Submission Queue
A while ago we had an issue, wherein SCOM fired for message queue length for messages in Submission. When we went to take a look, the messages were stuck in Submission and weren't really going anywhere. Took a dump of EdgeTransport, paused Transport, and the queues eventually flushed.
Opened Windbg against the dump and here's what was going on: Every time a flood of email would come, Transport would see that the local recipient cache for a group was out of date. Threads would spin-up (typically, per message) to update the recipient cache. Since this is an AD call, a lock is used. The lock is owned by the thread until it's released, which means we need to wait for the LDAP call to complete before the next thread can enter the critical section. Repeat ad infinim. The issue that complicated this matter is that the refresh was for the same group of sixteen-thousand-plus users, each time.
In the dump, this is how I went down the rabbit hole. First, I checked for any locks that may be blocking in the process:
!locks
CritSec Wldap32!SelectLock1+0 at 000007fefdd5ada0
WaiterWoken No
LockCount 0
RecursionCount 1
OwningThread 2a18
EntryCount 0
ContentionCount 2
*** Locked
I then checked the another thread to see if it was aware that there was contention (it was):
00 00000000`2b8fd4f8 000007fe`fd991430 ntdll!ZwWaitForMultipleObjects+0xa
01 00000000`2b8fd500 00000000`77731723 KERNELBASE!WaitForMultipleObjectsEx+0xe8
02 00000000`2b8fd600 000007fe`f830d0a5 kernel32!WaitForMultipleObjectsExImplementation+0xb3
03 00000000`2b8fd690 000007fe`f8311aa9 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1
04 00000000`2b8fd730 000007fe`f8344e95 mscorwks!Thread::DoAppropriateAptStateWait+0x41
05 00000000`2b8fd790 000007fe`f8427d94 mscorwks!Thread::DoAppropriateWaitWorker+0x191
06 00000000`2b8fd890 000007fe`f83ee48a mscorwks!Thread::DoAppropriateWait+0x5c
07 00000000`2b8fd900 000007fe`f82f077d mscorwks!CLREvent::WaitEx+0xbe
08 00000000`2b8fd9b0 000007fe`f843e6ea mscorwks!AwareLock::EnterEpilog+0xc9
09 00000000`2b8fda80 000007fe`f83336db mscorwks!AwareLock::Enter+0x72
0a 00000000`2b8fdab0 000007fe`f88b308f mscorwks!AwareLock::Contention+0x1fb
0b 00000000`2b8fdb80 000007fe`f2ff2beb mscorwks!JITutil_MonContention+0xdf
0c 00000000`2b8fdd30 000007fe`f2ff318f Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.AutoRefreshCache`3[[System.__Canon, mscorlib],[System.__Canon, mscorlib],[System.__Canon, mscorlib]].CreateAndCache(System.__Canon, System.__Canon, Boolean)+0x16cb
0d 00000000`2b8fddf0 000007fe`f324dd16 Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.AutoRefreshCache`3[[System.__Canon, mscorlib],[System.__Canon, mscorlib],[System.__Canon, mscorlib]].GetValue(System.__Canon, System.__Canon)+0x192f
0e 00000000`2b8fdeb0 000007fe`f324e2a0 Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.IsMemberOfResolver`1[[System.__Canon, mscorlib]].IsMemberOf(Microsoft.Exchange.Data.Directory.Recipient.ADRecipientSession, System.Guid, System.Guid)+0x25b6a6
0f 00000000`2b8fe090 000007fe`f0940c15 Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.IsMemberOfResolver`1[[System.__Canon, mscorlib]].IsMemberOf(Microsoft.Exchange.Data.Directory.Recipient.ADRecipientSession, Microsoft.Exchange.Data.Directory.ADObjectId, System.__Canon)+0x25bfd0
10 00000000`2b8fe180 000007fe`e0fc7267 Microsoft_Exchange_Transport_ni!Microsoft.Exchange.Transport.RecipientAPI.AddressBookImpl.IsMemberOf(Microsoft.Exchange.Data.Transport.RoutingAddress, Microsoft.Exchange.Data.Transport.RoutingAddress)+0x135
11 00000000`2b8fe200 000007fe`f245c46a Microsoft_Exchange_MessagingPolicies_Rules_ni!Microsoft.Exchange.MessagingPolicies.Rules.MembershipChecker.Equals(System.String, System.String)+0x47
12 00000000`2b8fe250 000007fe`e0fd5584 Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.RuleUtils.CompareStringValues(System.Object, System.Object, Microsoft.Exchange.MessagingPolicies.Rules.IStringComparer)+0x7a
13 00000000`2b8fe2a0 000007fe`f2459efe Microsoft_Exchange_MessagingPolicies_Rules_ni!Microsoft.Exchange.MessagingPolicies.Rules.IsMemberOfPredicate.Evaluate(Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluationContext)+0x74
14 00000000`2b8fe2e0 000007fe`f2459efe Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.OrCondition.Evaluate(Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluationContext)+0x6e
15 00000000`2b8fe360 000007fe`f246a157 Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.OrCondition.Evaluate(Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluationContext)+0x6e
16 00000000`2b8fe3e0 000007fe`e0fd839f Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluator.Run()+0x1b7
O.k., so if we’re update the cache, how many times are we doing it? Oh, I see…:
0:086> !findstack Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.AutoRefreshCache`3 0
Thread 023, 2 frame(s) match
Thread 024, 2 frame(s) match
Thread 025, 2 frame(s) match
Thread 034, 2 frame(s) match
Thread 035, 2 frame(s) match
Thread 036, 2 frame(s) match
Thread 037, 2 frame(s) match
Thread 038, 2 frame(s) match
Thread 039, 2 frame(s) match
Thread 040, 2 frame(s) match
Thread 050, 2 frame(s) match
Thread 051, 2 frame(s) match
Thread 052, 2 frame(s) match
Thread 053, 2 frame(s) match
Thread 054, 2 frame(s) match
Thread 055, 2 frame(s) match
Thread 056, 2 frame(s) match
Thread 057, 2 frame(s) match
Thread 058, 2 frame(s) match
Thread 059, 2 frame(s) match
Thread 060, 2 frame(s) match
Thread 061, 2 frame(s) match
Thread 062, 2 frame(s) match
Thread 063, 2 frame(s) match
Thread 064, 2 frame(s) match
Thread 065, 2 frame(s) match
Thread 066, 2 frame(s) match
Thread 067, 2 frame(s) match
Thread 068, 2 frame(s) match
Thread 069, 2 frame(s) match
Thread 070, 2 frame(s) match
Thread 071, 2 frame(s) match
Thread 072, 2 frame(s) match
Thread 073, 2 frame(s) match
Thread 074, 2 frame(s) match
Thread 075, 2 frame(s) match
Thread 076, 2 frame(s) match
Thread 077, 2 frame(s) match
Thread 078, 2 frame(s) match
Thread 079, 2 frame(s) match
Thread 080, 2 frame(s) match
Thread 081, 2 frame(s) match
Thread 082, 2 frame(s) match
Thread 083, 2 frame(s) match
Thread 084, 2 frame(s) match
Thread 085, 2 frame(s) match
Thread 086, 2 frame(s) match
Thread 087, 2 frame(s) match
The thread with the lock has an LDAP call being made:
00 00000000`5443cfc8 000007fe`fce91e35 ntdll!ZwWaitForSingleObject+0xa
01 00000000`5443cfd0 000007fe`fce91f96 mswsock!SockWaitForSingleObject+0xe145
02 00000000`5443d050 000007fe`ff1f4efc mswsock!WSPSelect+0xac26
03 00000000`5443d1f0 000007fe`ff1f4e7d ws2_32!DPROVIDER::WSPSelect+0x3c
04 00000000`5443d230 000007fe`fdd13a19 ws2_32!select+0x75a
05 00000000`5443d330 000007fe`fdd199b5 Wldap32!DrainWinsock+0x417
06 00000000`5443d3c0 000007fe`fdd136ec Wldap32!LdapWaitForResponseFromServer+0x646
07 00000000`5443d480 000007fe`fdd1d967 Wldap32!ldap_result_with_error+0x279
08 00000000`5443d520 000007fe`f84abec7 Wldap32!ldap_result+0x74
*** WARNING: Unable to verify checksum for System.DirectoryServices.Protocols.ni.dll
09 00000000`5443d580 000007fe`f1c453ff mscorwks!DoNDirectCall__PatchGetThreadCall+0x7b
0a 00000000`5443d630 000007fe`f1c5c61b System_DirectoryServices_Protocols_ni!DomainBoundILStubClass.IL_STUB(IntPtr, Int32, Int32, System.DirectoryServices.Protocols.LDAP_TIMEVAL, IntPtr ByRef)+0x8f
0b 00000000`5443d720 000007fe`f1c59f79 System_DirectoryServices_Protocols_ni!System.DirectoryServices.Protocols.LdapConnection.ConstructResponse(Int32, System.DirectoryServices.Protocols.LdapOperation, System.DirectoryServices.Protocols.ResultAll, System.TimeSpan, Boolean)+0x2cb
It looks like we’re performing this query, at the time of the dump:
000000005443d538 00000000e0188bf8 System.String "(&(memberOf=<GUID=20ff7cf3-3b7e-4767-8669-d2295d6fb0f3>)(!(msExchCU=*)))"
0:086> !do 00000000e0188bf8
Name: System.String
MethodTable: 000007fef7747d90
EEClass: 000007fef734e560
Size: 170(0xaa) bytes
GC Generation: 2
(C:\Windows\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
String: (&(memberOf=<GUID=20ff7cf3-3b7e-4767-8669-d2295d6fb0f3>)(!(msExchCU=*)))
Fields:
MT Field Offset Type VT Attr Value Name
000007fef774f000 4000096 8 System.Int32 1 instance 73 m_arrayLength
000007fef774f000 4000097 c System.Int32 1 instance 72 m_stringLength
000007fef77497d8 4000098 10 System.Char 1 instance 28 m_firstChar
000007fef7747d90 4000099 20 System.String 0 shared static Empty
>> Domain:Value 00000000004e5e50:000000015fff0370 <<
000007fef7749688 400009a 28 System.Char[] 0 shared static WhitespaceChars
>> Domain:Value 00000000004e5e50:000000015fff0b90 <<
O.k., how many calls against this object are we making?:
0:086> !refs 00000000e0188bf8
This command will work faster with an SOSEX heap index. To build an index, run !bhi.
Objects referenced by 00000000e0188bf8 (System.String):
NONE
Objects referencing 00000000e0188bf8 (System.String):
follow 00000000c0393040 80 System.Object[]
follow 00000000c0393108 96 System.Object[]
follow 00000000c04a1928 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000000c04a2960 72 System.Object[]
follow 00000000c0689800 80 System.Object[]
follow 00000000c06898c8 96 System.Object[]
follow 00000000c07c3e68 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000000c07c4de0 72 System.Object[]
follow 00000000c08ba200 80 System.Object[]
follow 00000000c08ba2c8 96 System.Object[]
follow 00000000c0976870 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000000c09777e8 72 System.Object[]
follow 00000000c0fbaf58 80 System.Object[]
follow 00000000c0fbb020 96 System.Object[]
follow 00000000c103fa40 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000000c1046fd0 72 System.Object[]
follow 00000000c113a868 80 System.Object[]
follow 00000000c113a930 96 System.Object[]
follow 00000000c11e75a0 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000000c11e8518 72 System.Object[]
follow 00000000c13b1068 80 System.Object[]
follow 00000000c13b1130 96 System.Object[]
follow 00000000c1547260 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000000c15481d8 72 System.Object[]
follow 00000000c1669338 80 System.Object[]
follow 00000000c1669400 96 System.Object[]
follow 00000000c175b778 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000000c175c6f0 72 System.Object[]
follow 00000000c1904c80 80 System.Object[]
follow 00000000c1904d48 96 System.Object[]
follow 00000000c1acba70 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000000c1acc9e8 72 System.Object[]
follow 00000000c1d03f08 80 System.Object[]
follow 00000000c1d03fd0 96 System.Object[]
follow 00000000c1f8f398 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000000c1f90310 72 System.Object[]
follow 00000000c21bd2f0 80 System.Object[]
follow 00000000c21bd3b8 96 System.Object[]
follow 00000000c243d8a0 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000000c243e818 72 System.Object[]
follow 00000000e0188510 176 Microsoft.Exchange.Data.Directory.ADPagedReader`1[[Microsoft.Exchange.Data.Directory.ADRawEntry, Microsoft.Exchange.Data.Directory]]
follow 00000000f096aea8 80 System.DirectoryServices.Protocols.SearchRequest
follow 00000001106ce9f8 80 System.DirectoryServices.Protocols.SearchRequest
CLR Thread 0x49
stack:000000005443d538
stack:000000005443d5d0
stack:000000005443d970
O.k., how many threads do we have waiting on this?:
0:086> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner
148 000000004ef62350 95 1 000000004ef6a070 2a18 86 0000000080074a08 System.Object
Waiting threads: 23 24 25 34 35 36 37 38 39 40 50 51 52 53 54 55 56 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 87
-----------------------------
Total 1548
CCW 0
RCW 0
ComClassFactory 0
Free 0
In the words of Strongbad: That is not a very small number. That is a very BIG number!
For those of you wondering what this means in common, everyday production use of Exchange:
Mail would come in, destined for 'x' recipient. Transport would see a rule that has the condition, IsMemberOf. Then, Transport would see that the local recipient cache for the group it needs to verify that the recipient is a member of is out-of-date. It would call a thread to update it. Since threads aren't context-aware, this would occur for many of the recipients currently in the Submission Queue. The thread making the LDAP call would own the lock, as it entered the critical section. The other threads had to wait for the lock to be released to make their AD call. Since only one thread in the process' bounds can own a lock, at a time, this means the threads could only complete in a one-by-one scenario. This is also why the mail couldn't be categorized because the lock prevented the thread from validating IsMemberOf without a current cache.
For those of you concerned if you'll run into this issue: This condition only reproduces in a single instance and that instance has only recurred twice, so far, during the timespan of a year. Rest assured, that even though the perceivable affected footprint of this issue is small, it is being investigated by the Transport developers.