The Case of the Phantom Spid: Troubleshooting and Resolving an Orphaned DTC Transaction

Several weeks ago one of our customers encountered an issue in which an index maintenance job that had always completed successfully, began failing with each nightly run. While troubleshooting the issue, the customer ran the DBCC OPENTRAN command against the affected database and discovered that another session, which had been running for several days, might be blocking the index maintenance job and causing it to fail. After repeated attempts at killing the long running session failed, the customer contacted our team and asked for our assistance in resolving the issue.

The first thing we asked the customer to do was to run the DBCC OPENTRAN command again against the affected database to confirm that the long running session was still active. As expected, the long running session (spid 178) was still active and had a transaction start time of several days in the past that coincided with the time the index maintenance job began failing.

We then asked the customer to run a query against the sys.dm_exec_sessions DMV to get the login time for spid 178. Surprisingly, the login time for spid 178 was only seconds in the past, and the login time was changing with each run of the sys.dm_exec_sessions query. This made it clear that the spid 178 returned by the DBCC OPENTRAN command was not the same spid 178 as the ones returned by the sys.dm_exec_sessions query.

So what was the origin of this phantom spid? To answer this question, we had to find a record of phantom spid 178 or its associated transaction somewhere on the server. We started our search by querying the sys.dm_tran_active_transactions and sys.dm_tran_session_transactions DMVs, but didn’t have any luck finding records related to phantom spid 178. But when we queried the sys.dm_tran_database_transactions DMV, we found a record in the affected database with a database_transaction_begin_time that matched the start time of phantom spid 178’s transaction exactly. Feeling confident that we had found phantom spid 178’s transaction record, we saved the transaction_id of the record.

Next, we queried the sys.syslockinfo compatibility view for any records with a req_transactionid matching the transaction id that we saved from the previous query. This paid off as the query returned several records matching the saved transaction id. But the req_spid value of these records was -2 and not 178! Was this the wrong transaction? No. SQL Server uses a req_spid value of -2 to designate an orphaned DTC transaction.

So how could we terminate a spid that didn’t exist? The answer is we couldn’t. But what we could do was terminate the unit of work (UOW) associated with the orphaned transaction using the KILL command. Luckily a DTC transaction’s UOW is found in the req_transactionUOW column of the sys.syslockinfo compatibility view. And so our final step in resolving this customer issue was to execute the KILL command against the UOW that we retrieved from the sys.syslockinfo records associated with the offending transaction id. As a result, the phantom spid released the objects it had locked and was never heard from again.

Chris Miller, SQL Server Dedicated Support Engineer

Comments

  • Anonymous
    June 24, 2009
    Did they also blame MSDTC for this "falure", or did they manage to identify the non-committing or non-rollbacking application and fix it ?

  • Anonymous
    August 09, 2009
    The comment has been removed

  • Anonymous
    June 29, 2014
    Good stuff to know.Thanks for the way to get rid of phantom.