Troubleshooting SQL Server I/O requests taking longer than 15 seconds - I/O stalls & Disk latency

SQL Server has encountered x occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Drive:\MSSQL\MSSQL.1\MSSQL\Data\xyz.mdf] in database [database]. The OS file handle is 0x00000000. The offset of the latest long I/O is: 0x00000000000000

Do you see this error message printed in your error logs? This is indication of Disk latency or stuck I/O or stalled I/O.

What this means in Practical terms?

Sakthi (SQL Server) wanted a book (data) from 3rd floor (database file) in the library. Librarian (Windows & Disk Subsystem) went to pick that book (data) from 3rd floor (database file) to deliver to Sakthi (SQL Server) but the librarian (Windows & Disk Subsystem) did not return back since 15 seconds. It could be because librarian was very slow in walking down the stairs on his/her way back (stalled) OR librarian's legs were caught between the staircase steps (stuck) :)

Each scheduler has a list of pending I/Os that are checked to determine whether they have been completed every time there is a context switch. The count is incremented when the request is inserted. This count is decremented when the request is completed. If there is a IO pending for more than 15 seconds in this list, it is added up to the count of I/O pending for more than 15 seconds and printed in SQL Server error log every 5 mins. (Refer support article given an end of this article for more details on Recording and Reporting of this message)

Following query shows the number of pending I/Os that are waiting to be completed for the entire SQL Server instance:

SELECT SUM(pending_disk_io_count)AS [Number of pending I/Os] FROMsys.dm_os_schedulers 

Following query gives details about the stalled I/O count reported by the first query.

SELECT  * FROMsys.dm_io_pending_io_requests

Note the io_pending column in the output which indicates whether the I/O request is pending or has been completed by Windows. An I/O request can still be pending even when Windows has completed the request, but SQL Server has not yet performed a context switch in which it would process the I/O request and remove it from this list. Is not nullable.

So if the io_pending column says 1, then I/O is not completed at Operating system level which inturn means I/O is not completed by the disk subsystems and if this column value is 0 then SQL Server is yet to process the I/O request which means say if SQL Server posted an I/O to read top 100 Employee ID and if this top 100 Employee ID is not in the memory, a request is posted to read the data from the disk. The SPID which requested this data will be moved to SUSPENDED queue (Read this article for details on SPID status) and will wait for I/O to be completed. Once Windows updates SQL Server that the posted I/O is completed, SQL Server will move the SPID which posted this I/O from SUSPENDED queue to RUNNABLE queue and then once it is RUNNING, it reads the top 100 Employee ID data loaded in memory to process it.

Following query shows the splitup of stalled I/O per each database files in the SQL Server instance. The values are cumulative since SQL Server started. You can use sample_ms column output to compare the output between two instances of the output and find which file is the cause of the stalled I/O. (sample_ms is the number of milliseconds since the computer was started)

SELECT DB_NAME(database_id)AS [Database],[file_id], io_stall_read_ms],[io_stall_write_ms],[io_stall] FROMsys.dm_io_virtual_file_stats(NULL,NULL)  

Now once you are able to locate the file or the drive which stalls the IO, time to use Performance monitor:

Once you launch performance monitor, you will get Resource Monitor giving a overall picture of the Disk usage of the Operating system like the screen below. You can sort the processes based on Read, Write and Response time. You can also see the total Disk bandwidth usage (Ex: 348 KB/sec as per screen below) and how is the usage trending (as plotted in the chart below)

Resrc_Mntr

If you notice any other process throttling the Disk, analyze the process to see whether you can stop it or move it etc…

If you notice that it is SQL Server responsible for I/O load in the server, move to “Performance Monitor” and monitor counters below.

Logical Disk –> Drive(s) experiencing I/O stall –> Avg. Disk sec/Transfer

Average Disk Sec/Transfer on a SQL Server machine is typically less than 15ms. If this value climbs, it may indicate that the I/O subsystem is not keeping up with I/O demand. (Source: 833 Error Details). And if you want to gauge I/O perfectly, here are the values for reference:

Any number Less than 10 ms - very good
Between 10 - 20 ms - okay
Between 20 - 50 ms - slow, needs attention
Greater than 50 ms – Serious I/O bottleneck (Source: MSDN)

If this value is above the recommended value, check whether SQL Server is doing extra I/O than it is supposed to be. For ex: A query was re-written adding a new column and index it was using earlier to seek the data does not have this column will end up in bookmark lookup which might bloat the I/O if executed multiple times. There are some sample scripts available at https://technet.microsoft.com/en-us/library/bb838723(office.12).aspx to find I/O bound queries and batches/requests that are generating the most I/O.

Refer the KB article below for details troubleshooting guide on Stuck and Stalled I/O: https://support.microsoft.com/kb/897284/en-us

You should always check the time when this message is reported and check whether you have a maintenance job running at that point. It could be a Full Backup, Update Statistics, Rebuild/Reorganize indexes running at that point which does a heavy I/O and because of which SQL Server experiences stalled I/O

Further reads:
https://technet.microsoft.com/en-in/library/cc966414(en-us).aspx - Physical Database Storage Design
https://technet.microsoft.com/en-us/library/aa997558(EXCHG.65).aspx - Ruling Out Disk-Bound Problems
https://www.microsoft.com/downloads/en/details.aspx?familyid=9a8b005b-84e4-4f24-8d65-cb53442d9e19&displaylang=en - SQLIO Disk Subsystem Benchmark Tool

https://msdn.microsoft.com/en-us/library/aa365199(v=VS.85).aspx - I/O Concepts (Windows)

Comments

  • Anonymous
    December 18, 2011
    Good read..!!

  • Anonymous
    May 20, 2012
    Nice explanation with the live example.. Thanks Sakthi...

  • Anonymous
    June 11, 2012
    Nice post

  • Anonymous
    October 20, 2013
    clearly explained...thanks for this post

  • Anonymous
    December 01, 2013
    pretty cool ..the link is powerful

  • Anonymous
    January 29, 2014
    Long last did I read such an excellent article !

  • Anonymous
    April 14, 2014
    Very good article with examples. Thanks Sakthi.

  • Anonymous
    May 19, 2014
    Nice read.......

  • Anonymous
    November 12, 2014
    Good explanation

  • Anonymous
    January 28, 2015
    Well explained !!

  • Anonymous
    May 23, 2015
    Good

  • Anonymous
    October 20, 2015
    The query SELECT DB_NAME(database_id) AS [Database],[file_id], io_stall_read_ms],[io_stall_write_ms],[io_stall] FROM sys.dm_io_virtual_file_stats(NULL,NULL) is wrong, the correct one is this: SELECT DB_NAME(database_id) AS [Database],[file_id], [io_stall_read_ms],[io_stall_write_ms],[io_stall] FROM sys.dm_io_virtual_file_stats(NULL,NULL)