The Case of Anti-Virus filter drive interference with File Stream Restore
"Denzil and I were working on this issue for a customer and Denzil has been gracious enough to write-up a blog for all of us." – Bob Dorr
From Denzil:
I recently worked with a customer on a Database restore issue where the database being restored had 2TB of File stream data. The restore in this case would just not complete successfully and would fail with the error below.
10 percent processed.
20 percent processed.
30 percent processed.
40 percent processed.
Msg 3634, Level 16, State 1, Line 1
The operating system returned the error '32(The process cannot access the file because it is being used by another process.)' while attempting 'OpenFile' on 'F:\SQLData11\DataFiles\535cc368-de43-4f03-9a64-f5506a3f532e\547fc3ed-da9f-44e0-9044-12babdb7cde8\00013562-0006edbb-0037'.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.
Subsequent restore attempts would fail with the same error though on "different" files and at a different point in the restore cycle.
Given that this was "not" the same file or the same point of the restore on various attempts my thoughts immediately went to some filter driver under the covers wreaking some havoc. I ran an a command to see what filter drivers were loaded (trimmed output below.)
C:\>fltmc instances
Filter Volume Name Altitude Instance Name
-------------------- ----------------------- ------------ ---------------------- -----
BHDrvx64 F:\SQLData11 365100 BHDrvx64 0
eeCtrl F:\SQLData11 329010 eeCtrl 0
SRTSP F:\SQLData11 329000 SRTSP 0
SymEFA F:\SQLData11 260600 SymEFA 0
RsFx0105 \Device\Mup 41001.05 RsFx0105 MiniFilter Instance 0
SymEFA = Symantec extended file attributes driver
SRTSP = Symantec Endpoint protection
RsFx0105 = SQL Server File Stream filter driver.
In discussing this with the customer, Anti-virus exclusions were controlled by GPO so he had put in a request to exclude the respective folders, yet the issue still continued.
In order to do my due diligence, the other question was whether we "released" the file handle after we created it, and whether someone else grabbed it? So we (Venu, Bob and I) did take a look at the code and this can be the case. On SQL Server 2008 R2 when we call the CreateFile API and we hardcode the shareAccess parameter to 0 which is exclusive access while we have it open to prevent secondary access.
https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx
If this parameter is zero and CreateFile succeeds, the file or device cannot be shared and cannot be opened again until the handle to the file or device is closed. For more information, see the Remarks section.
Once the file is created, we release the EX latch and can close the file handle, on the file, but sqlservr.exe continues to hold the lock on the file itself during the restore process. Once the restore operation is completed, we no longer hold an exclusive lock to the file.
We can reopen file handles during the Recovery process so the other thought was perhaps it was a transaction affected by recovery and GC and potentially some race condition but in this case we know that the restore was failing prior to that as it didn't reach 100% so that could be ruled out as well.
Getting a dump at the failure time showed me the same Restore Stack but different dumps showed multiple different files in question so it wasn't a particular Log record sequence per say causing this.
sqlservr!ex_raise
sqlservr!HandleOSError
sqlservr!FileHandleCache::OpenFile
sqlservr!FileHandleCache::ProbeForFileHandle
sqlservr!FileHandleCache::GetFileHandle
sqlservr!RestoreCopyContext::RestoreFilesystemData
BackupIoRequest::StartDatabaseScatteredWrite
Given now that it was unlikely it was SQL Server, I concentrated more on the Filter driver theory. I tried to capture Process monitor, but given the time it took and amount of files touched, Process monitor was not all that useful. I couldn't filter on a specific folder as it failed on different folders and there were 10 + mount points involved.
However from Process monitor while the restore was going on, I looked at the stack for some I/O operations (not ones that failed by any means) and I still saw fltmgr.sys sitting there for an OpenFile Call on a file in the filestream directory
fltmgr.sys + 0x2765 |
0xfffffa6001009765 |
C:\Windows\system32\drivers\fltmgr.sys |
fltmgr.sys + 0x424c |
0xfffffa600100b24c |
C:\Windows\system32\drivers\fltmgr.sys |
fltmgr.sys + 0x1f256 |
0xfffffa6001026256 |
C:\Windows\system32\drivers\fltmgr.sys |
ntoskrnl.exe + 0x2c8949 |
0xfffff80002918949 |
C:\Windows\system32\ntoskrnl.exe |
ntoskrnl.exe + 0x2c0e42 |
0xfffff80002910e42 |
C:\Windows\system32\ntoskrnl.exe |
ntoskrnl.exe + 0x2c19d5 |
0xfffff800029119d5 |
C:\Windows\system32\ntoskrnl.exe |
ntoskrnl.exe + 0x2c6fb7 |
0xfffff80002916fb7 |
C:\Windows\system32\ntoskrnl.exe |
ntoskrnl.exe + 0x2b61a8 |
0xfffff800029061a8 |
C:\Windows\system32\ntoskrnl.exe |
ntoskrnl.exe + 0x57573 |
0xfffff800026a7573 |
C:\Windows\system32\ntoskrnl.exe |
ntdll.dll + 0x471aa |
0x77b371aa |
C:\Windows\System32\ntdll.dll ZwOpenFile |
kernel32.dll + 0x10d48 |
0x779d0d48 |
C:\Windows\system32\kernel32.dll |
kernel32.dll + 0x10a7c |
0x779d0a7c |
GetVolumeNameForRoot |
_____SQL______Process______Available + 0x695c7e |
0x1a080fe |
GetVolumeDeviceNameAndMountPoint |
_____SQL______Process______Available + 0x6d6898 |
0x1a48d18 |
|
_____SQL______Process______Available + 0x6d714a |
0x1a495ca |
sqlservr!CFsaShareFilter::RsFxControlContainerOwnership |
Also looking at some other Symantec related issues, I found an article not necessarily to do with any SQL restores but the fact that this was a possibility – again this has to do with a specific issue on a specific build, but am illustrating that Filter drivers can cause some unexpected behaviors.
As far as Anti-virus exclusions go, we actually have guidance in the article below: https://support.microsoft.com/kb/309422
And also in our File stream best practices article: https://msdn.microsoft.com/en-us/library/dd206979(v=SQL.105).aspx
When you set up FILESTREAM storage volumes, consider the following guidelines:
•Turn off short file names on FILESTREAM computer systems. Short file names take significantly longer to create. To disable short file names, use the Windows fsutil utility.
•Regularly defragment FILESTREAM computer systems.
•Use 64-KB NTFS clusters. Compressed volumes must be set to 4-KB NTFS clusters.
•Disable indexing on FILESTREAM volumes and set disablelastaccess to set disablelastaccess, use the Windows fsutil utility.
•Disable antivirus scanning of FILESTREAM volumes when it is not unnecessary. If antivirus scanning is necessary, avoid setting policies that will automatically delete offending files.
•Set up and tune the RAID level for fault tolerance and the performance that is required by an application.
Looking at another run of "fltmc instances" command output and still saw the Anti-virus components on the list for those mount points. Given we "thought" we had put an exclusion in for the whole drive, and it was showing up, it was time to look at this closer
- Excluded the drives where the data was being stored – Restore still failed
- Stopped the AV Services - Restore still failed
- Uninstalled Anti-virus – Restore now succeeded
Voila once we uninstalled AV on this machine, the restore succeeded. The customer is broaching this this with the AV vendor to figure out more of the root cause.
Denzil Ribeiro – Senior PFE
Comments
Anonymous
April 02, 2013
We've also seen issues with AV network filter drivers. This is true even when the source and destination were on the same server. Recommendation that we ultimately got from Premier Support was to pull the AV install and scan from remote.Anonymous
October 08, 2015
The comment has been removed