Enhance AlwaysOn Failover Policy to Test SQL Server Database Data and Log Drives

In SQL Server 2012 and 2014, AlwaysOn health diagnostics detect the health of the SQL Server process in several ways. However, no health detection is performed on the accessibility or viability of the databases defined in AlwaysOn availability groups. If the disk hosting the availability group database or log files is lost, AlwaysOn health does not detect this event, and application runtime errors accessing the database will ensue. Loss of the drive or errors accessing the drive that host your availability database data and log files will affect access to your production data.

As per the Flexible Failover Policy for Automatic Failover of an Availability Group (SQL Server)

Damaged databases and suspect databases are not detected by any failure-condition level. Therefore, a database that is damaged or suspect (whether due to a hardware failure, data corruption, or other issue) never triggers an automatic failover.

NOTE SQL Server 2016 enhances the AlwaysOn health diagnostics with database health detection. If AlwaysOn availability group database health detection has been selected for your availability group and an availability database transitions out of the ONLINE state (sys.databases.state_desc), the entire availability group will failover automatically if configured to do so. For more information see MSDN topic 'CREATE AVAILABILITY GROUP' and section on ‘DB_FAILOVER’

CREATE AVAILABILITY GROUP (Transact-SQL)

You can enhance the availability of your databases by detecting disk health. Add a generic script resource to your availability group resource group that does basic read or write tests against the drives hosting your availability group database data and log files. The following describes how to add a generic script resource as a dependency to your availability group resource to enhance AlwaysOn health detection with a basic disk health check.

Use a generic script resource to do basic health check on SQL Server data and log drives

Here is the high level description of the implementation of the generic script resource to detect availability group database drive health.

Add a generic script clustered resource to the availability group resource group. Make the availability group resource dependent on the generic resource script. That way, if the script resource reports IsAlive failure, Windows Cluster will attempt to restart or failover the availability group resource if configured to do so.

The generic script clustered resource IsAlive test creates a text file in the specified Data drive location and a text file in the specified Log drive location. If the file exists, the script will overwrite it.

Attached to this blog is a zipped file, GenericScript_SQLIsAlive.zip, containing

sqlisalive.vbs <- The generic script written in Visual Basic Script and implements Windows Cluster IsAlive

Add_SQLIsAliveScript.ps1 <- The PowerShell script which adds the generic script resource to your availability group resource group and sets the availability group dependency on your generic script resource

readme.txt <- Step by step instructions for implementing the generic script resource and additional instructions on how to test the script.

Implement the generic script resource

I Configure the generic script, sqlisalive.vbs

Data and Log Drive Paths Currently, the generic script is configured to test the following drive and paths: c:\temp\data and c:\temp\log. For testing purposes, create these paths on the local drive of each replica (primary and automatic failover partner secondary). Later, you can change them to the appropriate drive and paths where your database data and log files live.

    DataDriveFile="c:\temp\data\ScriptFileData.txt"
    LogDriveFile="c:\temp\log\ScriptFileLog.txt"

II Configure and execute the Powershell script to deploy the generic script to your availability group

NOTE This generic script only implements IsAlive which runs every 60 seconds.

1 Ensure your availability group has two replicas configured for automatic failover.

2 Copy the generic script file to an identical local storage location like 'C:\temp\sqlisalive.vbs' on both servers whose replicas are configured for automatic failover.

3 Create the paths for the health check, c:\temp\data and c:\temp\log.

4 The Add_SQLIsAliveScript.ps1 PowerShell script adds the generic script resource to your availability group and adds a dependency to your availability group resource, on the generic script resource. In the Add_SQLIsAliveScript.ps1, change the following variables.

     Set $ag to your availability group name
     Set $listener to your availability group listener name. If your availabiltiy group does not have a listener, set $listener to ""
     Set $scriptfilepath to the path and file name of your sqlisalive.vbs script

5 On the server hosting the primary replica, run the PowerShell script Add_SQLIsAliveScript.ps1 to add the generic script resource to your availability group resource group.

6 Launch Failover Cluster Manager and review the availability group resource group to confirm addition of the generic script resource to the availability group resource group. The generic script should appear and come online in the availability group resource group under the Resources tab.

7 Confirm that the dependency has been created in the availability group resource, on the generic script resource.

NOTES

The attached readme.txt file has instructions on how to test the script resource to ensure that it can failover your availability group resource.

Diagnose failure detection by generic script resource

Generate the cluster log for the node hosting the primary replica and search for 'Data Drive Create File' or 'Log Drive Create File' - to locate success or failure report of the generic script resource IsAlive:

00001b04.00002924::2015/12/07-17:16:41.798 INFO  [RES] Generic Script <sqlisalive>: Entering IsAlive
00001b04.00002924::2015/12/07-17:16:41.801 INFO  [RES] Generic Script <sqlisalive>: Data Drive Create File Succeeded
00001b04.00002924::2015/12/07-17:16:41.801 INFO  [RES] Generic Script <sqlisalive>: Log Drive Create File Succeeded

Or for example, you set the \Data folder to read only:

00001b04.00002924::2015/12/07-17:17:41.801 INFO  [RES] Generic Script <sqlisalive>: Entering IsAlive
00001b04.00002924::2015/12/07-17:17:41.804 INFO  [RES] Generic Script <sqlisalive>: Data Drive Create File Succeeded
00001b04.00002924::2015/12/07-17:17:41.804 INFO  [RES] Generic Script <sqlisalive>: Data Drive Create File Failed
00001b04.00002924::2015/12/07-17:17:41.804 INFO  [RES] Generic Script <sqlisalive>: Permission denied
00001b04.00002924::2015/12/07-17:17:41.804 ERR   [RES] Generic Script <sqlisalive>: 'IsAlive' script entry point returned FALSE.'
00001b04.00002924::2015/12/07-17:17:41.804 INFO  [RES] Generic Script <sqlisalive>: Return value of 'IsAlive' script entry point caused HRESULT to be set to 0x00000001.
00001b04.00002298::2015/12/07-17:17:41.804 WARN  [RHS] Resource sqlisalive IsAlive has indicated failure.

GenericScript_SQLIsAlive.zip

Comments

  • Anonymous
    July 19, 2016
    Thanks a lot for your helpful article. But one thing isn't clear for me: Is this script also usable with SQL Server 2012/14 or only with SQL Server 2016?
    • Anonymous
      July 28, 2016
      In SQL Server 2012 and 2014, the AlwaysOn Availability Group in automatic failover mode will not initiate a failover at a database level failure. For example if a data or log disk failed for a user database, or if a user database had corruption and caused a state change from online to some other state (such as suspect), the database cluster would not fail over the availability group automatically.In SQL Server 2016, we added an additional configuration setting called DB_FAILOVER that is part of the create availability group or alter availability group command, that will initiate a failover if a database on the primary replica is in any state other than online. This means that a database level failure will induce an automatic failover if the AG is configured for automatic failover and the databases were in a synchronized state at the time of failure.The method of adding this script is a way that you can have the cluster perform an automatic failover in the event of a disk failure. The VB script writes a file to the path specified in the script. If that write fails, then it means that disk is considered failed and the script fails. The failing script causes the script resource in the cluster to fail. If the dependencies on the script resource are configured properly, this would make the cluster induce a failover due to disk/VB Script failure.Since this script does not connect to or utilize SQL Server in any way, it could be implemented for SQL 2012, 2014, or 2016. The benefit in 2016 is not as great if the DB_FAILOVER parameter is set to on since a database level failure would induce an automatic failover.
      • Anonymous
        October 09, 2016
        Thanks Mr. Carter...This will definitely help us...Thanks for the nice Article with Explanation
  • Anonymous
    January 17, 2017
    Hi, does this script (maybe with small changes) can be used to force a failover even when I'm using an asynchronous option between 2 SQL Server 2012? or do I still needa synchronous mode?