VSS backup of AlwaysOn Secondaries

Hi Everyone,

Today I’m going to highlight one of the changes brought by SQL Server 2012 SP2, which is the way we handle VSS Backup requests on AlwaysOn Secondary Databases.

Until now, any request for a FULL database backup (VSS_BT_FULL) thru VSS against a DB that is an AlwaysOn secondary was failing by design. Our VSS Writer SQLWriter would return FAILED_AT_PREPARE_SNAPSHOT (0x800423f4 - VSS_E_WRITERERROR_NONRETRYABLE).

A copy-only VSS backup (VSS_BT_COPY) would work.

The rationale being the following: a FULL backup is actually updating the target DB (reset of differential bitmap mainly), which is not possible when the DB is read only. Furthermore, because of the failover possibilities introduced by AlwaysOn, the favored option was to use Native SQL Servers backup that could rely on SQL Server variable backup location (https://msdn.microsoft.com/en-us/library/hh245119.aspx) if needed, and be ‘alwayson –aware’.

So that could be the end of the story: against an AlwaysOn Secondary DB, either use Copy_only VSS backups or use native backups.

But of course that wouldn’t make for a very interesting blog post…

Enters HyperV…

Consider the following scenario:

Large Windows HyperV Servers, hosting many HyperV virtual Servers, some of them SQL Servers in Always On architecture.

In short: a Private Cloud.

In this context, the IT usually takes care of the infrastructure at host level, and lets users deal with whatever happens within the VMs. One of the key tasks of IT is to manage backups (eg. for disaster recovery at datacenter level, or to provide restore of single VMs).

And the mainstream way to do that is to take VSS backups of the Host Disk Volumes. Microsoft System Center DPM will do exactly that.

But VSS backups are all about taking backups that are consistent: in ‘standalone’ SQL Server context you may already know all the logic SQLWriter implements to make sure that IO against the Databases that are backed up are frozen during the snapshot operation. So, back to our HyperV context, collecting a point-in-time image of a VHD without bothering with what happens within the VM would be defeating that very purpose right?

So what happens is the following: the VSS backup is propagated to Guest VMs thru HyperV integration services. That propagation hardcodes the backup type to VSS_BT_FULL, and therefore all guest VMs are initiating a VSS backup/snapshot in their own context. The purpose is to make sure that all applications are quiesced within all running VMs at the time we take the snapshot at the host level. This will enable us to generate a consistent backup of running VMs.

But let’s now put this in the context where one of the VMs is running an AlwaysOn secondary DB: you guessed it, it’s not going to work:

clip_image002

The important thing to know here is that the error returned by SQLWriter in VM3 will actually bubble up all the way to the initial VSS backup command at Host level, and will make it fail as a whole.

So we ended up in a situation where the IT infrastructure folks would see their Host backups failing from time to time for an unknown reason, depending on whether one or more of the VM present on the Host Drive being backup up had a secondary AlwaysOn DB! It could be that the AlwaysOn AG spans different HyperV hosts and therefore that the presence of a Secondary DB on a given Host is not something static over time.

Because of the complexity of the whole call chain, and because infrastructure IT operators may not have any visibility (or understanding) of the VM content, you can imagine what kind of troubleshooting challenges this would offer… And even when the situation is understood, well, what do we do? If host level backup must be somehow manually synchronized to the applicative state of Guest VMs, the Private Cloud scenario becomes quite more complicated all of a sudden.

This is the reason why SQL Server 2012 SP2 ships a code change for SQLWriter that will implement the following:

clip_image004

As you can see, SQLWriter now detects this specific situation and changes the backup type to VSS_BT_COPYONLY. This will only happen for VSS_BT_FULL backups against AlwaysOn secondary Dbs. VSS_BT_FULL backups against primary DB will happen without change.

In this case, the VSS backup will now successfully complete in VM3 and the host-level backup success state will no longer be tied to guest VM’s AlwaysOn activity. Private Cloud scenario unlocked!

Important note: the fact that VSS backup of AlwaysOn secondaries now works does not make it the preferred solution to backup SQL Server AlwaysOn architectures. The main purpose of the SP2 change is to avoid a situation where a single SQL Server in a VM fails a complete host-level backup operation that encompassing dozens of VMs.

The resulting backup for the VM hosting SQL should be considered a Disaster Recovery one, where AlwaysOn will be removed at once at restore time, not as a way to rebuild a subset of the nodes for an AlwaysOn Availability group. And for regular databases within the VM, that backup is as good any regular VSS one.

Finally, SQL Server 2012 SP2 only contains a partial fix for this issue. Servers running case-sensitive sort orders will require SQL Server 2012 SP2 Cumulative Update 2.

HTH,

Guillaume Fourrat
SQL Server Escalation Engineer
Microsoft France

Comments

  • Anonymous
    August 13, 2014
    Hey guys - great blog post, but for heaven's sake - please call the feature what it is - AVAILABILITY GROUPS. AlwaysOn is not a feature and it confuses the heck out of people. AlwaysOn is a marketing term and covers FCIs and AGs.

  • Anonymous
    August 19, 2014
    Thanks for the update.  Why not always use a COPY_ONLY in response to VSS backup request?  Do system differentials make use of the database backup chain and need to reset the differential reference?  Who should control the SQL Server backup chain – operations or DBA?   We are a county government.  We support all sorts of applications with a great deal of diversity when it comes to security and SLA.  The system recovery tools now available are great.  No more building a new system and restoring all databases one at a time.  However, a full system recovery is not the entire SLA.   DBA is often called upon to restore just part of a database to a point in time.  DBA can’t use a full system recovery without losing too much data or affecting unrelated applications.  I’m a bit nervous now that the system backup can break our backup chain – and the SLA.  Can we use the system backup to restore a single database to a point in time?   Should DBA have the right to control the response to a VSS backup request?  I really want it to be COPY_ONLY by default so that my backup chain remains valid.  Are SQL Server backups to disk and backup chains going to be a thing of the past - replaced by system level backups?  If not, perhaps there needs to be two backup chains - one for system backups and one for SQL Server?   connect.microsoft.com/.../protect-us-from-veeam I’m also curious about SAN replication.  Does it use VSS?  COPY_ONLY?  

  • Anonymous
    November 04, 2014
    @Allan Thanks for the feedback Allan, I'm guilty as charged. Correction pending.

  • Anonymous
    November 04, 2014
    The comment has been removed

  • Anonymous
    August 04, 2015
    The comment has been removed