SQL Server VM Disaster Recovery between AZURE and AMAZON
Some days ago, a recent article from former Microsoft employee Michael Washam (http://michaelwasham.com) captured my attention:
Connecting Clouds – Creating a site-to-site Network with Amazon Web Services and Windows Azure
http://michaelwasham.com/2013/09/03/connecting-clouds-site-to-site-aws-azure
Wow! Today we cannot (yet! :-) ) have an Azure Virtual Network/VPN crossing more than one Azure datacenter, but we can have a Virtual Network/VPN spanning two different Cloud providers…. Awesome!
My mind immediately went to the possible implications of new high availability and disaster recovery scenarios, such as building a solution that is not tied to a single Cloud Provider: working with partners on several Azure projects, I heard this kind of request several times since they want to ensure at least Disaster Recovery (DR), maybe also High Availability (HA), can be achieved even if a single Cloud Provider will fail completely.
- Reading the article, the procedure is pretty simple:
- Create a Virtual Private Cloud (VPC) on Amazon;
- Create a Virtual Network (VNET) on Azure with a Gateway;
- Deploy a Linux VM in Amazon VPC to host OpenSwan VPN software and configure parameters to connect to the Azure VNET Gateway;
NOTE: OpenSwan is a complete IPsec implementation for Linux, for more information see this link: https://www.openswan.org/projects/openswan .
The overall configuration process is simple, but there are some caveats:
- Even if OpenSwan, configured as in the article, seems to satisfy all the technical requirements for Azure Virtual Network Gateway connection, it’s not officially supported by Microsoft; pretty obvious that you will not able to open a Support Case with Microsoft complaining OpenSwan it’s not working;
- For a list of Azure gateway requirements and supported VPN devices, see the links below:
http://msdn.microsoft.com/en-us/library/windowsazure/jj156075.aspx#BKMK_VPNGateway
http://msdn.microsoft.com/en-us/library/windowsazure/jj156075.aspx#bkmk_VPNDevice
- While in Azure the VPN endpoint is highly-available, since backed up by TWO distinct (and hidden) Azure VMs, the architecture described in the article above presents a single point of failure on the OpenSwan server: I don’t know if that piece of software supports some kind of HA, but definitely you should investigate and evaluate;
But wait a moment: Why I have to use OpenSwan and Linux in the Amazon VPC, since it’s not officially supported by Azure? You can use a Windows Server 2012 VM and its RRAS feature and that’s it! It’ officially supported as you can read in the link below:
http://msdn.microsoft.com/en-us/library/windowsazure/jj156075.aspx#bkmk_VPNDevice
IMPORTANT: At least at my knowledge, there is no way to make Windows Server 2012 RRAS highly-available, then also in this case the proposed solution is more suitable for DR purposes, not HA.
Ok, now that you know the whole story, which HA/DR scenarios we can build? Since I’m still a SQL Server guy, let me focus on SQL Server (in Azure IaaS VMs) for the purpose of simplicity.
The starting point is provided in the white-paper below, where you can find all the possible HA/DR scenarios, without considering what we are discussing in this blog post:
High Availability and Disaster Recovery for SQL Server in Windows Azure Virtual Machines
http://msdn.microsoft.com/en-us/library/jj870962.aspx
Specifically, I’m interested in using SQL Server 2012 AlwaysOn Availability Groups (AG) to implement a DR scenario between AMAZON and AZURE, like the one below:
Here are my considerations:
- Since all AG nodes must be of the same Windows Cluster, Active Directory connectivity is required, also by the node in AMAZON: in the picture above, I placed a Domain Controller also on the AMAZON VPC for high-availability and performance reasons, it’s highly recommended to place at least one Domain Controller per Cloud provider;
- Please note that all 3 nodes are part of the same Windows Cluster: the majority type used is “Node Majority” since we have an odd number of nodes;
- As on-premise, the quorum vote mechanism should be adjusted on the secondary DR site, AMAZON in my example picture above; for details, see the section “Quorum Model and Node Votes” in the white-paper mentioned at the end of this post;
- SQL Server AG replica node in AMAZON should be configured for asynchronous replication (allow data loss) and then not for automatic failover, due to the network latency; if you require zero data loss, you can also change to synchronous replica, but be sure to test the performance impact carefully;
- The two nodes on the Azure side, should be configured for synchronous replication and automatic failover;
- Be aware of the costs: here you are paying for Gateway traffic on the Azure side; obviously, there are additional costs also on the AMAZON side;
- Be aware of the bandwidth: I don’t know on AMAZON, but on the AZURE side, there is a limit of approximately 60MB/sec, due to the fact that the Azure VMs used to implement the VPN Gateway are “SMALL” sized;
- Finally, I used AZURE as the primary cloud provider and AMAZON as the secondary, obviously you can do the converse, but I prefer to assume AZURE will have higher availability :-) ;
Now, what will happen in case of a complete AZURE or AMAZON failure?
In the scenario proposed in the picture, in case of a complete AMAZON failure, the AZURE side of the architecture will not be affected at all and SQL Server will remain up and available. Conversely, in case of a complete AZURE failure, Windows Cluster will not have the necessary quorum to remain online, then it will shut down and SQL Server will be not available: this is expected in a DR scenario, manual intervention will be required to force the AMAZON side survivor node to start and SQL Server AG to perform a forced failover (with potential data loss).
If you are interested in the recovery steps at the Windows Server Cluster and SQL 2012 AG, look at the white-paper below (section “Recovering from a Disaster”):
AlwaysOn Architecture Guide: Building a High Availability and Disaster Recovery Solution by Using Failover Cluster Instances and Availability Groups
http://msdn.microsoft.com/en-us/library/jj215886.aspx
That’s all folks…. I would like to know your opinion and eventually your experience implementing this kind of scenarios.
Regards.
Comments
- Anonymous
January 31, 2014
Did you try to implement with 2012 RRAS? I do not think its possible as RRAS must have a non-NAT interface and AWS/Elastic IP's are always a 1:1 NAT.