Use case definition
To support this worked example, the fictitious firm "Contoso" will be used with an Azure Data Platform based upon Microsoft Reference Architectures.
Data Service - Component View
Contoso has implemented the following foundational Azure architecture, which is a subset of the Enterprise Landing Zone design.
The numbers in the following descriptions correspond to the preceding diagram above.
Contoso's Azure Foundations - Workflow
- Enterprise enrollment - Contoso's top parent enterprise enrollment within Azure reflecting its commercial agreement with Microsoft, its organizational account structure and available Azure subscriptions. It provides the billing foundation for subscriptions and how the digital estate is administered.
- Identity and access management – The components required to provide identity, authentication, resource access and authorization services across Contoso's Azure estate.
- Management group and subscription organization - A scalable group hierarchy aligned to the data platform's core capabilities, allowing operationalization at scale using centrally managed security and governance where workloads have clear separation. Management groups provide a governance scope above subscriptions.
- Management subscription - A dedicated subscription for the various management level functions of required to support the data platform.
- Connectivity subscription - A dedicated subscription for the connectivity functions of the data platform enabling it to identify named services, determine secure routing and communication across and between internal and external services.
- Landing zone subscription – One-to-many subscriptions for Azure native, online applications, internal and external facing workloads and resources
- DevOps platform - The DevOps platform that supports the entire Azure estate. This platform contains the code base source control repository and CI/CD pipelines enabling automated deployments of infrastructure as code (IaC).
Note
Many customers still retain a large infrastructure as a service (IaaS) footprint. To provide recovery capabilities across IaaS, the key component to be added is Azure Site recovery. Site Recovery will orchestrate and automate the replication of Azure VMs between regions, on-premises virtual machines and physical servers to Azure, and on-premises machines to a secondary datacenter.
Within this foundational structure, Contoso has implemented the following elements to support its enterprise business intelligence needs, aligned to the guidance in Analytics end-to-end with Azure Synapse.
Contoso's Data Platform - Workflow
The workflow is read left to right, following the flow of data:
- Data sources - The sources or types of data that the data platform can consume from.
- Ingest - The Platform's capability to ingest data from various sources of varying structure and speed. This design reflects a Lambda architecture.
- Store - The capability to securely store data at scale that has been ingested onto the platform.
- Process - The Platform's capability to process data, making it "fit for purpose" for downstream processes like cleansing, standardizing and modeling. The pre-processing of data typically ensures that it's in a "position and a condition, ready for use."
- Enrich - The capability to enhance data processed on the platform via statistical, machine learning, or other modeling techniques or prebuilt Azure AI services.
- Serve - The Platform's capability to shape and present data for downstream consumption.
- Data consumers - The individuals, applications or downstream processes that consume data from the platforms' various serving touchpoints.
- Discover and govern - The Platform's capabilities to govern the data it contains and ensure it's indexed, discoverable/searchable, well-described, with full lineage, and is transparent to its end users and consuming processes.
- Platform - The foundation upon which the platform is built, that is, Contoso's Azure foundations as described above.
Note
For many customers, the conceptual level of the Data Platform reference architecture used will align, but the physical implementation may vary. For example, ELT (extract, load, transform) processes may be performed through Azure Data Factory, and data modeling by Azure SQL server. To address this concern, the Stateful vs stateless components section below will provide guidance.
For the Data Platform, Contoso has selected the lowest recommended production service tiers for all components and has chosen to adopt a "Redeploy on disaster" disaster recovery (DR) strategy based upon an operating cost-minimization approach.
The following sections will provide a baseline understanding of the DR process and levers available to customers to uplift this posture.
Azure service and component view
The following tables present a breakdown of each Azure service and component used across the Contoso – Data platform, with options for DR uplift.
Note
The sections below are organized by stateful vs stateless services.
Stateful foundational components
Microsoft Entra ID including role entitlements
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: Premium P1
- DR uplift options: the Microsoft Entra resiliency is part of its software as a service (SaaS) offering.
- Notes
Azure Key Vault
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: N/A
- DR uplift options: N/A, Covered as part of the Azure service.
Recovery Services Vault
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: Default (geo-redundant storage (GRS))
- DR uplift options: Enabling Cross Region Restore creates data restoration in the secondary, paired region.
- Notes
- While locally redundant storage (LRS) and zone-redundant storage (ZRS) are available, it requires configuration activities from the default setting.
Azure DevOps
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: DevOps Services
- DR uplift options: DevOps service and data resiliency is part of its SaaS offering.
- Notes
- DevOps Server as the on-premises offering will remain the customer's responsibility for disaster recovery.
- If third party services (SonarCloud, Jfrog Artifactory, Jenkins build servers for example) are used, they'll remain the customer's responsibility for recovery from a disaster.
- If IaaS VMs are used within the DevOps toolchain, they'll remain the customer's responsibility for recovery from a disaster.
Stateless Foundational Components
Subscriptions
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: N/A
- DR uplift options: N/A, Covered as part of the Azure service.
Management Groups
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: N/A
- DR uplift options: N/A, Covered as part of the Azure service.
Azure Monitor
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: N/A
- DR uplift options: N/A, Covered as part of the Azure service.
Cost Management
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: N/A
- DR uplift options: N/A, Covered as part of the Azure service.
Microsoft Defender for Cloud
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: N/A
- DR uplift options: N/A, Covered as part of the Azure service.
Azure DNS
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: Single Zone - Public
- DR uplift options: N/A, DNS is highly available by design.
Network Watcher
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: N/A
- DR uplift options: N/A, Covered as part of the Azure service.
Virtual Networks, including Subnets, user-defined route (UDR) & network security groups (NSG)
- Component recovery responsibility: Contoso
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: N/A
- DR uplift options: VNETs can be replicated into the secondary, paired region.
Azure Firewall
- Component recovery responsibility: Contoso
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: Standard
- DR uplift options: Azure Firewall is highly available by design and can be created with Availability Zones for increased availability.
Azure DDoS
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: DDoS Network Protection
- DR uplift options: N/A, Covered as part of the Azure service.
ExpressRoute Circuit
- Component recovery responsibility: Contoso, connectivity partner and Microsoft
- Workload/configuration recovery responsibility: Connectivity partner and Microsoft
- Contoso SKU selection: Standard
- DR uplift options:
- ExpressRoute can be uplifted to use private peering, delivering a geo-redundant service.
- ExpressRoute also has high availability (HA) designs available.
- Site-to-Site VPN connection can be used as a backup for ExpressRoute.
- Notes
- The ExpressRoute has inbuilt redundancy, with each circuit consisting of two connections to two Microsoft Enterprise edge routers (MSEEs) at an ExpressRoute Location from the connectivity provider/client's network edge.
- ExpressRoute premium circuit will enable access to all Azure regions globally.
VPN Gateway
- Component recovery responsibility: Contoso
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: Single Zone - VpnGw1
- DR uplift options: A VPN gateway can be deployed into an Availability Zone with the VpnGw#AZ SKUs to provide a zone redundant service.
Azure Load Balancer
- Component recovery responsibility: Contoso
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: Standard
- DR uplift options:
- A load balancer can be configured for Zone redundancy within a region with availability zones. If so, the data path will survive as long as one zone within the region remains healthy.
- Depending on the primary region, a cross-region load balancer can be deployed for a highly available, cross regional deployment.
- Notes
- Azure Traffic Manager is a DNS-based traffic load balancer. This service supports the distribution of traffic for public-facing applications across the global Azure regions. This solution will provide protection from a regional outage within a high availability design.
Stateful data platform-specific services
Storage Account: Azure Data Lake Gen2
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: LRS
- DR uplift options: Storage Accounts have a broad range of data redundancy options from primary region redundancy up to secondary region redundancy.
- Notes
- GRS is recommended to uplift redundancy, providing a copy of the data in the paired region.
Azure Event Hubs
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: Standard
- DR uplift options: An event hub namespace can be created with availability zones enabled. This resiliency can be extended to cover a full region outage with Geo-disaster recovery.
- Notes
- By design, Event Hubs geo-disaster recovery doesn't replicate data, therefore there are several considerations to keep in mind for failover and fallback.
Azure IoT Hubs
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: Standard
- DR uplift options:
- IoT Hub Resiliency can be uplifted by a cross regional HA implementation.
- Microsoft provides the following guidance for HA/DR options.
- Notes
- IoT Hub provides Microsoft-Initiated Failover and Manual Failover by replicating data to the paired region for each IoT hub.
- IoT Hub provides Intra-Region HA and will automatically use an availability zone if created in a predefined set of Azure regions.
Azure Stream Analytics
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: Standard
- DR uplift options: While Azure Stream Analytics is a fully managed platform as a service (PaaS) offering, it doesn't provide automatic geo-failover. Geo-redundancy can be achieved by deploying identical Stream Analytics jobs in multiple Azure regions.
Azure Machine Learning
- Component recovery responsibility: Contoso and Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: General Purpose, D Series instances
- DR uplift options:
- Azure Machine Learning depends on multiple Azure services, some of which are provisioned in the customer's subscription. As such, the customer remains responsible for the high-availability configuration of these services.
- Resiliency can be uplifted via a multi-regional deployment.
- Notes:
- Azure Machine Learning itself doesn't provide automatic failover or disaster recovery.
Power BI
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: Power BI Pro
- DR uplift options: N/A, Power BI's resiliency is part of its SaaS offering.
- Notes
- Power BI resides in the Office365 tenancy, not that of Azure.
- Power BI uses Azure Availability Zones to protect Power BI reports, applications and data from datacenter failures.
- In the case of regional failure, Power BI will failover to a new region, usually in the same geographical location, as noted in the Microsoft Trust Center.
Azure Cosmos DB
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: Single Region Write with Periodic backup
- DR uplift options:
- Single-region accounts may lose availability following a regional outage. Resiliency can be uplifted to a single write region and at least a second (read) region and enable Service-Managed failover.
- It's recommended that Azure Cosmos DB accounts used for production workloads to enable automatic failover. In the absence of this configuration, the account will experience loss of write availability for all the duration of the write region outage, as manual failover won't succeed due to lack of region connectivity.
- Notes
- To protect against data loss in a region, Azure Cosmos DB provides two different backup modes - Periodic and Continuous.
- Regional failovers are detected and handled in the Azure Cosmos DB client. They don't require any changes from the application.
- The following guidance describes the impact of a region outage based upon the Cosmos DB configuration.
Azure Data Share
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: N/A
- DR uplift options: the Azure Data Share resiliency can be uplifted by HA deployment into a secondary region.
Microsoft Purview
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: N/A
- DR uplift options: N/A
- Notes
- As of October 2024, Microsoft Purview doesn't support automated business continuity and disaster recovery (BCDR). Until that support is added, the customer is responsible for all backup and restore activities.
Stateless data platform-specific services
Azure Synapse: Pipelines
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: Computed Optimized Gen2
- DR uplift options: N/A, Synapse resiliency is part of its SaaS offering using the automatic failover feature.
- Notes
- If Self-Hosted Data Pipelines are used, they'll remain the customer's responsibility for recovery from a disaster.
Azure Synapse: Data Explorer Pools
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: Computed Optimized, Small (4 cores)
- DR uplift options: N/A, Synapse resiliency is part of its SaaS offering.
- Notes
- Availability Zones are enabled by default for Synapse Data Explorer where available.
Azure Synapse: Spark Pools
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: Computed Optimized, Small (4 cores)
- DR uplift options: N/A, Synapse resiliency is part of its SaaS offering.
- Notes
- Currently, Azure Synapse Analytics only supports disaster recovery for dedicated SQL pools and doesn't support it for Apache Spark pools.
Azure Synapse: Serverless and Dedicated SQL Pools
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Contoso
- Contoso SKU selection: Computed Optimized Gen2
- DR uplift options: N/A, Synapse resiliency is part of its SaaS offering.
- Notes
- Azure Synapse Analytics automatically takes snapshots throughout the day to create restore points that are available for seven days.
- Azure Synapse Analytics performs a standard geo-backup once per day to a paired datacenter. The recovery point objective (RPO) for a geo-restore is 24 hours.
- If Self-Hosted Data Pipelines are used, they'll remain the customers responsibility recovery from a disaster.
Azure AI services (formerly Cognitive Services)
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: Pay As You Go
- DR uplift options: N/A, the APIs for AI services are hosted by Microsoft-managed data centers.
- Notes
- If AI services has been deployed via customer deployed Docker containers, recovery remains the responsibility of the customer.
Azure AI Search (formerly Cognitive Search)
- Component recovery responsibility: Microsoft
- Workload/configuration recovery responsibility: Microsoft
- Contoso SKU selection: Standard S1
- DR uplift options:
- AI Search can be raised to an HA design by using replicas across availability zones and regions.
- Multiple services in separate regions can extend the resiliency further.
- Notes
- In AI Search business continuity (and disaster recovery) is achieved through multiple AI Search services.
- there's no built-in mechanism for disaster recovery. If continuous service is required during a catastrophic failure, the recommendation is to have a second service in a different region, and implementing a geo-replication strategy to ensure indexes are fully redundant across all services.
Stateful vs stateless components
The speed of innovation across the Microsoft product suite and Azure, in particular, means the component set that we've used for this worked example will quickly evolve. To future-proof against providing stale guidance and extend this guidance to components not explicitly covered in this document, the section below provides some instruction based upon the coarse-grain classification of state.
A component/service can be described as stateful if it's designed to remember preceding events or user interactions. Stateless means there's no record of previous interactions, and each interaction request has to be handled based entirely on information that comes with it.
For a DR scenario that calls for redeployment:
- Components/services that are "stateless", like Azure Functions and Azure Data Factory pipelines, can be redeployed from source control with at least a smoke test to validate availability before being introduced into the broader system.
- Components/services that are "stateful", like Azure SQL Database and storage accounts, require more attention.
- When procuring the component, a key decision will be selecting the data redundancy feature. This decision typically focuses on a trade-off between availability and durability with operating costs.
- Datastores will also need a data backup strategy. The data redundancy functionality of the underlying storage mitigates this risk for some designs, while others, like SQL databases will need a separate backup process.
- If necessary, the component can be redeployed from source control with a validated configuration via a smoke-test.
- A redeployed datastore must have its dataset rehydrated. Rehydration can be accomplished through data redundancy (when available) or a backup dataset. When rehydration has been completed, it must be validated for accuracy and completeness.
- Depending on the nature of the backup process, the backup datasets may require validation before being applied. Backup process corruption or errors may result in an earlier backup being used in place of the latest version available.
- Any delta between the component date/timestamp and the current date should be addressed by reexecuting or replaying the data ingestion processes from that point forward.
- Once the component's dataset is up to date, it can be introduced into the broader system.
Other key services
This section contains high availability (HA) and DR guidance for other key Azure data components and services.
- Azure Databricks - DR guidance can be found in the product documentation.
- Azure Analysis Services - HA guidance can be found in the product documentation.
- Azure Database for MySQL
- Flexible Server HA guidance can be found in the product documentation.
- Single Server HA guidance can be found in the product documentation.
- SQL
- SQL on Azure VMs guidance can be found in the product documentation.
- Azure SQL and Azure SQL Managed Instance guidance can be found in the product documentation.
Next steps
Now that you've learned about the scenario's architecture, you can learn about the scenario details.