Theoretical Foundation: Configure Azure Site Recovery for Azure Resources
1. Initial Intuitionβ
In previous topics, you learned to protect data with Azure Backup: creating vaults, policies, and performing backups and restores. Backup solves the problem of data loss (someone deleted a file, a disk corrupted, a database was accidentally altered).
Azure Site Recovery (ASR) solves a different and more serious problem: what if the entire Azure region becomes unavailable? An earthquake, a catastrophic datacenter failure, a prolonged power outage. In this scenario, having backups in the same region doesn't help, as the vault would also be inaccessible.
The analogy: Azure Backup is like a safe inside your office where you keep copies of documents. Azure Site Recovery is like having a complete and operational office in another city, ready to function immediately if your main headquarters is destroyed. It's not just about data; it's about entire running infrastructure.
ASR continuously replicates your VMs to a secondary region. If the primary region fails, you execute a failover and your VMs come up in the secondary region within minutes. When the primary region recovers, you execute a failback to return to the original state.
2. Contextβ
ASR is the BCDR (Business Continuity and Disaster Recovery) component of Azure. While Backup focuses on RPO and data retention, ASR focuses on:
- RTO (Recovery Time Objective): how long for infrastructure to be operational after a disaster
- RPO (Recovery Point Objective): how much data loss is acceptable (in ASR, RPO for Azure VMs is approximately 60 seconds)
ASR exists within the Recovery Services Vault (not in the Backup Vault). This is important: you use the same vault for both Azure Backup and ASR, but they are separate functionalities within it.
ASR for Azure resources (VM to VM, region to region) is different from ASR for on-premises resources. The focus of this topic is exclusively Azure VM to Azure VM (Azure-to-Azure replication), which is the scenario required in AZ-104.
3. Building the Conceptsβ
3.1 Fundamental Terminologyβ
Before proceeding, you need to master the specific ASR terms.
Source Region: where your VMs are running normally. Example: Brazil South.
Target Region: where VMs will be replicated and where failover will occur. Example: East US 2.
Replication: continuous process of copying disk changes from the source VM to the target region. It's incremental and happens in background without impacting the VM.
Cache Storage Account: storage account automatically created in the source region. Disk changes are first sent to this cache before being transferred to the target region. Acts as a buffer to ensure no changes are lost.
Recovery Point: point in time captured during replication. There are two types:
- Crash-consistent: captured automatically every 5 minutes. Equivalent to the VM state as if it had been abruptly shut down. Adequate for most workloads.
- App-consistent: captured with configurable frequency (default: every 4 hours). Uses VSS to ensure application consistency (databases, services). Safer for transactional workloads.
Failover: process of activating replicated VMs in the target region. Can be:
- Test Failover: creates VMs in the target region in an isolated network, without affecting replication or production. For DR testing.
- Planned Failover: controlled failover, with no data loss. Used for planned region maintenance.
- Unplanned Failover (Failover): triggered when the primary region fails. May have data loss equivalent to RPO.
Failback: process of returning operations to the source region after a failover, when the primary region recovers.
Reprotect: after a failover, VMs are running in the target region. To enable failback, you must "reprotect" the VMs, reversing the replication direction (target becomes temporary source).
3.2 Recovery Plansβ
A Recovery Plan is an orchestrated sequence of failover for multiple VMs. Instead of executing failover individually on each VM, you create a plan that:
- Defines the startup order of VMs (databases before application servers)
- Groups VMs that should start simultaneously
- Includes manual actions (e.g., notify team) or automated scripts between steps
3.3 Resources automatically created in target regionβ
When you enable replication for a VM, ASR automatically creates (or allows you to configure) the following resources in the target region:
| Resource | Default Behavior | Configurable |
|---|---|---|
| Resource Group | Creates new with "-asr" suffix | Yes |
| Virtual Network | Creates new mapped from source | Yes (Network Mapping) |
| Subnet | Replicates subnet structure | Yes |
| Storage Account (cache) | Creates in source for cache | Partially |
| Managed Disks | Creates replicated disks in target | Yes (disk type) |
| VM (replica) | Created only at failover | Yes (size, configurations) |
| Availability Set / Zones | Configures in target | Yes |
3.4 Network Mappingβ
Network Mapping is the configuration that defines how virtual networks from the source region map to networks in the target region. This ensures that after failover, VMs in the target region connect to the correct networks.
Without Network Mapping, failover VMs are connected to a generic default network. With Network Mapping, you ensure correct connectivity with other resources, VPNs, and ExpressRoutes in the target region.
4. Structural Viewβ
5. Practical Operationβ
Complete ASR lifecycleβ
Detailed steps to enable replicationβ
1. Prerequisite: create the Recovery Services Vault in the target region
This is a critical and often confused point: the vault for ASR must be in the TARGET region, not the source. The logic is that if the source region fails, the vault in the source would also be unavailable. The vault in the target region remains accessible to orchestrate failover.
# The vault MUST be in the target region
az recovery-services vault create \
--resource-group rg-asr-eastus2 \
--name rsv-asr-eastus2 \
--location eastus2
2. Enable replication for a VM
In the portal:
- Access the Recovery Services Vault (in the target region)
- Click "Site Recovery" > "Enable replication"
- Configure:
- Source: Azure, source region, Resource Group and VM
- Target: region, Resource Group, VNet, subnet, disk type
- Replication settings: replication policy
- Confirm and wait for initial synchronization
Initial synchronization can take hours depending on disk size. During this period, status is "Enabling replication" and then "Synchronizing".
3. Verify replication health
After initial synchronization, status changes to Protected. Monitor:
- RPO: time since last recovery point. Should be close to 0-60 seconds
- Replication health: Critical, Warning, or Healthy
- Last recovery point: the most recent available recovery point
Test Failover: how and when to executeβ
Test Failover creates VMs in the target region in an isolated network specified by you, without interrupting replication and without affecting production. It's the most important DR operation to validate that ASR is configured correctly.
Important Test Failover behaviors:
- VMs created in test failover are not the production replica; they are temporary VMs created specifically for testing
- Replication continues normally during test failover
- You must clean up the test failover after validation, which removes test VMs and frees temporary resources
- If you don't clean up, test VMs remain consuming cost
Failover: sequence of eventsβ
Commit is a critical step: after validating that VMs in the target region are working, you confirm the failover with Commit. This ends the possibility of returning to the previous recovery point and prepares the environment for the Reprotect/Failback process.
6. Implementation Methodsβ
6.1 Azure Portalβ
When to use: initial setup, failover/failback operations in emergency situations where familiarity with the portal is essential.
Enabling replication in the portal:
- Access the vault in the target region
- In Site Recovery, click "Enable replication"
- Fill Source: "Azure", source region, Resource Group, VM
- Fill Target: target region, Resource Group, VNet, subnet, storage
- Configure Replication Policy
- Review and enable
Limitation: not scalable for many VMs; use PowerShell or CLI to enable batch replication.
6.2 Azure PowerShellβ
When to use: enable replication on multiple VMs, automation, infrastructure pipelines.
# Set vault context (target region)
$vault = Get-AzRecoveryServicesVault `
-ResourceGroupName "rg-asr-eastus2" `
-Name "rsv-asr-eastus2"
Set-AzRecoveryServicesAsrVaultContext -Vault $vault
# Get source region fabric (created automatically when vault detects source)
$primaryFabric = Get-AzRecoveryServicesAsrFabric `
-Name "asr-a2a-default-brazilsouth-container"
# Get source Protection Container
$primaryContainer = Get-AzRecoveryServicesAsrProtectionContainer `
-Fabric $primaryFabric
# Get replication policy
$replicationPolicy = Get-AzRecoveryServicesAsrPolicy `
-Name "24-hour-retention-policy"
# Get target container
$recoveryFabric = Get-AzRecoveryServicesAsrFabric `
-Name "asr-a2a-default-eastus2-container"
$recoveryContainer = Get-AzRecoveryServicesAsrProtectionContainer `
-Fabric $recoveryFabric
# Associate containers with policy
$containerMapping = Get-AzRecoveryServicesAsrProtectionContainerMapping `
-ProtectionContainer $primaryContainer `
-Name "mapping-brazilsouth-to-eastus2"
# Get VM to replicate
$vm = Get-AzVM `
-ResourceGroupName "rg-app-prod" `
-Name "vm-producao-01"
# Configure disk details for replication
$diskConfig = New-AzRecoveryServicesAsrAzureToAzureDiskReplicationConfig `
-ManagedDisk `
-LogStorageAccountId "/subscriptions/.../storageAccounts/cacheaccount" `
-DiskId $vm.StorageProfile.OsDisk.ManagedDisk.Id `
-RecoveryResourceGroupId "/subscriptions/.../resourceGroups/rg-asr-eastus2" `
-RecoveryReplicaDiskAccountType "Premium_LRS" `
-RecoveryTargetDiskAccountType "Premium_LRS"
# Enable replication
New-AzRecoveryServicesAsrReplicationProtectedItem `
-AzureToAzure `
-AzureVmId $vm.Id `
-Name "replication-vm-producao-01" `
-ProtectionContainerMapping $containerMapping `
-AzureToAzureDiskReplicationConfiguration $diskConfig `
-RecoveryResourceGroupId "/subscriptions/.../resourceGroups/rg-asr-eastus2" `
-RecoveryVirtualNetworkId "/subscriptions/.../virtualNetworks/vnet-destino"
6.3 Azure CLIβ
When to use: automation scripts, simple pipelines, status checks.
# List replicated items
az site-recovery replication-protected-item list \
--resource-group rg-asr-eastus2 \
--vault-name rsv-asr-eastus2 \
--fabric-name asr-a2a-default-brazilsouth-container \
--protection-container-name asr-a2a-default-brazilsouth-container \
--output table
# Check replication health of an item
az site-recovery replication-protected-item show \
--resource-group rg-asr-eastus2 \
--vault-name rsv-asr-eastus2 \
--fabric-name asr-a2a-default-brazilsouth-container \
--protection-container-name asr-a2a-default-brazilsouth-container \
--replicated-protected-item-name replication-vm-producao-01
ASR CLI is less complete than PowerShell for complex operations like enabling replication with detailed disk configurations. For failover and monitoring operations, CLI is sufficient.
6.4 ARM Template / Terraformβ
When to use: IaC for complete DR infrastructure, environments where all configuration needs to be versioned.
ASR configuration via ARM/Terraform is complex because it involves multiple interdependent resources (fabrics, containers, mappings, protected items). For AZ-104, knowledge of portal and PowerShell is sufficient. In real production environments, Terraform has providers for ASR via azurerm_site_recovery_replication_policy and azurerm_site_recovery_protected_vm.
7. Control and Securityβ
RBAC for ASRβ
| Role | Capabilities |
|---|---|
| Site Recovery Contributor | Manage ASR completely, except creating vaults |
| Site Recovery Operator | Execute failover and failback; cannot modify configurations |
| Site Recovery Reader | Read-only access to replication status |
Network considerations for replicationβ
ASR needs outbound connectivity from the source VM to ASR and Azure Storage endpoints. In environments with restrictive Network Security Groups (NSGs) or User Defined Routes (UDRs) forcing traffic through firewall, you need to:
- Allow outbound traffic to Service Tags
AzureSiteRecoveryandStoragein NSGs - Or configure Private Endpoints for the vault, eliminating public internet traffic
- Verify that proxies or firewalls are not blocking necessary endpoints
Replication Policy: security settingsβ
The Replication Policy defines:
- Recovery Point Retention: how long recovery points are kept (default: 24 hours, maximum: 15 days)
- App-consistent snapshot frequency: frequency of application-consistent snapshots (default: 4 hours)
- Multi-VM consistency: allows VMs in a group to be failed over together with the same recovery point (disabled by default; enabling causes slight performance impact)
8. Decision Makingβ
ASR vs Backup: when to use eachβ
| Situation | Solution | Reason |
|---|---|---|
| Accidentally deleted file | Azure Backup | ASR doesn't protect individual data |
| Database corrupted by wrong query | Azure Backup | Need to restore to previous point |
| Entire primary region unavailable | Azure Site Recovery | Backup in same region would also be unavailable |
| RTO < 30 minutes for critical VM | Azure Site Recovery | Backup has RTO of hours; ASR has RTO of minutes |
| RPO < 1 hour | Azure Site Recovery | Daily backup has 24h RPO; ASR has ~60s RPO |
| Planned regional maintenance | Azure Site Recovery | Planned Failover with no data loss |
| 7-year retention compliance | Azure Backup | ASR doesn't maintain history; only current state |
Attention: ASR and Backup are not mutually exclusive. For critical workloads, use both: ASR to ensure operational continuity in regional disasters and Backup for protection against data loss or corruption.
Where to create the vault for ASRβ
| Scenario | Vault location | Reason |
|---|---|---|
| ASR for Azure VMs (A2A) | DESTINATION region | Vault accessible even if source fails |
| Azure VM Backup | Same region as VM | Vault close to protected data |
9. Best Practicesβ
Vault in destination region, always: the ASR vault must be in the destination region. This is the most important and most misunderstood rule of ASR.
Execute Test Failover regularly: at least once per quarter for each critical item. An untested failover is a failover of unknown reliability. Document the actual RTO measured in tests.
Separate critical VMs in Recovery Plans: don't execute failover manually VM by VM in a crisis situation. Recovery Plans ensure boot order and reduce human errors under pressure.
Configure Network Mapping: without this, failover VMs may not connect to the correct networks, requiring manual reconfiguration during a disaster, increasing RTO.
Monitor RPO continuously: an RPO that gradually grows indicates replication problems. Configure alerts when RPO exceeds 60 minutes, for example.
Consider Multi-VM Consistency only when necessary: enabling Multi-VM Consistency adds replication overhead. Use only for VMs that genuinely need consistency between them (e.g., database cluster).
Properly size the cache storage account: the cache storage account needs sufficient capacity to absorb write spikes from the source VM. Use Standard_LRS as cache account type.
10. Common Errorsβ
Error: creating the vault in the source region Why it happens: the operator creates the vault where their VMs are, by analogy with Azure Backup. How to avoid: memorize the rule: ASR vault in the DESTINATION region. Always.
Error: never executing Test Failover Why it happens: Test Failover seems optional and creates extra work (cleanup after test). How to avoid: include Test Failover in the maintenance calendar. Without testing, you have no guarantee that failover will work when you need it most.
Error: forgetting to Commit after successful failover Why it happens: the operator does failover, validates VMs, and forgets to commit. How to avoid: include Commit as a mandatory step in the DR runbook. Without Commit, the failover isn't finalized and Reprotect can't be started.
Error: trying to use ASR for long-term backup Why it happens: confusing disaster recovery with data backup. How to avoid: remember that ASR only maintains 15 days of recovery points. For long retention, use Azure Backup. Use both together for complete protection.
Error: not configuring Network Mapping Why it happens: it's an additional configuration that seems optional in the enablement flow. How to avoid: configure Network Mapping immediately after creating the vault. Without it, failover may create VMs without adequate network connectivity.
Error: enabling Multi-VM Consistency for all VMs by default Why it happens: the operator thinks more consistency is always better. How to avoid: Multi-VM Consistency should only be enabled for VMs in the same cluster or that share real-time data dependency. For independent VMs, the overhead isn't justified.
11. Operation and Maintenanceβ
Daily monitoringβ
In the portal, access the vault and go to Site Recovery. Check:
- Replicated Items: status of each replicated VM (Healthy, Warning, Critical)
- Recovery Plans: DR plan integrity
- Jobs: failures in the last 24h
- RPO: time since last recovery point for each critical VM
Replication health statesβ
| State | Meaning | Required action |
|---|---|---|
| Healthy | Replication working; RPO within expected | None |
| Warning | Minor issue; slightly elevated RPO or isolated event | Investigate, but not critical |
| Critical | Replication interrupted or very high RPO | Immediate action required |
| Synchronizing | Initial sync or re-sync in progress | Wait for completion |
Important ASR limits for Azure VMsβ
| Limit | Value |
|---|---|
| Protected VMs per vault | 5000 |
| Disks per replicated VM | Up to 100 disks |
| Maximum replicated disk size | 32 TB |
| Recovery point retention | Up to 15 days |
| Minimum guaranteed RPO | ~60 seconds |
| Maximum replication throughput per VM | No fixed limit; limited by network and disk |
12. Integration and Automationβ
Integration with Azure Automation for automated DRβ
The most advanced pattern is to integrate Recovery Plans with Azure Automation Runbooks to create a fully automated failover process:
Integration with Azure Traffic Manager / Front Doorβ
After a failover, VMs are in the destination region, but DNS and load balancing may still be pointing to the source region. Integrate ASR with:
- Azure Traffic Manager: configure with priority or automatic failover to redirect traffic to the destination region after a failover
- Azure Front Door: offers automatic global failover based on health probes
Executing failover via PowerShell in Recovery Planβ
# Get the Recovery Plan
$rp = Get-AzRecoveryServicesAsrRecoveryPlan `
-Name "rp-producao-completo" `
-Vault $vault
# Execute Test Failover
$job = Start-AzRecoveryServicesAsrTestFailoverJob `
-RecoveryPlan $rp `
-Direction PrimaryToRecovery `
-AzureVMNetworkId "/subscriptions/.../virtualNetworks/vnet-teste-isolado"
# Wait for completion
Get-AzRecoveryServicesAsrJob -Job $job | Wait-AzRecoveryServicesAsrJob
# Clean up Test Failover
Start-AzRecoveryServicesAsrTestFailoverCleanupJob `
-RecoveryPlan $rp `
-Comment "Quarterly DR test completed successfully"
13. Final Summaryβ
What it is: Azure Site Recovery is Azure's disaster recovery service that continuously replicates VMs from a source region to a destination region, enabling failover in minutes with RPO of approximately 60 seconds.
Essential points:
- The ASR vault must be in the destination region, not the source
- The RPO of ASR for Azure VMs is approximately 60 seconds (crash-consistent every 5 min, app-consistent configurable)
- There are three types of failover: Test (non-destructive, isolated network), Planned (no data loss) and Unplanned (emergency)
- After failover, it's mandatory to Commit before starting Reprotect for failback
- Recovery Plans orchestrate multi-VM failover with boot order and automated actions
- ASR and Azure Backup are complementary, not alternatives; use both for critical workloads
Critical differences:
| Aspect | Azure Backup | Azure Site Recovery |
|---|---|---|
| Objective | Data protection | Operational continuity |
| RPO | Hours to days | ~60 seconds |
| RTO | Hours | Minutes |
| Retention | Up to 99 years | Up to 15 days |
| Vault location | Same region as VM | DESTINATION region |
| Test capability | Restore in sandbox | Test Failover (no impact) |
| Protection scope | Files, VMs, databases | Entire VMs |
What needs to be remembered for AZ-104:
- ASR vault always in the destination region
- Test Failover doesn't affect production and doesn't interrupt replication
- Commit is mandatory after failover to enable Reprotect
- Recovery Plans define boot order and group VMs
- Multi-VM Consistency adds overhead; use only when necessary
- Network Mapping ensures correct VM connectivity after failover
- ASR doesn't replace Backup: use both together for complete protection