Theoretical Foundation: Perform a Failover to a Secondary Region by Using Site Recovery
1. Initial Intuitionβ
In the previous topic, you learned to configure Azure Site Recovery: create the vault in the target region, enable replication, define policies, and structure Recovery Plans. All this configuration exists for a single moment: when you need to trigger the failover.
Failover is the act of activating your infrastructure in the target region. Think of the following scenario: you have an emergency generator installed in the building. Configuring ASR is like installing and testing the generator. Executing the failover is like turning on the generator when the main power goes out.
But there's an important subtlety: there are different situations that require different types of failover. Turning on the generator for a scheduled test is different from activating it during an emergency blackout. In ASR, this distinction translates into three types of failover, each with distinct behavior, impact, and prerequisites.
In practice, failover serves to:
- Recover operations when the primary region is unavailable (emergency)
- Migrate workloads in a planned manner to another region (maintenance)
- Validate that the DR process works as expected (testing)
2. Contextβ
Failover is the most critical operation in the ASR lifecycle. All the configuration work, continuous replication, and monitoring converges to ensure that, at this moment, everything works correctly.
Failover is not an isolated operation. It's part of a complete cycle that includes the previous state (active replication) and the subsequent state (Commit, Reprotect, and Failback). Understanding the complete cycle is what differentiates an operator who knows how to execute a failover from one who knows how to manage the end-to-end DR process.
3. Building the Conceptsβ
3.1 The three types of failover in detailβ
Test Failover
Test Failover creates temporary VMs in the target region connected to an isolated test network that you specify. It's completely non-destructive:
- Replication continues normally during and after the test
- Production VMs at the source are not affected
- Test VMs are temporary and should be cleaned up after validation
- Does not change the replication status of the protected item
Test Failover is the most important operation in the DR process because it's the only way to ensure that the actual failover will work when needed.
Planned Failover
Planned Failover (also called Planned Failover or Migrate) is executed when you know in advance that the primary region will need to be evacuated. Characteristics:
- Synchronizes the latest changes from source to target before creating VMs
- Zero RPO: no data loss, as all changes are synchronized before cutover
- Requires source VMs to be shut down at the time of failover
- Used for planned region maintenance, definitive migration, or infrastructure updates
Unplanned Failover (simply "Failover")
The standard Failover is the emergency failover, triggered when the primary region is unavailable or when a quick decision is needed. Characteristics:
- Can be executed even with the source inaccessible
- Uses the selected recovery point (there may be data loss equivalent to RPO)
- The source doesn't need to be accessible nor VMs shut down
- This is the failover you'll execute in a real disaster scenario
3.2 Recovery point selectionβ
In Failover (unplanned), you need to choose which recovery point to use. The options are:
Latest (automatically selected): uses the most recently processed recovery point. Minimizes data loss. It's the default and recommended option for most emergency scenarios.
Latest app-consistent: uses the most recent application-consistent recovery point. May be older than "Latest," but ensures consistency for databases and transactional applications.
Latest crash-consistent: uses the most recent crash-consistent point. Faster to process.
Custom: allows manual selection of a specific point. Useful when you know data was corrupted after a certain time and want to restore to a specific previous state.
3.3 The Commit processβ
After failover (planned or unplanned), VMs are running in the target region, but ASR still maintains the association with the original recovery point. Commit finalizes the failover:
- Marks the failover as permanent
- Discards old recovery points associated with the source
- Frees the item to enter the Reprotect process
Without Commit, you can cancel the failover and return to the previous state (if the source is still accessible). After Commit, this is no longer possible.
3.4 Reprotect and Failbackβ
After Commit, VMs are running in the target region but without active protection. To reestablish protection and enable return to the source region, you execute Reprotect.
Reprotect reverses the replication direction: the target region becomes the new source, and the source region (when it recovers) becomes the new target. After Reprotect is active and synchronization is complete, you can execute Failback to return VMs to the original region.
4. Structural Viewβ
5. Practical Operationβ
Complete flow of an emergency failoverβ
Phase 1: Detection and decision
The primary region (e.g., Brazil South) shows severe degradation. You check the Azure portal and verify the region is experiencing issues and decide to trigger failover.
Phase 2: Vault access
Since the vault is in the target region (e.g., East US 2), it's accessible even with source region problems. Access the vault via portal URL or PowerShell/CLI.
Phase 3: Failover execution
- In the vault, access Site Recovery > Replicated Items
- Select the VM or execute via Recovery Plan (recommended for multiple VMs)
- Click Failover
- Select the recovery point
- Check "Shut down machine before beginning failover" only if the source is still accessible. If the source is inaccessible, uncheck.
- Confirm
Phase 4: Monitoring
Failover creates VMs in the target region. The process includes:
- Creating managed disks from replicas
- VM provisioning with predefined size and configurations
- Operating system boot
Typical failover time for an Azure VM is 15 to 30 minutes, depending on disk size and configurations.
Phase 5: Post-failover validation
Before Commit, verify:
- VM started correctly
- Critical services are running
- Network connectivity is functioning
- Critical data is intact
Phase 6: Commit
After positive validation, execute Commit to finalize the failover. If validation fails, you can still cancel the failover and return to the previous state (only if the source is still accessible).
Non-obvious behaviorsβ
Failover doesn't update DNS automatically: after failover, VMs are running in the target region with different IPs. DNS, Load Balancer, and any other routing service still point to the source. You need to update manually (or via automation) to redirect traffic to target VMs.
"Shut down machine before beginning failover" is not mandatory: this option tries to shut down the source VM before starting failover to ensure consistency. If the source is inaccessible, unchecking is necessary. If the source is accessible, checking reduces the possibility of split-brain (two instances of the same system running simultaneously).
Failover can be cancelled before Commit: while Commit hasn't been executed, failover can be reverted. After Commit, the process is irreversible for that cycle.
After Commit, source VM is not automatically deleted: the VM in the source region (if it still exists) remains in the state it was in. You need to manage decommissioning manually.
Test Failover creates resources that generate cost: VMs created by Test Failover consume cost until cleaned up. If you execute a Test Failover and forget to clean up, you'll continue being charged.
6. Implementation Methodsβ
6.1 Azure Portalβ
When to use: emergency failovers where speed and visual familiarity are critical; situations where less technical operators need to execute the process.
Executing Test Failover in the portal:
- Access the vault in the target region
- Site Recovery > Replicated Items
- Select the item > click Test Failover
- Select:
- Recovery Point: Latest (recommended)
- Azure virtual network: select the isolated test network
- Click OK
- Monitor in Jobs
- After validation, click Cleanup test failover
- Add notes about the test result and confirm cleanup
Executing Failover (unplanned) in the portal:
- Access the vault in the target region
- Site Recovery > Replicated Items
- Select the item > click Failover
- Read and confirm the impact warning
- Select the recovery point
- Configure "Shut down machines before beginning failover" according to the scenario
- Click OK
- Monitor the job until completion
- Validate VMs in the target region
- Click Commit to finalize
6.2 Azure PowerShellβ
When to use: failover automation in DR pipelines, failover of multiple items in sequence, integration with runbooks.
Test Failover via PowerShell:
# Configure vault context
$vault = Get-AzRecoveryServicesVault `
-ResourceGroupName "rg-asr-eastus2" `
-Name "rsv-asr-eastus2"
Set-AzRecoveryServicesAsrVaultContext -Vault $vault
# Get replicated item
$replicationItem = Get-AzRecoveryServicesAsrReplicationProtectedItem `
-ProtectionContainer (Get-AzRecoveryServicesAsrProtectionContainer `
-Fabric (Get-AzRecoveryServicesAsrFabric -Name "asr-a2a-default-brazilsouth-container")) `
-Name "replication-vm-producao-01"
# Execute Test Failover
$testFailoverJob = Start-AzRecoveryServicesAsrTestFailoverJob `
-ReplicationProtectedItem $replicationItem `
-Direction PrimaryToRecovery `
-AzureVMNetworkId "/subscriptions/{sub}/resourceGroups/rg-asr/providers/Microsoft.Network/virtualNetworks/vnet-teste-isolado"
# Wait for completion
$testFailoverJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Test Failover completed. Status: $($testFailoverJob.State)"
# Cleanup Test Failover after validation
$cleanupJob = Start-AzRecoveryServicesAsrTestFailoverCleanupJob `
-ReplicationProtectedItem $replicationItem `
-Comment "Q1 2025 DR Test - Successfully validated"
$cleanupJob | Wait-AzRecoveryServicesAsrJob
Failover (unplanned) via PowerShell:
# Execute Failover with latest recovery point
$failoverJob = Start-AzRecoveryServicesAsrUnplannedFailoverJob `
-ReplicationProtectedItem $replicationItem `
-Direction PrimaryToRecovery `
-PerformSourceSideAction $false # false if source inaccessible
# Wait for completion
$failoverJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Failover completed. Status: $($failoverJob.State)"
# After validation, execute Commit
$commitJob = Start-AzRecoveryServicesAsrCommitFailoverJob `
-ReplicationProtectedItem $replicationItem
$commitJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Commit completed"
Reprotect after failover:
# After Commit, start Reprotect (reverses replication direction)
$reprotectJob = Update-AzRecoveryServicesAsrProtectionDirection `
-ReplicationProtectedItem $replicationItem `
-Direction RecoveryToPrimary `
-FailoverDeploymentModel ResourceManager
$reprotectJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Reprotect started"
6.3 Azure CLIβ
When to use: status checks and simple operations in bash scripts.
# List replicated items with status
az site-recovery replication-protected-item list \
--resource-group rg-asr-eastus2 \
--vault-name rsv-asr-eastus2 \
--fabric-name asr-a2a-default-brazilsouth-container \
--protection-container-name asr-a2a-default-brazilsouth-container \
--output table
# Check jobs in progress
az site-recovery job list \
--resource-group rg-asr-eastus2 \
--vault-name rsv-asr-eastus2 \
--filter "status eq 'InProgress'" \
--output table
For complex operations like failover with specific parameters, PowerShell is more complete and recommended. ASR CLI has lesser coverage of advanced features.
6.4 Recovery Plans via Portalβ
For multiple VMs, Recovery Plan is the recommended method. In the portal:
Executing failover via Recovery Plan:
- Vault > Site Recovery > Recovery Plans
- Select the Recovery Plan
- Click Failover
- Select the recovery point (Latest recommended)
- The plan executes groups in sequence, respecting the configured order
- Monitor each group in Jobs
- After completion and validation, execute Commit on the entire Recovery Plan
Critical advantage: Recovery Plan ensures the database starts before application servers, which start before web servers. Without this, application servers might try to connect to a database that's not yet available, causing cascading failures.
7. Control and Securityβ
RBAC for failover operationsβ
| Operation | Minimum Role |
|---|---|
| Execute Test Failover | Site Recovery Operator |
| Execute Failover (unplanned) | Site Recovery Operator |
| Execute Planned Failover | Site Recovery Operator |
| Execute Commit | Site Recovery Operator |
| Execute Reprotect | Site Recovery Contributor |
| Execute Failback | Site Recovery Operator |
| Create/modify Recovery Plans | Site Recovery Contributor |
Protection against accidental failoverβ
To prevent accidental failover execution in production, consider:
- Vault locks: a
CanNotDeletelock doesn't prevent failover operations, but aReadOnlylock prevents any modification, including failover. Use carefully; ReadOnly locks on ASR vaults can prevent DR operations in emergencies. - Multi-User Authorization (MUA): configure Resource Guard to require approval from a second administrator before critical operations like failover
- Audit alerts: configure Azure Monitor to alert on any failover operation initiated
Data consistency in failoverβ
When selecting the recovery point, consider the workload type:
| Workload | Recommended Point | Reason |
|---|---|---|
| Database servers | Latest app-consistent | Ensures transactional integrity |
| Web servers | Latest | Minimizes downtime |
| File servers | Latest crash-consistent | Fast recovery, files don't need app consistency |
| Domain controllers | Latest app-consistent | Critical for AD consistency |
| Static web server | Latest (crash-consistent) | Stateless; any point is acceptable |
| SQL database | Latest app-consistent | Requires transactional consistency |
| File server | Latest | Files can be verified after restore |
| Application with message queue | Latest app-consistent | Queue state needs to be consistent |
8. Decision Makingβ
Which type of failover to useβ
| Situation | Failover Type | Reason |
|---|---|---|
| Quarterly DR validation | Test Failover | Doesn't affect production; allows complete validation |
| Scheduled region maintenance | Planned Failover | Zero RPO; shuts down source VMs first |
| Primary region inaccessible due to disaster | Unplanned Failover | Source inaccessible; maximum urgency |
| Definitive migration to another region | Planned Failover | Controlled, no data loss |
| Suspected recent data corruption | Custom Recovery Point | Allows selecting point prior to corruption |
Individual failover vs Recovery Planβ
| Scenario | Approach | Reason |
|---|---|---|
| Single independent VM | Individual failover | Simpler; no dependencies |
| Application stack with dependencies | Recovery Plan | Ensures boot order |
| Multiple VMs with no dependencies between them | Individual failover in parallel | Faster than sequential Recovery Plan |
| Complete production environment | Recovery Plan | Orchestration, automatic actions and auditing |
Immediate commit vs validation before Commitβ
| Situation | Approach | Reason |
|---|---|---|
| Critical emergency, no time for validation | Immediate commit after failover | Releases reverse replication as soon as possible |
| Planned failover with maintenance window | Validate extensively before Commit | Possibility to cancel if something fails |
| Test Failover (doesn't reach Commit) | Cleanup without Commit | Test Failover doesn't use Commit |
9. Best Practicesβ
Document the DR runbook: create a step-by-step document for failover, including approval contacts, action sequence, validation criteria, and rollback procedures. In an emergency, no one should need to improvise.
Measure and record actual RTO: execute timed Test Failovers and record the total time for each step (VM creation, service startup, validation, DNS update). Use this data to communicate actual RTO to the business.
Update DNS as part of failover: configure the DNS update process (via Traffic Manager, Azure Front Door or manual registration) as an explicit step in the Recovery Plan or runbook. DNS not updated is one of the most common causes of inflated RTO.
Never use the same production network in Test Failover: always specify an isolated test network (without routing to production) to avoid IP conflicts and accidental communication with production systems.
Execute Reprotect immediately after Commit: after an emergency failover, the new destination region is left without DR protection. Execute Reprotect as quickly as possible to reestablish resilience.
Validate Recovery Plan with every infrastructure change: whenever a new VM is added to the environment or network topology is changed, review and update Recovery Plans.
10. Common Errorsβ
Error: executing Test Failover on production network Why it happens: the operator selects the production VNet for convenience, without realizing the impact. How to avoid: create a VNet dedicated to DR tests, without peering with production, and use it exclusively for Test Failovers.
Error: forgetting to clean up Test Failover Why it happens: the test is executed, validated and the operator moves to other tasks without cleaning up. How to avoid: include cleanup in the test runbook as a mandatory step. Configure alerts for VMs with test tags that have been running for more than 24 hours.
Error: not updating DNS after failover Why it happens: the operator considers failover complete when VMs are running, without realizing traffic is still going to the source. How to avoid: include DNS update as an explicit step in the runbook and/or in the Recovery Plan with automated action.
Error: doing Commit before validating VMs Why it happens: pressure for speed in emergency situations. How to avoid: even in emergencies, basic validation (VM started? Main service responding?) takes less than 5 minutes and can save hours of rework if something is wrong.
Error: not executing Reprotect after failover Why it happens: the operator is focused on reestablishing service and doesn't think about protecting the new state. How to avoid: include Reprotect as the last mandatory step of the DR runbook. After a failover, you don't have DR until Reprotect is active.
Error: selecting wrong recovery point Why it happens: in emergency situations, the operator accepts the default without evaluating the scenario. How to avoid: for transactional workloads, always prefer the app-consistent point. For data corruption scenarios, evaluate Custom to select a point prior to corruption.
11. Operation and Maintenanceβ
Post-failover monitoringβ
After a failover, monitoring changes perspective. You start monitoring VMs in the destination region as if they were the new production environment:
| What to monitor | Where to check | Frequency |
|---|---|---|
| Reprotect status after failover | Site Recovery > Replicated Items | Every hour until Protected |
| RPO after Reprotect | Replicated Items > RPO | Continuous |
| VM health in new region | Azure Monitor / VM Insights | Continuous |
| Failed ASR jobs | Site Recovery > Jobs | Daily |
| Replication alerts | Azure Monitor Alerts | Immediate |
Detailed Failback processβ
Failback is the return to the source region after Reprotect is active. The process is identical to failover, but in reverse direction:
- Verify that Reprotect is in Protected state and RPO is healthy
- Execute a Planned Failover (RecoveryToPrimary) to return without data loss
- Wait for VM creation in the source region
- Validate functionality at source
- Execute Commit
- Execute Reprotect again (now reestablishing sourceβdestination replication)
- Verify that state returned to normal with active replication
Operational limits relevant to failoverβ
| Limit | Value |
|---|---|
| Maximum time between consecutive Test Failovers | No formal limit |
| Maximum number of VMs in a Recovery Plan | 100 VMs per plan |
| Number of groups in a Recovery Plan | 3 default groups; extensible with Automation |
| Recovery point retention time | 15 days |
| Number of script actions per Recovery Plan | No documented limit |
12. Integration and Automationβ
Recovery Plan with Azure Automation Runbookβ
The Recovery Plan can execute Azure Automation Runbooks as steps between VM groups. This allows automating actions like:
- Update DNS records in Azure DNS or Traffic Manager
- Scale network resources (e.g., increase VPN gateway capacity)
- Notify teams via Microsoft Teams or email
- Execute service validation scripts
Integration with Azure Monitor for automatic failoverβ
For fully automated DR scenarios (without human intervention), it's possible to create a flow that detects regional failure and triggers failover automatically:
- Azure Monitor detects critical regional failure metrics
- Action Group triggers a Logic App or Automation Runbook
- The Runbook executes failover via PowerShell (as demonstrated in previous code)
- The Runbook updates DNS and notifies the team after completion
Warning: fully automatic failover without human approval is suitable only for workloads where RTO is so critical that any human delay is unacceptable. For most scenarios, minimal human approval is recommended to avoid unnecessary failovers due to false positives.
13. Final Summaryβ
What it is: process of activating replicated infrastructure in the destination region when the source region fails or is evacuated in a planned manner, with three distinct execution modalities.
Essential points:
- There are three types of failover: Test (non-destructive, isolated network), Planned (zero RPO, VMs shut down at source) and Unplanned (emergency, may have data loss)
- Commit finalizes failover and is mandatory before Reprotect; after Commit, failover cannot be reverted
- Reprotect reverses replication direction and is necessary to enable Failback
- Test Failover uses an isolated network and should be cleaned up after validation
- Failover doesn't update DNS automatically; this step must be explicitly planned
- Recovery Plans ensure boot order and allow automatic actions between groups
Critical differences between failover types:
| Aspect | Test Failover | Planned Failover | Unplanned Failover |
|---|---|---|---|
| Production impact | None | Source VMs shut down | Depends on scenario |
| Data loss | None | Zero (synchronizes first) | Equivalent to RPO |
| Source needs to be accessible | No | Yes | No |
| Requires Commit | No (uses cleanup) | Yes | Yes |
| Typical use | DR tests | Planned maintenance | Actual disaster |
| Replication during operation | Continues normally | Pauses for synchronization | Stops after execution |
What needs to be remembered for AZ-104:
- Test Failover doesn't affect production and doesn't require Commit; uses cleanup after validation
- Planned Failover requires source VMs shut down and ensures zero RPO
- After Commit, Reprotect reverses direction: destination starts replicating to source
- DNS is not automatically updated by ASR; plan this step
- Recovery Plans allow orchestrating failover of multiple VMs with order and automation
- ASR vault must be in the destination region (reinforcing previous concept)