Skip to main content

Theoretical Foundation: Perform a Failover to a Secondary Region by Using Site Recovery


1. Initial Intuition​

In the previous topic, you learned to configure Azure Site Recovery: create the vault in the target region, enable replication, define policies, and structure Recovery Plans. All this configuration exists for a single moment: when you need to trigger the failover.

Failover is the act of activating your infrastructure in the target region. Think of the following scenario: you have an emergency generator installed in the building. Configuring ASR is like installing and testing the generator. Executing the failover is like turning on the generator when the main power goes out.

But there's an important subtlety: there are different situations that require different types of failover. Turning on the generator for a scheduled test is different from activating it during an emergency blackout. In ASR, this distinction translates into three types of failover, each with distinct behavior, impact, and prerequisites.

In practice, failover serves to:

  • Recover operations when the primary region is unavailable (emergency)
  • Migrate workloads in a planned manner to another region (maintenance)
  • Validate that the DR process works as expected (testing)

2. Context​

Failover is the most critical operation in the ASR lifecycle. All the configuration work, continuous replication, and monitoring converges to ensure that, at this moment, everything works correctly.

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Failover is not an isolated operation. It's part of a complete cycle that includes the previous state (active replication) and the subsequent state (Commit, Reprotect, and Failback). Understanding the complete cycle is what differentiates an operator who knows how to execute a failover from one who knows how to manage the end-to-end DR process.


3. Building the Concepts​

3.1 The three types of failover in detail​

Test Failover

Test Failover creates temporary VMs in the target region connected to an isolated test network that you specify. It's completely non-destructive:

  • Replication continues normally during and after the test
  • Production VMs at the source are not affected
  • Test VMs are temporary and should be cleaned up after validation
  • Does not change the replication status of the protected item

Test Failover is the most important operation in the DR process because it's the only way to ensure that the actual failover will work when needed.

Planned Failover

Planned Failover (also called Planned Failover or Migrate) is executed when you know in advance that the primary region will need to be evacuated. Characteristics:

  • Synchronizes the latest changes from source to target before creating VMs
  • Zero RPO: no data loss, as all changes are synchronized before cutover
  • Requires source VMs to be shut down at the time of failover
  • Used for planned region maintenance, definitive migration, or infrastructure updates

Unplanned Failover (simply "Failover")

The standard Failover is the emergency failover, triggered when the primary region is unavailable or when a quick decision is needed. Characteristics:

  • Can be executed even with the source inaccessible
  • Uses the selected recovery point (there may be data loss equivalent to RPO)
  • The source doesn't need to be accessible nor VMs shut down
  • This is the failover you'll execute in a real disaster scenario

3.2 Recovery point selection​

In Failover (unplanned), you need to choose which recovery point to use. The options are:

Latest (automatically selected): uses the most recently processed recovery point. Minimizes data loss. It's the default and recommended option for most emergency scenarios.

Latest app-consistent: uses the most recent application-consistent recovery point. May be older than "Latest," but ensures consistency for databases and transactional applications.

Latest crash-consistent: uses the most recent crash-consistent point. Faster to process.

Custom: allows manual selection of a specific point. Useful when you know data was corrupted after a certain time and want to restore to a specific previous state.

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

3.3 The Commit process​

After failover (planned or unplanned), VMs are running in the target region, but ASR still maintains the association with the original recovery point. Commit finalizes the failover:

  • Marks the failover as permanent
  • Discards old recovery points associated with the source
  • Frees the item to enter the Reprotect process

Without Commit, you can cancel the failover and return to the previous state (if the source is still accessible). After Commit, this is no longer possible.


3.4 Reprotect and Failback​

After Commit, VMs are running in the target region but without active protection. To reestablish protection and enable return to the source region, you execute Reprotect.

Reprotect reverses the replication direction: the target region becomes the new source, and the source region (when it recovers) becomes the new target. After Reprotect is active and synchronization is complete, you can execute Failback to return VMs to the original region.

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

4. Structural View​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

5. Practical Operation​

Complete flow of an emergency failover​

Phase 1: Detection and decision

The primary region (e.g., Brazil South) shows severe degradation. You check the Azure portal and verify the region is experiencing issues and decide to trigger failover.

Phase 2: Vault access

Since the vault is in the target region (e.g., East US 2), it's accessible even with source region problems. Access the vault via portal URL or PowerShell/CLI.

Phase 3: Failover execution

  1. In the vault, access Site Recovery > Replicated Items
  2. Select the VM or execute via Recovery Plan (recommended for multiple VMs)
  3. Click Failover
  4. Select the recovery point
  5. Check "Shut down machine before beginning failover" only if the source is still accessible. If the source is inaccessible, uncheck.
  6. Confirm

Phase 4: Monitoring

Failover creates VMs in the target region. The process includes:

  • Creating managed disks from replicas
  • VM provisioning with predefined size and configurations
  • Operating system boot

Typical failover time for an Azure VM is 15 to 30 minutes, depending on disk size and configurations.

Phase 5: Post-failover validation

Before Commit, verify:

  • VM started correctly
  • Critical services are running
  • Network connectivity is functioning
  • Critical data is intact

Phase 6: Commit

After positive validation, execute Commit to finalize the failover. If validation fails, you can still cancel the failover and return to the previous state (only if the source is still accessible).


Non-obvious behaviors​

Failover doesn't update DNS automatically: after failover, VMs are running in the target region with different IPs. DNS, Load Balancer, and any other routing service still point to the source. You need to update manually (or via automation) to redirect traffic to target VMs.

"Shut down machine before beginning failover" is not mandatory: this option tries to shut down the source VM before starting failover to ensure consistency. If the source is inaccessible, unchecking is necessary. If the source is accessible, checking reduces the possibility of split-brain (two instances of the same system running simultaneously).

Failover can be cancelled before Commit: while Commit hasn't been executed, failover can be reverted. After Commit, the process is irreversible for that cycle.

After Commit, source VM is not automatically deleted: the VM in the source region (if it still exists) remains in the state it was in. You need to manage decommissioning manually.

Test Failover creates resources that generate cost: VMs created by Test Failover consume cost until cleaned up. If you execute a Test Failover and forget to clean up, you'll continue being charged.


6. Implementation Methods​

6.1 Azure Portal​

When to use: emergency failovers where speed and visual familiarity are critical; situations where less technical operators need to execute the process.

Executing Test Failover in the portal:

  1. Access the vault in the target region
  2. Site Recovery > Replicated Items
  3. Select the item > click Test Failover
  4. Select:
    • Recovery Point: Latest (recommended)
    • Azure virtual network: select the isolated test network
  5. Click OK
  6. Monitor in Jobs
  7. After validation, click Cleanup test failover
  8. Add notes about the test result and confirm cleanup

Executing Failover (unplanned) in the portal:

  1. Access the vault in the target region
  2. Site Recovery > Replicated Items
  3. Select the item > click Failover
  4. Read and confirm the impact warning
  5. Select the recovery point
  6. Configure "Shut down machines before beginning failover" according to the scenario
  7. Click OK
  8. Monitor the job until completion
  9. Validate VMs in the target region
  10. Click Commit to finalize

6.2 Azure PowerShell​

When to use: failover automation in DR pipelines, failover of multiple items in sequence, integration with runbooks.

Test Failover via PowerShell:

# Configure vault context
$vault = Get-AzRecoveryServicesVault `
-ResourceGroupName "rg-asr-eastus2" `
-Name "rsv-asr-eastus2"

Set-AzRecoveryServicesAsrVaultContext -Vault $vault

# Get replicated item
$replicationItem = Get-AzRecoveryServicesAsrReplicationProtectedItem `
-ProtectionContainer (Get-AzRecoveryServicesAsrProtectionContainer `
-Fabric (Get-AzRecoveryServicesAsrFabric -Name "asr-a2a-default-brazilsouth-container")) `
-Name "replication-vm-producao-01"

# Execute Test Failover
$testFailoverJob = Start-AzRecoveryServicesAsrTestFailoverJob `
-ReplicationProtectedItem $replicationItem `
-Direction PrimaryToRecovery `
-AzureVMNetworkId "/subscriptions/{sub}/resourceGroups/rg-asr/providers/Microsoft.Network/virtualNetworks/vnet-teste-isolado"

# Wait for completion
$testFailoverJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Test Failover completed. Status: $($testFailoverJob.State)"

# Cleanup Test Failover after validation
$cleanupJob = Start-AzRecoveryServicesAsrTestFailoverCleanupJob `
-ReplicationProtectedItem $replicationItem `
-Comment "Q1 2025 DR Test - Successfully validated"

$cleanupJob | Wait-AzRecoveryServicesAsrJob

Failover (unplanned) via PowerShell:

# Execute Failover with latest recovery point
$failoverJob = Start-AzRecoveryServicesAsrUnplannedFailoverJob `
-ReplicationProtectedItem $replicationItem `
-Direction PrimaryToRecovery `
-PerformSourceSideAction $false # false if source inaccessible

# Wait for completion
$failoverJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Failover completed. Status: $($failoverJob.State)"

# After validation, execute Commit
$commitJob = Start-AzRecoveryServicesAsrCommitFailoverJob `
-ReplicationProtectedItem $replicationItem

$commitJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Commit completed"

Reprotect after failover:

# After Commit, start Reprotect (reverses replication direction)
$reprotectJob = Update-AzRecoveryServicesAsrProtectionDirection `
-ReplicationProtectedItem $replicationItem `
-Direction RecoveryToPrimary `
-FailoverDeploymentModel ResourceManager

$reprotectJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Reprotect started"

6.3 Azure CLI​

When to use: status checks and simple operations in bash scripts.

# List replicated items with status
az site-recovery replication-protected-item list \
--resource-group rg-asr-eastus2 \
--vault-name rsv-asr-eastus2 \
--fabric-name asr-a2a-default-brazilsouth-container \
--protection-container-name asr-a2a-default-brazilsouth-container \
--output table

# Check jobs in progress
az site-recovery job list \
--resource-group rg-asr-eastus2 \
--vault-name rsv-asr-eastus2 \
--filter "status eq 'InProgress'" \
--output table

For complex operations like failover with specific parameters, PowerShell is more complete and recommended. ASR CLI has lesser coverage of advanced features.


6.4 Recovery Plans via Portal​

For multiple VMs, Recovery Plan is the recommended method. In the portal:

Executing failover via Recovery Plan:

  1. Vault > Site Recovery > Recovery Plans
  2. Select the Recovery Plan
  3. Click Failover
  4. Select the recovery point (Latest recommended)
  5. The plan executes groups in sequence, respecting the configured order
  6. Monitor each group in Jobs
  7. After completion and validation, execute Commit on the entire Recovery Plan

Critical advantage: Recovery Plan ensures the database starts before application servers, which start before web servers. Without this, application servers might try to connect to a database that's not yet available, causing cascading failures.


7. Control and Security​

RBAC for failover operations​

OperationMinimum Role
Execute Test FailoverSite Recovery Operator
Execute Failover (unplanned)Site Recovery Operator
Execute Planned FailoverSite Recovery Operator
Execute CommitSite Recovery Operator
Execute ReprotectSite Recovery Contributor
Execute FailbackSite Recovery Operator
Create/modify Recovery PlansSite Recovery Contributor

Protection against accidental failover​

To prevent accidental failover execution in production, consider:

  • Vault locks: a CanNotDelete lock doesn't prevent failover operations, but a ReadOnly lock prevents any modification, including failover. Use carefully; ReadOnly locks on ASR vaults can prevent DR operations in emergencies.
  • Multi-User Authorization (MUA): configure Resource Guard to require approval from a second administrator before critical operations like failover
  • Audit alerts: configure Azure Monitor to alert on any failover operation initiated

Data consistency in failover​

When selecting the recovery point, consider the workload type:

WorkloadRecommended PointReason
Database serversLatest app-consistentEnsures transactional integrity
Web serversLatestMinimizes downtime
File serversLatest crash-consistentFast recovery, files don't need app consistency
Domain controllersLatest app-consistentCritical for AD consistency
Static web serverLatest (crash-consistent)Stateless; any point is acceptable
SQL databaseLatest app-consistentRequires transactional consistency
File serverLatestFiles can be verified after restore
Application with message queueLatest app-consistentQueue state needs to be consistent

8. Decision Making​

Which type of failover to use​

SituationFailover TypeReason
Quarterly DR validationTest FailoverDoesn't affect production; allows complete validation
Scheduled region maintenancePlanned FailoverZero RPO; shuts down source VMs first
Primary region inaccessible due to disasterUnplanned FailoverSource inaccessible; maximum urgency
Definitive migration to another regionPlanned FailoverControlled, no data loss
Suspected recent data corruptionCustom Recovery PointAllows selecting point prior to corruption

Individual failover vs Recovery Plan​

ScenarioApproachReason
Single independent VMIndividual failoverSimpler; no dependencies
Application stack with dependenciesRecovery PlanEnsures boot order
Multiple VMs with no dependencies between themIndividual failover in parallelFaster than sequential Recovery Plan
Complete production environmentRecovery PlanOrchestration, automatic actions and auditing

Immediate commit vs validation before Commit​

SituationApproachReason
Critical emergency, no time for validationImmediate commit after failoverReleases reverse replication as soon as possible
Planned failover with maintenance windowValidate extensively before CommitPossibility to cancel if something fails
Test Failover (doesn't reach Commit)Cleanup without CommitTest Failover doesn't use Commit

9. Best Practices​

Document the DR runbook: create a step-by-step document for failover, including approval contacts, action sequence, validation criteria, and rollback procedures. In an emergency, no one should need to improvise.

Measure and record actual RTO: execute timed Test Failovers and record the total time for each step (VM creation, service startup, validation, DNS update). Use this data to communicate actual RTO to the business.

Update DNS as part of failover: configure the DNS update process (via Traffic Manager, Azure Front Door or manual registration) as an explicit step in the Recovery Plan or runbook. DNS not updated is one of the most common causes of inflated RTO.

Never use the same production network in Test Failover: always specify an isolated test network (without routing to production) to avoid IP conflicts and accidental communication with production systems.

Execute Reprotect immediately after Commit: after an emergency failover, the new destination region is left without DR protection. Execute Reprotect as quickly as possible to reestablish resilience.

Validate Recovery Plan with every infrastructure change: whenever a new VM is added to the environment or network topology is changed, review and update Recovery Plans.


10. Common Errors​

Error: executing Test Failover on production network Why it happens: the operator selects the production VNet for convenience, without realizing the impact. How to avoid: create a VNet dedicated to DR tests, without peering with production, and use it exclusively for Test Failovers.

Error: forgetting to clean up Test Failover Why it happens: the test is executed, validated and the operator moves to other tasks without cleaning up. How to avoid: include cleanup in the test runbook as a mandatory step. Configure alerts for VMs with test tags that have been running for more than 24 hours.

Error: not updating DNS after failover Why it happens: the operator considers failover complete when VMs are running, without realizing traffic is still going to the source. How to avoid: include DNS update as an explicit step in the runbook and/or in the Recovery Plan with automated action.

Error: doing Commit before validating VMs Why it happens: pressure for speed in emergency situations. How to avoid: even in emergencies, basic validation (VM started? Main service responding?) takes less than 5 minutes and can save hours of rework if something is wrong.

Error: not executing Reprotect after failover Why it happens: the operator is focused on reestablishing service and doesn't think about protecting the new state. How to avoid: include Reprotect as the last mandatory step of the DR runbook. After a failover, you don't have DR until Reprotect is active.

Error: selecting wrong recovery point Why it happens: in emergency situations, the operator accepts the default without evaluating the scenario. How to avoid: for transactional workloads, always prefer the app-consistent point. For data corruption scenarios, evaluate Custom to select a point prior to corruption.


11. Operation and Maintenance​

Post-failover monitoring​

After a failover, monitoring changes perspective. You start monitoring VMs in the destination region as if they were the new production environment:

What to monitorWhere to checkFrequency
Reprotect status after failoverSite Recovery > Replicated ItemsEvery hour until Protected
RPO after ReprotectReplicated Items > RPOContinuous
VM health in new regionAzure Monitor / VM InsightsContinuous
Failed ASR jobsSite Recovery > JobsDaily
Replication alertsAzure Monitor AlertsImmediate

Detailed Failback process​

Failback is the return to the source region after Reprotect is active. The process is identical to failover, but in reverse direction:

  1. Verify that Reprotect is in Protected state and RPO is healthy
  2. Execute a Planned Failover (RecoveryToPrimary) to return without data loss
  3. Wait for VM creation in the source region
  4. Validate functionality at source
  5. Execute Commit
  6. Execute Reprotect again (now reestablishing source→destination replication)
  7. Verify that state returned to normal with active replication

Operational limits relevant to failover​

LimitValue
Maximum time between consecutive Test FailoversNo formal limit
Maximum number of VMs in a Recovery Plan100 VMs per plan
Number of groups in a Recovery Plan3 default groups; extensible with Automation
Recovery point retention time15 days
Number of script actions per Recovery PlanNo documented limit

12. Integration and Automation​

Recovery Plan with Azure Automation Runbook​

The Recovery Plan can execute Azure Automation Runbooks as steps between VM groups. This allows automating actions like:

  • Update DNS records in Azure DNS or Traffic Manager
  • Scale network resources (e.g., increase VPN gateway capacity)
  • Notify teams via Microsoft Teams or email
  • Execute service validation scripts
100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Integration with Azure Monitor for automatic failover​

For fully automated DR scenarios (without human intervention), it's possible to create a flow that detects regional failure and triggers failover automatically:

  1. Azure Monitor detects critical regional failure metrics
  2. Action Group triggers a Logic App or Automation Runbook
  3. The Runbook executes failover via PowerShell (as demonstrated in previous code)
  4. The Runbook updates DNS and notifies the team after completion

Warning: fully automatic failover without human approval is suitable only for workloads where RTO is so critical that any human delay is unacceptable. For most scenarios, minimal human approval is recommended to avoid unnecessary failovers due to false positives.


13. Final Summary​

What it is: process of activating replicated infrastructure in the destination region when the source region fails or is evacuated in a planned manner, with three distinct execution modalities.

Essential points:

  • There are three types of failover: Test (non-destructive, isolated network), Planned (zero RPO, VMs shut down at source) and Unplanned (emergency, may have data loss)
  • Commit finalizes failover and is mandatory before Reprotect; after Commit, failover cannot be reverted
  • Reprotect reverses replication direction and is necessary to enable Failback
  • Test Failover uses an isolated network and should be cleaned up after validation
  • Failover doesn't update DNS automatically; this step must be explicitly planned
  • Recovery Plans ensure boot order and allow automatic actions between groups

Critical differences between failover types:

AspectTest FailoverPlanned FailoverUnplanned Failover
Production impactNoneSource VMs shut downDepends on scenario
Data lossNoneZero (synchronizes first)Equivalent to RPO
Source needs to be accessibleNoYesNo
Requires CommitNo (uses cleanup)YesYes
Typical useDR testsPlanned maintenanceActual disaster
Replication during operationContinues normallyPauses for synchronizationStops after execution

What needs to be remembered for AZ-104:

  • Test Failover doesn't affect production and doesn't require Commit; uses cleanup after validation
  • Planned Failover requires source VMs shut down and ensures zero RPO
  • After Commit, Reprotect reverses direction: destination starts replicating to source
  • DNS is not automatically updated by ASR; plan this step
  • Recovery Plans allow orchestrating failover of multiple VMs with order and automation
  • ASR vault must be in the destination region (reinforcing previous concept)