Theoretical Foundation: Perform a Failover to a Secondary Region by Using Site Recovery

1. Initial Intuition

In the previous topic, you learned to configure Azure Site Recovery: create the vault in the target region, enable replication, define policies, and structure Recovery Plans. All this configuration exists for a single moment: when you need to trigger the failover.

Failover is the act of activating your infrastructure in the target region. Think of the following scenario: you have an emergency generator installed in the building. Configuring ASR is like installing and testing the generator. Executing the failover is like turning on the generator when the main power goes out.

But there's an important subtlety: there are different situations that require different types of failover. Turning on the generator for a scheduled test is different from activating it during an emergency blackout. In ASR, this distinction translates into three types of failover, each with distinct behavior, impact, and prerequisites.

In practice, failover serves to:

Recover operations when the primary region is unavailable (emergency)
Migrate workloads in a planned manner to another region (maintenance)
Validate that the DR process works as expected (testing)

2. Context

Failover is the most critical operation in the ASR lifecycle. All the configuration work, continuous replication, and monitoring converges to ensure that, at this moment, everything works correctly.

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Failover is not an isolated operation. It's part of a complete cycle that includes the previous state (active replication) and the subsequent state (Commit, Reprotect, and Failback). Understanding the complete cycle is what differentiates an operator who knows how to execute a failover from one who knows how to manage the end-to-end DR process.

3. Building the Concepts

3.1 The three types of failover in detail

Test Failover

Test Failover creates temporary VMs in the target region connected to an isolated test network that you specify. It's completely non-destructive:

Replication continues normally during and after the test
Production VMs at the source are not affected
Test VMs are temporary and should be cleaned up after validation
Does not change the replication status of the protected item

Test Failover is the most important operation in the DR process because it's the only way to ensure that the actual failover will work when needed.

Planned Failover

Planned Failover (also called Planned Failover or Migrate) is executed when you know in advance that the primary region will need to be evacuated. Characteristics:

Synchronizes the latest changes from source to target before creating VMs
Zero RPO: no data loss, as all changes are synchronized before cutover
Requires source VMs to be shut down at the time of failover
Used for planned region maintenance, definitive migration, or infrastructure updates

Unplanned Failover (simply "Failover")

The standard Failover is the emergency failover, triggered when the primary region is unavailable or when a quick decision is needed. Characteristics:

Can be executed even with the source inaccessible
Uses the selected recovery point (there may be data loss equivalent to RPO)
The source doesn't need to be accessible nor VMs shut down
This is the failover you'll execute in a real disaster scenario

3.2 Recovery point selection

In Failover (unplanned), you need to choose which recovery point to use. The options are:

Latest (automatically selected): uses the most recently processed recovery point. Minimizes data loss. It's the default and recommended option for most emergency scenarios.

Latest app-consistent: uses the most recent application-consistent recovery point. May be older than "Latest," but ensures consistency for databases and transactional applications.

Latest crash-consistent: uses the most recent crash-consistent point. Faster to process.

Custom: allows manual selection of a specific point. Useful when you know data was corrupted after a certain time and want to restore to a specific previous state.

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

3.3 The Commit process

After failover (planned or unplanned), VMs are running in the target region, but ASR still maintains the association with the original recovery point. Commit finalizes the failover:

Marks the failover as permanent
Discards old recovery points associated with the source
Frees the item to enter the Reprotect process

Without Commit, you can cancel the failover and return to the previous state (if the source is still accessible). After Commit, this is no longer possible.

3.4 Reprotect and Failback

After Commit, VMs are running in the target region but without active protection. To reestablish protection and enable return to the source region, you execute Reprotect.

Reprotect reverses the replication direction: the target region becomes the new source, and the source region (when it recovers) becomes the new target. After Reprotect is active and synchronization is complete, you can execute Failback to return VMs to the original region.

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

4. Structural View

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

5. Practical Operation

Complete flow of an emergency failover

Phase 1: Detection and decision

The primary region (e.g., Brazil South) shows severe degradation. You check the Azure portal and verify the region is experiencing issues and decide to trigger failover.

Phase 2: Vault access

Since the vault is in the target region (e.g., East US 2), it's accessible even with source region problems. Access the vault via portal URL or PowerShell/CLI.

Phase 3: Failover execution

In the vault, access Site Recovery > Replicated Items
Select the VM or execute via Recovery Plan (recommended for multiple VMs)
Click Failover
Select the recovery point
Check "Shut down machine before beginning failover" only if the source is still accessible. If the source is inaccessible, uncheck.
Confirm

Phase 4: Monitoring

Failover creates VMs in the target region. The process includes:

Creating managed disks from replicas
VM provisioning with predefined size and configurations
Operating system boot

Typical failover time for an Azure VM is 15 to 30 minutes, depending on disk size and configurations.

Phase 5: Post-failover validation

Before Commit, verify:

VM started correctly
Critical services are running
Network connectivity is functioning
Critical data is intact

Phase 6: Commit

After positive validation, execute Commit to finalize the failover. If validation fails, you can still cancel the failover and return to the previous state (only if the source is still accessible).

Non-obvious behaviors

Failover doesn't update DNS automatically: after failover, VMs are running in the target region with different IPs. DNS, Load Balancer, and any other routing service still point to the source. You need to update manually (or via automation) to redirect traffic to target VMs.

"Shut down machine before beginning failover" is not mandatory: this option tries to shut down the source VM before starting failover to ensure consistency. If the source is inaccessible, unchecking is necessary. If the source is accessible, checking reduces the possibility of split-brain (two instances of the same system running simultaneously).

Failover can be cancelled before Commit: while Commit hasn't been executed, failover can be reverted. After Commit, the process is irreversible for that cycle.

After Commit, source VM is not automatically deleted: the VM in the source region (if it still exists) remains in the state it was in. You need to manage decommissioning manually.

Test Failover creates resources that generate cost: VMs created by Test Failover consume cost until cleaned up. If you execute a Test Failover and forget to clean up, you'll continue being charged.

6. Implementation Methods

6.1 Azure Portal

When to use: emergency failovers where speed and visual familiarity are critical; situations where less technical operators need to execute the process.

Executing Test Failover in the portal:

Access the vault in the target region
Site Recovery > Replicated Items
Select the item > click Test Failover
Select:
- Recovery Point: Latest (recommended)
- Azure virtual network: select the isolated test network
Click OK
Monitor in Jobs
After validation, click Cleanup test failover
Add notes about the test result and confirm cleanup

Executing Failover (unplanned) in the portal:

Access the vault in the target region
Site Recovery > Replicated Items
Select the item > click Failover
Read and confirm the impact warning
Select the recovery point
Configure "Shut down machines before beginning failover" according to the scenario
Click OK
Monitor the job until completion
Validate VMs in the target region
Click Commit to finalize

6.2 Azure PowerShell

When to use: failover automation in DR pipelines, failover of multiple items in sequence, integration with runbooks.

Test Failover via PowerShell:

# Configure vault context
$vault = Get-AzRecoveryServicesVault `
  -ResourceGroupName "rg-asr-eastus2" `
  -Name "rsv-asr-eastus2"

Set-AzRecoveryServicesAsrVaultContext -Vault $vault

# Get replicated item
$replicationItem = Get-AzRecoveryServicesAsrReplicationProtectedItem `
  -ProtectionContainer (Get-AzRecoveryServicesAsrProtectionContainer `
    -Fabric (Get-AzRecoveryServicesAsrFabric -Name "asr-a2a-default-brazilsouth-container")) `
  -Name "replication-vm-producao-01"

# Execute Test Failover
$testFailoverJob = Start-AzRecoveryServicesAsrTestFailoverJob `
  -ReplicationProtectedItem $replicationItem `
  -Direction PrimaryToRecovery `
  -AzureVMNetworkId "/subscriptions/{sub}/resourceGroups/rg-asr/providers/Microsoft.Network/virtualNetworks/vnet-teste-isolado"

# Wait for completion
$testFailoverJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Test Failover completed. Status: $($testFailoverJob.State)"

# Cleanup Test Failover after validation
$cleanupJob = Start-AzRecoveryServicesAsrTestFailoverCleanupJob `
  -ReplicationProtectedItem $replicationItem `
  -Comment "Q1 2025 DR Test - Successfully validated"

$cleanupJob | Wait-AzRecoveryServicesAsrJob

Failover (unplanned) via PowerShell:

# Execute Failover with latest recovery point
$failoverJob = Start-AzRecoveryServicesAsrUnplannedFailoverJob `
  -ReplicationProtectedItem $replicationItem `
  -Direction PrimaryToRecovery `
  -PerformSourceSideAction $false  # false if source inaccessible

# Wait for completion
$failoverJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Failover completed. Status: $($failoverJob.State)"

# After validation, execute Commit
$commitJob = Start-AzRecoveryServicesAsrCommitFailoverJob `
  -ReplicationProtectedItem $replicationItem

$commitJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Commit completed"

Reprotect after failover:

# After Commit, start Reprotect (reverses replication direction)
$reprotectJob = Update-AzRecoveryServicesAsrProtectionDirection `
  -ReplicationProtectedItem $replicationItem `
  -Direction RecoveryToPrimary `
  -FailoverDeploymentModel ResourceManager

$reprotectJob | Wait-AzRecoveryServicesAsrJob
Write-Output "Reprotect started"

6.3 Azure CLI

When to use: status checks and simple operations in bash scripts.

# List replicated items with status
az site-recovery replication-protected-item list \
  --resource-group rg-asr-eastus2 \
  --vault-name rsv-asr-eastus2 \
  --fabric-name asr-a2a-default-brazilsouth-container \
  --protection-container-name asr-a2a-default-brazilsouth-container \
  --output table

# Check jobs in progress
az site-recovery job list \
  --resource-group rg-asr-eastus2 \
  --vault-name rsv-asr-eastus2 \
  --filter "status eq 'InProgress'" \
  --output table

For complex operations like failover with specific parameters, PowerShell is more complete and recommended. ASR CLI has lesser coverage of advanced features.

6.4 Recovery Plans via Portal

For multiple VMs, Recovery Plan is the recommended method. In the portal:

Executing failover via Recovery Plan:

Vault > Site Recovery > Recovery Plans
Select the Recovery Plan
Click Failover
Select the recovery point (Latest recommended)
The plan executes groups in sequence, respecting the configured order
Monitor each group in Jobs
After completion and validation, execute Commit on the entire Recovery Plan

Critical advantage: Recovery Plan ensures the database starts before application servers, which start before web servers. Without this, application servers might try to connect to a database that's not yet available, causing cascading failures.

7. Control and Security

RBAC for failover operations

Operation	Minimum Role
Execute Test Failover	Site Recovery Operator
Execute Failover (unplanned)	Site Recovery Operator
Execute Planned Failover	Site Recovery Operator
Execute Commit	Site Recovery Operator
Execute Reprotect	Site Recovery Contributor
Execute Failback	Site Recovery Operator
Create/modify Recovery Plans	Site Recovery Contributor

Protection against accidental failover

To prevent accidental failover execution in production, consider:

Vault locks: a CanNotDelete lock doesn't prevent failover operations, but a ReadOnly lock prevents any modification, including failover. Use carefully; ReadOnly locks on ASR vaults can prevent DR operations in emergencies.
Multi-User Authorization (MUA): configure Resource Guard to require approval from a second administrator before critical operations like failover
Audit alerts: configure Azure Monitor to alert on any failover operation initiated

Data consistency in failover

When selecting the recovery point, consider the workload type:

Workload	Recommended Point	Reason
Database servers	Latest app-consistent	Ensures transactional integrity
Web servers	Latest	Minimizes downtime
File servers	Latest crash-consistent	Fast recovery, files don't need app consistency
Domain controllers	Latest app-consistent	Critical for AD consistency
Static web server	Latest (crash-consistent)	Stateless; any point is acceptable
SQL database	Latest app-consistent	Requires transactional consistency
File server	Latest	Files can be verified after restore
Application with message queue	Latest app-consistent	Queue state needs to be consistent

8. Decision Making

Which type of failover to use

Situation	Failover Type	Reason
Quarterly DR validation	Test Failover	Doesn't affect production; allows complete validation
Scheduled region maintenance	Planned Failover	Zero RPO; shuts down source VMs first
Primary region inaccessible due to disaster	Unplanned Failover	Source inaccessible; maximum urgency
Definitive migration to another region	Planned Failover	Controlled, no data loss
Suspected recent data corruption	Custom Recovery Point	Allows selecting point prior to corruption

Individual failover vs Recovery Plan

Scenario	Approach	Reason
Single independent VM	Individual failover	Simpler; no dependencies
Application stack with dependencies	Recovery Plan	Ensures boot order
Multiple VMs with no dependencies between them	Individual failover in parallel	Faster than sequential Recovery Plan
Complete production environment	Recovery Plan	Orchestration, automatic actions and auditing

Immediate commit vs validation before Commit

Situation	Approach	Reason
Critical emergency, no time for validation	Immediate commit after failover	Releases reverse replication as soon as possible
Planned failover with maintenance window	Validate extensively before Commit	Possibility to cancel if something fails
Test Failover (doesn't reach Commit)	Cleanup without Commit	Test Failover doesn't use Commit

9. Best Practices

Document the DR runbook: create a step-by-step document for failover, including approval contacts, action sequence, validation criteria, and rollback procedures. In an emergency, no one should need to improvise.

Measure and record actual RTO: execute timed Test Failovers and record the total time for each step (VM creation, service startup, validation, DNS update). Use this data to communicate actual RTO to the business.

Update DNS as part of failover: configure the DNS update process (via Traffic Manager, Azure Front Door or manual registration) as an explicit step in the Recovery Plan or runbook. DNS not updated is one of the most common causes of inflated RTO.

Never use the same production network in Test Failover: always specify an isolated test network (without routing to production) to avoid IP conflicts and accidental communication with production systems.

Execute Reprotect immediately after Commit: after an emergency failover, the new destination region is left without DR protection. Execute Reprotect as quickly as possible to reestablish resilience.

Validate Recovery Plan with every infrastructure change: whenever a new VM is added to the environment or network topology is changed, review and update Recovery Plans.

10. Common Errors

Error: executing Test Failover on production network Why it happens: the operator selects the production VNet for convenience, without realizing the impact. How to avoid: create a VNet dedicated to DR tests, without peering with production, and use it exclusively for Test Failovers.

Error: forgetting to clean up Test Failover Why it happens: the test is executed, validated and the operator moves to other tasks without cleaning up. How to avoid: include cleanup in the test runbook as a mandatory step. Configure alerts for VMs with test tags that have been running for more than 24 hours.

Error: not updating DNS after failover Why it happens: the operator considers failover complete when VMs are running, without realizing traffic is still going to the source. How to avoid: include DNS update as an explicit step in the runbook and/or in the Recovery Plan with automated action.

Error: doing Commit before validating VMs Why it happens: pressure for speed in emergency situations. How to avoid: even in emergencies, basic validation (VM started? Main service responding?) takes less than 5 minutes and can save hours of rework if something is wrong.

Error: not executing Reprotect after failover Why it happens: the operator is focused on reestablishing service and doesn't think about protecting the new state. How to avoid: include Reprotect as the last mandatory step of the DR runbook. After a failover, you don't have DR until Reprotect is active.

Error: selecting wrong recovery point Why it happens: in emergency situations, the operator accepts the default without evaluating the scenario. How to avoid: for transactional workloads, always prefer the app-consistent point. For data corruption scenarios, evaluate Custom to select a point prior to corruption.

11. Operation and Maintenance

Post-failover monitoring

After a failover, monitoring changes perspective. You start monitoring VMs in the destination region as if they were the new production environment:

What to monitor	Where to check	Frequency
Reprotect status after failover	Site Recovery > Replicated Items	Every hour until Protected
RPO after Reprotect	Replicated Items > RPO	Continuous
VM health in new region	Azure Monitor / VM Insights	Continuous
Failed ASR jobs	Site Recovery > Jobs	Daily
Replication alerts	Azure Monitor Alerts	Immediate

Detailed Failback process

Failback is the return to the source region after Reprotect is active. The process is identical to failover, but in reverse direction:

Verify that Reprotect is in Protected state and RPO is healthy
Execute a Planned Failover (RecoveryToPrimary) to return without data loss
Wait for VM creation in the source region
Validate functionality at source
Execute Commit
Execute Reprotect again (now reestablishing source→destination replication)
Verify that state returned to normal with active replication

Operational limits relevant to failover

Limit	Value
Maximum time between consecutive Test Failovers	No formal limit
Maximum number of VMs in a Recovery Plan	100 VMs per plan
Number of groups in a Recovery Plan	3 default groups; extensible with Automation
Recovery point retention time	15 days
Number of script actions per Recovery Plan	No documented limit

12. Integration and Automation

Recovery Plan with Azure Automation Runbook

The Recovery Plan can execute Azure Automation Runbooks as steps between VM groups. This allows automating actions like:

Update DNS records in Azure DNS or Traffic Manager
Scale network resources (e.g., increase VPN gateway capacity)
Notify teams via Microsoft Teams or email
Execute service validation scripts

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Integration with Azure Monitor for automatic failover

For fully automated DR scenarios (without human intervention), it's possible to create a flow that detects regional failure and triggers failover automatically:

Azure Monitor detects critical regional failure metrics
Action Group triggers a Logic App or Automation Runbook
The Runbook executes failover via PowerShell (as demonstrated in previous code)
The Runbook updates DNS and notifies the team after completion

Warning: fully automatic failover without human approval is suitable only for workloads where RTO is so critical that any human delay is unacceptable. For most scenarios, minimal human approval is recommended to avoid unnecessary failovers due to false positives.

13. Final Summary

What it is: process of activating replicated infrastructure in the destination region when the source region fails or is evacuated in a planned manner, with three distinct execution modalities.

Essential points:

There are three types of failover: Test (non-destructive, isolated network), Planned (zero RPO, VMs shut down at source) and Unplanned (emergency, may have data loss)
Commit finalizes failover and is mandatory before Reprotect; after Commit, failover cannot be reverted
Reprotect reverses replication direction and is necessary to enable Failback
Test Failover uses an isolated network and should be cleaned up after validation
Failover doesn't update DNS automatically; this step must be explicitly planned
Recovery Plans ensure boot order and allow automatic actions between groups

Critical differences between failover types:

Aspect	Test Failover	Planned Failover	Unplanned Failover
Production impact	None	Source VMs shut down	Depends on scenario
Data loss	None	Zero (synchronizes first)	Equivalent to RPO
Source needs to be accessible	No	Yes	No
Requires Commit	No (uses cleanup)	Yes	Yes
Typical use	DR tests	Planned maintenance	Actual disaster
Replication during operation	Continues normally	Pauses for synchronization	Stops after execution

What needs to be remembered for AZ-104:

Test Failover doesn't affect production and doesn't require Commit; uses cleanup after validation
Planned Failover requires source VMs shut down and ensures zero RPO
After Commit, Reprotect reverses direction: destination starts replicating to source
DNS is not automatically updated by ASR; plan this step
Recovery Plans allow orchestrating failover of multiple VMs with order and automation
ASR vault must be in the destination region (reinforcing previous concept)

1. Initial Intuition​

2. Context​

3. Building the Concepts​

3.1 The three types of failover in detail​

3.2 Recovery point selection​

3.3 The Commit process​

3.4 Reprotect and Failback​

4. Structural View​

5. Practical Operation​

Complete flow of an emergency failover​

Non-obvious behaviors​

6. Implementation Methods​

6.1 Azure Portal​

6.2 Azure PowerShell​

6.3 Azure CLI​

6.4 Recovery Plans via Portal​

7. Control and Security​

RBAC for failover operations​

Protection against accidental failover​

Data consistency in failover​

8. Decision Making​

Which type of failover to use​

Individual failover vs Recovery Plan​

Immediate commit vs validation before Commit​

9. Best Practices​

10. Common Errors​

11. Operation and Maintenance​

Post-failover monitoring​

Detailed Failback process​

Operational limits relevant to failover​

12. Integration and Automation​

Recovery Plan with Azure Automation Runbook​

Integration with Azure Monitor for automatic failover​

13. Final Summary​

1. Initial Intuition

2. Context

3. Building the Concepts

3.1 The three types of failover in detail

3.2 Recovery point selection

3.3 The Commit process

3.4 Reprotect and Failback

4. Structural View

5. Practical Operation

Complete flow of an emergency failover

Non-obvious behaviors

6. Implementation Methods

6.1 Azure Portal

6.2 Azure PowerShell

6.3 Azure CLI

6.4 Recovery Plans via Portal

7. Control and Security

RBAC for failover operations

Protection against accidental failover

Data consistency in failover

8. Decision Making

Which type of failover to use

Individual failover vs Recovery Plan

Immediate commit vs validation before Commit

9. Best Practices

10. Common Errors

11. Operation and Maintenance

Post-failover monitoring

Detailed Failback process

Operational limits relevant to failover

12. Integration and Automation

Recovery Plan with Azure Automation Runbook

Integration with Azure Monitor for automatic failover

13. Final Summary