Troubleshooting Lab: Perform a Failover to a Secondary Region by Using Site Recovery
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An administrator attempts to initiate an unplanned failover of a VM called prod-app-vm to the secondary region after an outage in the primary region. The Recovery Services Vault is configured in the secondary region. The VM has an operating system disk and two data disks.
In the Azure portal, when clicking "Failover", the wizard displays the following message and does not allow proceeding:
Error: The virtual machine 'prod-app-vm' is not in a protected state.
Replication health: Critical
Last successful recovery point: 72 hours ago
Pending actions: None
The administrator verifies that the source VM is still running in the primary region and that the Recovery Services Vault exists and is accessible. The cache storage account used by replication is in the same region as the source VM and was created six months ago. They also notice that the subscription reached the available core limit in the secondary region three days ago, but the issue was resolved yesterday with an approved quota increase.
What is the root cause preventing the failover?
A) The core quota limit in the secondary region is still active, preventing resource creation during failover.
B) Replication is in a critical state, possibly due to failure in data transfer to the cache storage account, resulting in the absence of recent valid recovery points.
C) The Recovery Services Vault is in the wrong region; it should be in the source region, not the secondary.
D) The source VM is running, and Site Recovery requires it to be shut down before initiating an unplanned failover.
Scenario 2 β Action Decisionβ
The operations team successfully executed an unplanned failover of the entire application tier to the secondary region. The failover was completed 40 minutes ago. Systems are operational and users have been redirected. The root cause of the failure in the primary region has been identified: a firmware update on a set of physical hosts caused temporary unavailability. Microsoft confirmed that the primary region will return to normal in approximately two hours.
The team wants to ensure that, as soon as the primary region is restored, they can return workloads to it with active replication protection.
The identified cause is: the failover was executed but has not yet been committed (commit not performed).
What is the correct action to take at this moment?
A) Execute failback immediately to the primary region while it is still in the recovery process, taking advantage of the pending failover state to revert without needing to commit.
B) Execute the "Commit" of the failover now to finalize the VM state in the secondary region and then wait for primary region recovery to execute "Re-protect".
C) Execute "Re-protect" immediately, without performing the commit, to anticipate the reverse replication configuration while the primary region recovers.
D) Discard the failover ("Discard Failover") to maintain the original state and wait for complete primary region return before any action.
Scenario 3 β Root Causeβ
A VM called db-primary-vm has been successfully replicated to the secondary region for three months. The administrator performs a successful test failover the previous week. Today, when executing the actual (unplanned) failover, the VM comes up in the secondary region, but the application cannot connect to the database.
The administrator verifies the following:
# Output from az vm list-ip-addresses command in secondary region
{
"virtualMachine": {
"name": "db-primary-vm",
"network": {
"privateIpAddresses": ["10.1.0.5"]
}
}
}
The original VM configuration in the primary region was:
| Property | Primary Region | Secondary Region (observed) |
|---|---|---|
| Private IP | 10.0.0.5 | 10.1.0.5 |
| VM name | db-primary-vm | db-primary-vm |
| Virtual network | vnet-primary | vnet-secondary |
| Security group | nsg-db | nsg-db |
The application connection string uses the fixed IP address 10.0.0.5 in the configuration file. The administrator notes that the NSG applied to the VM in the secondary region is the same one used in the primary and that port 1433 is allowed.
What is the root cause of the connectivity failure?
A) The NSG inherited from the primary region is blocking inbound connections in the secondary virtual network due to route rule differences.
B) The VM's IP address in the secondary region is different from the IP configured in the application connection string, which points to the original IP from the primary region.
C) The secondary virtual network is not peered with the primary network, preventing the application from reaching the database.
D) The test failover from the previous week corrupted the replication state, causing the original IP not to be preserved in the actual failover.
Scenario 4 β Diagnostic Sequenceβ
An administrator receives an alert informing that the replication health of a group of VMs is in a warning state in the Recovery Services Vault. No failover has been executed. The VMs continue running in the primary region.
The available investigation steps are out of order:
- Verify if there is available space in the cache storage account and if it is accessible.
- Consult the vault's "Replication Health" dashboard to identify which specific VMs are in warning state.
- Check the affected VM's replication events in the vault to identify specific error messages.
- Confirm if the mobility agent installed on the VM is in a version compatible with the current vault configuration.
- Verify outbound connectivity from the source VM to Site Recovery URLs required by Microsoft.
What is the correct investigation sequence?
A) 1 β 5 β 2 β 3 β 4
B) 2 β 3 β 5 β 1 β 4
C) 4 β 2 β 3 β 1 β 5
D) 2 β 1 β 5 β 4 β 3
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The decisive clue is in the portal output: Replication health: Critical and Last successful recovery point: 72 hours ago. This indicates that replication stopped transferring data to the cache storage account, making recovery points obsolete. Site Recovery blocks failover when there is no valid and recent recovery point, as there is no guarantee of data consistency.
The information about the core quota limit (alternative A) is purposefully irrelevant: the statement confirms that the problem was resolved the previous day. Focusing on this data is a classic diagnostic error by anchoring on the most recent event perceived as negative.
Alternative C is incorrect: the Recovery Services Vault should reside in the secondary region, which is the replication destination. Alternative D represents a frequent misconception; unplanned failover does not require the source VM to be shut down beforehand, unlike planned failover.
The most dangerous distractor is A, because the quota problem was real and recent, leading the administrator to act on quota infrastructure instead of investigating the replication state.
Answer Key β Scenario 2β
Answer: B
The statement explicitly declares that the commit has not yet been executed. The commit is the step that formally finalizes the failover, consolidates the VM state in the secondary region, and discards previous recovery points. Without it, the failover remains in a pending state and "Re-protect" cannot be initiated.
Alternative C represents the most common error in this context: trying to execute "Re-protect" without first performing the commit. Site Recovery does not allow this operation out of order; it will result in an error.
Alternative A is dangerous because the primary region is still unavailable. Executing failback to a region under recovery is a high-risk decision that can result in data loss or a second unplanned failover subsequently.
Alternative D would discard all work already performed and return the VMs to the state prior to failover, which directly conflicts with the team's declared objective of maintaining operations in the secondary region and preparing reverse replication.
Answer Key β Scenario 3β
Answer: B
The cause is directly visible in the comparison table: the VM's private IP in the secondary region is 10.1.0.5, while the application connection string points to 10.0.0.5, which was the IP in the primary region. Site Recovery does not guarantee IP address preservation when the destination subnet has a different addressing block.
The information about the test failover from the previous week (alternative D) is irrelevant and was included purposefully. A successful test failover does not change the IP that will be assigned in the actual failover, and does not interfere with the destination network's IP assignment logic.
Alternative A is a plausible distractor, as NSGs can cause connectivity failures, but the statement confirms that port 1433 is allowed. Alternative C describes a real problem in other contexts, but there is no indication in the statement that the application tries to reach the primary network; it simply uses a fixed IP that no longer exists at that address.
The impact of not identifying this cause is critical: the administrator may spend hours investigating NSGs and network peering while the real cause is the static IP configuration in the application.
Answer Key β Scenario 4β
Answer: B
The correct sequence is 2 β 3 β 5 β 1 β 4, which follows the diagnostic logic from general to specific:
- Step 2 identifies which VMs are affected before any technical investigation. Without this, all following steps are blind investigation.
- Step 3 examines the affected VM's events to obtain the specific error message, which guides the following steps.
- Step 5 verifies outbound connectivity, which is the most frequent cause of replication interruption in corporate environments with restrictive proxy or firewall.
- Step 1 validates the cache storage account, which is checked after ruling out connectivity issues.
- Step 4 checks the mobility agent version, which is the least likely cause of a sudden warning state in an environment that was working correctly, being the most specific and time-consuming verification step.
Alternative C is the most dangerous distractor because it starts with agent version, which is a valid but low-priority check when the environment was working recently. This leads the administrator to a time-consuming and potentially unnecessary verification before investigating more likely causes.
Troubleshooting Tree: Perform a Failover to a Secondary Region by Using Site Recoveryβ
Color legend:
| Color | Node type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question |
| Red | Identified cause |
| Green | Recommended action or resolution |
To use this tree when facing a real problem, start at the root node and answer each question based on what you observe in the portal or logs. Follow the branch corresponding to your answer without skipping steps. Each path ends with an identified cause followed by a concrete action. Resisting the impulse to act before reaching a red node is the core discipline this tree reinforces.