Skip to main content

Troubleshooting Lab: Configure Azure Site Recovery for Azure resources

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

An infrastructure team enabled replication of 12 Azure VMs from East US region to West US using Azure Site Recovery. The VMs are distributed across three different resource groups. The cache storage account was created in the source region with LRS redundancy. The Recovery Services Vault is in the West US region.

After 72 hours, the responsible engineer observes that 4 of the 12 VMs show the following status in the portal:

VM: vm-prod-db-02
Replication status: Warning
Health: Replication hasn't generated any recovery point
in the last 24 hours.
Last app-consistent recovery point: 71h ago
Last crash-consistent recovery point: 23h ago

The other 8 VMs are healthy and with RPO within expected ranges. The engineer verifies that the 4 problematic VMs are all of type Standard_D8s_v3 with 2 TB Premium SSD disks each. The database team reports that these VMs have intensive write workloads, with frequent spikes above 80 MB/s per disk. The network between regions shows no alerts and ASR service status is normal in the Azure health dashboard.

What is the root cause of the problem observed on these 4 VMs?

A) The Recovery Services Vault is in the target region, which prevents generating app-consistent recovery points for VMs with Premium SSD disks.

B) The data churn rate of the VMs exceeds the limits supported by ASR, causing the cache storage account to be unable to process and transfer data at the speed necessary to generate new recovery points.

C) The cache storage account with LRS redundancy is causing throttling on write operations, as the LRS level doesn't support the throughput required by 2 TB Premium SSD disks.

D) The mobility agent installed on these 4 VMs is outdated and cannot capture disk changes above 50 MB/s, blocking recovery point creation.


Scenario 2 β€” Action Decision​

The operations team identified that the cause of replication failure in a group of VMs is the lack of adequate permissions on the Recovery Services Vault's managed identity over the cache storage account. The vault doesn't have the Storage Blob Data Contributor role assigned on this account.

The environment has the following constraints:

  • The incident is open with severity 2, requiring resolution within 2 hours
  • The affected VMs are production systems and cannot be restarted or have replication disabled at this time
  • The engineer has Contributor permission at the resource group level where the vault is located, but doesn't have Owner or User Access Administrator permission on the subscription or storage account resource group
  • A colleague with Owner permission on the subscription is available and online

What is the correct action to take at this moment?

A) Disable and re-enable replication of the affected VMs to force ASR to automatically recreate permissions on the storage account.

B) Request the colleague with Owner permission to assign the Storage Blob Data Contributor role to the vault's managed identity on the storage account, and wait for propagation before validating replication status.

C) Create a new cache storage account in the resource group where the engineer has Contributor permission and redirect replication to it.

D) Temporarily elevate own permissions via Microsoft Entra Privileged Identity Management (PIM) to Owner on the subscription and perform the role assignment directly.


Scenario 3 β€” Root Cause​

An operations engineer executes a test failover of a critical VM to validate the recovery plan before an internal audit. The test failover completes without errors in the portal and the engineer marks the test as successful. Two days later, he notices the following in the ASR portal:

VM: vm-app-frontend-01
Replication status: Critical
Detail: Replication is suspended.
Test failover was not completed correctly.
Execute 'Clean up test failover' to resume replication.

The engineer confirms that the test VM has been shut down in the secondary region since yesterday. The isolated virtual network used in the test still exists. The source VM is functioning normally in production. The vault's activity logs show no cleanup operations recorded after the test failover.

The engineer states that he clicked "Stop test VM" in the portal after validation.

What is the root cause of the observed problem?

A) Clicking "Stop test VM" doesn't equate to executing the test failover cleanup operation in ASR. The mandatory Clean up test failover operation wasn't executed, and without it replication remains suspended.

B) The test VM was manually shut down in the secondary region, which corrupted ASR's internal state and suspended replication as a protection mechanism.

C) The isolated virtual network used in the test still exists, and ASR interprets this as a resource conflict that prevents replication resumption.

D) The test failover generated a consistency snapshot that wasn't automatically released, and ASR suspended replication to avoid inconsistency between the retained snapshot and new data from the source VM.


Scenario 4 β€” Diagnostic Sequence​

An administrator receives an alert informing that an Azure VM protected by ASR shows critical replication status. The administrator has never investigated this type of failure before and needs to follow a logical diagnostic sequence.

The available steps are:

  1. Check vault health events in the portal and identify the specific error code associated with the VM
  2. Confirm if the source VM is powered on and accessible in the primary region
  3. Check if the cache storage account is available and without capacity or throttling alerts
  4. Access the ASR service health dashboard in Azure Status to rule out global service failure
  5. Analyze mobility agent logs inside the source VM to check for disk capture errors

What is the correct diagnostic sequence?

A) 2 β†’ 4 β†’ 1 β†’ 3 β†’ 5

B) 4 β†’ 2 β†’ 1 β†’ 3 β†’ 5

C) 1 β†’ 3 β†’ 2 β†’ 5 β†’ 4

D) 3 β†’ 1 β†’ 4 β†’ 2 β†’ 5


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The central clue is in the combination of two provided data points: the 4 affected VMs are exclusively those with intensive write workloads with spikes above 80 MB/s per disk, while the 8 healthy VMs don't have this profile. ASR has documented churn (data turnover) limits per disk and per VM. When these limits are consistently exceeded, the service cannot process and transfer accumulated data in the cache storage account at the necessary speed, resulting in increasingly older recovery points until complete status degradation.

The irrelevant information in this scenario is the LRS redundancy of the cache storage account. The redundancy level doesn't affect cache throughput capacity in the ASR context; it's not the variable that determines if churn can be absorbed. Alternative C leads the reader to focus on this detail purposely inserted.

Alternative D is the most dangerous distractor: an outdated mobility agent is a real cause of ASR problems, but the typical symptom is installation failure or connectivity errors, not gradual degradation correlated to disk write rate.

Acting based on D would lead the engineer to update agents without solving the real problem, losing critical hours while RPO continues degrading.


Answer Key β€” Scenario 2​

Answer: B

The cause is already identified in the statement: missing Storage Blob Data Contributor role on the vault's managed identity over the storage account. The critical constraint is that the engineer doesn't have permission to assign roles on this account. The correct action is to engage the colleague with Owner permission to perform the assignment, which is a simple operation targeted at the correct resource.

Alternative A is technically incorrect: disabling and re-enabling replication doesn't recreate managed identity permissions; this behavior doesn't exist in ASR. Additionally, the statement's constraint explicitly prohibits this action in production.

Alternative C ignores that redirecting replication to a new storage account requires disabling and re-enabling replication, which directly violates the stated constraint.

Alternative D may seem valid if the environment has PIM configured for Owner, but the statement doesn't mention this permission as available to the engineer, and the most proportional and quick action is to engage the already available colleague. Using PIM to escalate own permission when there's someone competent and available is an inadequate governance decision for the context.


Answer Key β€” Scenario 3​

Answer: A

The definitive clue is in the vault's activity logs: no cleanup operation was recorded after the test failover. ASR requires the administrator to explicitly execute the Clean up test failover operation to end the test cycle and resume normal replication. This operation is distinct and separate from any action taken on the test VM itself, such as shutting it down or deleting it.

The irrelevant information is the existence of the isolated virtual network. It doesn't interfere with replication state; ASR doesn't monitor the presence or absence of test VNets as criteria to suspend or resume replication.

Alternative B reverses causality: shutting down the test VM doesn't suspend replication; the suspension was already active since the test failover was initiated without being formally ended.

The most dangerous distractor is C: it's plausible to imagine a resource conflict, but ASR doesn't use VNet existence as a suspension trigger. Acting based on C would lead the administrator to delete the virtual network without solving the problem, since the cause is operational, not infrastructural.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence is: 4 β†’ 2 β†’ 1 β†’ 3 β†’ 5

The correct diagnostic reasoning goes from general to specific, and from quickest to verify to most costly:

Step 4 rules out a global ASR service failure before any local investigation. If the service has an active incident, all subsequent investigation is premature.

Step 2 confirms if the source VM is operational. Without this, any replication investigation has no foundation.

Step 1 accesses the specific error code from the vault, which directs the next steps with precision instead of investigating blindly.

Step 3 checks the cache storage account, which is one of the most common and identifiable failure points in the portal.

Step 5 is the most invasive and time-consuming, requiring internal VM access, and should be the last resort when previous steps haven't identified the cause.

Alternative A starts by checking the source VM before ruling out global failure, which can waste time. Alternative C starts with vault events without first ruling out global failure or checking if the VM is active, losing the broader context. Alternative D starts with the storage account, which is a specific component, before any scope triage.


Troubleshooting Tree: Configure Azure Site Recovery for Azure resources​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Legend:

  • Dark blue: initial symptom or entry point
  • Blue: objective diagnostic question
  • Red: identified cause
  • Green: recommended action or resolution
  • Orange: intermediate validation or verification

To use this tree when facing a real problem, start from the root node and answer each question based on what is directly observable in the portal or VM. Always follow the path that corresponds to the actual state of the environment, without skipping questions. Red nodes indicate when the cause is identified; from them, follow to the corresponding recommended action and validate the result before closing the incident.