Troubleshooting Lab: Create and configure a backup policy

Diagnostic Scenarios

Scenario 1 — Root Cause

An operations team reports that production VMs in an Azure subscription have not been generating recovery points for 48 hours. The alert arrived in the morning, but the last successful backup recorded in the vault dates back two days.

The administrator accesses the Recovery Services Vault and observes the following in the portal:

Vault: vault-prod-brazilsouth
Region: Brazil South
Replication type: Geo-Redundant Storage (GRS)
Backup Items: 12 associated VMs
Job status (last 48h):

  VM-APP-01   | Failed | Error: UserErrorVmNotInDesiredState
  VM-APP-02   | Failed | Error: UserErrorVmNotInDesiredState
  VM-DB-01    | Failed | Error: UserErrorVmNotInDesiredState
  VM-DB-02    | Failed | Error: UserErrorVmNotInDesiredState
  (pattern repeated for all 12 VMs)

The administrator notes that all VMs are listed as Running in the Azure portal. The vault was created 6 months ago and worked normally until two days ago. The infrastructure team reports that on the same date, a ReadOnly resource lock was applied to the resource group containing the VMs as part of a governance process.

The vault redundancy was changed from LRS to GRS three weeks before the incident, with no impact on backups at that time.

What is the root cause of the backup job failures?

A) The vault redundancy change from LRS to GRS corrupted the VM association metadata, requiring policy reconfiguration.

B) The ReadOnly lock on the VM resource group prevents the Azure Backup agent from writing the necessary snapshots and metadata to the VMs during job execution.

C) The UserErrorVmNotInDesiredState error indicates that the VMs are internally in a deallocated state, despite appearing as Running in the portal, which is a known synchronization delay.

D) The vault in the Brazil South region experienced a regional service degradation that affected all backup jobs simultaneously in the last 48 hours.

Scenario 2 — Action Decision

The cause of a failure has been identified: the backup policy associated with a set of critical production VMs was configured with daily retention of 1 day instead of the 30 days required by the company's security policy. Recovery points generated in the last 15 days were automatically deleted according to the incorrect policy.

The environment has the following restrictions:

VMs are in active production and cannot be shut down or restarted
Soft Delete is enabled in the vault with a 14-day window
The administrator has the Backup Contributor role in the subscription
A change committee must approve retention policy changes, but emergency approval can be obtained in 2 hours
The next scheduled backup occurs in 4 hours

What is the correct action to take at this moment?

A) Immediately disassociate the VMs from the incorrect policy and create a new policy with 30-day retention, applying it to the VMs before the next scheduled backup, without waiting for committee approval as this is a correction, not a change.

B) Request emergency approval from the change committee, edit the existing policy to correct retention to 30 days, and confirm the change before the next scheduled backup.

C) Trigger Soft Delete to recover points deleted in the last 14 days and, in parallel, edit the retention policy without committee approval.

D) Wait for the next scheduled backup without changing the policy, to ensure the generated point is not immediately deleted, and submit the change through the normal committee process.

Scenario 3 — Root Cause

An administrator tries to configure backup for a SQL Server database hosted on an Azure VM. When accessing the vault and clicking Backup, they select SQL in Azure VM as the workload type and try to discover databases on the VM VM-SQL-PROD-01.

The discovery operation returns the following error:

Discovery failed for VM: VM-SQL-PROD-01
Error code: UserErrorSqlInstanceNotFound
Message: No SQL Server instances were found on the virtual machine.
         Ensure the SQL Server service is running and the VM agent is healthy.

The administrator verifies:

# Executed via Run Command on the VM
> Get-Service -Name MSSQLSERVER

Status   Name               DisplayName
------   ----               -----------
Running  MSSQLSERVER        SQL Server (MSSQLSERVER)

> Get-Service -Name WindowsAzureGuestAgent

Status   Name                     DisplayName
------   ----                     -----------
Running  WindowsAzureGuestAgent   Windows Azure Guest Agent

The VM was migrated from on-premises to Azure 30 days ago using Azure Migrate. The data disk where the MDF and LDF files are stored is a Premium SSD managed disk with 1 TB. The SQL Server instance uses a custom port: 1455, different from the default port 1433. The VM is in a VNet with an NSG that allows unrestricted outbound traffic.

What is the root cause of the discovery failure?

A) Azure Migrate does not properly prepare the VM for Azure Backup integration, requiring VM agent reinstallation after migration.

B) The VNet NSG is blocking communication between the vault and the VM on the discovery port, as unrestricted outbound traffic does not guarantee the necessary inbound traffic return.

C) The AzureBackupWindowsWorkload extension was not installed on the VM, as the discovery process installs it, and it failed before completing installation due to a connectivity issue with the extensions endpoint.

D) Using custom port 1455 prevents the Azure Backup discovery mechanism from locating the SQL Server instance, as discovery depends on enumeration via default port 1433 or via WMI/Windows registry, not direct TCP connection.

Scenario 4 — Diagnostic Sequence

An administrator receives the following alert at 07:15:

Alert: Backup job failed
Vault: vault-corp-eastus2
VM: VM-WEB-PROD-07
Error: ExtensionSnapshotFailedNoNetwork
Failure time: 02:35 UTC
Last successful backup: 26 hours ago

The administrator has the following investigation steps available, presented out of order:

Check the vault job history to identify if the failure is isolated to this VM or affects other VMs in the same vault
Inspect the VMSnapshot extension logs within the VM to get details of network errors logged at job time
Confirm if the previous day's recovery point is intact and available for restoration if needed
Check if there are NSG or UDR rules applied to the VM's subnet that might be blocking traffic necessary for the snapshot service
Confirm if the VMSnapshot extension is installed and has Provisioning succeeded status on the VM

Which diagnostic sequence represents the most effective approach?

A) 3 → 1 → 5 → 2 → 4

B) 1 → 5 → 4 → 2 → 3

C) 5 → 2 → 4 → 1 → 3

D) 1 → 3 → 5 → 4 → 2

Answer Key and Explanations

Answer Key — Scenario 1

Answer: B

Explanation:

A ReadOnly lock on a resource group prevents any write operations on the contained resources, including operations that the Azure Backup agent needs to perform on the VM during snapshot creation: metadata writing, VSS coordination, and fabric communication. The UserErrorVmNotInDesiredState error is generated when the backup service cannot put the VM in the necessary state for snapshotting, which occurs when write operations are blocked.
The definitive clue in the scenario is the exact temporal coincidence: the ReadOnly lock was applied on the same date as the last successful backup, and all 12 VMs failed simultaneously with the same error code.
The information about the LRS to GRS change is purposely irrelevant: it occurred three weeks earlier with no impact, and vault redundancy does not affect job execution.
Alternative C is the most dangerous distractor: the error code suggests a VM state issue, and a less experienced administrator might pursue a non-existent VM problem instead of examining governance restrictions applied to the resource group.

Answer Key — Scenario 2

Answer: B

Explanation:

The scenario presents an explicit and non-negotiable restriction: retention policy changes require change committee approval, with emergency approval available in 2 hours. The next backup occurs in 4 hours. There is sufficient time to follow the correct process before the next point is generated and potentially retained for the incorrect period.
Alternative A represents the technically correct action but applied while ignoring a critical process restriction. In regulated corporate environments, bypassing the change process, even with good intentions, can create auditable non-compliance issues.
Alternative C demonstrates a misunderstanding of Soft Delete: this feature maintains data deleted for 14 days after operator deletion request, and does not recover points that were automatically deleted by the configured retention policy. The points from the last 15 days are permanently lost.
Alternative D is the most dangerous: waiting without correction ensures the next generated point will be deleted in 1 day, perpetuating the risk exposure.

Answer Key — Scenario 3

Answer: D

Explanation:

The Azure Backup discovery mechanism for SQL in Azure VM does not perform direct TCP connection to SQL Server through the listening port. It uses WMI and Windows registry enumeration to locate SQL instances installed on the VM. However, the registration and authentication process of backup with the SQL Server instance depends on default port 1433 to establish the necessary communication with the backup service during the discovery and protection phase.
The clues in the scenario confirming this diagnosis: SQL service is Running, VM agent is Running, NSG allows unrestricted outbound, and the only atypical documented element is custom port 1455.
The information about the Premium SSD 1 TB disk is purposely irrelevant: data disk type and size do not affect the SQL instance discovery mechanism.
Alternative A is the distractor based on VM origin (Azure Migrate migration), which the administrator could unnecessarily pursue. Azure Migrate does not compromise the VM agent, as confirmed by the service status.
Acting based on alternative A would lead to unnecessary reinstallations and wasted time without resolving the real cause.

Answer Key — Scenario 4

Answer: B

Explanation:

The correct sequence follows the logic of scope → infrastructure → detail → impact:
- Step 1: Checking if the problem is isolated or generalized determines incident scope immediately and influences all following steps. A generalized problem points to the vault or network; an isolated problem points to the specific VM.
- Step 5: Confirming the extension is installed and healthy is the prerequisite for any deeper VM diagnosis.
- Step 4: With the extension confirmed as healthy, investigating NSG and UDR is the next logical step given the ExtensionSnapshotFailedNoNetwork error code, which explicitly points to network failure.
- Step 2: Inspecting extension logs within the VM provides technical detail that confirms or refines the network diagnosis.
- Step 3: Confirming the integrity of the last recovery point is an impact and contingency check, not a diagnostic step. Should be the last step.
Alternative A places contingency verification (step 3) as the first action, representing a priority inversion: before checking if there's data to recover, it's necessary to understand and resolve the failure cause.
Alternative C starts with the VM extension without first checking scope, which may lead to detailed diagnosis of a problem affecting the entire vault, wasting time.

Troubleshooting Tree: Create and configure a backup policy

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Color Legend:

Color	Node Type
Dark blue	Initial symptom, entry point
Medium blue	Decision diagnostic question
Red	Identified cause
Green	Recommended action or resolution
Orange	Intermediate verification or validation

To use this tree when facing a real problem, start at the root node by identifying if the problem is a failing job, recovery points with unexpected behavior, item not appearing as protectable, or vault cannot be deleted. Follow the diagnostic questions by answering objectively based on what you can observe in the portal or logs. Each branch progressively eliminates hypotheses until reaching a named cause and concrete action. Never skip an intermediate verification question: they exist to prevent you from applying the correct action to the wrong cause.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Action Decision​

Scenario 3 — Root Cause​

Scenario 4 — Diagnostic Sequence​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Create and configure a backup policy​

Diagnostic Scenarios

Scenario 1 — Root Cause

Scenario 2 — Action Decision

Scenario 3 — Root Cause

Scenario 4 — Diagnostic Sequence

Answer Key and Explanations

Answer Key — Scenario 1

Answer Key — Scenario 2

Answer Key — Scenario 3

Answer Key — Scenario 4

Troubleshooting Tree: Create and configure a backup policy