Troubleshooting Lab: Create and configure a backup policy
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that production VMs in an Azure subscription have not been generating recovery points for 48 hours. The alert arrived in the morning, but the last successful backup recorded in the vault dates back two days.
The administrator accesses the Recovery Services Vault and observes the following in the portal:
Vault: vault-prod-brazilsouth
Region: Brazil South
Replication type: Geo-Redundant Storage (GRS)
Backup Items: 12 associated VMs
Job status (last 48h):
VM-APP-01 | Failed | Error: UserErrorVmNotInDesiredState
VM-APP-02 | Failed | Error: UserErrorVmNotInDesiredState
VM-DB-01 | Failed | Error: UserErrorVmNotInDesiredState
VM-DB-02 | Failed | Error: UserErrorVmNotInDesiredState
(pattern repeated for all 12 VMs)
The administrator notes that all VMs are listed as Running in the Azure portal. The vault was created 6 months ago and worked normally until two days ago. The infrastructure team reports that on the same date, a ReadOnly resource lock was applied to the resource group containing the VMs as part of a governance process.
The vault redundancy was changed from LRS to GRS three weeks before the incident, with no impact on backups at that time.
What is the root cause of the backup job failures?
A) The vault redundancy change from LRS to GRS corrupted the VM association metadata, requiring policy reconfiguration.
B) The ReadOnly lock on the VM resource group prevents the Azure Backup agent from writing the necessary snapshots and metadata to the VMs during job execution.
C) The UserErrorVmNotInDesiredState error indicates that the VMs are internally in a deallocated state, despite appearing as Running in the portal, which is a known synchronization delay.
D) The vault in the Brazil South region experienced a regional service degradation that affected all backup jobs simultaneously in the last 48 hours.
Scenario 2 β Action Decisionβ
The cause of a failure has been identified: the backup policy associated with a set of critical production VMs was configured with daily retention of 1 day instead of the 30 days required by the company's security policy. Recovery points generated in the last 15 days were automatically deleted according to the incorrect policy.
The environment has the following restrictions:
- VMs are in active production and cannot be shut down or restarted
- Soft Delete is enabled in the vault with a 14-day window
- The administrator has the Backup Contributor role in the subscription
- A change committee must approve retention policy changes, but emergency approval can be obtained in 2 hours
- The next scheduled backup occurs in 4 hours
What is the correct action to take at this moment?
A) Immediately disassociate the VMs from the incorrect policy and create a new policy with 30-day retention, applying it to the VMs before the next scheduled backup, without waiting for committee approval as this is a correction, not a change.
B) Request emergency approval from the change committee, edit the existing policy to correct retention to 30 days, and confirm the change before the next scheduled backup.
C) Trigger Soft Delete to recover points deleted in the last 14 days and, in parallel, edit the retention policy without committee approval.
D) Wait for the next scheduled backup without changing the policy, to ensure the generated point is not immediately deleted, and submit the change through the normal committee process.
Scenario 3 β Root Causeβ
An administrator tries to configure backup for a SQL Server database hosted on an Azure VM. When accessing the vault and clicking Backup, they select SQL in Azure VM as the workload type and try to discover databases on the VM VM-SQL-PROD-01.
The discovery operation returns the following error:
Discovery failed for VM: VM-SQL-PROD-01
Error code: UserErrorSqlInstanceNotFound
Message: No SQL Server instances were found on the virtual machine.
Ensure the SQL Server service is running and the VM agent is healthy.
The administrator verifies:
# Executed via Run Command on the VM
> Get-Service -Name MSSQLSERVER
Status Name DisplayName
------ ---- -----------
Running MSSQLSERVER SQL Server (MSSQLSERVER)
> Get-Service -Name WindowsAzureGuestAgent
Status Name DisplayName
------ ---- -----------
Running WindowsAzureGuestAgent Windows Azure Guest Agent
The VM was migrated from on-premises to Azure 30 days ago using Azure Migrate. The data disk where the MDF and LDF files are stored is a Premium SSD managed disk with 1 TB. The SQL Server instance uses a custom port: 1455, different from the default port 1433. The VM is in a VNet with an NSG that allows unrestricted outbound traffic.
What is the root cause of the discovery failure?
A) Azure Migrate does not properly prepare the VM for Azure Backup integration, requiring VM agent reinstallation after migration.
B) The VNet NSG is blocking communication between the vault and the VM on the discovery port, as unrestricted outbound traffic does not guarantee the necessary inbound traffic return.
C) The AzureBackupWindowsWorkload extension was not installed on the VM, as the discovery process installs it, and it failed before completing installation due to a connectivity issue with the extensions endpoint.
D) Using custom port 1455 prevents the Azure Backup discovery mechanism from locating the SQL Server instance, as discovery depends on enumeration via default port 1433 or via WMI/Windows registry, not direct TCP connection.
Scenario 4 β Diagnostic Sequenceβ
An administrator receives the following alert at 07:15:
Alert: Backup job failed
Vault: vault-corp-eastus2
VM: VM-WEB-PROD-07
Error: ExtensionSnapshotFailedNoNetwork
Failure time: 02:35 UTC
Last successful backup: 26 hours ago
The administrator has the following investigation steps available, presented out of order:
- Check the vault job history to identify if the failure is isolated to this VM or affects other VMs in the same vault
- Inspect the
VMSnapshotextension logs within the VM to get details of network errors logged at job time - Confirm if the previous day's recovery point is intact and available for restoration if needed
- Check if there are NSG or UDR rules applied to the VM's subnet that might be blocking traffic necessary for the snapshot service
- Confirm if the
VMSnapshotextension is installed and has Provisioning succeeded status on the VM
Which diagnostic sequence represents the most effective approach?
A) 3 β 1 β 5 β 2 β 4
B) 1 β 5 β 4 β 2 β 3
C) 5 β 2 β 4 β 1 β 3
D) 1 β 3 β 5 β 4 β 2
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
Explanation:
- A ReadOnly lock on a resource group prevents any write operations on the contained resources, including operations that the Azure Backup agent needs to perform on the VM during snapshot creation: metadata writing, VSS coordination, and fabric communication. The
UserErrorVmNotInDesiredStateerror is generated when the backup service cannot put the VM in the necessary state for snapshotting, which occurs when write operations are blocked. - The definitive clue in the scenario is the exact temporal coincidence: the ReadOnly lock was applied on the same date as the last successful backup, and all 12 VMs failed simultaneously with the same error code.
- The information about the LRS to GRS change is purposely irrelevant: it occurred three weeks earlier with no impact, and vault redundancy does not affect job execution.
- Alternative C is the most dangerous distractor: the error code suggests a VM state issue, and a less experienced administrator might pursue a non-existent VM problem instead of examining governance restrictions applied to the resource group.
Answer Key β Scenario 2β
Answer: B
Explanation:
- The scenario presents an explicit and non-negotiable restriction: retention policy changes require change committee approval, with emergency approval available in 2 hours. The next backup occurs in 4 hours. There is sufficient time to follow the correct process before the next point is generated and potentially retained for the incorrect period.
- Alternative A represents the technically correct action but applied while ignoring a critical process restriction. In regulated corporate environments, bypassing the change process, even with good intentions, can create auditable non-compliance issues.
- Alternative C demonstrates a misunderstanding of Soft Delete: this feature maintains data deleted for 14 days after operator deletion request, and does not recover points that were automatically deleted by the configured retention policy. The points from the last 15 days are permanently lost.
- Alternative D is the most dangerous: waiting without correction ensures the next generated point will be deleted in 1 day, perpetuating the risk exposure.
Answer Key β Scenario 3β
Answer: D
Explanation:
- The Azure Backup discovery mechanism for SQL in Azure VM does not perform direct TCP connection to SQL Server through the listening port. It uses WMI and Windows registry enumeration to locate SQL instances installed on the VM. However, the registration and authentication process of backup with the SQL Server instance depends on default port 1433 to establish the necessary communication with the backup service during the discovery and protection phase.
- The clues in the scenario confirming this diagnosis: SQL service is Running, VM agent is Running, NSG allows unrestricted outbound, and the only atypical documented element is custom port 1455.
- The information about the Premium SSD 1 TB disk is purposely irrelevant: data disk type and size do not affect the SQL instance discovery mechanism.
- Alternative A is the distractor based on VM origin (Azure Migrate migration), which the administrator could unnecessarily pursue. Azure Migrate does not compromise the VM agent, as confirmed by the service status.
- Acting based on alternative A would lead to unnecessary reinstallations and wasted time without resolving the real cause.
Answer Key β Scenario 4β
Answer: B
Explanation:
- The correct sequence follows the logic of scope β infrastructure β detail β impact:
- Step 1: Checking if the problem is isolated or generalized determines incident scope immediately and influences all following steps. A generalized problem points to the vault or network; an isolated problem points to the specific VM.
- Step 5: Confirming the extension is installed and healthy is the prerequisite for any deeper VM diagnosis.
- Step 4: With the extension confirmed as healthy, investigating NSG and UDR is the next logical step given the
ExtensionSnapshotFailedNoNetworkerror code, which explicitly points to network failure. - Step 2: Inspecting extension logs within the VM provides technical detail that confirms or refines the network diagnosis.
- Step 3: Confirming the integrity of the last recovery point is an impact and contingency check, not a diagnostic step. Should be the last step.
- Alternative A places contingency verification (step 3) as the first action, representing a priority inversion: before checking if there's data to recover, it's necessary to understand and resolve the failure cause.
- Alternative C starts with the VM extension without first checking scope, which may lead to detailed diagnosis of a problem affecting the entire vault, wasting time.
Troubleshooting Tree: Create and configure a backup policyβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom, entry point |
| Medium blue | Decision diagnostic question |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate verification or validation |
To use this tree when facing a real problem, start at the root node by identifying if the problem is a failing job, recovery points with unexpected behavior, item not appearing as protectable, or vault cannot be deleted. Follow the diagnostic questions by answering objectively based on what you can observe in the portal or logs. Each branch progressively eliminates hypotheses until reaching a named cause and concrete action. Never skip an intermediate verification question: they exist to prevent you from applying the correct action to the wrong cause.