Troubleshooting Lab: Perform Backup and Restore Operations by Using Azure Backup
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that backups for a critical VM stopped running 6 days ago. The VM continues running normally, with no reported network incidents or disk failures. The Recovery Services Vault is in the same region as the VM, with Active status in the portal.
The administrator checks the job history in the vault and finds the following:
Backup Job History β Last 7 days
Date Status Duration Error Code
2026-03-20 Completed 00:42:11 β
2026-03-21 Failed 00:00:03 UserErrorVmNotInDesiredState
2026-03-22 Failed 00:00:03 UserErrorVmNotInDesiredState
2026-03-23 Failed 00:00:03 UserErrorVmNotInDesiredState
2026-03-24 Failed 00:00:03 UserErrorVmNotInDesiredState
2026-03-25 Failed 00:00:03 UserErrorVmNotInDesiredState
2026-03-26 Failed 00:00:03 UserErrorVmNotInDesiredState
The administrator also confirms that:
- The backup policy is configured for daily execution at 02:00 UTC
- The vault has GRS replication enabled
- The VM has a 2 TB data disk added recently
- The subscription has not reached any storage quota limits
What is the root cause of the backup failures?
A) The 2 TB data disk added to the VM exceeds the limit supported by Azure Backup for individual disks, blocking job execution.
B) The VM is in a state that prevents backup execution, such as Deallocated or Stopped (deallocated), despite the operating system appearing accessible through other means.
C) The GRS replication enabled on the vault is causing conflicts with backup execution during the geographic replication window.
D) The backup policy was corrupted after adding the data disk, requiring recreation of the association between the VM and the vault.
Scenario 2 β Action Decisionβ
The cause of a problem has been identified: an administrator accidentally deleted the backup item for a production VM directly in the Recovery Services Vault. The deletion occurred 9 days ago. The vault has the Soft Delete feature enabled.
The environment has the following relevant characteristics:
- The original VM is still running and healthy
- The last successful backup before deletion was performed 11 days ago
- The organization requires the maximum RPO for this VM to be 24 hours
- A new backup was configured immediately after discovering the incident, 2 hours ago
- The security team is monitoring the case and awaiting confirmation that historical data has been preserved
Given that the cause has been identified and the environment is described above, what is the correct action to take at this moment?
A) Immediately start VM restoration from the recovery point available in soft delete state, overwriting the current production VM.
B) Reactivate the backup item in soft delete state to recover access to historical recovery points, without interrupting the production VM.
C) Wait for the next scheduled backup to complete before taking any action, as the original VM is healthy and there is no active data loss.
D) Permanently delete the item in soft delete and recreate protection from scratch, as the data is 11 days old and violates the 24-hour RPO anyway.
Scenario 3 β Root Causeβ
An administrator attempts to perform individual file restoration from a Linux VM using Azure Backup's File Recovery functionality. The selected recovery point is 3 days old. The mount script was successfully generated in the portal and executed on the target VM.
The script output on the target VM is as follows:
$ sudo bash ILRscript.sh
Connection to recovery point established.
Mounting recovery volumes...
ERROR: Mount failed for volume /dev/sdc1
Filesystem type: XFS
Kernel module: xfs β not loaded
Please ensure the required filesystem modules are available on this machine.
Recovery point connection will expire in: 11:47:32
The administrator verifies that:
- The target VM is Ubuntu 22.04 with kernel 5.15
- The source VM where the backup was taken is Red Hat Enterprise Linux 8.6
- The network between VMs and the vault shows no blocks
- The target VM's resource group is in the same region as the vault
- The source VM's OS disk uses XFS file system
What is the root cause of the mount failure?
A) The mount script expired during execution due to network latency between the target VM and the vault, preventing iSCSI volume establishment.
B) The kernel module for the XFS file system is not loaded on the target VM, which uses a different distribution than the source VM.
C) Individual file restoration is not supported between VMs of different Linux distributions, requiring full disk restoration.
D) The 3-day-old recovery point is not available for File Recovery because only points created within the last 24 hours support iSCSI mounting.
Scenario 4 β Diagnostic Sequenceβ
An administrator receives an alert indicating that a Windows VM backup failed with the error GuestAgentSnapshotTaskTimedOut. The VM is used as a production application server and backups were working normally until 48 hours ago. No infrastructure changes were recorded in the team's changelog.
The available investigation steps are:
[P] Check the status and version of Azure VM Agent installed on the VM
[Q] Confirm if the backup policy was modified in the last 48 hours
[R] Check for pending or stuck snapshots on the VM disk via portal
[S] Analyze Windows Event Viewer logs on the VM for guest agent errors
[T] Verify VM connectivity with Azure Backup endpoints (service URLs)
What is the correct investigation sequence for this error?
A) T, Q, P, S, R
B) Q, T, R, P, S
C) P, T, S, R, Q
D) R, P, T, S, Q
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The error code UserErrorVmNotInDesiredState directly indicates that the VM is not in the operational state expected by Azure Backup at job execution time. This error occurs when the VM is in Deallocated or Stopped (deallocated) state, as Azure Backup Application Consistent Backup requires the guest agent to be running to coordinate snapshots with consistency.
The decisive clue in the scenario is that the error occurs with only 3 seconds duration on all days, indicating immediate rejection before any snapshot attempt, characteristic of VM state verification before execution.
The irrelevant information is the 2 TB data disk added recently. Although it's a prominent technical detail, Azure Backup supports disks up to 32 TB and the error is unrelated to disk size. Including this detail tests the tendency to correlate the most recent change with the failure, without evidence.
GRS replication (alternative C) is a vault configuration that does not interfere with backup job execution. Alternative D is implausible because the association between VM and vault is not affected by adding disks.
Acting based on alternative A would lead the administrator to remove the data disk or reconfigure disk policies without solving the real problem, prolonging the exposure window without backup.
Answer Key β Scenario 2β
Answer: B
The backup item is in soft delete state for 9 days. The soft delete retention period is 14 days, so historical data is still accessible and within the deadline. The correct action is to reactivate (undelete) the item, which restores access to all recovery points without any data loss and without impact on the running production VM.
Alternative A is incorrect in context: the original VM is healthy, and overwriting production with an 11-day-old recovery point would cause data loss from the last 11 days without any operational need.
Alternative C ignores the RPO constraint: even though the VM is healthy today, the organization requires backup history with 24-hour RPO. Waiting for the next backup doesn't resolve the historical gap nor confirm to the security team that previous data was preserved.
Alternative D is the most dangerous distractor. The reasoning that "the data is 11 days old and violates the RPO anyway" is false because the 24-hour RPO refers to the objective of future backup frequency, not a criterion for discarding existing historical data. Permanently deleting an item in soft delete is irreversible.
Answer Key β Scenario 3β
Answer: B
The script output is explicit: Kernel module: xfs β not loaded. The volume to be mounted uses XFS file system, which is native to distributions like Red Hat Enterprise Linux, but is not loaded by default in Ubuntu 22.04. The mount script successfully established the connection with the recovery point but failed when trying to mount the volume locally due to the missing kernel module.
The direct solution is to load the module with sudo modprobe xfs and run the script again within the 12-hour validity period.
Alternative A is directly refuted by the log: Connection to recovery point established confirms that the iSCSI connection was successful. The timer showing 11h47m remaining also rules out expiration as the cause.
Alternative C is a common conceptual mistake: File Recovery supports restoration between different Linux distributions, as long as the file system is accessible on the target VM.
Alternative D is false: the availability window for File Recovery is not limited to 24 hours. What has 12-hour validity is the mount script after its generation, not the recovery point itself.
The irrelevant information is the location of the target VM's resource group in the same region as the vault, which is a File Recovery support requirement already met and unrelated to the observed failure.
Answer Key β Scenario 4β
Answer: C
The correct sequence is P, T, S, R, Q, which follows the progressive diagnostic logic for the GuestAgentSnapshotTaskTimedOut error:
-
P β Checking Azure VM Agent status and version is the first step because the error explicitly names the guest agent as the involved component. An outdated, stopped, or corrupted agent is the most frequent cause of this error.
-
T β If the agent is healthy, the next step is to verify connectivity with Azure Backup endpoints, as the agent depends on communication with specific URLs to coordinate snapshots.
-
S β With connectivity confirmed, analyzing Event Viewer logs deepens the diagnosis within the VM, looking for specific agent or VSS errors.
-
R β Checking for pending or stuck snapshots is an intermediate validation step after logs, as accumulated snapshots can block new attempts.
-
Q β Checking backup policy modifications is the last step because the scenario already states that no infrastructure changes were recorded, making this hypothesis less priority, but still valid for formal dismissal.
The sequence in alternative A (T, Q, P, S, R) errs by prioritizing policy verification before the agent, ignoring the direct clue the error name provides. Alternative B starts with policy, which the scenario already signaled as unlikely. Alternative D starts with stuck snapshots, which are a possible consequence, not an initial root cause.
Troubleshooting Tree: Perform Backup and Restore Operations by Using Azure Backupβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark Blue | Initial symptom (root) |
| Blue | Diagnostic question |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Validation or intermediate verification |
To use this tree when facing a real problem, start at the root node by identifying whether the failure is in a backup job or a restore operation. From there, follow the closed questions answering only with what you can observe directly in the portal or logs. Orange nodes indicate you need to confirm a state before proceeding. When you reach a red node, you've found the cause; the green node immediately derived indicates the corresponding corrective action.