Troubleshooting Lab: Configure and interpret reports and alerts for backups
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An administrator configures Backup Reports for a Recovery Services vault created two months ago. The vault's diagnostic settings are active and pointing to a Log Analytics Workspace in the same region. The workspace retention plan is set to 90 days.
When accessing Azure Backup Center and navigating to the reports section, the administrator selects the correct workspace in the filter and sets the time range as the last 60 days. The dashboard correctly displays jobs and policies data, but the Backup Instances tab appears completely empty, with no items listed.
The administrator verifies in the portal that there are 14 virtual machines protected by the vault, all with successful backups in the last 7 days. The vault uses default backup policies and hasn't undergone any recent changes. The subscription is active and without quota alerts.
Additional information collected by the administrator:
Workspace: law-backup-prod-eastus
Vault : rsv-prod-eastus
Diagnostics settings enabled on: 2024-01-15
Current date: 2024-03-20
Configured tables:
- CoreAzureBackup
- AddonAzureBackupJobs
- AddonAzureBackupPolicy
- AddonAzureBackupStorage
What is the root cause of the missing data in the Backup Instances tab?
A) The workspace is configured for 90-day retention, which limits the display of instances to that period, hiding older records.
B) The AddonAzureBackupProtectedInstance table was not included in the vault's diagnostic settings.
C) The 60-day time range filter exceeds the supported limit by Backup Reports for the instances tab, which is 30 days.
D) Azure Backup Center requires explicit read permission at the workspace level to display protected instance data, and this permission is missing.
Scenario 2 β Action Decisionβ
The operations team received a critical alert at 02:17 indicating backup failure of a production virtual machine running a critical SQL database. Initial investigation confirmed that the cause is an outdated backup agent installed on the VM, incompatible with the current version of the Azure Backup extension.
The environment has the following restrictions at the moment:
| Restriction | Detail |
|---|---|
| Maintenance window | Ends at 04:00 (1h43 remaining) |
| Production impact | VM is active and processing transactions |
| Last successful backup | 22:00 the previous day |
| Available permission | Administrator with Contributor access on the subscription |
| Change policy | Agent updates require CAB approval |
What is the correct action to take at this moment?
A) Immediately update the backup agent on the VM, since the administrator has Contributor permission and the maintenance window is still open.
B) Execute an on-demand backup via Azure portal to cover the period since the last successful backup, without changing the agent, and log the occurrence for treatment at the next CAB.
C) Suspend VM protection in the vault, manually remove the backup extension via SSH and reinstall it in the latest version within the maintenance window.
D) Escalate to the emergency CAB requesting approval to update the agent now, waiting for response before any action.
Scenario 3 β Root Causeβ
An administrator reports that backup alerts configured in Azure Monitor stopped being sent to the linked action group. The alert rule has Enabled status in the portal and shows no visible error messages.
The administrator collects the following information:
Alert rule : alert-backup-unhealthy
Signal : Backup Health Events
Condition : Health Status = Unhealthy
Action group : ag-ops-email
Rule status : Enabled
Last trigger : 8 days ago
When querying the action group directly, the administrator verifies:
Action group : ag-ops-email
Status : Enabled
Configured action : Email
Address : ops-team@empresa.com
Last use : 8 days ago
The administrator also observes that, in the last 8 days, three VMs presented backup failures confirmed by the Recovery Services vault portal, and the failures are recorded in the workspace logs.
The vault underwent a resource group migration 10 days ago, performed by the infrastructure team for resource reorganization.
What is the root cause of the problem?
A) The action group has an invalid email address after the resource group migration, since linked resources lose notification settings during movements.
B) The alert rule lost the link with the Recovery Services vault after migration between resource groups, now monitoring a non-existent or incorrect scope.
C) The Backup Health Events signal has a trigger limit of once every 7 days by design, which explains the absence of alerts after the last trigger 8 days ago.
D) The Log Analytics Workspace stopped receiving events from the vault after migration, since diagnostic settings are automatically unlinked during resource group movements.
Scenario 4 β Diagnostic Sequenceβ
An administrator receives the following complaint from the business team:
"The Backup Reports dashboard displays no data. We configured everything correctly last week."
The administrator needs to investigate the problem systematically. The available steps for diagnosis are:
- Verify if the vault's diagnostic settings are enabled and pointing to the correct workspace.
- Confirm in the Recovery Services vault portal if backups were executed in the period selected by the filter.
- Query the workspace tables via KQL to verify if there is ingested data.
- Verify if the correct workspace is selected in the Backup Reports filter.
- Confirm which diagnostic tables were enabled in the vault settings.
What is the correct diagnostic sequence?
A) 2 β 1 β 5 β 3 β 4
B) 4 β 1 β 5 β 3 β 2
C) 1 β 3 β 4 β 5 β 2
D) 4 β 2 β 1 β 5 β 3
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The list of tables configured in diagnostic settings does not include AddonAzureBackupProtectedInstance, which is exactly the table responsible for feeding the Backup Instances tab in Backup Reports. Each report tab depends on a specific table: jobs depend on AddonAzureBackupJobs, policies on AddonAzureBackupPolicy, and protected instances on AddonAzureBackupProtectedInstance. The absence of this table in the configuration makes only the instances tab empty, while the others display data normally.
The information about 90-day retention (alternative A) is the irrelevant data included purposefully. Retention controls how long data is kept, not which data is collected. Alternative C is false; there is no 30-day limit for the instances tab. Alternative D is a plausible distractor, but the statement presents no symptom of permission error, and the administrator can view other tabs of the same report without problem.
The most dangerous distractor is alternative D: an administrator who assigns additional permissions without checking diagnostic settings will waste time and not solve the problem.
Answer Key β Scenario 2β
Answer: B
The cause is already identified (outdated agent), but the critical restriction that determines the correct action is the change policy: agent updates require CAB approval. Acting without approval, even with technical permission available (alternative A), violates the governance process and exposes the administrator to responsibility for an unauthorized change in production.
The correct action is to execute an on-demand backup, which doesn't change configurations or require CAB approval, ensuring period coverage without backup while the definitive resolution follows the correct process. This protects the RPO without violating procedural restrictions.
Alternative C combines two problems: manually removing the extension is a destructive and unauthorized action, with real risk of production interruption. Alternative D ignores that there is an immediate lower-risk action available (the on-demand backup) while the approval process occurs in parallel, wasting the available protection window.
Answer Key β Scenario 3β
Answer: B
The decisive clue in the statement is the vault migration between resource groups 10 days ago, exactly when alerts stopped. Azure Monitor alert rules are configured with a scope that points to a resource by its Resource ID. When a resource is moved between resource groups, its Resource ID changes. The alert rule starts monitoring an ID that no longer exists, becoming silently inoperative without displaying visible errors in the portal.
The information about the action group being enabled and with last use 8 days ago is the irrelevant data included to divert diagnosis. The action group works correctly; the problem is in the rule scope.
Alternative D is the most dangerous distractor: it's technically true that diagnostic settings can be affected by movements, but the statement informs that backup failures are recorded in workspace logs, which confirms that data ingestion continues working. This eliminates alternative D. Alternative C describes non-existent behavior in the platform.
Answer Key β Scenario 4β
Answer: B
The correct sequence is 4 β 1 β 5 β 3 β 2, as it follows the logic of progressive diagnosis from most superficial to deepest:
Step 4 first: verify if the correct workspace is selected in the filter eliminates the simplest and most common error without requiring any additional technical access.
Step 1 next: confirm if diagnostic settings are enabled and pointing to the right workspace validates the data source.
Step 5: verify which tables were enabled in settings is more specific than confirming if they exist, and should come before querying the data.
Step 3: query tables via KQL empirically confirms if data is being ingested, validating previous steps.
Step 2 last: confirm if backups were executed in the period only makes sense after ruling out all configuration and ingestion problems. If there's no data in the report but there's data in the workspace, the problem is filtering; if there's no data in the workspace, the problem is ingestion.
Alternative A makes the classic mistake of going directly to backup verification before validating the data collection infrastructure. Alternative C starts with the technical source before eliminating simple causes.
Troubleshooting Tree: Configure and interpret reports and alerts for backupsβ
Color legend:
| Color | Node type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate validation or verification |
To use this tree when facing a real problem, start at the root node and answer each question based on what is observable in the portal or logs. Follow the branch corresponding to the answer until reaching a red node of identified cause, then advance to the green action node. After executing the action, go through the orange validation node to confirm the problem was resolved before closing the incident.