Troubleshooting Lab: Interpret metrics in Azure Monitor
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A company's operations team receives a complaint: the monitoring dashboard shows that the Available Memory Bytes metric from a production Windows VM (vm-prod-app01, Standard D4s v3, East US 2 region) has simply disappeared from the graphs in Metrics Explorer. The metric was visible until three days ago.
The responsible engineer verifies the following:
VM Status : Running
OS : Windows Server 2022
Azure Monitor Agent: Installed (version 1.22.0)
Diagnostic Settings: Enabled (destination to Log Analytics Workspace)
Workspace Region : West US
Last heartbeat : 4 days ago
The team mentions that two days ago, they applied a corporate security policy update that blocked outbound communication on port 443 to some endpoints. They also report that the VM underwent a resize from D2s to D4s during the same period, an operation that restarted the machine.
What is the root cause of the metric disappearance?
A) The VM resize from D2s to D4s restarted the agent and corrupted its local configuration, requiring reinstallation.
B) The security policy blocked the Azure Monitor Agent communication with ingestion endpoints, interrupting the sending of guest metrics.
C) The Available Memory Bytes metric is a platform metric that was discontinued for D-series VMs after the September 2023 version.
D) The destination of the Log Analytics Workspace in a different region from the VM causes replication latency exceeding 72 hours, making the metric appear absent.
Scenario 2 β Action Decisionβ
The SRE team identified the cause of a recurring false positive alert: an alert rule configured for the Transactions metric of a Storage Account uses 5-minute granularity with 1-minute evaluation frequency and Total aggregation. The overlapping windows are inflating the evaluated values.
The Storage Account is used by a critical financial application that operates 24 hours a day, 7 days a week, without scheduled maintenance windows. The current alert rule is notifying the team via PagerDuty every 3 to 5 minutes, generating alert fatigue. The engineer has Monitoring Contributor permission on the Resource Group.
What is the correct action to take at this moment?
A) Delete the existing alert rule and recreate from scratch with aligned granularity and evaluation frequency (both at 5 minutes), without interrupting the application.
B) Increase the alert condition threshold to a value high enough to absorb the inflation caused by overlapping windows, maintaining the current configuration.
C) Temporarily disable the alert rule, wait for a maintenance window to apply necessary corrections and re-enable afterwards.
D) Edit the existing alert rule changing only the evaluation frequency to 5 minutes, aligning it with the granularity, without deleting or recreating the rule.
Scenario 3 β Root Causeβ
A developer opens a ticket reporting that they created an alert in Azure Monitor to monitor the Http5xxErrors metric of an Azure App Service, but the alert never fired, even with error 500s visible in the application logs.
The on-call engineer accesses the portal and finds the following:
Resource : App Service (app-api-prod)
Metric : Http5xxErrors
Aggregation : Average
Granularity : 5 minutes
Condition : Average > 0
Frequency : Every 5 minutes
Rule status : Enabled
Action Group : ag-oncall (email + webhook)
The engineer also observes that the App Service is on a Standard S2 tier App Service Plan and that Application Insights is enabled and connected. The Application Insights logs clearly show HTTP 500 type exceptions occurring every 10 to 15 minutes, with durations of less than 30 seconds each.
What is the root cause of the alert never having fired?
A) Application Insights captures the errors before the Http5xxErrors metric is recorded, preventing Azure Monitor from counting them.
B) The Average aggregation diluted the point errors within the 5-minute window: if only 1 or 2 requests failed out of hundreds, the average remains close to zero and doesn't cross the > 0 threshold.
C) The Standard S2 tier of the App Service Plan doesn't support emitting the Http5xxErrors metric to Azure Monitor; this feature requires the Premium tier.
D) The Action Group ag-oncall has the webhook configured incorrectly, and the alert fires internally, but the notification is never delivered.
Scenario 4 β Diagnostic Sequenceβ
An engineer receives the following report: "CPU alerts from multiple VMs stopped firing at the same time. The VMs continue to function normally."
Below are the available investigation steps, presented out of order:
[P] Verify if alert rules are enabled and without active suppression
[Q] Confirm if CPU metrics are being collected and visible in Metrics Explorer
[R] Check alert firing history in the last 24 hours in the "Alert history" tab
[S] Check if the Action Group associated with the rules has valid endpoints and no delivery errors
[T] Confirm if there was a recent change in alert rules or configured thresholds
What is the correct diagnostic sequence?
A) T, P, Q, R, S
B) Q, P, T, R, S
C) R, Q, P, T, S
D) P, Q, R, T, S
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The decisive clue is in Last heartbeat: 4 days ago, which coincides exactly with the moment the security policy was applied. The Azure Monitor Agent depends on outbound HTTPS communication (port 443) with specific Microsoft endpoints to send guest metrics, including Available Memory Bytes. Blocking this communication silently interrupts the data flow, making the metric disappear without visible errors on the VM.
The information about the VM resize is purposely irrelevant: reboots don't corrupt the agent configuration, and the fact that the VM is in Running state rules out availability problems. Alternative C is false because Available Memory Bytes continues to be a supported guest metric. Alternative D describes a behavior that doesn't exist: region differences between VM and workspace cause minimal latency, not data absence.
The most dangerous distractor is A, as the resize and reboot are concrete events close in time, which leads the engineer to attribute the cause to the most visible event, ignoring the network policy applied simultaneously.
Answer Key β Scenario 2β
Answer: D
The cause has already been identified (overlapping windows due to frequency lower than granularity) and the surgical correction is to align the evaluation frequency with the granularity, both at 5 minutes. Alternative D solves the problem without deleting the rule, without interrupting monitoring and without requiring a maintenance window, respecting all scenario constraints.
Alternative A would also fix the problem technically, but deleting and recreating a rule implies a period of coverage absence, which is unacceptable for a critical financial application without maintenance windows. Alternative B is a workaround that masks the problem without fixing it, compromising alert sensitivity. Alternative C directly violates the constraint of no maintenance window.
The engineer has Monitoring Contributor permission, sufficient to edit alert rules, making alternative D viable without permission escalation.
Answer Key β Scenario 3β
Answer: B
The Average aggregation is the critical point. The Http5xxErrors metric counts the number of errors in an interval. When aggregation is Average, Azure Monitor calculates the average of points collected within the granularity window. If the application processes hundreds of requests with 1 or 2 failures, the average errors per collection period results in a decimal value very close to zero, never consistently crossing the > 0 threshold.
The correct solution would be to use Total or Count aggregation, which would accumulate errors and cross the threshold even for point failures.
The information about Application Insights is purposely irrelevant: Application Insights and App Service platform metrics are independent sources; one doesn't interfere with the other. Alternative D is the most dangerous distractor because it directs investigation toward the Action Group, which is a separate component from alert firing. If the alert never fired (and didn't "fire but not notify"), the problem is in the firing condition, not in delivery.
Answer Key β Scenario 4β
Answer: B
The correct sequence is: Q, P, T, R, S.
Progressive diagnostic reasoning should start with the most fundamental layer: verify if data exists (Q). If metrics aren't visible in Metrics Explorer, the problem is in collection, not rules. Next, confirm if rules are active and without suppression (P), then investigate if there were recent configuration changes (T), check firing history to understand if alerts evaluated conditions (R) and, lastly, verify if the problem is in delivery via Action Group (S).
Alternative A starts with recent changes before confirming if data exists, which skips the most basic layer. Alternative C starts with alert history, but if metrics aren't being collected, the history will be empty and the information will be inconclusive without the context of Q first. Alternative D starts with rules without confirming if there's data to evaluate.
The most common error in this type of diagnosis is going straight to rule configuration without first validating that telemetry is flowing.
Troubleshooting Tree: Interpret metrics in Azure Monitorβ
Color legend:
| Color | Node type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Medium blue | Diagnostic question (decision) |
| Orange | Intermediate verification or validation |
| Red | Identified cause |
| Green | Recommended action or resolution |
To use this tree when facing a real problem, start with the root node describing the observed symptom and follow the branches by answering each question based on what you can verify directly in the portal or via command. Each answer eliminates a set of hypotheses and narrows the path to the cause. Never skip a verification level to go straight to an assumed cause: the tree is designed so that the order of questions filters the most common distractors before reaching the real cause.