Troubleshooting Lab: Query and Analyze Logs in Azure Monitor
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team monitors a set of Windows virtual machines connected to a Log Analytics Workspace. The Azure Monitor Agent (AMA) was installed on all VMs three weeks ago and the Data Collection Rules (DCRs) were correctly associated. The workspace has 90 days of retention configured.
During an incident investigation, an analyst executes the query below and finds no events of type 4625 (authentication failure) for a specific VM named vm-prod-03, although the security team confirms that failed login attempts occurred on this machine in the last two hours:
SecurityEvent
| where Computer == "vm-prod-03"
| where EventID == 4625
| where TimeGenerated > ago(2h)
The analyst verifies that other events from the same VM appear normally in the Heartbeat table. The VM is powered on and responsive. The workspace shows no ingestion alerts. The network team reports no recent firewall changes.
What is the root cause of the absence of 4625 events in the query?
A) The 90-day retention does not cover security events of type 4625, which require separate extended retention configuration.
B) The Data Collection Rule associated with the VM does not include the collection of security events with the level or EventID corresponding to the Windows Event Log Security channel.
C) The AMA agent is experiencing silent failure on VM vm-prod-03, as the Heartbeat table uses a different communication channel and does not reflect the state of event collection.
D) The query uses Computer == "vm-prod-03" with case-sensitive comparison, and the actual name registered in the SecurityEvent table uses different capitalization.
Scenario 2 β Action Decisionβ
The cause has been identified: a log-based Alert Rule in Azure Monitor is triggering duplicate alerts at each evaluation cycle because the underlying query does not use the appropriate deduplication operator. The query counts all occurrences of a recurring event, and each alert execution returns an increasing number of results above the threshold, generating repeated notifications to the operations team.
The context is as follows:
- The alert is in production and monitors critical errors from a financial application
- The operations team is being flooded with notifications and has already ignored two real incidents in the last 24 hours due to alert fatigue
- You have permission to edit the query and alert configuration
- A maintenance window is scheduled in 72 hours
- Completely disabling the alert removes monitoring coverage for critical errors
What is the correct action to take at this moment?
A) Wait for the maintenance window in 72 hours to safely fix the query, avoiding any changes in production outside the established process.
B) Disable the alert immediately to stop the noise, and reactivate only after the maintenance window with the corrected query.
C) Edit the alert query now to fix the deduplication logic and adjust the threshold if necessary, keeping the alert active and functional during the correction.
D) Redirect the alert notifications to a low-priority secondary channel until the maintenance window, without changing the query.
Scenario 3 β Root Causeβ
An administrator is investigating slowness in a web application hosted on Azure App Service. He accesses the Log Analytics Workspace and executes the following query to calculate the average response time per hour in the last 6 hours:
AppRequests
| where TimeGenerated > ago(6h)
| summarize AvgDuration = avg(DurationMs) by bin(TimeGenerated, 1h), Name
| order by TimeGenerated desc
The results show AvgDuration values between 200ms and 400ms in all periods, with no anomalies. The administrator concludes there is no performance issue.
However, the support team continues receiving user complaints about slowness between 2 PM and 3 PM. The App Service Plan was scaled up yesterday from P1v3 to P2v3. The application uses authentication via Microsoft Entra ID. The internal web server logs show requests with more than 8 seconds in that interval.
What is the root cause of the discrepancy between what the query returns and what users experienced?
A) The query uses avg() instead of percentile(), and the average is being distorted by a large volume of fast requests that hide the slow cases experienced by a subset of users.
B) The vertical scaling of the App Service Plan from P1v3 to P2v3 caused a momentary interruption that erased the logs from the period between 2 PM and 3 PM from the AppRequests table.
C) Authentication via Microsoft Entra ID adds latency that is not recorded in the DurationMs column of the AppRequests table, being invisible to the query.
D) The query filters by TimeGenerated > ago(6h), but the Azure Monitor ingestion delay means the interval between 2 PM and 3 PM is not yet available in the workspace.
Scenario 4 β Diagnostic Sequenceβ
An administrator receives the following complaint: "We created a log alert in Azure Monitor two days ago, but it never fired, even with the conditions being met."
The administrator has access to the Azure portal and the Log Analytics Workspace. The available investigation steps are:
- Manually execute the alert query in the Log Analytics Workspace to verify if it returns results with data from the configured evaluation period.
- Check the alert execution history in the Alert Rule tab to confirm if the query is being executed and what result is being returned.
- Confirm if the Action Group associated with the alert is correctly configured and if the notification channel is active.
- Verify if the alert is in Enabled state and if it is not in suppression period or within a Maintenance window.
- Review the firing condition (threshold, aggregation type, and evaluation period) to verify if the configured criterion is mathematically compatible with the values returned by the query.
What is the correct investigation sequence?
A) 1 β 2 β 5 β 4 β 3
B) 4 β 1 β 2 β 5 β 3
C) 2 β 1 β 4 β 5 β 3
D) 4 β 2 β 1 β 5 β 3
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The decisive clue in the statement is that the Heartbeat table contains normal data from the VM, while SecurityEvent does not contain the expected events. The AMA agent, unlike the legacy MMA agent, does not collect logs automatically by default. All collection is governed by the Data Collection Rule. The DCR needs to explicitly specify which Windows Event Log channels will be collected and with what level or EventID filter. If the DCR associated with the VM does not include the Security channel with scope comprehensive enough to capture "Audit Failure" level events (which corresponds to EventID 4625), these events simply will not be ingested.
The information about absence of firewall changes and the VM status are irrelevant to this diagnosis and were purposely included to divert attention to the network layer.
Alternative A is false: retention affects how long data remains available, not which events are collected. Alternative C represents a common misconception, but the Heartbeat table and security events use the same agent and the same sending pipeline; the presence of heartbeats confirms that the agent is functional. Alternative D is technically plausible in some contexts, but KQL uses case-insensitive comparison by default for strings in fields like Computer, making this distractor invalid as a root cause here.
The most dangerous distractor is C: acting based on it would lead to reinstalling the agent instead of reviewing the DCR, which would not solve the problem and cause unnecessary interruption to the VM.
Answer Key β Scenario 2β
Answer: C
The scenario presents an identified cause (query with incorrect logic generating duplicate alerts) and a critical ongoing constraint: the team has already ignored real incidents in the last 24 hours due to alert fatigue. This transforms the situation into an active operational risk, not a problem to be managed calmly until the maintenance window.
Alternative A ignores the real ongoing impact. Waiting 72 hours maintains the degraded state and prolongs the risk of real incidents being ignored. Alternative B removes monitoring coverage from a critical financial system, which is worse than the current noise. Alternative D does not solve the problem: redirecting notifications to a low-priority channel only confirms that alerts will be ignored, increasing the risk.
Alternative C is the only one that balances all constraints: keeps the alert active, fixes the root cause immediately, and requires no service interruption. Editing the query of an Alert Rule in Azure Monitor does not require a maintenance window and can be done with zero impact on the monitored application.
The reasoning error of the distractors is treating the process (maintenance window) as more important than the already materialized operational risk.
Answer Key β Scenario 3β
Answer: A
The query uses avg(DurationMs), which calculates the arithmetic mean of all requests in the interval. If in the period between 2 PM and 3 PM there was a large volume of fast requests (200ms to 300ms) and a smaller subset of extremely slow requests (8 seconds or more), the average can easily result in an apparently normal value, like 350ms, that does not represent the experience of affected users.
The critical clue is that the internal web server logs show requests above 8 seconds in the same interval. This confirms that the data exists and that the problem is the chosen aggregation metric, not absence of data. Using percentile(DurationMs, 95) or percentile(DurationMs, 99) would reveal the anomaly.
The information about the App Service Plan scaling is irrelevant and purposely included to induce the reader to consider alternative B. Vertical scaling in Azure App Service does not erase logs already ingested in the workspace.
Alternative C describes incorrect behavior: Microsoft Entra ID authentication latency occurs before the request reaches the application, and the time recorded in DurationMs represents the processing time on the application server. Alternative D would be plausible if the internal server logs also did not show the problem, but the statement confirms that the data exists.
Answer Key β Scenario 4β
Answer: D
The correct sequence is: 4 β 2 β 1 β 5 β 3.
The correct diagnostic reasoning goes from the simplest and most verifiable to the most complex:
Step 4 should be first because a disabled alert or one in suppression does not fire by design. Checking this state immediately eliminates the most trivial cause before any data analysis.
Step 2 comes next because the execution history reveals if Azure Monitor is executing the query and what value it is returning. If the history shows executions with results above the threshold but no firing, the problem is in the condition. If there are no recorded executions, the problem is in the alert state.
Step 1 validates the query in isolation, confirming that it returns real data in the workspace context, which is necessary before analyzing if the threshold is adequate.
Step 5 analyzes if the mathematical firing condition is compatible with what the query returns, which only makes sense after confirming that the query works correctly.
Step 3 is last because the Action Group is only relevant if the alert should have fired but the notification did not arrive. Checking it before confirming that firing should have occurred is a waste of diagnostic effort.
Alternative A makes the error of executing the query before checking if the alert is active, potentially investigating data when the problem is purely configuration state.
Troubleshooting Tree: Query and Analyze Logs in Azure Monitorβ
Color legend:
| Color | Node type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate verification or validation |
To use this tree when facing a real problem, start with the root node identifying whether the problem is absence of data or alert failure. Follow the closed questions answering with what you observe in the environment, without skipping steps. Each orange node represents an intermediate verification that must be completed before advancing to the next question. Red nodes indicate where to stop and act. If the path taken leads to a cause that does not resolve the problem after correction, return to the origin node and follow the next unexplored branch.