Skip to main content

Troubleshooting Lab: Set up alert rules, action groups, and alert processing rules in Azure Monitor

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The operations team reports that an alert rule configured to monitor the Percentage CPU metric of a virtual machine never fired, even during a high load event documented by the application team. The event occurred last Friday at 14:30 UTC and lasted approximately 20 minutes, with sustained CPU above 95% according to the Azure portal metrics dashboard.

The alert rule was created three weeks ago and is listed as Enabled in the portal. The associated action group was manually tested the same week it was created and sent email correctly. The subscription has no Azure Policy blocking alert creation. The virtual machine is in a different region from the team's default region, but still within the same subscription.

The current rule configuration is as follows:

Resource:             /subscriptions/xxx/resourceGroups/rg-prod/providers/
Microsoft.Compute/virtualMachines/vm-prod-app01
Metric: Percentage CPU
Operator: Greater than
Threshold: 80
Aggregation type: Average
Aggregation period: 5 minutes
Evaluation frequency: 15 minutes
Severity: Sev 2
Action Group: ag-ops-standard

After initial investigation, the team observes in the rule's evaluation history that it was evaluated normally during the incident period and that no violation condition was recorded, despite the values visible in the metrics dashboard.

What is the root cause of the problem?

A. The action group ag-ops-standard lost association with the rule after a permissions update in the subscription.

B. The metric displayed in the dashboard uses 1-minute granularity, while the rule aggregates data in 5-minute windows, and the average calculated in that window did not exceed the 80% threshold.

C. The 15-minute evaluation frequency caused the incident window to be ignored completely.

D. The virtual machine being in a different region from the default caused delay in metric collection, preventing correct evaluation.


Scenario 2 β€” Action Decision​

The cause of the problem has been identified: an alert processing rule of type Suppress notifications is active at the entire subscription scope, without recurring scheduling, and was created six months ago by an engineer who has left the company. There is no documentation about the original purpose of this rule.

The environment is active production, with contractual SLA for incident response in effect. The security team informed that there is an audit of monitoring configurations scheduled for tomorrow at 09:00 UTC. The operations manager authorized any necessary action to restore notifications immediately.

What is the correct action to take at this moment?

A. Immediately delete the alert processing rule to restore default notification behavior across the entire subscription.

B. Disable the alert processing rule, validate that notifications are working correctly again, and only after confirmation, decide between deleting it or documenting it and rewriting it with appropriate scope and scheduling.

C. Create a new alert processing rule of type Apply action groups in the same scope to override the existing suppression rule.

D. Wait for tomorrow's audit so that the decision about the rule is made with formal review, avoiding undocumented changes on the eve of the process.


Scenario 3 β€” Root Cause​

A developer created an alert rule based on Log Analytics to detect critical errors in a web application. The rule was configured to query the workspace law-prod-eastus and send notification to the action group ag-dev-alerts whenever the number of events with severity = critical exceeded 5 in a 10-minute window.

In the past week, the application generated 47 confirmed critical errors in the application logs, but no alerts were fired. The team verified:

  • The action group ag-dev-alerts is active and the destination email is correct.
  • The alert rule is enabled and has no alert processing rules applied to its scope.
  • The workspace law-prod-eastus is receiving data normally, as confirmed by other queries executed manually.

The query configured in the alert rule is:

AppEvents
| where Properties.severity == "critical"
| summarize count() by bin(TimeGenerated, 10m)
| where count_ > 5

When executing this query manually in Log Analytics during the incident period, the result returned 0 rows.

An alternative query executed by the team returned the expected 47 errors:

AppExceptions
| where outerMessage contains "critical"
| summarize count() by bin(TimeGenerated, 10m)

What is the root cause of the problem?

A. The alert rule is querying the wrong table; the application error events are being ingested into the AppExceptions table, not the AppEvents table.

B. The bin(TimeGenerated, 10m) function is not compatible with log-based alert rules in Azure Monitor.

C. The action group ag-dev-alerts does not support notifications originating from log-based alerts; this type of alert requires a dedicated action group.

D. The workspace law-prod-eastus is in a different region from the alert rule, which introduces ingestion latency greater than the configured evaluation window.


Scenario 4 β€” Diagnostic Sequence​

A metric alert rule was fired correctly and the state in the Azure Monitor portal shows Fired. However, no member of the on-call team received email notification. The action group associated with the rule contains an Email/SMS/Push/Voice action with three registered email addresses.

The available investigation steps are:

  1. Check the alert rule's firing history in the portal to confirm that the Fired state was recorded with correct timestamp.
  2. Check if there is any active alert processing rule with scope that encompasses the alert rule or resource group, of type Suppress notifications.
  3. Access the action group in the portal and use the Test action group functionality to send a test notification to the registered emails.
  4. Confirm with recipients if the email was received in the spam folder or blocked by corporate filter.
  5. Check in the portal if the action group is correctly associated with the alert rule and was not disassociated after a recent rule edit.

What is the correct investigation sequence?

A. 3 -> 4 -> 5 -> 2 -> 1

B. 1 -> 5 -> 2 -> 3 -> 4

C. 2 -> 1 -> 3 -> 5 -> 4

D. 5 -> 3 -> 1 -> 2 -> 4


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The decisive clue is in the rule's evaluation history: it was evaluated normally and no violation was recorded. This immediately eliminates any hypothesis related to firing failure, action group problem, or region delay, because the evaluation engine worked and concluded that the condition was not satisfied.

The Azure portal metrics dashboard displays, by default, points with 1-minute granularity. CPU spikes that appear visually high in this visualization can be smoothed when aggregated in 5-minute windows by the Average function. If the average calculated over the 5 minutes did not exceed 80%, the rule behaved correctly by not firing.

The information about the different region is purposely irrelevant: Azure Monitor collects metrics from resources in any region within the same subscription without evaluation restriction. Focusing on this detail is the most common diagnostic error in this type of scenario.

The most dangerous distractor is alternative C. The 15-minute evaluation frequency does not make the rule "skip" incident windows; it defines how often the 5-minute window is re-evaluated. During a 20-minute incident, the rule would have been evaluated at least once. The risk of acting based on this alternative would be changing the frequency without solving the real cause, which is aggregation by average value.


Answer Key β€” Scenario 2​

Answer: B

The scenario presents an identified cause and concrete constraints: active production, response SLA in effect, and audit scheduled for the next day. The correct action must restore notifications immediately without introducing additional risk.

Disabling the rule is preferable to deleting it immediately because it preserves the original configuration for later analysis, which is relevant especially before an audit. Immediate deletion (alternative A) solves the operational problem but eliminates evidence that may be relevant for tomorrow's audit process.

Alternative C is technically incorrect: an alert processing rule of type Apply action groups does not override or cancel a suppression rule; both coexist and suppression prevails because it removes notifications before actions are executed.

Alternative D ignores the most critical constraint of the scenario: the incident response SLA in effect. Waiting for the audit with notifications suppressed in production is a decision that violates contractual commitment and represents the most dangerous distractor.


Answer Key β€” Scenario 3​

Answer: A

The root cause confirmation is in the direct evidence provided in the statement: the rule's query returned 0 rows when executed manually, while an alternative query on the AppExceptions table returned the expected 47 errors. This indicates that the data was never in the AppEvents table queried by the rule.

The workspace is working, the action group is correct, the rule is enabled, and no suppression is active. All these elements were explicitly verified in the statement, eliminating alternatives B, C, and D as possible causes.

Alternative D is the distractor that exploits the region information, present in the previous scenario as an irrelevant element and reused here to test if the reader has already learned to automatically disregard it.

The most dangerous distractor is alternative B: believing that the bin() function is incompatible with log alerts would lead the team to rewrite the query without correcting the table, and the alert would continue not firing for the next 47 events.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence is 1 -> 5 -> 2 -> 3 -> 4, which follows progressive diagnostic logic: confirm that the problem actually occurred, verify Azure Monitor configuration before testing external components, and only then involve factors outside platform control.

Step 1 confirms that the firing was recorded with correct timestamp, establishing the diagnostic foundation. Step 5 verifies if the action group is still associated with the rule, as recent edits can silently undo this connection. Step 2 investigates if an alert processing rule is suppressing notification after firing. Step 3 tests the action group in isolation to confirm if it works independently of the rule. Step 4, finally, involves recipients and external factors like spam and corporate filters, which should be verified only after confirming the platform sent correctly.

Alternative A makes the classic error of testing the action group before confirming the problem is in it, which can lead to incorrect diagnosis if the action group works in testing but suppression is blocking only real firings. Alternative C inverts logic by prioritizing the search for suppression before confirming that firing actually occurred.


Troubleshooting Tree: Set up alert rules, action groups, and alert processing rules in Azure Monitor​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blueInitial symptom (root)
BlueDiagnostic question
RedIdentified cause or corrective action
OrangeIntermediate validation or verification

To use this tree when facing a real problem, start at the root node describing the observed symptom and answer each diagnostic question based on what you can verify directly in the Azure portal or via CLI. Follow the path corresponding to the answer obtained at each node until reaching a red node, which names the cause and indicated corrective action. When the path passes through an orange node, execute the described validation before proceeding, as it protects against premature diagnoses that lead to incorrect actions.