Theoretical Foundation: Set Up Alert Rules, Action Groups, and Alert Processing Rules in Azure Monitor
1. Initial Intuitionβ
Imagine that you manage a physical datacenter. You don't stare at the servers 24 hours a day. Instead, you install sensors: a temperature sensor that triggers a siren if the air conditioning fails, a load sensor that activates an alarm if the disk reaches 90% usage, cameras that alert about suspicious movement outside business hours.
Alert Rules in Azure are these sensors and trigger conditions. You define: "when the VM's CPU stays above 85% for more than 5 minutes, notify me".
Action Groups are the contact list and actions to execute when the alarm triggers: send email to the operations team, send SMS to the on-call manager, call a webhook that automatically opens a ticket in Jira.
Alert Processing Rules are the silence and routing policies: "during scheduled maintenance windows, don't send alerts", or "all production alerts should also trigger the security team".
Together, these three components form Azure's complete alert system.
2. Contextβ
2.1 The three pillars of the alert systemβ
2.2 Types of Alert Rules by data sourceβ
Azure has different types of alert rules depending on where the data comes from:
| Type | Data source | Latency | Example usage |
|---|---|---|---|
| Metric alerts | Azure Monitor Metrics | 1-5 min | CPU > 85%, Storage > 90% |
| Log alerts | Log Analytics (KQL) | 5-15 min | 10+ login failures in 5 min |
| Activity log alerts | Azure Activity Log | 5-10 min | VM deleted, NSG modified |
| Service Health alerts | Azure Service Health | Variable | Region incidents, maintenance |
| Resource health alerts | Azure Resource Health | Variable | VM became unavailable |
3. Building the Conceptsβ
3.1 Alert Rule: complete anatomyβ
An Alert Rule is composed of:
1. Scope: The resource or set of resources being monitored. Can be a VM, an entire Resource Group, or a Subscription.
2. Condition: What is being measured and when the alert should fire. Defines:
- Signal: which metric, log or activity to monitor
- Operator: greater than, less than, equal to
- Threshold: the value that triggers the alert
- Aggregation: how values are combined (Average, Maximum, etc.)
- Evaluation period: time window of data evaluated
- Frequency: how often to evaluate the condition
3. Action Group: Which Action Group to trigger when the condition is satisfied.
4. Alert Details: Name, description, severity and other alert properties.
3.2 Alert severityβ
| Level | Name | Typical usage |
|---|---|---|
| Sev 0 | Critical | Critical failure with immediate production impact |
| Sev 1 | Error | Serious problem requiring quick action |
| Sev 2 | Warning | Concerning condition needing attention |
| Sev 3 | Informational | Relevant information without urgency |
| Sev 4 | Verbose | Detailed diagnostics |
3.3 Stateful vs Stateless alertsβ
Stateful (default for metric alerts):
The alert has states: Fired, Resolved. When the condition is no longer true, the alert is automatically resolved and a resolution notification is sent.
Stateless (default for log alerts): Each evaluation that satisfies the condition fires a notification, regardless of previous state. Useful for events that don't have a natural "resolved" concept.
3.4 Metric Alert: advanced conditionsβ
Static threshold: Fixed value compared with the metric.
CPU Percentage > 85 (average, over 5 minutes)
Dynamic threshold: Azure learns the metric's historical behavior and automatically sets thresholds based on deviations from normality. Useful when normal behavior varies by time of day or day of week.
Dimensions in metric alerts: You can create an alert that monitors a metric filtered by dimension. Example: alert for 500 errors in only a specific region.
3.5 Log Alert: KQL query configurationβ
Log alerts execute a KQL query at regular intervals and fire when the result satisfies a condition:
Number of results: Alert if the query returns more or less than X lines.
SecurityEvent
| where EventID == 4625
| where TimeGenerated > ago(5m)
Configuration: "If count > 10, fire alert"
Metric measurement: The query calculates a numeric value per resource/time, and the alert fires when this value crosses a threshold.
Perf
| where CounterName == "% Processor Time"
| summarize AvgCPU = avg(CounterValue) by Computer
| where AvgCPU > 85
3.6 Action Groups: action typesβ
Notifications:
| Type | Description | Limitation |
|---|---|---|
| Email (Azure Resource Manager role) | Email to RBAC role members | Up to 1,000 emails/hour |
| Email/SMS/Push/Voice | Email, SMS, Azure app notification, call | Limits per subscriber |
| Azure App Push | Notification in Azure mobile app | Requires installed app |
Actions:
| Type | Description | When to use |
|---|---|---|
| Automation Runbook | Executes a PowerShell/Python runbook | Automatic remediation |
| Azure Function | Invokes a Function App | Custom logic |
| Event Hub | Publishes to Event Hub | Streaming integration |
| ITSM | Opens ticket in ServiceNow, Cherwell | Incident management |
| Logic App | Starts a Logic App workflow | Complex orchestration |
| Secure Webhook | Calls HTTPS endpoint with AAD auth | External systems |
| Webhook | Calls HTTPS endpoint | Simple integrations |
3.7 Alert Processing Rules: the orchestratorβ
Alert Processing Rules act on already fired alerts. They can:
1. Suppress notifications: Silence alert notifications during maintenance windows. The alert is still fired and recorded, but notifications are not sent.
2. Apply additional Action Group: Add an Action Group to alerts matching defined filters. Example: all severity 0 and 1 alerts in production should also trigger the manager via SMS.
Available filters in Alert Processing Rules:
- Subscription
- Resource Group
- Resource Type
- Resource
- Alert Rule
- Severity
- Monitor Condition (Fired/Resolved)
- Alert Context
Scheduling: Can be configured to act always or only in specific time windows (e.g. Saturday and Sunday from 22h to 06h for maintenance windows).
4. Structural Viewβ
5. Practical Operationβ
5.1 Metric Alert: evaluation behaviorβ
When you configure a metric alert, two parameters define the behavior:
Evaluation frequency: How often the condition is checked. Example: every 5 minutes.
Aggregation granularity (window): What data period is considered in each evaluation. Example: 15-minute window.
If frequency = 5m and window = 15m: every 5 minutes, Azure evaluates the last 15 minutes of aggregated data.
Non-obvious behavior: An alert with
window = 15mcan take up to 20 minutes to fire after a condition is met (15 min window + up to 5 min evaluation frequency + metric collection latency). For critical alerts that need to fire quickly, usewindow = 5mandfrequency = 1m.
5.2 Suppression vs Disable alertβ
There's an important distinction:
Suppress via Alert Processing Rule: The alert is still created and recorded in history. Notifications are suppressed. When the suppression window ends, alerts that continue firing will resume notifying.
Disable the Alert Rule: The condition is no longer evaluated. There's no alert record during the disabled period. Not recommended for maintenance because you lose visibility.
6. Implementation Methodsβ
6.1 Azure Portalβ
When to use: Initial creation, exploring available options, visual troubleshooting.
Creating metric alert:
Azure Monitor > Alerts > + Create > Alert rule
Creating Action Group:
Azure Monitor > Alerts > Action groups > + Create
Creating Alert Processing Rule:
Azure Monitor > Alerts > Alert processing rules > + Create
6.2 Azure CLIβ
Creating Action Group:
az monitor action-group create \
--resource-group myRG \
--name "ops-team-ag" \
--short-name "OpsTeam" \
--email-receiver name="OnCall" email-address="oncall@company.com" use-common-alert-schema=true \
--sms-receiver name="Manager" country-code="55" phone-number="11999999999" \
--webhook-receiver name="PagerDuty" service-uri="https://events.pagerduty.com/integration/xxx/enqueue" use-common-alert-schema=true
Creating Metric Alert:
az monitor metrics alert create \
--resource-group myRG \
--name "High-CPU-Alert" \
--scopes <vm-resource-id> \
--condition "avg Percentage CPU > 85" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 2 \
--action-group <action-group-id> \
--description "Average CPU above 85% for 5 minutes" \
--auto-mitigate true
Creating Log Alert:
az monitor scheduled-query alert create \
--resource-group myRG \
--name "Failed-Login-Alert" \
--scopes <workspace-resource-id> \
--condition-query "SecurityEvent | where EventID == 4625 | where TimeGenerated > ago(5m) | summarize count()" \
--condition-operator "GreaterThan" \
--condition-threshold 10 \
--evaluation-frequency 5m \
--window-size 5m \
--severity 1 \
--action-group <action-group-id>
Creating Activity Log Alert:
az monitor activity-log alert create \
--resource-group myRG \
--name "VM-Delete-Alert" \
--scope /subscriptions/<sub-id> \
--condition category=Administrative and operationName=Microsoft.Compute/virtualMachines/delete \
--action-group <action-group-id>
Creating Alert Processing Rule (maintenance suppression):
az monitor alert-processing-rule create \
--resource-group myRG \
--name "Weekend-Maintenance" \
--rule-type Suppression \
--scopes /subscriptions/<sub-id>/resourceGroups/production-rg \
--filter-severity Sev0 Sev1 Sev2 \
--schedule-recurrence-type Weekly \
--schedule-recurrence Saturday Sunday \
--schedule-start-datetime "2025-01-01 22:00:00" \
--schedule-end-datetime "2025-01-01 06:00:00"
6.3 Bicepβ
// Action Group
resource actionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
name: 'ops-team-ag'
location: 'global'
properties: {
groupShortName: 'OpsTeam'
enabled: true
emailReceivers: [
{
name: 'OnCall'
emailAddress: 'oncall@company.com'
useCommonAlertSchema: true
}
]
smsReceivers: [
{
name: 'Manager'
countryCode: '55'
phoneNumber: '11999999999'
}
]
webhookReceivers: [
{
name: 'Automation'
serviceUri: 'https://prod.webhook.office.com/...'
useCommonAlertSchema: true
}
]
}
}
// Metric Alert
resource cpuAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'High-CPU-Alert'
location: 'global'
properties: {
description: 'Average CPU above 85% for 5 minutes'
severity: 2
enabled: true
scopes: [vm.id]
evaluationFrequency: 'PT1M'
windowSize: 'PT5M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'HighCPU'
metricName: 'Percentage CPU'
operator: 'GreaterThan'
threshold: 85
timeAggregation: 'Average'
criterionType: 'StaticThresholdCriterion'
}
]
}
autoMitigate: true
actions: [
{
actionGroupId: actionGroup.id
}
]
}
}
// Alert Processing Rule (maintenance suppression)
resource maintenanceRule 'Microsoft.AlertsManagement/actionRules@2021-08-08' = {
name: 'Weekend-Maintenance'
location: 'global'
properties: {
scopes: [resourceGroup().id]
conditions: [
{
field: 'Severity'
operator: 'Equals'
values: ['sev0', 'sev1', 'sev2']
}
]
actions: [
{
actionType: 'RemoveAllActionGroups'
}
]
schedule: {
recurrences: [
{
recurrenceType: 'Weekly'
daysOfWeek: ['Saturday', 'Sunday']
startTime: '22:00:00'
endTime: '06:00:00'
}
]
}
enabled: true
}
}
7. Control and Securityβ
7.1 Required permissionsβ
| Operation | Minimum role |
|---|---|
| Create/edit Alert Rules | Monitoring Contributor |
| Create/edit Action Groups | Monitoring Contributor |
| Create/edit Alert Processing Rules | Monitoring Contributor |
| Only view fired alerts | Monitoring Reader |
| Acknowledge alerts | Monitoring Contributor |
7.2 Common Alert Schemaβ
The Common Alert Schema is a standardized JSON format for notifications from all alert types (metric, log, activity log). When enabled in the Action Group, all webhooks and Logic Apps receive the same structure, regardless of alert type.
{
"schemaId": "azureMonitorCommonAlertSchema",
"data": {
"essentials": {
"alertId": "/subscriptions/.../alerts/...",
"alertRule": "High-CPU-Alert",
"severity": "Sev2",
"signalType": "Metric",
"monitorCondition": "Fired",
"monitoringService": "Platform",
"alertTargetIDs": ["/subscriptions/.../virtualMachines/myVM"],
"firedDateTime": "2025-01-15T14:32:00Z"
},
"alertContext": {
"properties": {},
"conditionType": "SingleResourceMultipleMetricCriteria",
"condition": {
"windowSize": "PT5M",
"allOf": [
{
"metricName": "Percentage CPU",
"metricNamespace": "Microsoft.Compute/virtualMachines",
"operator": "GreaterThan",
"threshold": "85",
"timeAggregation": "Average",
"dimensions": [],
"metricValue": 92.3
}
]
}
}
}
}
8. Best Practicesβ
8.1 Alert fatigue preventionβ
Start with higher thresholds: Begin with conservative thresholds and adjust based on false positive rates.
Use dynamic thresholds for seasonal patterns: Services with predictable daily/weekly patterns benefit from machine learning-based thresholds.
Implement alert hierarchy: Use different severities and Action Groups. Not every alert needs to wake someone up.
8.2 Action Group organizationβ
By team: Create Action Groups per responsible team (network-team-ag, database-team-ag).
By urgency: Create Action Groups for different response levels (critical-24x7-ag, business-hours-ag, info-only-ag).
By environment: Separate production from non-production notifications.
8.3 Alert Processing Rules strategyβ
Maintenance windows: Create recurring suppression rules for known maintenance windows.
Environment-specific routing: Use Alert Processing Rules to add environment-specific Action Groups (e.g., all prod alerts also go to management).
Geographic considerations: For global services, route alerts to the appropriate on-call team based on time zones.
This comprehensive foundation covers Azure Monitor's complete alerting system, from individual alert rules to sophisticated processing and routing logic, enabling you to build robust monitoring solutions that scale with your infrastructure needs.
{
"conditionType": "SingleResourceMultipleMetricCriteria",
"condition": {
"windowSize": "PT5M",
"allOf": [...]
}
}
Always use Common Alert Schema whenever possible. It greatly simplifies processing in Logic Apps and webhooks.
7.3 Notification rate limitsβ
| Channel | Limit |
|---|---|
| 100 emails/hour per Action Group | |
| SMS | 1 SMS every 5 minutes per receiver |
| Voice | 1 call every 5 minutes per receiver |
| Webhook | No explicit limit (depends on endpoint) |
| Azure Function | No explicit limit |
Important behavior: If multiple alerts fire at the same time and there are many email receivers, Azure silently applies rate limiting. Not all notifications arrive immediately. For high-load scenarios, use webhook or Event Hub as the primary channel.
8. Decision Makingβ
8.1 Which alert type to use for each scenarioβ
| Scenario | Alert type | Reason |
|---|---|---|
| VM CPU > 85% | Metric Alert | Platform metric with fixed threshold |
| Anomalous CPU pattern (no known threshold) | Metric Alert with Dynamic Threshold | Azure learns historical pattern |
| 10+ login failures in 5 min | Log Alert | Requires KQL query over logs |
| VM deleted (audit) | Activity Log Alert | Control plane event |
| Azure region with incident | Service Health Alert | Azure service health |
| OS disk with over 90% | Metric Alert (Guest metrics) | Requires Azure Monitor Agent on VM |
8.2 When to use Alert Processing Rules vs configuring Action Group in Alert Ruleβ
| Situation | Approach | Reason |
|---|---|---|
| Recurring maintenance window | Alert Processing Rule (Suppression) | Centralized configuration, no need to modify each Alert Rule |
| All prod alerts should cc the manager | Alert Processing Rule (Add Action Group) | Avoids duplicating the Action Group in each Alert Rule |
| Specific alert with specific action | Action Group directly in Alert Rule | Simpler, no need for additional layer |
| Silence only during specific deployment | Alert Processing Rule with single window | Flexibility without modifying Alert Rules |
8.3 Frequency and Window Size: balancing speed and costβ
| Requirement | Frequency | Window | Consideration |
|---|---|---|---|
| Ultra-fast detection (critical) | 1 min | 5 min | Higher evaluation cost |
| Moderate detection (standard) | 5 min | 15 min | Cost/speed balance |
| Trend alert (not urgent) | 15 min | 1 hour | Lower cost, more smoothed |
9. Best Practicesβ
- Create reusable Action Groups separated by team (ops-team, security-team, management) and reference them in multiple Alert Rules instead of creating one Action Group per alert.
- Use Common Alert Schema in all webhooks and Logic Apps to simplify processing.
- Separate alerts by severity and configure different Action Groups: Sev0/Sev1 triggers SMS + immediate email; Sev2/Sev3 sends email only.
- Configure auto-mitigate = true in metric alerts to receive automatic resolution notification when the condition is no longer satisfied.
- Use Alert Processing Rules for maintenance instead of disabling Alert Rules. The alert continues to be evaluated and logged.
- Document the meaning of each alert in the Alert Rule description: what it means when it fires, what action is expected, what response runbook to use.
- Group related alerts using the "Alert Rule Name" field as a filter in Alert Processing Rules to apply consistent logic.
- Test Action Groups using the "Test" button in the portal before trusting them in production. Check if emails arrive, if webhooks respond with 200.
- Configure service health alerts (Service Health) to receive notifications of incidents and maintenance in the Azure regions you use.
10. Common Errorsβ
| Error | Why it happens | How to avoid |
|---|---|---|
| Alert constantly firing (alert fatigue) | Threshold too sensitive or window too small | Adjust threshold; use dynamic thresholds |
| Didn't receive critical alert notification | Email rate limiting active | Use webhook or Azure Function as primary channel for Sev0/Sev1 |
| Alert doesn't fire during real incident | Alert Rule disabled or threshold too high | Test regularly with Test Action Group |
| Alert fires during maintenance | No suppression Alert Processing Rule | Create APR with maintenance window |
| Log alert doesn't fire | Incorrect KQL query or data still in transit | Test query separately in Log Analytics; check ingestion latency |
| Too many duplicate alerts | Low frequency + large window creating overlaps | Adjust frequency and window; enable stateful |
| Action Group doesn't trigger webhook | Endpoint with invalid certificate or timeout | Use Secure Webhook with AAD authentication for robustness |
| Alert Processing Rule doesn't suppress | Incorrect filters (severity, scope) | Verify filters exactly match the fired alert |
11. Operation and Maintenanceβ
11.1 Viewing fired alertsβ
# List active alerts in subscription
az monitor alert list \
--resource-group myRG \
--output table
# View alert history (last 24h)
az monitor alert list \
--state "all" \
--time-range 1d \
--output table
In portal: Azure Monitor > Alerts > Alerts (preview) shows current state of all alerts.
11.2 Acknowledging alertsβ
Alerts can have manually managed state:
- Fired: Active condition, notification sent
- Acknowledged: Someone acknowledged the alert (being investigated)
- Resolved: Condition is no longer true
az monitor alert update \
--ids <alert-resource-id> \
--status Acknowledged
11.3 Testing Action Groupsβ
# Send test notification to an Action Group
az monitor action-group test \
--resource-group myRG \
--action-group-name "ops-team-ag" \
--alert-type servicehealth
Verify that:
- Email arrived in inbox (not spam)
- SMS was received
- Webhook returned 200 OK
- Azure Function was invoked
11.4 Important limitsβ
| Resource | Limit |
|---|---|
| Alert Rules per subscription | 5,000 |
| Action Groups per subscription | 2,000 |
| Receivers per Action Group | 10 per type |
| Alert Processing Rules per subscription | 1,000 |
| Alert history retention | 30 days |
| Minimum metric alert frequency | 1 minute |
| Minimum log alert frequency | 5 minutes |
12. Integration and Automationβ
12.1 Automatic remediation with Automation Runbookβ
Configure an Action Group that calls a runbook for auto-remediation:
# Runbook: automatically restart VM if CPU > threshold
param(
[Parameter(Mandatory=$true)]
[string]$vmName,
[string]$resourceGroup
)
Connect-AzAccount -Identity
Write-Output "Restarting VM $vmName in response to high CPU alert"
Restart-AzVM -ResourceGroupName $resourceGroup -Name $vmName
Write-Output "VM restarted successfully"
12.2 Microsoft Teams integration via Logic Appβ
- Create a Logic App with HTTP trigger
- Configure the Logic App to post message to Teams
- Add the Logic App endpoint as Webhook in the Action Group
The Common Alert Schema payload allows creating rich messages in Teams with all alert details.
12.3 Azure Policy to ensure alert coverageβ
# Policy: ensure all VMs have CPU alert rule
az policy assignment create \
--name "vm-cpu-alert-required" \
--policy "<policy-definition-id>" \
--scope "/subscriptions/<sub-id>"
DeployIfNotExists policies can automatically create Alert Rules on new resources.
13. Final Summaryβ
Essential concepts:
- Alert Rule defines the firing condition: which signal to monitor (metric, log, activity log), which threshold, in which time window and with which frequency.
- Action Group defines what happens when the alert fires: notifications (email, SMS, push) and actions (runbook, function, webhook, ITSM).
- Alert Processing Rule modifies the behavior of already fired alerts: suppresses notifications in maintenance windows or adds additional Action Groups based on filters.
Critical differences:
- Metric Alert vs Log Alert: Metric operates over numerical values in near real-time (1-5 min). Log Alert executes KQL query over logs with higher latency (5-15 min).
- Stateful vs Stateless: Stateful fires once when reaching the condition and resolves automatically. Stateless fires at each evaluation that satisfies the condition.
- Suppress vs Disable: Suppress (via APR) keeps the alert being evaluated and logged. Disabling the Alert Rule completely stops evaluation.
- Frequency vs Window: Frequency is how often the condition is evaluated. Window is the period of data considered in each evaluation.
What needs to be remembered:
- An alert can take up to
window + frequency + collection latencyto fire after the condition is met. - Action Groups have rate limits: 100 emails/hour, 1 SMS every 5 minutes per receiver.
- Use Common Alert Schema in webhooks and Logic Apps to receive standardized format regardless of alert type.
- Alert Processing Rules filter by subscription, resource group, resource type, severity, status and alert rule name.
- For maintenance, use Alert Processing Rules with time window instead of disabling Alert Rules.
- Test Action Groups with the "Test" button before trusting them in production.
- Alert history is retained for 30 days in the Azure Monitor portal.