Theoretical Foundation: Interpret Metrics in Azure Monitor
1. Initial Intuitionβ
Imagine you're driving a car. On the dashboard, you have a speedometer, tachometer, fuel gauge, and engine temperature indicator. Each of these instruments collects a specific numerical measurement in real-time and displays it so you can make decisions while driving. If the temperature rises too high, you know something's wrong and you need to act.
Metrics in Azure Monitor are exactly these instruments for your cloud resources. Each Azure resource (VMs, Storage Accounts, databases, networks) continuously generates numerical values that describe their state and behavior: CPU percentage, bytes transferred, number of requests, response latency, used storage capacity.
The difference from a car dashboard is that in Azure you can query these measurements historically, combine multiple metrics in one chart, calculate averages and percentiles, and configure automatic alerts when a value crosses a defined threshold.
2. Contextβ
2.1 Metrics within Azure Monitorβ
Azure Monitor is Azure's central observability platform. It collects three types of data:
Why do metrics exist separately from logs? Metrics are optimized for fast queries and real-time alerts. They're stored in compressed time-series format, suitable for rendering charts and evaluating alert conditions in seconds. Logs are semi-structured text, suitable for deep investigation but with higher ingestion latency.
3. Building Conceptsβ
3.1 What is a metric: fundamental structureβ
A metric is a time series of numerical values associated with an Azure resource. Each data point has:
- Timestamp: When it was collected
- Value: The numerical value (e.g., 78.5)
- Metric Name: What's being measured (e.g., "Percentage CPU")
- Resource: The Azure resource it belongs to
- Dimensions (optional): Subdivisions of the metric by attribute
3.2 Dimensions: the concept that multiplies the power of metricsβ
A dimension is an attribute that allows filtering or segmenting a metric. It's the difference between "how many total requests arrived" and "how many requests arrived per HTTP response code."
Concrete example with Storage Account:
The Transactions metric (number of operations) has dimensions like:
ResponseType: Success, ServerError, ClientErrorApiName: GetBlob, PutBlob, ListContainersAuthentication: SAS, AccountKey, AzureActiveDirectory
Without dimensions, you only see the total. With dimensions, you can answer: "How many GetBlob operations failed with server error in the last 4 hours?"
3.3 Aggregation typesβ
Metrics aren't displayed as individual points for each second. They are aggregated over time intervals. Understanding which aggregation to use is fundamental for correct interpretation:
| Aggregation | Description | When to use |
|---|---|---|
| Average | Mean of values in the interval | CPU%, average latency |
| Maximum | Highest value in the interval | CPU peak, maximum connections |
| Minimum | Lowest value in the interval | Minimum available memory |
| Sum | Sum of all values | Total requests, bytes transferred |
| Count | Number of data points | Number of operations |
| Percentile (P50, P95, P99) | Percentile of values | Latency percentile (e.g., "95% of requests responded in less than X ms") |
Classic mistake: Using Average to analyze tail latency. A 100ms average can hide that 5% of requests take 2 seconds. Use P95 or P99 to understand the real experience of the slowest users.
3.4 Time Granularityβ
When querying metrics, you define the time range (e.g., last 24 hours) and the granularity (e.g., points every 5 minutes). Granularity determines the size of the aggregation window.
| Period queried | Minimum available granularity |
|---|---|
| 1 hour | 1 minute |
| 24 hours | 5 minutes |
| 7 days | 1 hour |
| 30 days | 1 day |
| More than 30 days | 1 day |
Retention: Metrics with 1-minute granularity are retained for 93 days. After this period, they're aggregated into larger granularities. For long-term retention, export metrics to Log Analytics or Storage Account.
3.5 Platform metrics vs custom metrics vs guest metricsβ
Platform Metrics: Automatically collected by Azure for each resource, no configuration required. Examples: VM CPU, Storage transactions, SQL Database DTU. Available immediately after creating the resource.
Guest OS Metrics: Operating system metrics inside the VM: memory usage, disk, processes. Require installing the Azure Monitor Agent on the VM since Azure has no visibility into the OS by default.
Custom Metrics: Created by your application or scripts. Sent to Azure Monitor via API, Application Insights SDK, or Azure Monitor Metrics REST API. Allow measuring anything specific to your application.
3.6 Multi-dimensional metrics: Splitting and Filteringβ
Two powerful concepts when working with dimensions in Metrics Explorer:
Splitting: Divides a metric into separate series by dimension value. For example: splitting Storage Account Transactions by ResponseType shows separate lines for Success, ServerError, and ClientError on the same chart.
Filtering: Shows only data where the dimension has a specific value. For example: filtering Transactions by ApiName = GetBlob shows only blob download operations.
4. Structural Viewβ
5. Practical Operationβ
5.1 Navigating the Metrics Explorerβ
The Metrics Explorer is the main interface for viewing metrics. Access via:
Azure Monitor > Metrics or [Resource] > Metrics
The interface has four main controls:
Scope: The resource (or resource group or subscription) whose metrics you want to see.
Metric Namespace: Groups related metrics. A VM has several namespaces: Virtual Machine Host (platform metrics), azure.vm.windows.guestmetrics (Windows guest metrics), etc.
Metric: The specific metric (e.g., Percentage CPU, Available Memory Bytes).
Aggregation: How values will be combined over the time interval (Average, Max, Sum, etc.).
5.2 Practical interpretation examplesβ
Scenario 1: VM with consistently high CPU
Metric: Percentage CPU | Aggregation: Average | Period: 24 hours | Granularity: 5 min
If the chart shows 85-90% average for several hours, this indicates CPU saturation. Compare with peak (Maximum) to see if it reaches 100% and at what times.
Scenario 2: Storage Account with errors
Metric: Transactions | Splitting by ResponseType
If you see lines for ServerError or ThrottlingError growing, this indicates the Storage Account is being throttled or has internal problems.
Scenario 3: Database latency
Metric: Connection Failed or DTU Consumption Percent (Azure SQL) | Aggregation: Maximum
Peaks in Maximum with normal Average indicate intermittent problems that the average hides.
5.3 Time Range Comparisonβ
Metrics Explorer allows adding a comparison line with a previous period. Example: compare CPU from the last 24 hours with the 24 hours from the same time last week. This reveals anomalous behavior patterns versus expected normal behavior.
5.4 Multi-resource metrics simultaneouslyβ
With Multi-resource metrics, you can compare the same metric across multiple VMs simultaneously. For example: see Percentage CPU of all VMs in a Scale Set side by side to identify if a specific VM is unbalanced.
6. Implementation Approachesβ
6.1 Azure Portal (Metrics Explorer)β
When to use: Interactive investigation, ad-hoc dashboard creation, real-time troubleshooting.
Advantages: Intuitive visual interface, no need to know metric names beforehand, easy dimension exploration with splitting and filtering.
Limitation: Not automatable; each query is manual.
Tip: Use the "Pin to dashboard" button to save useful charts on a permanent dashboard.
6.2 Azure CLIβ
# List all available metrics for a resource
az monitor metrics list-definitions \
--resource <resource-id> \
--output table
# Query specific metric
az monitor metrics list \
--resource <resource-id> \
--metric "Percentage CPU" \
--interval PT5M \
--start-time 2025-01-15T00:00:00Z \
--end-time 2025-01-15T23:59:59Z \
--aggregation Average Maximum \
--output table
# Query metric with dimension filter
az monitor metrics list \
--resource <storage-account-id> \
--metric "Transactions" \
--interval PT1H \
--aggregation Total \
--filter "ResponseType eq 'ServerError'" \
--output table
When to use: Automation scripts, periodic reports, when you need to process values programmatically.
6.3 Azure PowerShellβ
# Query metrics
$result = Get-AzMetric `
-ResourceId <resource-id> `
-MetricName "Percentage CPU" `
-StartTime (Get-Date).AddHours(-24) `
-EndTime (Get-Date) `
-TimeGrainInMinutes 5 `
-AggregationType Average
# Process result
$result.Data | Select-Object TimeStamp, Average | Format-Table
6.4 Azure Monitor REST APIβ
For integration with external systems or custom dashboards:
# Query via REST API
curl -X GET \
"https://management.azure.com{resource-id}/providers/microsoft.insights/metrics?metricnames=Percentage%20CPU×pan=2025-01-15T00:00:00Z/2025-01-15T23:59:59Z&interval=PT5M&aggregation=average&api-version=2019-07-01" \
-H "Authorization: Bearer <token>"
The response returns JSON with timestamps and aggregated values.
6.5 Kusto (Log Analytics) for archived metricsβ
When you export metrics to Log Analytics, you can query them with Kusto Query Language (KQL):
// Average CPU of all VMs in the last 24 hours
AzureMetrics
| where TimeGenerated > ago(24h)
| where MetricName == "Percentage CPU"
| where ResourceType == "MICROSOFT.COMPUTE/VIRTUALMACHINES"
| summarize AvgCPU = avg(Average) by Resource, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
When to use: Historical analysis beyond 93 days, correlating metrics with logs, complex reports.
7. Control and Securityβ
7.1 Permissions for reading metricsβ
| Role | Metrics access |
|---|---|
| Monitoring Reader | Read metrics and alerts (without modifying) |
| Monitoring Contributor | Create and modify alerts, action groups |
| Reader | View resource metrics (inherited) |
| Owner / Contributor | Full access |
For operations teams that only need to monitor without modifying resources, Monitoring Reader is the ideal role.
7.2 Diagnostics: enabling diagnostic metricsβ
Some resources require explicit enabling of diagnostics to export metrics and logs beyond the default:
az monitor diagnostic-settings create \
--name "vm-diagnostics" \
--resource <vm-resource-id> \
--metrics '[{"category":"AllMetrics","enabled":true,"retentionPolicy":{"days":30,"enabled":true}}]' \
--workspace <log-analytics-workspace-id>
This sends metrics to Log Analytics, enabling historical analysis beyond the default 93 days.
7.3 Continuous metrics exportβ
For long-term retention or integration with third-party systems (Grafana, Datadog, Splunk):
# Export metrics to Storage Account
az monitor diagnostic-settings create \
--name "metrics-export" \
--resource <resource-id> \
--metrics '[{"category":"AllMetrics","enabled":true}]' \
--storage-account <storage-account-id>
8. Decision Makingβ
8.1 Which aggregation to use for each scenarioβ
| Metric | Recommended aggregation | Reason |
|---|---|---|
| CPU Percentage | Average + Maximum | Average shows trend; Max shows peaks |
| Available Memory Bytes | Minimum | You want to know the worst case |
| Network In/Out bytes | Sum | Total data transferred in period |
| Request Count | Sum | Total requests |
| Response Latency | P95 or P99 | Slowest user experience |
| Error Count | Sum | Total errors |
| Disk Queue Depth | Average | Average I/O pressure |
| Connections Active | Maximum | Peak simultaneous connections |
8.2 Platform metrics vs Log Analytics for queriesβ
| Situation | Best approach | Reason |
|---|---|---|
| Real-time alert (< 1 min) | Platform Metrics | Minimal latency |
| Historical analysis > 93 days | Log Analytics | Metrics exported for long retention |
| Correlate metrics with log events | Log Analytics | Data joined in same KQL query |
| Live operational dashboard | Platform Metrics + Metrics Explorer | Frequent updates |
| Monthly capacity report | Log Analytics + KQL | Long-term trend analysis |
| VM Scale Set autoscale | Platform Metrics | Autoscale only uses platform metrics |
8.3 Appropriate granularity by scenarioβ
| Scenario | Recommended granularity |
|---|---|
| Recent incident investigation | 1 minute |
| Daily operational dashboard | 5 minutes |
| Weekly capacity trend | 1 hour |
| Monthly report | 1 day |
| Seasonality analysis | 1 day or 1 week |
9. Best Practicesβ
- Combine Average with Maximum when analyzing CPU: Average shows general trend; Maximum reveals peaks that the average hides.
- Use P95 or P99 for latency metrics, never just Average. Latency averages mask the experience of the slowest users.
- Enable splitting by dimension when investigating errors: splitting
TransactionsbyResponseTypeimmediately reveals if errors come from server or client. - Save useful charts to shared dashboards for the team, avoiding recreating the same visualizations during incidents.
- Configure extended retention by exporting metrics to Log Analytics if you need historical analysis beyond 93 days.
- Use period comparison (previous period) when investigating anomalies: comparing with the same window from last week reveals if behavior is new or standard.
- Document normal limits for critical metrics of each resource. Without baseline, any value seems suspicious or normal.
- Combine platform metrics with guest metrics for VMs: the platform shows CPU and network; guest metrics show memory and internal disk I/O.
10. Common Errorsβ
| Error | Why it happens | How to avoid |
|---|---|---|
| CPU looks low but application is slow | Using Average hiding short peaks | Add Maximum and smaller granularity (1 min) |
| Latency looks good but users complain | Using Average instead of P95/P99 | Use percentiles for latency metrics |
| Memory metric doesn't appear for VM | Guest metrics not configured | Install Azure Monitor Agent on VM |
| Can't find expected metric | Wrong namespace selected | Check all available namespaces for the resource |
| Chart shows "No data" | Resource without data in the period | Increase time range or verify if resource was active |
| Storage throttling not detected | Not applying splitting by ResponseType | Use split by ResponseType to see ThrottlingError separately |
| Alert triggering unnecessarily | Threshold too sensitive for long granularity | Adjust granularity or use more appropriate aggregation |
| Historical data unavailable | Period beyond 93 days without export configured | Configure export to Log Analytics in advance |
11. Operation and Maintenanceβ
11.1 Essential metrics by resource typeβ
Virtual Machines:
| Metric | Namespace | Aggregation | Attention threshold |
|---|---|---|---|
| Percentage CPU | Virtual Machine Host | Average + Max | > 80% avg or 100% max for > 5 min |
| Available Memory Bytes | Guest OS | Min | < 10% of total memory |
| OS Disk Queue Depth | Virtual Machine Host | Average | > 10 |
| Network In/Out | Virtual Machine Host | Sum | Abnormal peak vs baseline |
Storage Accounts:
| Metric | Aggregation | Attention threshold |
|---|---|---|
| Transactions | Sum, split by ResponseType | Any ThrottlingError |
| SuccessE2ELatency | Average + P95 | > 200ms average |
| Availability | Average | < 99.9% |
| UsedCapacity | Average | > 80% of limit |
Azure SQL Database:
| Metric | Aggregation | Attention threshold |
|---|---|---|
| DTU Consumption Percent | Maximum | > 80% |
| Connection Failed | Sum | Any value > 0 |
| Deadlocks | Sum | Any value > 0 |
| Sessions Percent | Maximum | > 80% |
11.2 Monitoring Azure Monitor itselfβ
If metrics stop appearing, check:
# Check if Azure Monitor Agent is active on VM
az vm extension list \
--resource-group myRG \
--vm-name myVM \
--query "[?name=='AzureMonitorLinuxAgent'].{Name:name, State:provisioningState}" \
--output table
11.3 Important limitsβ
| Aspect | Limit |
|---|---|
| Platform metrics retention | 93 days |
| Minimum granularity available | 1 minute |
| Custom metrics per resource | 50 dimensions, 10 values per dimension |
| Metrics ingestion latency | 2 to 3 minutes (platform metrics) |
| Guest metrics latency | 5 to 10 minutes after configuration |
| Platform metrics cost | Free |
| Custom metrics cost | Per data point sent |
12. Integration and Automationβ
12.1 Creating alerts based on metricsβ
az monitor metrics alert create \
--name "High-CPU-Alert" \
--resource-group myRG \
--scopes <vm-resource-id> \
--condition "avg Percentage CPU > 85" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 2 \
--action-group <action-group-id> \
--description "CPU above 85% for 5 minutes"
The --condition field supports operators like avg, max, min, sum, count and comparisons >, <, >=, <=, ==.
12.2 Integrating with Grafanaβ
Azure Monitor has a native data source plugin for Grafana. Configure the data source pointing to your Azure subscription and create dashboards with metrics from any resource:
{
"type": "grafana-azure-monitor-datasource",
"name": "Azure Monitor",
"access": "proxy",
"jsonData": {
"subscriptionId": "<sub-id>",
"tenantId": "<tenant-id>"
}
}
12.3 Autoscale based on metricsβ
Azure Autoscale uses platform metrics to automatically scale resources:
az monitor autoscale create \
--resource-group myRG \
--resource <vmss-resource-id> \
--resource-type Microsoft.Compute/virtualMachineScaleSets \
--name myAutoscaleSettings \
--min-count 2 \
--max-count 10 \
--count 2
# Add scale-out rule based on CPU
az monitor autoscale rule create \
--resource-group myRG \
--autoscale-name myAutoscaleSettings \
--condition "Percentage CPU > 75 avg 5m" \
--scale out 1
12.4 Exporting metrics via Diagnostic Settings to multiple destinationsβ
az monitor diagnostic-settings create \
--name "full-diagnostics" \
--resource <resource-id> \
--metrics '[{"category":"AllMetrics","enabled":true}]' \
--workspace <log-analytics-workspace-id> \
--storage-account <storage-account-id> \
--event-hub-name <event-hub-name> \
--event-hub-rule <event-hub-auth-rule-id>
You can send to Log Analytics (KQL query), Storage Account (long-term archiving) and Event Hub (integration with Splunk, Datadog, etc.) simultaneously.
13. Final Summaryβ
Essential concepts:
- Metrics are time series of numerical values automatically collected from Azure resources, stored for 93 days with granularity up to 1 minute.
- Each metric can have dimensions that allow filtering (see only errors) or splitting (separate by operation type) the data.
- The Metrics Explorer is the main interface for interactive visualization with scope, namespace, metric and aggregation as main controls.
Critical differences:
- Platform Metrics vs Guest OS Metrics: Platform metrics are automatic. Guest metrics require Azure Monitor Agent installed on VM to collect internal OS metrics (memory, disk).
- Average vs Maximum vs Percentile: Average for trends. Maximum for peaks. P95/P99 for real user experience latency.
- Splitting vs Filtering: Splitting creates separate lines in the chart by dimension value. Filtering shows only data corresponding to a specific dimension value.
- Platform Metrics vs Log Analytics: Metrics are optimized for real-time alerts and have 93-day retention. Log Analytics has configurable retention and allows correlation with logs.
What needs to be remembered:
- Platform metrics have latency of 2 to 3 minutes after the event. Don't confuse absence of data with non-existence of activity.
- For analysis beyond 93 days, configure export to Log Analytics or Storage Account before needing it.
- Autoscale uses only platform metrics, not guest metrics or logs.
- For VMs, memory metrics don't appear by default in the platform namespace; Azure Monitor Agent is needed for guest metrics.
- Use P95 or P99 for latency metrics in any user experience analysis. Latency averages are misleading.
- Available granularity decreases over time: 1-minute data remains available for 93 days, but older historical data only exists in larger granularities.