Theoretical Foundation: Interpret Metrics in Azure Monitor

1. Initial Intuition

Imagine you're driving a car. On the dashboard, you have a speedometer, tachometer, fuel gauge, and engine temperature indicator. Each of these instruments collects a specific numerical measurement in real-time and displays it so you can make decisions while driving. If the temperature rises too high, you know something's wrong and you need to act.

Metrics in Azure Monitor are exactly these instruments for your cloud resources. Each Azure resource (VMs, Storage Accounts, databases, networks) continuously generates numerical values that describe their state and behavior: CPU percentage, bytes transferred, number of requests, response latency, used storage capacity.

The difference from a car dashboard is that in Azure you can query these measurements historically, combine multiple metrics in one chart, calculate averages and percentiles, and configure automatic alerts when a value crosses a defined threshold.

2. Context

2.1 Metrics within Azure Monitor

Azure Monitor is Azure's central observability platform. It collects three types of data:

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Why do metrics exist separately from logs? Metrics are optimized for fast queries and real-time alerts. They're stored in compressed time-series format, suitable for rendering charts and evaluating alert conditions in seconds. Logs are semi-structured text, suitable for deep investigation but with higher ingestion latency.

3. Building Concepts

3.1 What is a metric: fundamental structure

A metric is a time series of numerical values associated with an Azure resource. Each data point has:

Timestamp: When it was collected
Value: The numerical value (e.g., 78.5)
Metric Name: What's being measured (e.g., "Percentage CPU")
Resource: The Azure resource it belongs to
Dimensions (optional): Subdivisions of the metric by attribute

3.2 Dimensions: the concept that multiplies the power of metrics

A dimension is an attribute that allows filtering or segmenting a metric. It's the difference between "how many total requests arrived" and "how many requests arrived per HTTP response code."

Concrete example with Storage Account:

The Transactions metric (number of operations) has dimensions like:

ResponseType: Success, ServerError, ClientError
ApiName: GetBlob, PutBlob, ListContainers
Authentication: SAS, AccountKey, AzureActiveDirectory

Without dimensions, you only see the total. With dimensions, you can answer: "How many GetBlob operations failed with server error in the last 4 hours?"

3.3 Aggregation types

Metrics aren't displayed as individual points for each second. They are aggregated over time intervals. Understanding which aggregation to use is fundamental for correct interpretation:

Aggregation	Description	When to use
Average	Mean of values in the interval	CPU%, average latency
Maximum	Highest value in the interval	CPU peak, maximum connections
Minimum	Lowest value in the interval	Minimum available memory
Sum	Sum of all values	Total requests, bytes transferred
Count	Number of data points	Number of operations
Percentile (P50, P95, P99)	Percentile of values	Latency percentile (e.g., "95% of requests responded in less than X ms")

Classic mistake: Using Average to analyze tail latency. A 100ms average can hide that 5% of requests take 2 seconds. Use P95 or P99 to understand the real experience of the slowest users.

3.4 Time Granularity

When querying metrics, you define the time range (e.g., last 24 hours) and the granularity (e.g., points every 5 minutes). Granularity determines the size of the aggregation window.

Period queried	Minimum available granularity
1 hour	1 minute
24 hours	5 minutes
7 days	1 hour
30 days	1 day
More than 30 days	1 day

Retention: Metrics with 1-minute granularity are retained for 93 days. After this period, they're aggregated into larger granularities. For long-term retention, export metrics to Log Analytics or Storage Account.

3.5 Platform metrics vs custom metrics vs guest metrics

Platform Metrics: Automatically collected by Azure for each resource, no configuration required. Examples: VM CPU, Storage transactions, SQL Database DTU. Available immediately after creating the resource.

Guest OS Metrics: Operating system metrics inside the VM: memory usage, disk, processes. Require installing the Azure Monitor Agent on the VM since Azure has no visibility into the OS by default.

Custom Metrics: Created by your application or scripts. Sent to Azure Monitor via API, Application Insights SDK, or Azure Monitor Metrics REST API. Allow measuring anything specific to your application.

3.6 Multi-dimensional metrics: Splitting and Filtering

Two powerful concepts when working with dimensions in Metrics Explorer:

Splitting: Divides a metric into separate series by dimension value. For example: splitting Storage Account Transactions by ResponseType shows separate lines for Success, ServerError, and ClientError on the same chart.

Filtering: Shows only data where the dimension has a specific value. For example: filtering Transactions by ApiName = GetBlob shows only blob download operations.

4. Structural View

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

5. Practical Operation

5.1 Navigating the Metrics Explorer

The Metrics Explorer is the main interface for viewing metrics. Access via:

Azure Monitor > Metrics or [Resource] > Metrics

The interface has four main controls:

Scope: The resource (or resource group or subscription) whose metrics you want to see.

Metric Namespace: Groups related metrics. A VM has several namespaces: Virtual Machine Host (platform metrics), azure.vm.windows.guestmetrics (Windows guest metrics), etc.

Metric: The specific metric (e.g., Percentage CPU, Available Memory Bytes).

Aggregation: How values will be combined over the time interval (Average, Max, Sum, etc.).

5.2 Practical interpretation examples

Scenario 1: VM with consistently high CPU

Metric: Percentage CPU | Aggregation: Average | Period: 24 hours | Granularity: 5 min

If the chart shows 85-90% average for several hours, this indicates CPU saturation. Compare with peak (Maximum) to see if it reaches 100% and at what times.

Scenario 2: Storage Account with errors

Metric: Transactions | Splitting by ResponseType

If you see lines for ServerError or ThrottlingError growing, this indicates the Storage Account is being throttled or has internal problems.

Scenario 3: Database latency

Metric: Connection Failed or DTU Consumption Percent (Azure SQL) | Aggregation: Maximum

Peaks in Maximum with normal Average indicate intermittent problems that the average hides.

5.3 Time Range Comparison

Metrics Explorer allows adding a comparison line with a previous period. Example: compare CPU from the last 24 hours with the 24 hours from the same time last week. This reveals anomalous behavior patterns versus expected normal behavior.

5.4 Multi-resource metrics simultaneously

With Multi-resource metrics, you can compare the same metric across multiple VMs simultaneously. For example: see Percentage CPU of all VMs in a Scale Set side by side to identify if a specific VM is unbalanced.

6. Implementation Approaches

6.1 Azure Portal (Metrics Explorer)

When to use: Interactive investigation, ad-hoc dashboard creation, real-time troubleshooting.

Advantages: Intuitive visual interface, no need to know metric names beforehand, easy dimension exploration with splitting and filtering.

Limitation: Not automatable; each query is manual.

Tip: Use the "Pin to dashboard" button to save useful charts on a permanent dashboard.

6.2 Azure CLI

# List all available metrics for a resource
az monitor metrics list-definitions \
  --resource <resource-id> \
  --output table

# Query specific metric
az monitor metrics list \
  --resource <resource-id> \
  --metric "Percentage CPU" \
  --interval PT5M \
  --start-time 2025-01-15T00:00:00Z \
  --end-time 2025-01-15T23:59:59Z \
  --aggregation Average Maximum \
  --output table

# Query metric with dimension filter
az monitor metrics list \
  --resource <storage-account-id> \
  --metric "Transactions" \
  --interval PT1H \
  --aggregation Total \
  --filter "ResponseType eq 'ServerError'" \
  --output table

When to use: Automation scripts, periodic reports, when you need to process values programmatically.

6.3 Azure PowerShell

# Query metrics
$result = Get-AzMetric `
  -ResourceId <resource-id> `
  -MetricName "Percentage CPU" `
  -StartTime (Get-Date).AddHours(-24) `
  -EndTime (Get-Date) `
  -TimeGrainInMinutes 5 `
  -AggregationType Average

# Process result
$result.Data | Select-Object TimeStamp, Average | Format-Table

6.4 Azure Monitor REST API

For integration with external systems or custom dashboards:

# Query via REST API
curl -X GET \
  "https://management.azure.com{resource-id}/providers/microsoft.insights/metrics?metricnames=Percentage%20CPU&timespan=2025-01-15T00:00:00Z/2025-01-15T23:59:59Z&interval=PT5M&aggregation=average&api-version=2019-07-01" \
  -H "Authorization: Bearer <token>"

The response returns JSON with timestamps and aggregated values.

6.5 Kusto (Log Analytics) for archived metrics

When you export metrics to Log Analytics, you can query them with Kusto Query Language (KQL):

// Average CPU of all VMs in the last 24 hours
AzureMetrics
| where TimeGenerated > ago(24h)
| where MetricName == "Percentage CPU"
| where ResourceType == "MICROSOFT.COMPUTE/VIRTUALMACHINES"
| summarize AvgCPU = avg(Average) by Resource, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

When to use: Historical analysis beyond 93 days, correlating metrics with logs, complex reports.

7. Control and Security

7.1 Permissions for reading metrics

Role	Metrics access
Monitoring Reader	Read metrics and alerts (without modifying)
Monitoring Contributor	Create and modify alerts, action groups
Reader	View resource metrics (inherited)
Owner / Contributor	Full access

For operations teams that only need to monitor without modifying resources, Monitoring Reader is the ideal role.

7.2 Diagnostics: enabling diagnostic metrics

Some resources require explicit enabling of diagnostics to export metrics and logs beyond the default:

az monitor diagnostic-settings create \
  --name "vm-diagnostics" \
  --resource <vm-resource-id> \
  --metrics '[{"category":"AllMetrics","enabled":true,"retentionPolicy":{"days":30,"enabled":true}}]' \
  --workspace <log-analytics-workspace-id>

This sends metrics to Log Analytics, enabling historical analysis beyond the default 93 days.

7.3 Continuous metrics export

For long-term retention or integration with third-party systems (Grafana, Datadog, Splunk):

# Export metrics to Storage Account
az monitor diagnostic-settings create \
  --name "metrics-export" \
  --resource <resource-id> \
  --metrics '[{"category":"AllMetrics","enabled":true}]' \
  --storage-account <storage-account-id>

8. Decision Making

8.1 Which aggregation to use for each scenario

Metric	Recommended aggregation	Reason
CPU Percentage	Average + Maximum	Average shows trend; Max shows peaks
Available Memory Bytes	Minimum	You want to know the worst case
Network In/Out bytes	Sum	Total data transferred in period
Request Count	Sum	Total requests
Response Latency	P95 or P99	Slowest user experience
Error Count	Sum	Total errors
Disk Queue Depth	Average	Average I/O pressure
Connections Active	Maximum	Peak simultaneous connections

8.2 Platform metrics vs Log Analytics for queries

Situation	Best approach	Reason
Real-time alert (< 1 min)	Platform Metrics	Minimal latency
Historical analysis > 93 days	Log Analytics	Metrics exported for long retention
Correlate metrics with log events	Log Analytics	Data joined in same KQL query
Live operational dashboard	Platform Metrics + Metrics Explorer	Frequent updates
Monthly capacity report	Log Analytics + KQL	Long-term trend analysis
VM Scale Set autoscale	Platform Metrics	Autoscale only uses platform metrics

8.3 Appropriate granularity by scenario

Scenario	Recommended granularity
Recent incident investigation	1 minute
Daily operational dashboard	5 minutes
Weekly capacity trend	1 hour
Monthly report	1 day
Seasonality analysis	1 day or 1 week

9. Best Practices

Combine Average with Maximum when analyzing CPU: Average shows general trend; Maximum reveals peaks that the average hides.
Use P95 or P99 for latency metrics, never just Average. Latency averages mask the experience of the slowest users.
Enable splitting by dimension when investigating errors: splitting Transactions by ResponseType immediately reveals if errors come from server or client.
Save useful charts to shared dashboards for the team, avoiding recreating the same visualizations during incidents.
Configure extended retention by exporting metrics to Log Analytics if you need historical analysis beyond 93 days.
Use period comparison (previous period) when investigating anomalies: comparing with the same window from last week reveals if behavior is new or standard.
Document normal limits for critical metrics of each resource. Without baseline, any value seems suspicious or normal.
Combine platform metrics with guest metrics for VMs: the platform shows CPU and network; guest metrics show memory and internal disk I/O.

10. Common Errors

Error	Why it happens	How to avoid
CPU looks low but application is slow	Using Average hiding short peaks	Add Maximum and smaller granularity (1 min)
Latency looks good but users complain	Using Average instead of P95/P99	Use percentiles for latency metrics
Memory metric doesn't appear for VM	Guest metrics not configured	Install Azure Monitor Agent on VM
Can't find expected metric	Wrong namespace selected	Check all available namespaces for the resource
Chart shows "No data"	Resource without data in the period	Increase time range or verify if resource was active
Storage throttling not detected	Not applying splitting by ResponseType	Use split by ResponseType to see ThrottlingError separately
Alert triggering unnecessarily	Threshold too sensitive for long granularity	Adjust granularity or use more appropriate aggregation
Historical data unavailable	Period beyond 93 days without export configured	Configure export to Log Analytics in advance

11. Operation and Maintenance

11.1 Essential metrics by resource type

Virtual Machines:

Metric	Namespace	Aggregation	Attention threshold
Percentage CPU	Virtual Machine Host	Average + Max	> 80% avg or 100% max for > 5 min
Available Memory Bytes	Guest OS	Min	< 10% of total memory
OS Disk Queue Depth	Virtual Machine Host	Average	> 10
Network In/Out	Virtual Machine Host	Sum	Abnormal peak vs baseline

Storage Accounts:

Metric	Aggregation	Attention threshold
Transactions	Sum, split by ResponseType	Any ThrottlingError
SuccessE2ELatency	Average + P95	> 200ms average
Availability	Average	< 99.9%
UsedCapacity	Average	> 80% of limit

Azure SQL Database:

Metric	Aggregation	Attention threshold
DTU Consumption Percent	Maximum	> 80%
Connection Failed	Sum	Any value > 0
Deadlocks	Sum	Any value > 0
Sessions Percent	Maximum	> 80%

11.2 Monitoring Azure Monitor itself

If metrics stop appearing, check:

# Check if Azure Monitor Agent is active on VM
az vm extension list \
  --resource-group myRG \
  --vm-name myVM \
  --query "[?name=='AzureMonitorLinuxAgent'].{Name:name, State:provisioningState}" \
  --output table

11.3 Important limits

Aspect	Limit
Platform metrics retention	93 days
Minimum granularity available	1 minute
Custom metrics per resource	50 dimensions, 10 values per dimension
Metrics ingestion latency	2 to 3 minutes (platform metrics)
Guest metrics latency	5 to 10 minutes after configuration
Platform metrics cost	Free
Custom metrics cost	Per data point sent

12. Integration and Automation

12.1 Creating alerts based on metrics

az monitor metrics alert create \
  --name "High-CPU-Alert" \
  --resource-group myRG \
  --scopes <vm-resource-id> \
  --condition "avg Percentage CPU > 85" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 2 \
  --action-group <action-group-id> \
  --description "CPU above 85% for 5 minutes"

The --condition field supports operators like avg, max, min, sum, count and comparisons >, <, >=, <=, ==.

12.2 Integrating with Grafana

Azure Monitor has a native data source plugin for Grafana. Configure the data source pointing to your Azure subscription and create dashboards with metrics from any resource:

{
  "type": "grafana-azure-monitor-datasource",
  "name": "Azure Monitor",
  "access": "proxy",
  "jsonData": {
    "subscriptionId": "<sub-id>",
    "tenantId": "<tenant-id>"
  }
}

12.3 Autoscale based on metrics

Azure Autoscale uses platform metrics to automatically scale resources:

az monitor autoscale create \
  --resource-group myRG \
  --resource <vmss-resource-id> \
  --resource-type Microsoft.Compute/virtualMachineScaleSets \
  --name myAutoscaleSettings \
  --min-count 2 \
  --max-count 10 \
  --count 2

# Add scale-out rule based on CPU
az monitor autoscale rule create \
  --resource-group myRG \
  --autoscale-name myAutoscaleSettings \
  --condition "Percentage CPU > 75 avg 5m" \
  --scale out 1

12.4 Exporting metrics via Diagnostic Settings to multiple destinations

az monitor diagnostic-settings create \
  --name "full-diagnostics" \
  --resource <resource-id> \
  --metrics '[{"category":"AllMetrics","enabled":true}]' \
  --workspace <log-analytics-workspace-id> \
  --storage-account <storage-account-id> \
  --event-hub-name <event-hub-name> \
  --event-hub-rule <event-hub-auth-rule-id>

You can send to Log Analytics (KQL query), Storage Account (long-term archiving) and Event Hub (integration with Splunk, Datadog, etc.) simultaneously.

13. Final Summary

Essential concepts:

Metrics are time series of numerical values automatically collected from Azure resources, stored for 93 days with granularity up to 1 minute.
Each metric can have dimensions that allow filtering (see only errors) or splitting (separate by operation type) the data.
The Metrics Explorer is the main interface for interactive visualization with scope, namespace, metric and aggregation as main controls.

Critical differences:

Platform Metrics vs Guest OS Metrics: Platform metrics are automatic. Guest metrics require Azure Monitor Agent installed on VM to collect internal OS metrics (memory, disk).
Average vs Maximum vs Percentile: Average for trends. Maximum for peaks. P95/P99 for real user experience latency.
Splitting vs Filtering: Splitting creates separate lines in the chart by dimension value. Filtering shows only data corresponding to a specific dimension value.
Platform Metrics vs Log Analytics: Metrics are optimized for real-time alerts and have 93-day retention. Log Analytics has configurable retention and allows correlation with logs.

What needs to be remembered:

Platform metrics have latency of 2 to 3 minutes after the event. Don't confuse absence of data with non-existence of activity.
For analysis beyond 93 days, configure export to Log Analytics or Storage Account before needing it.
Autoscale uses only platform metrics, not guest metrics or logs.
For VMs, memory metrics don't appear by default in the platform namespace; Azure Monitor Agent is needed for guest metrics.
Use P95 or P99 for latency metrics in any user experience analysis. Latency averages are misleading.
Available granularity decreases over time: 1-minute data remains available for 93 days, but older historical data only exists in larger granularities.

1. Initial Intuition​

2. Context​

2.1 Metrics within Azure Monitor​

3. Building Concepts​

3.1 What is a metric: fundamental structure​

3.2 Dimensions: the concept that multiplies the power of metrics​

3.3 Aggregation types​

3.4 Time Granularity​

3.5 Platform metrics vs custom metrics vs guest metrics​

3.6 Multi-dimensional metrics: Splitting and Filtering​

4. Structural View​

5. Practical Operation​

5.1 Navigating the Metrics Explorer​

5.2 Practical interpretation examples​

5.3 Time Range Comparison​

5.4 Multi-resource metrics simultaneously​

6. Implementation Approaches​

6.1 Azure Portal (Metrics Explorer)​

6.2 Azure CLI​

6.3 Azure PowerShell​

6.4 Azure Monitor REST API​

6.5 Kusto (Log Analytics) for archived metrics​

7. Control and Security​

7.1 Permissions for reading metrics​

7.2 Diagnostics: enabling diagnostic metrics​

7.3 Continuous metrics export​

8. Decision Making​

8.1 Which aggregation to use for each scenario​

8.2 Platform metrics vs Log Analytics for queries​

8.3 Appropriate granularity by scenario​

9. Best Practices​

10. Common Errors​

11. Operation and Maintenance​

11.1 Essential metrics by resource type​

11.2 Monitoring Azure Monitor itself​

11.3 Important limits​

12. Integration and Automation​

12.1 Creating alerts based on metrics​

12.2 Integrating with Grafana​

12.3 Autoscale based on metrics​

12.4 Exporting metrics via Diagnostic Settings to multiple destinations​

13. Final Summary​