Skip to main content

Theoretical Foundation: Interpret Metrics in Azure Monitor


1. Initial Intuition​

Imagine you're driving a car. On the dashboard, you have a speedometer, tachometer, fuel gauge, and engine temperature indicator. Each of these instruments collects a specific numerical measurement in real-time and displays it so you can make decisions while driving. If the temperature rises too high, you know something's wrong and you need to act.

Metrics in Azure Monitor are exactly these instruments for your cloud resources. Each Azure resource (VMs, Storage Accounts, databases, networks) continuously generates numerical values that describe their state and behavior: CPU percentage, bytes transferred, number of requests, response latency, used storage capacity.

The difference from a car dashboard is that in Azure you can query these measurements historically, combine multiple metrics in one chart, calculate averages and percentiles, and configure automatic alerts when a value crosses a defined threshold.


2. Context​

2.1 Metrics within Azure Monitor​

Azure Monitor is Azure's central observability platform. It collects three types of data:

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Why do metrics exist separately from logs? Metrics are optimized for fast queries and real-time alerts. They're stored in compressed time-series format, suitable for rendering charts and evaluating alert conditions in seconds. Logs are semi-structured text, suitable for deep investigation but with higher ingestion latency.


3. Building Concepts​

3.1 What is a metric: fundamental structure​

A metric is a time series of numerical values associated with an Azure resource. Each data point has:

  • Timestamp: When it was collected
  • Value: The numerical value (e.g., 78.5)
  • Metric Name: What's being measured (e.g., "Percentage CPU")
  • Resource: The Azure resource it belongs to
  • Dimensions (optional): Subdivisions of the metric by attribute

3.2 Dimensions: the concept that multiplies the power of metrics​

A dimension is an attribute that allows filtering or segmenting a metric. It's the difference between "how many total requests arrived" and "how many requests arrived per HTTP response code."

Concrete example with Storage Account:

The Transactions metric (number of operations) has dimensions like:

  • ResponseType: Success, ServerError, ClientError
  • ApiName: GetBlob, PutBlob, ListContainers
  • Authentication: SAS, AccountKey, AzureActiveDirectory

Without dimensions, you only see the total. With dimensions, you can answer: "How many GetBlob operations failed with server error in the last 4 hours?"


3.3 Aggregation types​

Metrics aren't displayed as individual points for each second. They are aggregated over time intervals. Understanding which aggregation to use is fundamental for correct interpretation:

AggregationDescriptionWhen to use
AverageMean of values in the intervalCPU%, average latency
MaximumHighest value in the intervalCPU peak, maximum connections
MinimumLowest value in the intervalMinimum available memory
SumSum of all valuesTotal requests, bytes transferred
CountNumber of data pointsNumber of operations
Percentile (P50, P95, P99)Percentile of valuesLatency percentile (e.g., "95% of requests responded in less than X ms")

Classic mistake: Using Average to analyze tail latency. A 100ms average can hide that 5% of requests take 2 seconds. Use P95 or P99 to understand the real experience of the slowest users.


3.4 Time Granularity​

When querying metrics, you define the time range (e.g., last 24 hours) and the granularity (e.g., points every 5 minutes). Granularity determines the size of the aggregation window.

Period queriedMinimum available granularity
1 hour1 minute
24 hours5 minutes
7 days1 hour
30 days1 day
More than 30 days1 day

Retention: Metrics with 1-minute granularity are retained for 93 days. After this period, they're aggregated into larger granularities. For long-term retention, export metrics to Log Analytics or Storage Account.


3.5 Platform metrics vs custom metrics vs guest metrics​

Platform Metrics: Automatically collected by Azure for each resource, no configuration required. Examples: VM CPU, Storage transactions, SQL Database DTU. Available immediately after creating the resource.

Guest OS Metrics: Operating system metrics inside the VM: memory usage, disk, processes. Require installing the Azure Monitor Agent on the VM since Azure has no visibility into the OS by default.

Custom Metrics: Created by your application or scripts. Sent to Azure Monitor via API, Application Insights SDK, or Azure Monitor Metrics REST API. Allow measuring anything specific to your application.


3.6 Multi-dimensional metrics: Splitting and Filtering​

Two powerful concepts when working with dimensions in Metrics Explorer:

Splitting: Divides a metric into separate series by dimension value. For example: splitting Storage Account Transactions by ResponseType shows separate lines for Success, ServerError, and ClientError on the same chart.

Filtering: Shows only data where the dimension has a specific value. For example: filtering Transactions by ApiName = GetBlob shows only blob download operations.


4. Structural View​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

5. Practical Operation​

5.1 Navigating the Metrics Explorer​

The Metrics Explorer is the main interface for viewing metrics. Access via:

Azure Monitor > Metrics or [Resource] > Metrics

The interface has four main controls:

Scope: The resource (or resource group or subscription) whose metrics you want to see.

Metric Namespace: Groups related metrics. A VM has several namespaces: Virtual Machine Host (platform metrics), azure.vm.windows.guestmetrics (Windows guest metrics), etc.

Metric: The specific metric (e.g., Percentage CPU, Available Memory Bytes).

Aggregation: How values will be combined over the time interval (Average, Max, Sum, etc.).


5.2 Practical interpretation examples​

Scenario 1: VM with consistently high CPU

Metric: Percentage CPU | Aggregation: Average | Period: 24 hours | Granularity: 5 min

If the chart shows 85-90% average for several hours, this indicates CPU saturation. Compare with peak (Maximum) to see if it reaches 100% and at what times.

Scenario 2: Storage Account with errors

Metric: Transactions | Splitting by ResponseType

If you see lines for ServerError or ThrottlingError growing, this indicates the Storage Account is being throttled or has internal problems.

Scenario 3: Database latency

Metric: Connection Failed or DTU Consumption Percent (Azure SQL) | Aggregation: Maximum

Peaks in Maximum with normal Average indicate intermittent problems that the average hides.


5.3 Time Range Comparison​

Metrics Explorer allows adding a comparison line with a previous period. Example: compare CPU from the last 24 hours with the 24 hours from the same time last week. This reveals anomalous behavior patterns versus expected normal behavior.


5.4 Multi-resource metrics simultaneously​

With Multi-resource metrics, you can compare the same metric across multiple VMs simultaneously. For example: see Percentage CPU of all VMs in a Scale Set side by side to identify if a specific VM is unbalanced.


6. Implementation Approaches​

6.1 Azure Portal (Metrics Explorer)​

When to use: Interactive investigation, ad-hoc dashboard creation, real-time troubleshooting.

Advantages: Intuitive visual interface, no need to know metric names beforehand, easy dimension exploration with splitting and filtering.

Limitation: Not automatable; each query is manual.

Tip: Use the "Pin to dashboard" button to save useful charts on a permanent dashboard.


6.2 Azure CLI​

# List all available metrics for a resource
az monitor metrics list-definitions \
--resource <resource-id> \
--output table

# Query specific metric
az monitor metrics list \
--resource <resource-id> \
--metric "Percentage CPU" \
--interval PT5M \
--start-time 2025-01-15T00:00:00Z \
--end-time 2025-01-15T23:59:59Z \
--aggregation Average Maximum \
--output table

# Query metric with dimension filter
az monitor metrics list \
--resource <storage-account-id> \
--metric "Transactions" \
--interval PT1H \
--aggregation Total \
--filter "ResponseType eq 'ServerError'" \
--output table

When to use: Automation scripts, periodic reports, when you need to process values programmatically.


6.3 Azure PowerShell​

# Query metrics
$result = Get-AzMetric `
-ResourceId <resource-id> `
-MetricName "Percentage CPU" `
-StartTime (Get-Date).AddHours(-24) `
-EndTime (Get-Date) `
-TimeGrainInMinutes 5 `
-AggregationType Average

# Process result
$result.Data | Select-Object TimeStamp, Average | Format-Table

6.4 Azure Monitor REST API​

For integration with external systems or custom dashboards:

# Query via REST API
curl -X GET \
"https://management.azure.com{resource-id}/providers/microsoft.insights/metrics?metricnames=Percentage%20CPU&timespan=2025-01-15T00:00:00Z/2025-01-15T23:59:59Z&interval=PT5M&aggregation=average&api-version=2019-07-01" \
-H "Authorization: Bearer <token>"

The response returns JSON with timestamps and aggregated values.


6.5 Kusto (Log Analytics) for archived metrics​

When you export metrics to Log Analytics, you can query them with Kusto Query Language (KQL):

// Average CPU of all VMs in the last 24 hours
AzureMetrics
| where TimeGenerated > ago(24h)
| where MetricName == "Percentage CPU"
| where ResourceType == "MICROSOFT.COMPUTE/VIRTUALMACHINES"
| summarize AvgCPU = avg(Average) by Resource, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

When to use: Historical analysis beyond 93 days, correlating metrics with logs, complex reports.


7. Control and Security​

7.1 Permissions for reading metrics​

RoleMetrics access
Monitoring ReaderRead metrics and alerts (without modifying)
Monitoring ContributorCreate and modify alerts, action groups
ReaderView resource metrics (inherited)
Owner / ContributorFull access

For operations teams that only need to monitor without modifying resources, Monitoring Reader is the ideal role.


7.2 Diagnostics: enabling diagnostic metrics​

Some resources require explicit enabling of diagnostics to export metrics and logs beyond the default:

az monitor diagnostic-settings create \
--name "vm-diagnostics" \
--resource <vm-resource-id> \
--metrics '[{"category":"AllMetrics","enabled":true,"retentionPolicy":{"days":30,"enabled":true}}]' \
--workspace <log-analytics-workspace-id>

This sends metrics to Log Analytics, enabling historical analysis beyond the default 93 days.


7.3 Continuous metrics export​

For long-term retention or integration with third-party systems (Grafana, Datadog, Splunk):

# Export metrics to Storage Account
az monitor diagnostic-settings create \
--name "metrics-export" \
--resource <resource-id> \
--metrics '[{"category":"AllMetrics","enabled":true}]' \
--storage-account <storage-account-id>

8. Decision Making​

8.1 Which aggregation to use for each scenario​

MetricRecommended aggregationReason
CPU PercentageAverage + MaximumAverage shows trend; Max shows peaks
Available Memory BytesMinimumYou want to know the worst case
Network In/Out bytesSumTotal data transferred in period
Request CountSumTotal requests
Response LatencyP95 or P99Slowest user experience
Error CountSumTotal errors
Disk Queue DepthAverageAverage I/O pressure
Connections ActiveMaximumPeak simultaneous connections

8.2 Platform metrics vs Log Analytics for queries​

SituationBest approachReason
Real-time alert (< 1 min)Platform MetricsMinimal latency
Historical analysis > 93 daysLog AnalyticsMetrics exported for long retention
Correlate metrics with log eventsLog AnalyticsData joined in same KQL query
Live operational dashboardPlatform Metrics + Metrics ExplorerFrequent updates
Monthly capacity reportLog Analytics + KQLLong-term trend analysis
VM Scale Set autoscalePlatform MetricsAutoscale only uses platform metrics

8.3 Appropriate granularity by scenario​

ScenarioRecommended granularity
Recent incident investigation1 minute
Daily operational dashboard5 minutes
Weekly capacity trend1 hour
Monthly report1 day
Seasonality analysis1 day or 1 week

9. Best Practices​

  • Combine Average with Maximum when analyzing CPU: Average shows general trend; Maximum reveals peaks that the average hides.
  • Use P95 or P99 for latency metrics, never just Average. Latency averages mask the experience of the slowest users.
  • Enable splitting by dimension when investigating errors: splitting Transactions by ResponseType immediately reveals if errors come from server or client.
  • Save useful charts to shared dashboards for the team, avoiding recreating the same visualizations during incidents.
  • Configure extended retention by exporting metrics to Log Analytics if you need historical analysis beyond 93 days.
  • Use period comparison (previous period) when investigating anomalies: comparing with the same window from last week reveals if behavior is new or standard.
  • Document normal limits for critical metrics of each resource. Without baseline, any value seems suspicious or normal.
  • Combine platform metrics with guest metrics for VMs: the platform shows CPU and network; guest metrics show memory and internal disk I/O.

10. Common Errors​

ErrorWhy it happensHow to avoid
CPU looks low but application is slowUsing Average hiding short peaksAdd Maximum and smaller granularity (1 min)
Latency looks good but users complainUsing Average instead of P95/P99Use percentiles for latency metrics
Memory metric doesn't appear for VMGuest metrics not configuredInstall Azure Monitor Agent on VM
Can't find expected metricWrong namespace selectedCheck all available namespaces for the resource
Chart shows "No data"Resource without data in the periodIncrease time range or verify if resource was active
Storage throttling not detectedNot applying splitting by ResponseTypeUse split by ResponseType to see ThrottlingError separately
Alert triggering unnecessarilyThreshold too sensitive for long granularityAdjust granularity or use more appropriate aggregation
Historical data unavailablePeriod beyond 93 days without export configuredConfigure export to Log Analytics in advance

11. Operation and Maintenance​

11.1 Essential metrics by resource type​

Virtual Machines:

MetricNamespaceAggregationAttention threshold
Percentage CPUVirtual Machine HostAverage + Max> 80% avg or 100% max for > 5 min
Available Memory BytesGuest OSMin< 10% of total memory
OS Disk Queue DepthVirtual Machine HostAverage> 10
Network In/OutVirtual Machine HostSumAbnormal peak vs baseline

Storage Accounts:

MetricAggregationAttention threshold
TransactionsSum, split by ResponseTypeAny ThrottlingError
SuccessE2ELatencyAverage + P95> 200ms average
AvailabilityAverage< 99.9%
UsedCapacityAverage> 80% of limit

Azure SQL Database:

MetricAggregationAttention threshold
DTU Consumption PercentMaximum> 80%
Connection FailedSumAny value > 0
DeadlocksSumAny value > 0
Sessions PercentMaximum> 80%

11.2 Monitoring Azure Monitor itself​

If metrics stop appearing, check:

# Check if Azure Monitor Agent is active on VM
az vm extension list \
--resource-group myRG \
--vm-name myVM \
--query "[?name=='AzureMonitorLinuxAgent'].{Name:name, State:provisioningState}" \
--output table

11.3 Important limits​

AspectLimit
Platform metrics retention93 days
Minimum granularity available1 minute
Custom metrics per resource50 dimensions, 10 values per dimension
Metrics ingestion latency2 to 3 minutes (platform metrics)
Guest metrics latency5 to 10 minutes after configuration
Platform metrics costFree
Custom metrics costPer data point sent

12. Integration and Automation​

12.1 Creating alerts based on metrics​

az monitor metrics alert create \
--name "High-CPU-Alert" \
--resource-group myRG \
--scopes <vm-resource-id> \
--condition "avg Percentage CPU > 85" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 2 \
--action-group <action-group-id> \
--description "CPU above 85% for 5 minutes"

The --condition field supports operators like avg, max, min, sum, count and comparisons >, <, >=, <=, ==.


12.2 Integrating with Grafana​

Azure Monitor has a native data source plugin for Grafana. Configure the data source pointing to your Azure subscription and create dashboards with metrics from any resource:

{
"type": "grafana-azure-monitor-datasource",
"name": "Azure Monitor",
"access": "proxy",
"jsonData": {
"subscriptionId": "<sub-id>",
"tenantId": "<tenant-id>"
}
}

12.3 Autoscale based on metrics​

Azure Autoscale uses platform metrics to automatically scale resources:

az monitor autoscale create \
--resource-group myRG \
--resource <vmss-resource-id> \
--resource-type Microsoft.Compute/virtualMachineScaleSets \
--name myAutoscaleSettings \
--min-count 2 \
--max-count 10 \
--count 2

# Add scale-out rule based on CPU
az monitor autoscale rule create \
--resource-group myRG \
--autoscale-name myAutoscaleSettings \
--condition "Percentage CPU > 75 avg 5m" \
--scale out 1

12.4 Exporting metrics via Diagnostic Settings to multiple destinations​

az monitor diagnostic-settings create \
--name "full-diagnostics" \
--resource <resource-id> \
--metrics '[{"category":"AllMetrics","enabled":true}]' \
--workspace <log-analytics-workspace-id> \
--storage-account <storage-account-id> \
--event-hub-name <event-hub-name> \
--event-hub-rule <event-hub-auth-rule-id>

You can send to Log Analytics (KQL query), Storage Account (long-term archiving) and Event Hub (integration with Splunk, Datadog, etc.) simultaneously.


13. Final Summary​

Essential concepts:

  • Metrics are time series of numerical values automatically collected from Azure resources, stored for 93 days with granularity up to 1 minute.
  • Each metric can have dimensions that allow filtering (see only errors) or splitting (separate by operation type) the data.
  • The Metrics Explorer is the main interface for interactive visualization with scope, namespace, metric and aggregation as main controls.

Critical differences:

  • Platform Metrics vs Guest OS Metrics: Platform metrics are automatic. Guest metrics require Azure Monitor Agent installed on VM to collect internal OS metrics (memory, disk).
  • Average vs Maximum vs Percentile: Average for trends. Maximum for peaks. P95/P99 for real user experience latency.
  • Splitting vs Filtering: Splitting creates separate lines in the chart by dimension value. Filtering shows only data corresponding to a specific dimension value.
  • Platform Metrics vs Log Analytics: Metrics are optimized for real-time alerts and have 93-day retention. Log Analytics has configurable retention and allows correlation with logs.

What needs to be remembered:

  • Platform metrics have latency of 2 to 3 minutes after the event. Don't confuse absence of data with non-existence of activity.
  • For analysis beyond 93 days, configure export to Log Analytics or Storage Account before needing it.
  • Autoscale uses only platform metrics, not guest metrics or logs.
  • For VMs, memory metrics don't appear by default in the platform namespace; Azure Monitor Agent is needed for guest metrics.
  • Use P95 or P99 for latency metrics in any user experience analysis. Latency averages are misleading.
  • Available granularity decreases over time: 1-minute data remains available for 93 days, but older historical data only exists in larger granularities.