Theoretical Foundation: Troubleshoot load balancing
1. Initial Intuitionβ
You configured a Load Balancer, added three VMs to the backend pool, created the load balancing rule and the health probe. Everything looks correct in the portal. But when you try to access the Load Balancer's public IP, the connection fails or only works for some requests.
Load Balancer diagnostics is different from general network connectivity diagnostics. Here, the problem can be in any of the five LB components (frontend, probe, backend pool, rule, NSG), and each component has its own behavior and failure point.
The analogy continues with the restaurant: the host (Load Balancer) may be working, but if the cashiers (VMs) don't respond to the manager's signal (health probe), the host considers them closed and doesn't direct customers. Or the cashiers are open, but a security guard at the door (NSG) is blocking customers from arriving.
Diagnosing Load Balancing means systematically checking each component in this chain until you find where the failure is.
2. Contextβ
Load Balancer diagnostics integrates all concepts from previous modules. A failure can originate from multiple layers:
The main tools are Load Balancer metrics in Azure Monitor, Network Watcher (studied previously), and LB diagnostic logs. Understanding what each metric indicates is the core of this module.
3. Building Conceptsβ
3.1 The Fundamental Metrics of Load Balancer Standardβ
Azure Load Balancer Standard exposes metrics in Azure Monitor that are the first line of diagnostics:
| Metric | What it measures | Healthy value | Problem signal |
|---|---|---|---|
| Data Path Availability | LB data path availability to backend | 100% | < 100% indicates unhealthy VMs |
| Health Probe Status | Percentage of VMs passing health probe | 100% | < 100% indicates VMs failing probe |
| Byte Count | Bytes processed by LB (inbound + outbound) | Positive and constant | Zero = no traffic arriving |
| Packet Count | Packets processed | Proportional to traffic | Abrupt drop = connectivity problem |
| SYN Count | SYN packets received (new TCP connections) | Proportional to traffic | Zero = traffic not reaching LB |
| SNAT Connection Count | Active and failed SNAT connections | Failures = 0 | Failures > 0 = SNAT port exhaustion |
| Allocated SNAT Ports | Allocated SNAT ports | Proportional to VMs | Close to maximum = exhaustion risk |
| Used SNAT Ports | SNAT ports in use | Less than allocated | Equal to maximum = active exhaustion |
3.2 Data Path Availability vs. Health Probe Statusβ
These two metrics are complementary and should be analyzed together:
Scenario C4 (probes passing but no traffic) is the most confusing: it indicates that VMs are healthy according to the probe, but real traffic isn't arriving. This usually points to a problem in the load balancing rule, frontend IP configuration, or an NSG that blocks business traffic (but not the probe, which uses AzureLoadBalancer as source).
3.3 The Three Types of Failures and Their Patternsβ
Type 1: No traffic reaches the Load Balancer
Symptom: SYN Count = 0, Byte Count = 0. The problem is before the LB: incorrect DNS, wrong public IP, NSG in the subnet blocking inbound traffic before reaching the LB (for LB Standard, inbound traffic is not blocked by NSG in the frontend subnet, since the LB is a managed service, but NSG on VMs can block). Or for internal LB: the client is not in the correct VNet or there's no route to the LB IP.
Type 2: Traffic reaches LB, but VMs are unhealthy
Symptom: Health Probe Status < 100%, Data Path Availability < 100%. VMs don't respond to the probe. Causes: application stopped, NSG blocks the probe, wrong port in probe, error 500 returned by application when probe expects 200.
Type 3: VMs healthy, but connections fail
Symptom: Health Probe Status = 100%, Data Path Availability = 100%, but client receives timeout or error. Causes: NSG blocks business traffic (but not the probe), application fails only for real traffic (probe on simplified endpoint doesn't detect the problem), misconfigured session persistence, SNAT exhaustion for outbound connections.
4. Structural Viewβ
The Systematic Diagnostic Flowβ
5. Practical Implementationβ
Checking Health Probe Status per Individual VMβ
One of the most useful features for diagnostics is checking which specific VM is failing the probe, not just the general average. In Azure Monitor, filter the Health Probe Status metric by Backend IP Address dimension:
# Check probe status per individual VM via CLI
az monitor metrics list \
--resource /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
--metric "HealthProbeStatus" \
--dimension "BackendIPAddress" \
--aggregation Average \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--output table
This shows, for each VM, whether the probe is passing or failing, allowing you to identify exactly which VM has a problem.
Testing the Health Probe Manuallyβ
To verify if an HTTP/HTTPS health probe would work, perform the same test that the LB does, from the VM's perspective or from a client on the same network:
# Test the probe endpoint from another VM in the same VNet
curl -v -k https://10.0.1.4:443/health
# Check if TCP port is open (for TCP probe)
Test-NetConnection -ComputerName 10.0.1.4 -Port 443
If the endpoint returns 200, the probe should be passing. If it returns another code, the probe fails. If the connection is refused, the application isn't listening or the OS firewall is blocking.
Checking NSG for Probesβ
The Load Balancer probe originates from 168.63.129.16 with the AzureLoadBalancer service tag. To confirm if NSG is allowing:
# View effective NSG rules on VM's NIC
az network nic list-effective-nsg \
--name nic-vm-web-01 \
--resource-group rg-producao \
--output json | grep -A 5 "AzureLoadBalancer"
If AzureLoadBalancer doesn't appear in inbound rules as Allow, the probe is being blocked.
SNAT Exhaustion Diagnosticsβ
SNAT port exhaustion is silent: the health probe continues working (probe is inbound, doesn't use SNAT), VMs appear healthy, but outbound connections to the internet fail or timeout.
# Check SNAT metrics
az monitor metrics list \
--resource /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
--metric "SNATConnectionCount" \
--filter "ConnectionState eq 'Failed'" \
--aggregation Total \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
If ConnectionState = Failed has elevated values, there's SNAT exhaustion. Inside VMs, the symptom is outbound TCP connections that remain in CLOSE_WAIT or TIME_WAIT excessively:
# Check connections in TIME_WAIT on VM (Linux)
ss -s | grep TIME-WAIT
# Windows
netstat -n | findstr TIME_WAIT | measure -line
6. Implementation Methodsβ
6.1 Azure Portal: Azure Monitor for Load Balancerβ
When to use: trend visualization, initial diagnostics with charts.
Path: Load Balancer > Monitoring > Metrics
The portal offers charts of all metrics with filters by dimension (by VM, by port, by frontend). The Backend IP Address dimension is especially useful for identifying which specific VM has a problem.
Configure alerts directly in the portal for proactive notification:
Path: Load Balancer > Monitoring > Alerts > New alert rule
Recommended alerts:
Health Probe Status< 100% for more than 5 minutesData Path Availability< 100%SNAT Connection CountwithConnectionState = Failed> 0
6.2 Azure CLIβ
Check complete Load Balancer configuration (components, rules, probes):
# Complete Load Balancer view
az network lb show \
--name lb-web-public \
--resource-group rg-networking \
--output json
# List only probes
az network lb probe list \
--lb-name lb-web-public \
--resource-group rg-networking \
--output table
# List load balancing rules
az network lb rule list \
--lb-name lb-web-public \
--resource-group rg-networking \
--output table
# Check VMs in backend pool
az network lb address-pool show \
--lb-name lb-web-public \
--name bp-vms-web \
--resource-group rg-networking \
--query "loadBalancerBackendAddresses[].{Nome:name, IP:networkInterfaceIPConfiguration.id}"
Check if a VM is correctly added to the pool:
# See which backend pool a NIC belongs to
az network nic show \
--name nic-vm-web-01 \
--resource-group rg-producao \
--query "ipConfigurations[0].loadBalancerBackendAddressPools[].id"
If the result is empty, the NIC is not in the backend pool. This is one of the most common problems: the Load Balancer was created with a pool, but VMs were never added.
6.3 PowerShellβ
# Get LB and inspect all components
$lb = Get-AzLoadBalancer -Name "lb-web-public" -ResourceGroupName "rg-networking"
# List probes
$lb.Probes | Select-Object Name, Protocol, Port, RequestPath, IntervalInSeconds, NumberOfProbes
# List rules
$lb.LoadBalancingRules | Select-Object Name, Protocol, FrontendPort, BackendPort, LoadDistribution
# Check VMs in backend pool
$lb.BackendAddressPools[0].LoadBalancerBackendAddresses | Select-Object Name
# Check if VM is in pool via NIC
$nic = Get-AzNetworkInterface -Name "nic-vm-web-01" -ResourceGroupName "rg-producao"
$nic.IpConfigurations[0].LoadBalancerBackendAddressPools
6.4 Network Watcher: Diagnostic Complementβ
Network Watcher complements Load Balancer diagnostics:
# IP Flow Verify: check if NSG allows traffic from internet to VM
az network watcher test-ip-flow \
--direction Inbound \
--protocol TCP \
--local 10.0.1.4:443 \
--remote 203.0.113.1:54321 \
--vm /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Compute/virtualMachines/vm-web-01 \
--resource-group NetworkWatcherRG \
--watcher-resource-group NetworkWatcherRG
# Connection Troubleshoot: test connectivity through LB path
az network watcher test-connectivity \
--source-resource /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Compute/virtualMachines/vm-cliente \
--dest-address 40.68.100.50 \
--dest-port 443 \
--protocol Tcp \
--resource-group NetworkWatcherRG \
--watcher-resource-group NetworkWatcherRG
7. Control and Securityβ
Diagnosing Conflicting NSG Rulesβ
A very common failure pattern is having two NSGs: one that allows AzureLoadBalancer (for probes) but not business traffic, or one that allows business traffic but not the probe. Checking both NSGs (NIC and Subnet) is essential:
# Check subnet NSG
az network nsg rule list \
--nsg-name nsg-subnet-web \
--resource-group rg-networking \
--output table
# Check NIC NSG
az network nsg rule list \
--nsg-name nsg-nic-vm-web-01 \
--resource-group rg-networking \
--output table
The analysis should check:
- Is there an
Allowrule forAzureLoadBalanceron the probe port? - Is there an
Allowrule for business traffic (port 80/443) from any origin (or from the internet viaInternettag)? - Is there any high-priority
Denyrule that might be overriding theAllowrules?
Check Load Balancer Diagnostic Logsβ
Load Balancer Standard can send logs to Log Analytics:
# Enable diagnostic logs
az monitor diagnostic-settings create \
--name diag-lb-web \
--resource /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
--workspace /subscriptions/<sub-id>/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/law-monitoring \
--metrics '[{"category": "AllMetrics", "enabled": true}]'
With logs in Log Analytics, KQL queries enable historical analysis:
// Health probe failures in the last 24 hours by VM
AzureMetrics
| where TimeGenerated > ago(24h)
| where ResourceType == "LOADBALANCERS"
| where MetricName == "HealthProbeStatus"
| where Average < 100
| project TimeGenerated, ResourceId, Average, DimensionValue1
| sort by TimeGenerated desc
8. Decision Makingβ
Which metric to check first based on symptoms?β
| Symptom | First metric | What to look for |
|---|---|---|
| No traffic works | SYN Count | Zero = traffic not reaching LB |
| Some requests work, others don't | Health Probe Status by VM | Specific VMs with failing probe |
| Slowness / timeouts | Data Path Availability + SNAT Connection Count | Gradual degradation or SNAT exhaustion |
| VM unexpectedly removed from pool | Health Probe Status by VM in period | Intermittent probe failures |
| VM outbound connections failing | SNAT Connection Count with Failed | SNAT port exhaustion |
| Uneven traffic between VMs | No direct metric | Check Session Persistence and distribution by 5-tuple hash |
Which diagnostic tool to use?β
| Situation | Tool | Reason |
|---|---|---|
| Check if NSG blocks probe | IP Flow Verify with source AzureLoadBalancer | Simulates probe packet |
| Check if NSG blocks business traffic | IP Flow Verify with real source IP | Simulates client packet |
| Test if LB delivers traffic to VM | Connection Troubleshoot from LB frontend to VM | End-to-end test |
| See exactly which packets reach VM | Packet Capture on VM NIC | Raw traffic analysis |
| VPN Gateway diagnostics in path | VPN Troubleshoot | Specific for gateways |
| Historical probe failure | Azure Monitor Metrics history | Identify temporal pattern |
9. Best Practicesβ
Configure proactive alerts before having problems: creating alerts for Health Probe Status < 100% and Data Path Availability < 100% ensures immediate notification when a VM is removed from the pool, before the impact is reported by users.
Use dedicated probe endpoints with deep checks: an HTTP probe on /health that just returns 200 immediately doesn't detect real application failures. A /health endpoint that checks database connection, queues and other dependencies removes VMs from the pool when there are functional problems, not just when the server is offline.
Enable Load Balancer diagnostics for Log Analytics from creation: historical metrics allow retroactive analysis when a problem is reported with delay ("yesterday afternoon the site was slow"). Without historical logs, the diagnostic window is limited to what Azure Monitor retains by default (93 days for metrics).
Document the expected NSG design for the LB: maintaining documentation of which NSG rules are necessary for the Load Balancer to function (AzureLoadBalancer for probe + internet source for traffic) facilitates diagnosis when someone inadvertently modifies the rules.
10. Common Errorsβ
Probe configured on wrong port
The probe is configured on port 443, but the application only listens on port 8443 (or vice versa). The TCP probe checks 443, finds nothing listening (connection refused), marks the VM as unhealthy. The VM is working perfectly, but the LB doesn't send traffic to it. Checking the actual application port via netstat -an | grep LISTEN on the VM and correcting the probe resolves it.
VM added to backend pool but NIC disassociated
A VM was deleted and recreated, but the old NIC remains associated with the backend pool. The pool shows an entry but with no NIC associated with an active VM. The LB tries to send probes to an IP address that no longer exists. Verify that each entry in the backend pool corresponds to an existing NIC and VM.
Health probe successful, but application fails for real traffic
The probe is GET /health HTTP/1.1 and returns 200 immediately without doing anything. But the application's real traffic (GET /api/data) fails because the database connection is broken. The VM remains in the pool. The probe doesn't detect the real problem because the /health endpoint doesn't check critical subsystems.
Session Persistence configured incorrectly causing uneven load
With Client IP session persistence, corporate clients behind NAT (all with the same outbound IP) are always sent to the same VM, overloading it while others remain idle. The administrator increases VMs in the pool but distribution doesn't improve. The solution is None (5-tuple hash) for stateless applications, which distributes by client IP + source port.
Ignoring BackendIPAddress dimension in probe metrics
The administrator sees Health Probe Status = 80% (some problem), but doesn't filter by individual VM. Spends hours trying to diagnose "the LB" without identifying that only one of 5 VMs has a failing probe. Always filter by BackendIPAddress when investigating probe failures.
11. Operation and Maintenanceβ
Automated Diagnostic Script for Load Balancerβ
#!/bin/bash
LB_NAME="lb-web-public"
RG="rg-networking"
POOL_NAME="bp-vms-web"
echo "=== Load Balancer Configuration ==="
az network lb show --name $LB_NAME --resource-group $RG \
--query "{SKU:sku.name, FrontendIPs:frontendIPConfigurations[].name, ProbeCount:length(probes), RuleCount:length(loadBalancingRules)}" \
--output table
echo ""
echo "=== Health Probes ==="
az network lb probe list --lb-name $LB_NAME --resource-group $RG \
--query "[].{Nome:name, Protocolo:protocol, Porta:port, Path:requestPath, Intervalo:intervalInSeconds, Threshold:numberOfProbes}" \
--output table
echo ""
echo "=== Load Balancing Rules ==="
az network lb rule list --lb-name $LB_NAME --resource-group $RG \
--query "[].{Nome:name, Protocolo:protocol, PortaFront:frontendPort, PortaBack:backendPort, Distribuicao:loadDistribution}" \
--output table
echo ""
echo "=== VMs in Backend Pool ==="
az network lb address-pool show \
--lb-name $LB_NAME \
--name $POOL_NAME \
--resource-group $RG \
--query "loadBalancerBackendAddresses[].{Nome:name, NIC:networkInterfaceIPConfiguration.id}" \
--output table
echo ""
echo "=== Recent Metrics (last 30 min) ==="
START=$(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ)
END=$(date -u +%Y-%m-%dT%H:%M:%SZ)
LB_ID=$(az network lb show --name $LB_NAME --resource-group $RG --query id -o tsv)
echo "Data Path Availability:"
az monitor metrics list --resource $LB_ID \
--metric "VipAvailability" --aggregation Average \
--start-time $START --end-time $END \
--query "value[0].timeseries[0].data[-1].average" --output tsv
echo "Health Probe Status:"
az monitor metrics list --resource $LB_ID \
--metric "DipAvailability" --aggregation Average \
--start-time $START --end-time $END \
--query "value[0].timeseries[0].data[-1].average" --output tsv
KQL Queries for Load Balancer Analysisβ
// Health probe history by VM
AzureMetrics
| where ResourceType == "LOADBALANCERS"
| where MetricName == "DipAvailability"
| where TimeGenerated > ago(6h)
| summarize avg(Average) by bin(TimeGenerated, 5m), tostring(DimensionValue1)
| render timechart
// Detect moments of SNAT exhaustion
AzureMetrics
| where ResourceType == "LOADBALANCERS"
| where MetricName == "SnatConnectionCount"
| where TimeGenerated > ago(24h)
| where DimensionValue1 == "Failed"
| where Total > 0
| project TimeGenerated, Total
| sort by TimeGenerated desc
Relevant Limits for Diagnosisβ
| Item | Limit | Diagnosis Impact |
|---|---|---|
| Azure Monitor metrics retention | 93 days | Historical available for retroactive analysis |
| SNAT ports per VM (without Outbound Rule) | 1,024 (Standard default) | Can exhaust in VMs with many outbound connections |
| VMs per backend pool (Standard) | 1,000 | Rarely reached, but check in large VMSS |
| Probes per second per endpoint | ~2 probes/sec (5s minimum interval) | Probe fails 2 consecutive times before marking unhealthy |
12. Integration and Automationβ
Automated Alerts and Runbooksβ
Configure alerts that trigger automatic diagnostic runbooks:
Load Balancer Health Dashboardβ
Create a centralized Azure Dashboard with critical LB metrics:
# Create workbook in Azure Monitor for consolidated visualization
az monitor workbook create \
--resource-group rg-monitoring \
--name "lb-health-dashboard" \
--display-name "Load Balancer Health Dashboard" \
--kind shared \
--source-id /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
--serialized-data @lb-workbook-template.json
13. Final Summaryβ
Essential points:
- The two fundamental metrics are Data Path Availability (can the LB reach the VMs?) and Health Probe Status (do the VMs respond to the probe?). Analyzing them together reveals the type of problem.
- The Load Balancer probe originates from
168.63.129.16(AzureLoadBalancertag). NSGs that block this IP cause unhealthy VMs even with the application working. - SNAT Exhaustion is silent: probes pass, VMs appear healthy, but outbound connections from VMs fail. The
SNAT Connection Count (Failed)metric reveals the problem. - The
BackendIPAddressdimension in probe metrics allows identifying which specific VM has a problem, not just the general average.
Critical differences:
- Probe failing vs. NSG blocking business traffic: the probe uses
AzureLoadBalanceras source; business traffic uses the client or internet IP. An NSG can allow one and block the other, resulting in healthy VMs (probe OK) but no real traffic reaching them. - Health Probe Status vs. Data Path Availability:
Health Probe Statusmeasures if the probe reaches the VM;Data Path Availabilitymeasures if real traffic can be delivered. They can have different values. - SNAT Exhaustion vs. Probe failure: they are completely independent. SNAT affects outbound connections from VMs. Probe affects whether VMs stay in the pool to receive inbound traffic.
What needs to be remembered:
- Always check metrics with
BackendIPAddressdimension to identify which VM has the problem. - Configure proactive alerts for
Health Probe Status < 100%in production environment. - If VMs are in the pool, probes passing, but real traffic fails: check the NSG for the business port (not just the probe port).
- Use
az network lb address-pool showto confirm that VM NICs are correctly associated with the backend pool. - For SNAT Exhaustion: configure Outbound Rules with explicit port allocation or migrate to NAT Gateway.
- The fastest command for initial diagnosis is checking
Health Probe StatusandData Path Availabilityin Azure Monitor for the affected LB.