Skip to main content

Theoretical Foundation: Troubleshoot load balancing


1. Initial Intuition​

You configured a Load Balancer, added three VMs to the backend pool, created the load balancing rule and the health probe. Everything looks correct in the portal. But when you try to access the Load Balancer's public IP, the connection fails or only works for some requests.

Load Balancer diagnostics is different from general network connectivity diagnostics. Here, the problem can be in any of the five LB components (frontend, probe, backend pool, rule, NSG), and each component has its own behavior and failure point.

The analogy continues with the restaurant: the host (Load Balancer) may be working, but if the cashiers (VMs) don't respond to the manager's signal (health probe), the host considers them closed and doesn't direct customers. Or the cashiers are open, but a security guard at the door (NSG) is blocking customers from arriving.

Diagnosing Load Balancing means systematically checking each component in this chain until you find where the failure is.


2. Context​

Load Balancer diagnostics integrates all concepts from previous modules. A failure can originate from multiple layers:

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

The main tools are Load Balancer metrics in Azure Monitor, Network Watcher (studied previously), and LB diagnostic logs. Understanding what each metric indicates is the core of this module.


3. Building Concepts​

3.1 The Fundamental Metrics of Load Balancer Standard​

Azure Load Balancer Standard exposes metrics in Azure Monitor that are the first line of diagnostics:

MetricWhat it measuresHealthy valueProblem signal
Data Path AvailabilityLB data path availability to backend100%< 100% indicates unhealthy VMs
Health Probe StatusPercentage of VMs passing health probe100%< 100% indicates VMs failing probe
Byte CountBytes processed by LB (inbound + outbound)Positive and constantZero = no traffic arriving
Packet CountPackets processedProportional to trafficAbrupt drop = connectivity problem
SYN CountSYN packets received (new TCP connections)Proportional to trafficZero = traffic not reaching LB
SNAT Connection CountActive and failed SNAT connectionsFailures = 0Failures > 0 = SNAT port exhaustion
Allocated SNAT PortsAllocated SNAT portsProportional to VMsClose to maximum = exhaustion risk
Used SNAT PortsSNAT ports in useLess than allocatedEqual to maximum = active exhaustion

3.2 Data Path Availability vs. Health Probe Status​

These two metrics are complementary and should be analyzed together:

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Scenario C4 (probes passing but no traffic) is the most confusing: it indicates that VMs are healthy according to the probe, but real traffic isn't arriving. This usually points to a problem in the load balancing rule, frontend IP configuration, or an NSG that blocks business traffic (but not the probe, which uses AzureLoadBalancer as source).

3.3 The Three Types of Failures and Their Patterns​

Type 1: No traffic reaches the Load Balancer

Symptom: SYN Count = 0, Byte Count = 0. The problem is before the LB: incorrect DNS, wrong public IP, NSG in the subnet blocking inbound traffic before reaching the LB (for LB Standard, inbound traffic is not blocked by NSG in the frontend subnet, since the LB is a managed service, but NSG on VMs can block). Or for internal LB: the client is not in the correct VNet or there's no route to the LB IP.

Type 2: Traffic reaches LB, but VMs are unhealthy

Symptom: Health Probe Status < 100%, Data Path Availability < 100%. VMs don't respond to the probe. Causes: application stopped, NSG blocks the probe, wrong port in probe, error 500 returned by application when probe expects 200.

Type 3: VMs healthy, but connections fail

Symptom: Health Probe Status = 100%, Data Path Availability = 100%, but client receives timeout or error. Causes: NSG blocks business traffic (but not the probe), application fails only for real traffic (probe on simplified endpoint doesn't detect the problem), misconfigured session persistence, SNAT exhaustion for outbound connections.


4. Structural View​

The Systematic Diagnostic Flow​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

5. Practical Implementation​

Checking Health Probe Status per Individual VM​

One of the most useful features for diagnostics is checking which specific VM is failing the probe, not just the general average. In Azure Monitor, filter the Health Probe Status metric by Backend IP Address dimension:

# Check probe status per individual VM via CLI
az monitor metrics list \
--resource /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
--metric "HealthProbeStatus" \
--dimension "BackendIPAddress" \
--aggregation Average \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--output table

This shows, for each VM, whether the probe is passing or failing, allowing you to identify exactly which VM has a problem.

Testing the Health Probe Manually​

To verify if an HTTP/HTTPS health probe would work, perform the same test that the LB does, from the VM's perspective or from a client on the same network:

# Test the probe endpoint from another VM in the same VNet
curl -v -k https://10.0.1.4:443/health

# Check if TCP port is open (for TCP probe)
Test-NetConnection -ComputerName 10.0.1.4 -Port 443

If the endpoint returns 200, the probe should be passing. If it returns another code, the probe fails. If the connection is refused, the application isn't listening or the OS firewall is blocking.

Checking NSG for Probes​

The Load Balancer probe originates from 168.63.129.16 with the AzureLoadBalancer service tag. To confirm if NSG is allowing:

# View effective NSG rules on VM's NIC
az network nic list-effective-nsg \
--name nic-vm-web-01 \
--resource-group rg-producao \
--output json | grep -A 5 "AzureLoadBalancer"

If AzureLoadBalancer doesn't appear in inbound rules as Allow, the probe is being blocked.

SNAT Exhaustion Diagnostics​

SNAT port exhaustion is silent: the health probe continues working (probe is inbound, doesn't use SNAT), VMs appear healthy, but outbound connections to the internet fail or timeout.

# Check SNAT metrics
az monitor metrics list \
--resource /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
--metric "SNATConnectionCount" \
--filter "ConnectionState eq 'Failed'" \
--aggregation Total \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

If ConnectionState = Failed has elevated values, there's SNAT exhaustion. Inside VMs, the symptom is outbound TCP connections that remain in CLOSE_WAIT or TIME_WAIT excessively:

# Check connections in TIME_WAIT on VM (Linux)
ss -s | grep TIME-WAIT

# Windows
netstat -n | findstr TIME_WAIT | measure -line

6. Implementation Methods​

6.1 Azure Portal: Azure Monitor for Load Balancer​

When to use: trend visualization, initial diagnostics with charts.

Path: Load Balancer > Monitoring > Metrics

The portal offers charts of all metrics with filters by dimension (by VM, by port, by frontend). The Backend IP Address dimension is especially useful for identifying which specific VM has a problem.

Configure alerts directly in the portal for proactive notification:

Path: Load Balancer > Monitoring > Alerts > New alert rule

Recommended alerts:

  • Health Probe Status < 100% for more than 5 minutes
  • Data Path Availability < 100%
  • SNAT Connection Count with ConnectionState = Failed > 0

6.2 Azure CLI​

Check complete Load Balancer configuration (components, rules, probes):

# Complete Load Balancer view
az network lb show \
--name lb-web-public \
--resource-group rg-networking \
--output json

# List only probes
az network lb probe list \
--lb-name lb-web-public \
--resource-group rg-networking \
--output table

# List load balancing rules
az network lb rule list \
--lb-name lb-web-public \
--resource-group rg-networking \
--output table

# Check VMs in backend pool
az network lb address-pool show \
--lb-name lb-web-public \
--name bp-vms-web \
--resource-group rg-networking \
--query "loadBalancerBackendAddresses[].{Nome:name, IP:networkInterfaceIPConfiguration.id}"

Check if a VM is correctly added to the pool:

# See which backend pool a NIC belongs to
az network nic show \
--name nic-vm-web-01 \
--resource-group rg-producao \
--query "ipConfigurations[0].loadBalancerBackendAddressPools[].id"

If the result is empty, the NIC is not in the backend pool. This is one of the most common problems: the Load Balancer was created with a pool, but VMs were never added.

6.3 PowerShell​

# Get LB and inspect all components
$lb = Get-AzLoadBalancer -Name "lb-web-public" -ResourceGroupName "rg-networking"

# List probes
$lb.Probes | Select-Object Name, Protocol, Port, RequestPath, IntervalInSeconds, NumberOfProbes

# List rules
$lb.LoadBalancingRules | Select-Object Name, Protocol, FrontendPort, BackendPort, LoadDistribution

# Check VMs in backend pool
$lb.BackendAddressPools[0].LoadBalancerBackendAddresses | Select-Object Name

# Check if VM is in pool via NIC
$nic = Get-AzNetworkInterface -Name "nic-vm-web-01" -ResourceGroupName "rg-producao"
$nic.IpConfigurations[0].LoadBalancerBackendAddressPools

6.4 Network Watcher: Diagnostic Complement​

Network Watcher complements Load Balancer diagnostics:

# IP Flow Verify: check if NSG allows traffic from internet to VM
az network watcher test-ip-flow \
--direction Inbound \
--protocol TCP \
--local 10.0.1.4:443 \
--remote 203.0.113.1:54321 \
--vm /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Compute/virtualMachines/vm-web-01 \
--resource-group NetworkWatcherRG \
--watcher-resource-group NetworkWatcherRG

# Connection Troubleshoot: test connectivity through LB path
az network watcher test-connectivity \
--source-resource /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Compute/virtualMachines/vm-cliente \
--dest-address 40.68.100.50 \
--dest-port 443 \
--protocol Tcp \
--resource-group NetworkWatcherRG \
--watcher-resource-group NetworkWatcherRG

7. Control and Security​

Diagnosing Conflicting NSG Rules​

A very common failure pattern is having two NSGs: one that allows AzureLoadBalancer (for probes) but not business traffic, or one that allows business traffic but not the probe. Checking both NSGs (NIC and Subnet) is essential:

# Check subnet NSG
az network nsg rule list \
--nsg-name nsg-subnet-web \
--resource-group rg-networking \
--output table

# Check NIC NSG
az network nsg rule list \
--nsg-name nsg-nic-vm-web-01 \
--resource-group rg-networking \
--output table

The analysis should check:

  1. Is there an Allow rule for AzureLoadBalancer on the probe port?
  2. Is there an Allow rule for business traffic (port 80/443) from any origin (or from the internet via Internet tag)?
  3. Is there any high-priority Deny rule that might be overriding the Allow rules?

Check Load Balancer Diagnostic Logs​

Load Balancer Standard can send logs to Log Analytics:

# Enable diagnostic logs
az monitor diagnostic-settings create \
--name diag-lb-web \
--resource /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
--workspace /subscriptions/<sub-id>/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/law-monitoring \
--metrics '[{"category": "AllMetrics", "enabled": true}]'

With logs in Log Analytics, KQL queries enable historical analysis:

// Health probe failures in the last 24 hours by VM
AzureMetrics
| where TimeGenerated > ago(24h)
| where ResourceType == "LOADBALANCERS"
| where MetricName == "HealthProbeStatus"
| where Average < 100
| project TimeGenerated, ResourceId, Average, DimensionValue1
| sort by TimeGenerated desc

8. Decision Making​

Which metric to check first based on symptoms?​

SymptomFirst metricWhat to look for
No traffic worksSYN CountZero = traffic not reaching LB
Some requests work, others don'tHealth Probe Status by VMSpecific VMs with failing probe
Slowness / timeoutsData Path Availability + SNAT Connection CountGradual degradation or SNAT exhaustion
VM unexpectedly removed from poolHealth Probe Status by VM in periodIntermittent probe failures
VM outbound connections failingSNAT Connection Count with FailedSNAT port exhaustion
Uneven traffic between VMsNo direct metricCheck Session Persistence and distribution by 5-tuple hash

Which diagnostic tool to use?​

SituationToolReason
Check if NSG blocks probeIP Flow Verify with source AzureLoadBalancerSimulates probe packet
Check if NSG blocks business trafficIP Flow Verify with real source IPSimulates client packet
Test if LB delivers traffic to VMConnection Troubleshoot from LB frontend to VMEnd-to-end test
See exactly which packets reach VMPacket Capture on VM NICRaw traffic analysis
VPN Gateway diagnostics in pathVPN TroubleshootSpecific for gateways
Historical probe failureAzure Monitor Metrics historyIdentify temporal pattern

9. Best Practices​

Configure proactive alerts before having problems: creating alerts for Health Probe Status < 100% and Data Path Availability < 100% ensures immediate notification when a VM is removed from the pool, before the impact is reported by users.

Use dedicated probe endpoints with deep checks: an HTTP probe on /health that just returns 200 immediately doesn't detect real application failures. A /health endpoint that checks database connection, queues and other dependencies removes VMs from the pool when there are functional problems, not just when the server is offline. Enable Load Balancer diagnostics for Log Analytics from creation: historical metrics allow retroactive analysis when a problem is reported with delay ("yesterday afternoon the site was slow"). Without historical logs, the diagnostic window is limited to what Azure Monitor retains by default (93 days for metrics).

Document the expected NSG design for the LB: maintaining documentation of which NSG rules are necessary for the Load Balancer to function (AzureLoadBalancer for probe + internet source for traffic) facilitates diagnosis when someone inadvertently modifies the rules.


10. Common Errors​

Probe configured on wrong port

The probe is configured on port 443, but the application only listens on port 8443 (or vice versa). The TCP probe checks 443, finds nothing listening (connection refused), marks the VM as unhealthy. The VM is working perfectly, but the LB doesn't send traffic to it. Checking the actual application port via netstat -an | grep LISTEN on the VM and correcting the probe resolves it.

VM added to backend pool but NIC disassociated

A VM was deleted and recreated, but the old NIC remains associated with the backend pool. The pool shows an entry but with no NIC associated with an active VM. The LB tries to send probes to an IP address that no longer exists. Verify that each entry in the backend pool corresponds to an existing NIC and VM.

Health probe successful, but application fails for real traffic

The probe is GET /health HTTP/1.1 and returns 200 immediately without doing anything. But the application's real traffic (GET /api/data) fails because the database connection is broken. The VM remains in the pool. The probe doesn't detect the real problem because the /health endpoint doesn't check critical subsystems.

Session Persistence configured incorrectly causing uneven load

With Client IP session persistence, corporate clients behind NAT (all with the same outbound IP) are always sent to the same VM, overloading it while others remain idle. The administrator increases VMs in the pool but distribution doesn't improve. The solution is None (5-tuple hash) for stateless applications, which distributes by client IP + source port.

Ignoring BackendIPAddress dimension in probe metrics

The administrator sees Health Probe Status = 80% (some problem), but doesn't filter by individual VM. Spends hours trying to diagnose "the LB" without identifying that only one of 5 VMs has a failing probe. Always filter by BackendIPAddress when investigating probe failures.


11. Operation and Maintenance​

Automated Diagnostic Script for Load Balancer​

#!/bin/bash
LB_NAME="lb-web-public"
RG="rg-networking"
POOL_NAME="bp-vms-web"

echo "=== Load Balancer Configuration ==="
az network lb show --name $LB_NAME --resource-group $RG \
--query "{SKU:sku.name, FrontendIPs:frontendIPConfigurations[].name, ProbeCount:length(probes), RuleCount:length(loadBalancingRules)}" \
--output table

echo ""
echo "=== Health Probes ==="
az network lb probe list --lb-name $LB_NAME --resource-group $RG \
--query "[].{Nome:name, Protocolo:protocol, Porta:port, Path:requestPath, Intervalo:intervalInSeconds, Threshold:numberOfProbes}" \
--output table

echo ""
echo "=== Load Balancing Rules ==="
az network lb rule list --lb-name $LB_NAME --resource-group $RG \
--query "[].{Nome:name, Protocolo:protocol, PortaFront:frontendPort, PortaBack:backendPort, Distribuicao:loadDistribution}" \
--output table

echo ""
echo "=== VMs in Backend Pool ==="
az network lb address-pool show \
--lb-name $LB_NAME \
--name $POOL_NAME \
--resource-group $RG \
--query "loadBalancerBackendAddresses[].{Nome:name, NIC:networkInterfaceIPConfiguration.id}" \
--output table

echo ""
echo "=== Recent Metrics (last 30 min) ==="
START=$(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ)
END=$(date -u +%Y-%m-%dT%H:%M:%SZ)

LB_ID=$(az network lb show --name $LB_NAME --resource-group $RG --query id -o tsv)

echo "Data Path Availability:"
az monitor metrics list --resource $LB_ID \
--metric "VipAvailability" --aggregation Average \
--start-time $START --end-time $END \
--query "value[0].timeseries[0].data[-1].average" --output tsv

echo "Health Probe Status:"
az monitor metrics list --resource $LB_ID \
--metric "DipAvailability" --aggregation Average \
--start-time $START --end-time $END \
--query "value[0].timeseries[0].data[-1].average" --output tsv

KQL Queries for Load Balancer Analysis​

// Health probe history by VM
AzureMetrics
| where ResourceType == "LOADBALANCERS"
| where MetricName == "DipAvailability"
| where TimeGenerated > ago(6h)
| summarize avg(Average) by bin(TimeGenerated, 5m), tostring(DimensionValue1)
| render timechart

// Detect moments of SNAT exhaustion
AzureMetrics
| where ResourceType == "LOADBALANCERS"
| where MetricName == "SnatConnectionCount"
| where TimeGenerated > ago(24h)
| where DimensionValue1 == "Failed"
| where Total > 0
| project TimeGenerated, Total
| sort by TimeGenerated desc

Relevant Limits for Diagnosis​

ItemLimitDiagnosis Impact
Azure Monitor metrics retention93 daysHistorical available for retroactive analysis
SNAT ports per VM (without Outbound Rule)1,024 (Standard default)Can exhaust in VMs with many outbound connections
VMs per backend pool (Standard)1,000Rarely reached, but check in large VMSS
Probes per second per endpoint~2 probes/sec (5s minimum interval)Probe fails 2 consecutive times before marking unhealthy

12. Integration and Automation​

Automated Alerts and Runbooks​

Configure alerts that trigger automatic diagnostic runbooks:

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Load Balancer Health Dashboard​

Create a centralized Azure Dashboard with critical LB metrics:

# Create workbook in Azure Monitor for consolidated visualization
az monitor workbook create \
--resource-group rg-monitoring \
--name "lb-health-dashboard" \
--display-name "Load Balancer Health Dashboard" \
--kind shared \
--source-id /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
--serialized-data @lb-workbook-template.json

13. Final Summary​

Essential points:

  • The two fundamental metrics are Data Path Availability (can the LB reach the VMs?) and Health Probe Status (do the VMs respond to the probe?). Analyzing them together reveals the type of problem.
  • The Load Balancer probe originates from 168.63.129.16 (AzureLoadBalancer tag). NSGs that block this IP cause unhealthy VMs even with the application working.
  • SNAT Exhaustion is silent: probes pass, VMs appear healthy, but outbound connections from VMs fail. The SNAT Connection Count (Failed) metric reveals the problem.
  • The BackendIPAddress dimension in probe metrics allows identifying which specific VM has a problem, not just the general average.

Critical differences:

  • Probe failing vs. NSG blocking business traffic: the probe uses AzureLoadBalancer as source; business traffic uses the client or internet IP. An NSG can allow one and block the other, resulting in healthy VMs (probe OK) but no real traffic reaching them.
  • Health Probe Status vs. Data Path Availability: Health Probe Status measures if the probe reaches the VM; Data Path Availability measures if real traffic can be delivered. They can have different values.
  • SNAT Exhaustion vs. Probe failure: they are completely independent. SNAT affects outbound connections from VMs. Probe affects whether VMs stay in the pool to receive inbound traffic.

What needs to be remembered:

  • Always check metrics with BackendIPAddress dimension to identify which VM has the problem.
  • Configure proactive alerts for Health Probe Status < 100% in production environment.
  • If VMs are in the pool, probes passing, but real traffic fails: check the NSG for the business port (not just the probe port).
  • Use az network lb address-pool show to confirm that VM NICs are correctly associated with the backend pool.
  • For SNAT Exhaustion: configure Outbound Rules with explicit port allocation or migrate to NAT Gateway.
  • The fastest command for initial diagnosis is checking Health Probe Status and Data Path Availability in Azure Monitor for the affected LB.