Theoretical Foundation: Troubleshoot load balancing

1. Initial Intuition

You configured a Load Balancer, added three VMs to the backend pool, created the load balancing rule and the health probe. Everything looks correct in the portal. But when you try to access the Load Balancer's public IP, the connection fails or only works for some requests.

Load Balancer diagnostics is different from general network connectivity diagnostics. Here, the problem can be in any of the five LB components (frontend, probe, backend pool, rule, NSG), and each component has its own behavior and failure point.

The analogy continues with the restaurant: the host (Load Balancer) may be working, but if the cashiers (VMs) don't respond to the manager's signal (health probe), the host considers them closed and doesn't direct customers. Or the cashiers are open, but a security guard at the door (NSG) is blocking customers from arriving.

Diagnosing Load Balancing means systematically checking each component in this chain until you find where the failure is.

2. Context

Load Balancer diagnostics integrates all concepts from previous modules. A failure can originate from multiple layers:

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

The main tools are Load Balancer metrics in Azure Monitor, Network Watcher (studied previously), and LB diagnostic logs. Understanding what each metric indicates is the core of this module.

3. Building Concepts

3.1 The Fundamental Metrics of Load Balancer Standard

Azure Load Balancer Standard exposes metrics in Azure Monitor that are the first line of diagnostics:

Metric	What it measures	Healthy value	Problem signal
Data Path Availability	LB data path availability to backend	100%	< 100% indicates unhealthy VMs
Health Probe Status	Percentage of VMs passing health probe	100%	< 100% indicates VMs failing probe
Byte Count	Bytes processed by LB (inbound + outbound)	Positive and constant	Zero = no traffic arriving
Packet Count	Packets processed	Proportional to traffic	Abrupt drop = connectivity problem
SYN Count	SYN packets received (new TCP connections)	Proportional to traffic	Zero = traffic not reaching LB
SNAT Connection Count	Active and failed SNAT connections	Failures = 0	Failures > 0 = SNAT port exhaustion
Allocated SNAT Ports	Allocated SNAT ports	Proportional to VMs	Close to maximum = exhaustion risk
Used SNAT Ports	SNAT ports in use	Less than allocated	Equal to maximum = active exhaustion

3.2 Data Path Availability vs. Health Probe Status

These two metrics are complementary and should be analyzed together:

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Scenario C4 (probes passing but no traffic) is the most confusing: it indicates that VMs are healthy according to the probe, but real traffic isn't arriving. This usually points to a problem in the load balancing rule, frontend IP configuration, or an NSG that blocks business traffic (but not the probe, which uses AzureLoadBalancer as source).

3.3 The Three Types of Failures and Their Patterns

Type 1: No traffic reaches the Load Balancer

Symptom: SYN Count = 0, Byte Count = 0. The problem is before the LB: incorrect DNS, wrong public IP, NSG in the subnet blocking inbound traffic before reaching the LB (for LB Standard, inbound traffic is not blocked by NSG in the frontend subnet, since the LB is a managed service, but NSG on VMs can block). Or for internal LB: the client is not in the correct VNet or there's no route to the LB IP.

Type 2: Traffic reaches LB, but VMs are unhealthy

Symptom: Health Probe Status < 100%, Data Path Availability < 100%. VMs don't respond to the probe. Causes: application stopped, NSG blocks the probe, wrong port in probe, error 500 returned by application when probe expects 200.

Type 3: VMs healthy, but connections fail

Symptom: Health Probe Status = 100%, Data Path Availability = 100%, but client receives timeout or error. Causes: NSG blocks business traffic (but not the probe), application fails only for real traffic (probe on simplified endpoint doesn't detect the problem), misconfigured session persistence, SNAT exhaustion for outbound connections.

4. Structural View

The Systematic Diagnostic Flow

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

5. Practical Implementation

Checking Health Probe Status per Individual VM

One of the most useful features for diagnostics is checking which specific VM is failing the probe, not just the general average. In Azure Monitor, filter the Health Probe Status metric by Backend IP Address dimension:

# Check probe status per individual VM via CLI
az monitor metrics list \
  --resource /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
  --metric "HealthProbeStatus" \
  --dimension "BackendIPAddress" \
  --aggregation Average \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --output table

This shows, for each VM, whether the probe is passing or failing, allowing you to identify exactly which VM has a problem.

Testing the Health Probe Manually

To verify if an HTTP/HTTPS health probe would work, perform the same test that the LB does, from the VM's perspective or from a client on the same network:

# Test the probe endpoint from another VM in the same VNet
curl -v -k https://10.0.1.4:443/health

# Check if TCP port is open (for TCP probe)
Test-NetConnection -ComputerName 10.0.1.4 -Port 443

If the endpoint returns 200, the probe should be passing. If it returns another code, the probe fails. If the connection is refused, the application isn't listening or the OS firewall is blocking.

Checking NSG for Probes

The Load Balancer probe originates from 168.63.129.16 with the AzureLoadBalancer service tag. To confirm if NSG is allowing:

# View effective NSG rules on VM's NIC
az network nic list-effective-nsg \
  --name nic-vm-web-01 \
  --resource-group rg-producao \
  --output json | grep -A 5 "AzureLoadBalancer"

If AzureLoadBalancer doesn't appear in inbound rules as Allow, the probe is being blocked.

SNAT Exhaustion Diagnostics

SNAT port exhaustion is silent: the health probe continues working (probe is inbound, doesn't use SNAT), VMs appear healthy, but outbound connections to the internet fail or timeout.

# Check SNAT metrics
az monitor metrics list \
  --resource /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
  --metric "SNATConnectionCount" \
  --filter "ConnectionState eq 'Failed'" \
  --aggregation Total \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

If ConnectionState = Failed has elevated values, there's SNAT exhaustion. Inside VMs, the symptom is outbound TCP connections that remain in CLOSE_WAIT or TIME_WAIT excessively:

# Check connections in TIME_WAIT on VM (Linux)
ss -s | grep TIME-WAIT

# Windows
netstat -n | findstr TIME_WAIT | measure -line

6. Implementation Methods

6.1 Azure Portal: Azure Monitor for Load Balancer

When to use: trend visualization, initial diagnostics with charts.

Path: Load Balancer > Monitoring > Metrics

The portal offers charts of all metrics with filters by dimension (by VM, by port, by frontend). The Backend IP Address dimension is especially useful for identifying which specific VM has a problem.

Configure alerts directly in the portal for proactive notification:

Path: Load Balancer > Monitoring > Alerts > New alert rule

Recommended alerts:

Health Probe Status < 100% for more than 5 minutes
Data Path Availability < 100%
SNAT Connection Count with ConnectionState = Failed > 0

6.2 Azure CLI

Check complete Load Balancer configuration (components, rules, probes):

# Complete Load Balancer view
az network lb show \
  --name lb-web-public \
  --resource-group rg-networking \
  --output json

# List only probes
az network lb probe list \
  --lb-name lb-web-public \
  --resource-group rg-networking \
  --output table

# List load balancing rules
az network lb rule list \
  --lb-name lb-web-public \
  --resource-group rg-networking \
  --output table

# Check VMs in backend pool
az network lb address-pool show \
  --lb-name lb-web-public \
  --name bp-vms-web \
  --resource-group rg-networking \
  --query "loadBalancerBackendAddresses[].{Nome:name, IP:networkInterfaceIPConfiguration.id}"

Check if a VM is correctly added to the pool:

# See which backend pool a NIC belongs to
az network nic show \
  --name nic-vm-web-01 \
  --resource-group rg-producao \
  --query "ipConfigurations[0].loadBalancerBackendAddressPools[].id"

If the result is empty, the NIC is not in the backend pool. This is one of the most common problems: the Load Balancer was created with a pool, but VMs were never added.

6.3 PowerShell

# Get LB and inspect all components
$lb = Get-AzLoadBalancer -Name "lb-web-public" -ResourceGroupName "rg-networking"

# List probes
$lb.Probes | Select-Object Name, Protocol, Port, RequestPath, IntervalInSeconds, NumberOfProbes

# List rules
$lb.LoadBalancingRules | Select-Object Name, Protocol, FrontendPort, BackendPort, LoadDistribution

# Check VMs in backend pool
$lb.BackendAddressPools[0].LoadBalancerBackendAddresses | Select-Object Name

# Check if VM is in pool via NIC
$nic = Get-AzNetworkInterface -Name "nic-vm-web-01" -ResourceGroupName "rg-producao"
$nic.IpConfigurations[0].LoadBalancerBackendAddressPools

6.4 Network Watcher: Diagnostic Complement

Network Watcher complements Load Balancer diagnostics:

# IP Flow Verify: check if NSG allows traffic from internet to VM
az network watcher test-ip-flow \
  --direction Inbound \
  --protocol TCP \
  --local 10.0.1.4:443 \
  --remote 203.0.113.1:54321 \
  --vm /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Compute/virtualMachines/vm-web-01 \
  --resource-group NetworkWatcherRG \
  --watcher-resource-group NetworkWatcherRG

# Connection Troubleshoot: test connectivity through LB path
az network watcher test-connectivity \
  --source-resource /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Compute/virtualMachines/vm-cliente \
  --dest-address 40.68.100.50 \
  --dest-port 443 \
  --protocol Tcp \
  --resource-group NetworkWatcherRG \
  --watcher-resource-group NetworkWatcherRG

7. Control and Security

Diagnosing Conflicting NSG Rules

A very common failure pattern is having two NSGs: one that allows AzureLoadBalancer (for probes) but not business traffic, or one that allows business traffic but not the probe. Checking both NSGs (NIC and Subnet) is essential:

# Check subnet NSG
az network nsg rule list \
  --nsg-name nsg-subnet-web \
  --resource-group rg-networking \
  --output table

# Check NIC NSG
az network nsg rule list \
  --nsg-name nsg-nic-vm-web-01 \
  --resource-group rg-networking \
  --output table

The analysis should check:

Is there an Allow rule for AzureLoadBalancer on the probe port?
Is there an Allow rule for business traffic (port 80/443) from any origin (or from the internet via Internet tag)?
Is there any high-priority Deny rule that might be overriding the Allow rules?

Check Load Balancer Diagnostic Logs

Load Balancer Standard can send logs to Log Analytics:

# Enable diagnostic logs
az monitor diagnostic-settings create \
  --name diag-lb-web \
  --resource /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
  --workspace /subscriptions/<sub-id>/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/law-monitoring \
  --metrics '[{"category": "AllMetrics", "enabled": true}]'

With logs in Log Analytics, KQL queries enable historical analysis:

// Health probe failures in the last 24 hours by VM
AzureMetrics
| where TimeGenerated > ago(24h)
| where ResourceType == "LOADBALANCERS"
| where MetricName == "HealthProbeStatus"
| where Average < 100
| project TimeGenerated, ResourceId, Average, DimensionValue1
| sort by TimeGenerated desc

8. Decision Making

Which metric to check first based on symptoms?

Symptom	First metric	What to look for
No traffic works	`SYN Count`	Zero = traffic not reaching LB
Some requests work, others don't	`Health Probe Status` by VM	Specific VMs with failing probe
Slowness / timeouts	`Data Path Availability` + `SNAT Connection Count`	Gradual degradation or SNAT exhaustion
VM unexpectedly removed from pool	`Health Probe Status` by VM in period	Intermittent probe failures
VM outbound connections failing	`SNAT Connection Count` with Failed	SNAT port exhaustion
Uneven traffic between VMs	No direct metric	Check Session Persistence and distribution by 5-tuple hash

Which diagnostic tool to use?

Situation	Tool	Reason
Check if NSG blocks probe	IP Flow Verify with source `AzureLoadBalancer`	Simulates probe packet
Check if NSG blocks business traffic	IP Flow Verify with real source IP	Simulates client packet
Test if LB delivers traffic to VM	Connection Troubleshoot from LB frontend to VM	End-to-end test
See exactly which packets reach VM	Packet Capture on VM NIC	Raw traffic analysis
VPN Gateway diagnostics in path	VPN Troubleshoot	Specific for gateways
Historical probe failure	Azure Monitor Metrics history	Identify temporal pattern

9. Best Practices

Configure proactive alerts before having problems: creating alerts for Health Probe Status < 100% and Data Path Availability < 100% ensures immediate notification when a VM is removed from the pool, before the impact is reported by users.

Use dedicated probe endpoints with deep checks: an HTTP probe on /health that just returns 200 immediately doesn't detect real application failures. A /health endpoint that checks database connection, queues and other dependencies removes VMs from the pool when there are functional problems, not just when the server is offline. Enable Load Balancer diagnostics for Log Analytics from creation: historical metrics allow retroactive analysis when a problem is reported with delay ("yesterday afternoon the site was slow"). Without historical logs, the diagnostic window is limited to what Azure Monitor retains by default (93 days for metrics).

Document the expected NSG design for the LB: maintaining documentation of which NSG rules are necessary for the Load Balancer to function (AzureLoadBalancer for probe + internet source for traffic) facilitates diagnosis when someone inadvertently modifies the rules.

10. Common Errors

Probe configured on wrong port

The probe is configured on port 443, but the application only listens on port 8443 (or vice versa). The TCP probe checks 443, finds nothing listening (connection refused), marks the VM as unhealthy. The VM is working perfectly, but the LB doesn't send traffic to it. Checking the actual application port via netstat -an | grep LISTEN on the VM and correcting the probe resolves it.

VM added to backend pool but NIC disassociated

A VM was deleted and recreated, but the old NIC remains associated with the backend pool. The pool shows an entry but with no NIC associated with an active VM. The LB tries to send probes to an IP address that no longer exists. Verify that each entry in the backend pool corresponds to an existing NIC and VM.

Health probe successful, but application fails for real traffic

The probe is GET /health HTTP/1.1 and returns 200 immediately without doing anything. But the application's real traffic (GET /api/data) fails because the database connection is broken. The VM remains in the pool. The probe doesn't detect the real problem because the /health endpoint doesn't check critical subsystems.

Session Persistence configured incorrectly causing uneven load

With Client IP session persistence, corporate clients behind NAT (all with the same outbound IP) are always sent to the same VM, overloading it while others remain idle. The administrator increases VMs in the pool but distribution doesn't improve. The solution is None (5-tuple hash) for stateless applications, which distributes by client IP + source port.

Ignoring BackendIPAddress dimension in probe metrics

The administrator sees Health Probe Status = 80% (some problem), but doesn't filter by individual VM. Spends hours trying to diagnose "the LB" without identifying that only one of 5 VMs has a failing probe. Always filter by BackendIPAddress when investigating probe failures.

11. Operation and Maintenance

Automated Diagnostic Script for Load Balancer

#!/bin/bash
LB_NAME="lb-web-public"
RG="rg-networking"
POOL_NAME="bp-vms-web"

echo "=== Load Balancer Configuration ==="
az network lb show --name $LB_NAME --resource-group $RG \
  --query "{SKU:sku.name, FrontendIPs:frontendIPConfigurations[].name, ProbeCount:length(probes), RuleCount:length(loadBalancingRules)}" \
  --output table

echo ""
echo "=== Health Probes ==="
az network lb probe list --lb-name $LB_NAME --resource-group $RG \
  --query "[].{Nome:name, Protocolo:protocol, Porta:port, Path:requestPath, Intervalo:intervalInSeconds, Threshold:numberOfProbes}" \
  --output table

echo ""
echo "=== Load Balancing Rules ==="
az network lb rule list --lb-name $LB_NAME --resource-group $RG \
  --query "[].{Nome:name, Protocolo:protocol, PortaFront:frontendPort, PortaBack:backendPort, Distribuicao:loadDistribution}" \
  --output table

echo ""
echo "=== VMs in Backend Pool ==="
az network lb address-pool show \
  --lb-name $LB_NAME \
  --name $POOL_NAME \
  --resource-group $RG \
  --query "loadBalancerBackendAddresses[].{Nome:name, NIC:networkInterfaceIPConfiguration.id}" \
  --output table

echo ""
echo "=== Recent Metrics (last 30 min) ==="
START=$(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ)
END=$(date -u +%Y-%m-%dT%H:%M:%SZ)

LB_ID=$(az network lb show --name $LB_NAME --resource-group $RG --query id -o tsv)

echo "Data Path Availability:"
az monitor metrics list --resource $LB_ID \
  --metric "VipAvailability" --aggregation Average \
  --start-time $START --end-time $END \
  --query "value[0].timeseries[0].data[-1].average" --output tsv

echo "Health Probe Status:"
az monitor metrics list --resource $LB_ID \
  --metric "DipAvailability" --aggregation Average \
  --start-time $START --end-time $END \
  --query "value[0].timeseries[0].data[-1].average" --output tsv

KQL Queries for Load Balancer Analysis

// Health probe history by VM
AzureMetrics
| where ResourceType == "LOADBALANCERS"
| where MetricName == "DipAvailability"
| where TimeGenerated > ago(6h)
| summarize avg(Average) by bin(TimeGenerated, 5m), tostring(DimensionValue1)
| render timechart

// Detect moments of SNAT exhaustion
AzureMetrics
| where ResourceType == "LOADBALANCERS"
| where MetricName == "SnatConnectionCount"
| where TimeGenerated > ago(24h)
| where DimensionValue1 == "Failed"
| where Total > 0
| project TimeGenerated, Total
| sort by TimeGenerated desc

Relevant Limits for Diagnosis

Item	Limit	Diagnosis Impact
Azure Monitor metrics retention	93 days	Historical available for retroactive analysis
SNAT ports per VM (without Outbound Rule)	1,024 (Standard default)	Can exhaust in VMs with many outbound connections
VMs per backend pool (Standard)	1,000	Rarely reached, but check in large VMSS
Probes per second per endpoint	~2 probes/sec (5s minimum interval)	Probe fails 2 consecutive times before marking unhealthy

12. Integration and Automation

Automated Alerts and Runbooks

Configure alerts that trigger automatic diagnostic runbooks:

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Load Balancer Health Dashboard

Create a centralized Azure Dashboard with critical LB metrics:

# Create workbook in Azure Monitor for consolidated visualization
az monitor workbook create \
  --resource-group rg-monitoring \
  --name "lb-health-dashboard" \
  --display-name "Load Balancer Health Dashboard" \
  --kind shared \
  --source-id /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/loadBalancers/lb-web-public \
  --serialized-data @lb-workbook-template.json

13. Final Summary

Essential points:

The two fundamental metrics are Data Path Availability (can the LB reach the VMs?) and Health Probe Status (do the VMs respond to the probe?). Analyzing them together reveals the type of problem.
The Load Balancer probe originates from 168.63.129.16 (AzureLoadBalancer tag). NSGs that block this IP cause unhealthy VMs even with the application working.
SNAT Exhaustion is silent: probes pass, VMs appear healthy, but outbound connections from VMs fail. The SNAT Connection Count (Failed) metric reveals the problem.
The BackendIPAddress dimension in probe metrics allows identifying which specific VM has a problem, not just the general average.

Critical differences:

Probe failing vs. NSG blocking business traffic: the probe uses AzureLoadBalancer as source; business traffic uses the client or internet IP. An NSG can allow one and block the other, resulting in healthy VMs (probe OK) but no real traffic reaching them.
Health Probe Status vs. Data Path Availability: Health Probe Status measures if the probe reaches the VM; Data Path Availability measures if real traffic can be delivered. They can have different values.
SNAT Exhaustion vs. Probe failure: they are completely independent. SNAT affects outbound connections from VMs. Probe affects whether VMs stay in the pool to receive inbound traffic.

What needs to be remembered:

Always check metrics with BackendIPAddress dimension to identify which VM has the problem.
Configure proactive alerts for Health Probe Status < 100% in production environment.
If VMs are in the pool, probes passing, but real traffic fails: check the NSG for the business port (not just the probe port).
Use az network lb address-pool show to confirm that VM NICs are correctly associated with the backend pool.
For SNAT Exhaustion: configure Outbound Rules with explicit port allocation or migrate to NAT Gateway.
The fastest command for initial diagnosis is checking Health Probe Status and Data Path Availability in Azure Monitor for the affected LB.

1. Initial Intuition​

2. Context​

3. Building Concepts​

3.1 The Fundamental Metrics of Load Balancer Standard​

3.2 Data Path Availability vs. Health Probe Status​

3.3 The Three Types of Failures and Their Patterns​

4. Structural View​

The Systematic Diagnostic Flow​

5. Practical Implementation​

Checking Health Probe Status per Individual VM​

Testing the Health Probe Manually​

Checking NSG for Probes​

SNAT Exhaustion Diagnostics​

6. Implementation Methods​

6.1 Azure Portal: Azure Monitor for Load Balancer​

6.2 Azure CLI​

6.3 PowerShell​

6.4 Network Watcher: Diagnostic Complement​

7. Control and Security​

Diagnosing Conflicting NSG Rules​

Check Load Balancer Diagnostic Logs​

8. Decision Making​

Which metric to check first based on symptoms?​

Which diagnostic tool to use?​

9. Best Practices​

10. Common Errors​

11. Operation and Maintenance​

Automated Diagnostic Script for Load Balancer​

KQL Queries for Load Balancer Analysis​

Relevant Limits for Diagnosis​

12. Integration and Automation​

Automated Alerts and Runbooks​

Load Balancer Health Dashboard​

13. Final Summary​

1. Initial Intuition

2. Context

3. Building Concepts

3.1 The Fundamental Metrics of Load Balancer Standard

3.2 Data Path Availability vs. Health Probe Status

3.3 The Three Types of Failures and Their Patterns

4. Structural View

The Systematic Diagnostic Flow

5. Practical Implementation

Checking Health Probe Status per Individual VM

Testing the Health Probe Manually

Checking NSG for Probes

SNAT Exhaustion Diagnostics

6. Implementation Methods

6.1 Azure Portal: Azure Monitor for Load Balancer

6.2 Azure CLI

6.3 PowerShell

6.4 Network Watcher: Diagnostic Complement

7. Control and Security

Diagnosing Conflicting NSG Rules

Check Load Balancer Diagnostic Logs

8. Decision Making

Which metric to check first based on symptoms?

Which diagnostic tool to use?

9. Best Practices

10. Common Errors

11. Operation and Maintenance

Automated Diagnostic Script for Load Balancer

KQL Queries for Load Balancer Analysis

Relevant Limits for Diagnosis

12. Integration and Automation

Automated Alerts and Runbooks

Load Balancer Health Dashboard

13. Final Summary