Troubleshooting Lab: Identify appropriate use cases for Azure Load Balancer
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that a web application hosted on three VMs behind a public Azure Load Balancer Standard has started experiencing intermittent failures. Some users can access it normally, others receive timeouts. The application has been in production for six months with no changes to the Load Balancer configuration.
Last week, the security team applied a new NSG policy to the VM subnets to restrict lateral traffic between them. Metrics diagnostics was also enabled on the Load Balancer via Azure Monitor.
The team checks the following in the portal:
Health Probe Status:
VM-01: Healthy
VM-02: Degraded
VM-03: Degraded
NSG applied to subnet:
Inbound rules:
Priority 100 | Source: VirtualNetwork | Port: 443 | Allow
Priority 200 | Source: Internet | Port: 443 | Allow
Priority 4096| Default Deny All
Probe configured:
Protocol: HTTP
Port: 80
Path: /health
Users who can access are being routed exclusively to VM-01. The other VMs are running and respond locally when tested via SSH with curl localhost:80/health.
What is the root cause of the intermittent failures?
A) Azure Load Balancer Standard does not support HTTP health probes on port 80 when the load balancing rule uses port 443
B) The new NSG rule blocks traffic originating from Azure's probe address (168.63.129.16) on port 80, causing VM-02 and VM-03 to be marked as degraded
C) The diagnostics enabled via Azure Monitor is consuming VM resources and causing slowness in health probe responses
D) The NSG rule with priority 100 restricts inbound traffic only to port 443, preventing users from accessing the application on degraded VMs
Scenario 2 β Action Decisionβ
The network team identified that an internal Azure Load Balancer is configured incorrectly: the load balancing rule points to the correct backend pool, but Floating IP (also called Direct Server Return) was inadvertently enabled. The target application was not developed to operate with Floating IP and is not configured to respond using the Load Balancer frontend IP.
The environment is in production. There is a scheduled maintenance window in 72 hours. Changing the Floating IP configuration on an existing rule does not cause immediate downtime, it only requires the rule to be saved again. The development team confirmed that none of the backend instances need additional reconfiguration after correcting the rule.
What is the correct action to take at this moment?
A) Wait for the 72-hour maintenance window to disable Floating IP, as any change to Load Balancer rules in production must follow the formal change process
B) Disable Floating IP on the load balancing rule immediately, as the correction does not cause downtime and the environment is operating with active routing failure
C) Recreate the Load Balancer from scratch without Floating IP, taking advantage of the maintenance window to ensure a clean configuration
D) Add a static route on the backend VMs pointing the Load Balancer frontend IP to loopback, as a temporary solution until the maintenance window
Scenario 3 β Root Causeβ
A company recently migrated a legacy industrial monitoring system to Azure. The system uses Modbus TCP protocol on port 502 for communication between a central server and 12 field devices represented by VMs. The team configured an internal Azure Load Balancer Standard to distribute requests from the central server to the VMs.
After migration, the central server can establish TCP connection on port 502, but the field devices report receiving duplicate requests irregularly. The team suspects a backend pool configuration problem.
Information collected:
Load Balancer SKU: Standard (Internal)
Frontend IP: 10.10.1.100
Backend pool: 12 VMs
Load balancing rule:
Protocol: TCP
Frontend port: 502
Backend port: 502
Session Persistence: None
Idle Timeout: 4 minutes
Floating IP: Disabled
Connection logs on central server:
[10:01:22] Connected to 10.10.1.100:502
[10:01:22] Request sent: READ_HOLDING_REGISTERS
[10:01:23] Response received from 10.10.1.101
[10:01:45] New connection established to 10.10.1.100:502
[10:01:45] Request sent: READ_HOLDING_REGISTERS
[10:01:46] Response received from 10.10.1.107
The central server does not intentionally reopen connections. The VMs are healthy and the Modbus application is responding normally in isolated tests. The 4-minute Idle Timeout was configured by the team to reduce resource consumption.
What is the root cause of the duplicate requests?
A) The backend pool with 12 VMs exceeds the instance limit supported by internal Azure Load Balancer Standard, causing unexpected rebalancing
B) The lack of Session Persistence causes each new TCP connection to be routed to a different VM, and the Modbus protocol, being stateful per session, interprets the backend switch as request duplication
C) The 4-minute Idle Timeout is terminating active TCP connections before the central server completes the polling cycle, forcing reconnections that the Load Balancer routes to different backends
D) Modbus TCP protocol is not supported by Azure Load Balancer because it operates on a port below 1024, requiring special privileged port configuration
Scenario 4 β Diagnostic Sequenceβ
An engineer receives the following report: "Backend pool VMs are not receiving traffic from the Load Balancer. The frontend IP responds to ping, but the application on the VMs is not reached."
The engineer has access to the Azure portal and the VMs via Bastion. The following investigation steps are available, but out of order:
Step P: Check if VMs are in "Running" state and if the OS agent is responsive
Step Q: Check health probe status in Azure portal (Healthy/Degraded)
Step R: Confirm that the load balancing rule is associated with the correct frontend IP and backend pool
Step S: Manually test if the application responds on the configured port by curling directly to the VM IP via Bastion
Step T: Check if the subnet or NIC NSG allows probe traffic (168.63.129.16) on the configured port
Which sequence represents the most efficient diagnostic reasoning, from general to specific?
A) P -> Q -> R -> T -> S
B) R -> Q -> T -> S -> P
C) Q -> R -> P -> S -> T
D) S -> T -> Q -> R -> P
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The central clue is in the combination of two facts: the new NSG policy was applied recently, and the health probes of the degraded VMs use HTTP on port 80, while the NSG rules only allow inbound traffic on port 443 and from source addresses "VirtualNetwork" or "Internet". The source address of Azure Load Balancer health probes is 168.63.129.16, which does not fit the service tags allowed by existing rules, and port 80 is not allowed in any rule. As a result, probes do not reach VM-02 and VM-03, which are marked as degraded.
The irrelevant information in the scenario is enabling Azure Monitor for diagnostics. It is plausible as a distractor because it temporally coincided with the problem, but passive monitoring does not interfere with VM responsiveness.
The most dangerous distractor is alternative A, which suggests a non-existent technical limitation between probe protocol and rule port. This belief would lead the engineer to change the probe configuration without solving the actual NSG blocking, keeping the environment degraded.
Answer Key β Scenario 2β
Answer: B
The scenario explicitly states that the change does not cause downtime and that the backend does not require additional reconfiguration. Given these conditions, waiting 72 hours (alternative A) means deliberately maintaining an environment with active routing failure, which is not justifiable. The correction is low risk and should be applied immediately.
Alternative C is technically valid as a cleanup approach, but rebuilding the entire Load Balancer to correct a single property of one rule represents unnecessary operational risk and violates the principle of minimal impact in production.
Alternative D is a workaround for the scenario where Floating IP is intentional and the backend needs to respond using the frontend IP. It would be applicable if the cause were different from what's described, but in this case it introduces complexity without solving the root problem.
The reasoning error in the distractors is treating process restrictions (maintenance window) as absolute even when the correction risk is null and the inaction impact is concrete.
Answer Key β Scenario 3β
Answer: C
The Modbus TCP protocol is session-oriented: the central server opens a TCP connection and performs multiple polling cycles within it. The Idle Timeout configured as 4 minutes is shorter than the interval between some polling cycles of the legacy system, causing the Load Balancer to terminate TCP connections considered idle. When the central server tries to reuse the terminated connection, the operating system establishes a new TCP connection, which the Load Balancer routes to a different backend (since without Session Persistence each new connection is treated independently). The field device receiving the new request had no context from the previous session, and the central server interprets the response from a different VM as duplication or inconsistency.
The irrelevant information is the number of VMs in the backend pool (12). It suggests alternative A as a plausible cause, but Azure Load Balancer Standard has no practical limit at this level.
Alternative B correctly describes the expected Load Balancer behavior without Session Persistence, but is not the cause of duplications in this specific scenario: the central server does not intentionally reopen connections, so the lack of Session Persistence alone would not explain the problem if the Idle Timeout were adequate.
The most dangerous distractor is B, as it would lead the team to enable Session Persistence without increasing the Idle Timeout, which would partially resolve but not eliminate the reconnections forced by timeout.
Answer Key β Scenario 4β
Answer: B
The correct sequence is R -> Q -> T -> S -> P.
Efficient diagnostic reasoning starts with logical configuration (is the rule correctly associated with frontend and backend?), advances to service-reported state (do probes indicate healthy backends?), investigates network blocking (does the NSG allow probe traffic?), validates the application directly (does the service respond on the correct port?), and only then checks VM state as a last resort.
Sequence A (P -> Q -> R -> T -> S) starts with VM state, which is the least likely factor to be the cause in an environment where VMs were described as running. Starting there wastes diagnostic time.
Sequence D (S -> T -> Q -> R -> P) starts with direct VM testing, which requires Bastion access and is the most operationally expensive step. Applying this step before checking logical configuration and probe state reverses the order of cost and probability.
Progressive diagnostic discipline requires that high-impact, low-operational-cost questions come before direct tests that require instance access.
Troubleshooting Tree: Identify appropriate use cases for Azure Load Balancerβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question (verifiable decision) |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Validation or intermediate verification |
When facing a real problem with Azure Load Balancer, start at the root node and answer each question based on what is observable in the environment: Azure portal, Monitor metrics, or direct VM access. Each answer directs to the next level of investigation. Never jump to an action node without following the complete path from the symptom, as different causes produce identical symptoms and incorrect action can mask the real problem without solving it.