Troubleshooting Lab: Map requirements to features and capabilities of Azure Load Balancer
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team receives alerts that a production web application is experiencing high latency and intermittent errors. The environment uses a public Azure Load Balancer Standard with three VMs in the backend pool. The on-call engineer checks the Azure dashboard and collects the following information:
Load Balancer: lb-prod-web (Standard, Public)
Frontend IP: 20.120.45.10
Backend Pool:
VM-prod-01: Running, NIC associated, responding to internal ping
VM-prod-02: Running, NIC associated, responding to internal ping
VM-prod-03: Running, NIC associated, responding to internal ping
Health Probe:
Protocol: HTTP
Port: 8080
Path: /healthz
Interval: 15s
Unhealthy threshold: 2
NSG associated with VM subnet:
Inbound Rule 100: Allow TCP 443 from Any
Inbound Rule 200: Allow TCP 8080 from 10.0.0.0/8
Inbound Rule 4096: Deny All Inbound
The engineer confirms that all three VMs are running, the application on port 443 responds when tested directly via each VM's private IP, and there has been no recent deployment. User traffic arrives on port 443.
What is the root cause of the intermittent errors?
A. The health probe is using HTTP instead of HTTPS, which is incompatible with the Standard SKU and generates false negatives
B. The NSG blocks health probe requests because the probe source IP range is not covered by the rule that allows port 8080
C. The default Idle Timeout of 4 minutes is terminating long connections before the application completes processing
D. The Load Balancer cannot distribute traffic because no rule explicitly allows the frontend IP in the subnet NSG
Scenario 2 β Root Causeβ
An internal Standard Load Balancer was deployed to balance requests between instances of a data processing service. The VMs in the backend pool need to query an external API on the internet during processing. After migrating the environment from Basic to Standard, jobs began failing with timeout errors when calling the external API. The team verifies:
Load Balancer: lb-internal-proc (Standard, Internal)
Frontend IP: 10.1.2.100 (private)
Backend Pool:
VM-proc-01: no public IP assigned to NIC
VM-proc-02: no public IP assigned to NIC
VM-proc-03: no public IP assigned to NIC
Effective routes on VMs:
0.0.0.0/0 -> Next hop: Internet
Subnet NSG:
Outbound Rule 100: Allow TCP 443 to Internet
Outbound Rule 4096: Deny All Outbound
The NSG allows outbound traffic on port 443, routes point to the internet, and there is no Azure Firewall or NVA in the path. The environment was functional when using Basic Load Balancer.
What is the root cause of the outbound connectivity failure?
A. The NSG is blocking outbound traffic because the Allow TCP 443 rule doesn't specifically cover the external API's IP address
B. The Standard SKU doesn't provide default outbound connectivity for VMs without public IPs, unlike Basic, and no outbound solution has been configured
C. The default route 0.0.0.0/0 with next hop Internet is being ignored because the internal Load Balancer intercepts all outbound traffic
D. VMs need individual public IPs on their NICs to have outbound connectivity, regardless of Load Balancer SKU
Scenario 3 β Action Decisionβ
The cause has been identified: a public Standard Load Balancer in production has all VMs in the backend pool marked as unhealthy by the HTTP health probe on port 80. Investigation confirmed that the application was updated yesterday and the new version exposes the health check endpoint on port 8080, no longer on port 80. Port 80 no longer responds on any of the VMs.
The environment has the following constraints:
- Allowed maintenance window: Saturdays from 10 PM to 2 AM. Today is Tuesday, 2 PM.
- The Load Balancer has Session Persistence: Client IP configured.
- User traffic is arriving and being dropped because there are no healthy VMs.
- A second standby Load Balancer is available, pre-configured with probe on port 8080, but has never been tested in production.
- The team has permission to change configurations on the existing Load Balancer outside the maintenance window in case of an active incident.
What is the correct action at this time?
A. Wait for Saturday's maintenance window to change the health probe port on the existing Load Balancer, avoiding risk outside the window
B. Immediately promote the standby Load Balancer to production, redirecting DNS to the new frontend IP
C. Change the health probe port on the existing Load Balancer from 80 to 8080, leveraging the permission for changes during active incidents
D. Roll back the application deployment to the previous version, restoring response on port 80 until the next maintenance window
Scenario 4 β Diagnostic Sequenceβ
An engineer receives the following report: "After adding a new VM to the Load Balancer's backend pool, some users started reporting that their sessions are randomly restarting."
The engineer has the following investigation steps available:
- Verify if the Session Persistence configuration in the load balancing rule is different from None
- Confirm if the new VM started receiving traffic based on Load Balancer metrics data
- Verify if the health probe is marking the new VM as healthy
- Compare application configurations and software versions between the new VM and existing VMs in the pool
- Verify if the Idle Timeout of the load balancing rule was changed along with the VM addition
Which diagnostic sequence is the most logical and efficient for this scenario?
A. 3 -> 2 -> 1 -> 4 -> 5
B. 1 -> 3 -> 2 -> 4 -> 5
C. 5 -> 1 -> 3 -> 2 -> 4
D. 3 -> 1 -> 5 -> 2 -> 4
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The HTTP health probe on port 8080 originates from a special Azure platform IP address: 168.63.129.16. This IP doesn't belong to the 10.0.0.0/8 block and isn't covered by NSG rule 200, which only allows port 8080 for internal sources. Rule 4096 blocks everything else. Therefore, probes are dropped, VMs are intermittently marked as unhealthy, and traffic stops being sent to them in rotation, causing the errors.
The hint in the scenario is the combination of the restricted source in the NSG rule (10.0.0.0/8) with the fact that the probe uses port 8080, which isn't open to all sources.
The irrelevant information is that VMs respond to internal ping and port 443 works on direct access. This data confirms the application is healthy but has no relation to the health probe path.
The distractors lead to errors by focusing on the probe protocol (A), timeout (C), or absence of an explicit rule for the frontend (D). The most dangerous error would be C: increasing the Idle Timeout wouldn't solve anything and would delay identifying the real cause for hours.
Answer Key β Scenario 2β
Answer: B
The Basic SKU Load Balancer provides default outbound access implicitly for VMs in the backend pool. The Standard SKU eliminates this behavior by design, requiring explicitly configured outbound connectivity via Outbound Rules, NAT Gateway, or public IP on the NIC. Since none of these alternatives were configured during migration, the VMs lost the ability to initiate connections to the internet.
The hint in the scenario is direct: the environment "was functional when using Basic Load Balancer" and no other changes were made.
The irrelevant information is the 0.0.0.0/0 route with next hop Internet. It's correct and not the problem; the blockage occurs before the routing layer, due to the absence of a valid SNAT mechanism.
Distractor D is the most dangerous: it claims VMs always need individual public IPs, which is technically false. Outbound Rules and NAT Gateway are valid solutions without public IPs on NICs. Acting on this belief would lead to incorrect and more expensive architecture.
Answer Key β Scenario 3β
Answer: C
The cause is identified and the impact is immediate: 100% of traffic is being dropped. The environment has explicit permission for changes outside the window during active incidents. Changing the health probe port from 80 to 8080 on the existing Load Balancer is the surgical, safe, and reversible action that solves the problem without introducing additional risk.
Alternative A ignores the incident permission and prolongs unavailability until Saturday, which is unacceptable. Alternative B introduces unnecessary risk by putting a never-tested resource into production, potentially trading one problem for another. Alternative D reverts an application deployment, which is an operation with its own impact and wouldn't be justified when the correct Load Balancer fix is available and low-risk.
The key point is that maintenance window restrictions exist for planned changes, not to block incident response when explicit permission exists.
Answer Key β Scenario 4β
Answer: A
The correct sequence is: 3 -> 2 -> 1 -> 4 -> 5.
The progressive reasoning starts by validating that the new VM is actually participating in the pool in a healthy way (step 3). Without confirming this, any hypothesis about sessions is premature. Next, confirming through metrics that it's receiving traffic (step 2) establishes that the problem isn't pool exclusion. With this confirmed, checking if Session Persistence is configured (step 1) is the next logical step, as adding a new VM with 5-tuple hash can redistribute existing sessions if persistence isn't active. Software version comparison (step 4) identifies compatibility issues at the application layer. Idle Timeout (step 5) is least likely to have been changed along with VM addition and should be checked last.
Alternative B starts with Session Persistence before confirming if the VM is healthy and receiving traffic, which reverses the hypothesis elimination order. Alternatives C and D place Idle Timeout in central positions of the diagnosis, which isn't justified by the reported symptom.
Troubleshooting Tree: Map requirements to features and capabilities of Azure Load Balancerβ
Legend:
| Color | Meaning |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question (verifiable decision) |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate verification or validation |
To use this tree when facing a real problem, start at the root node describing the observed symptom. At each blue node, answer the question based on what is observable in the Azure portal, logs, or metrics. Follow the path corresponding to your answer until reaching a red node (identified cause) and then execute the action in the associated green node. If the action doesn't resolve the symptom, return to the nearest orange verification node and reassess the hypotheses before advancing to another branch.