Troubleshooting Lab: Configure an internal or public load balancer
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that a web application exposed via public Azure Load Balancer Standard stopped responding to external clients at 2:32 PM. The backend pool VMs are running, NSGs allow traffic on port 80, and the team confirms that no changes were made to load balancing rules or frontend IP in the last 48 hours.
The on-call engineer collects the following information:
Health probe status (Azure portal):
vm-web-01: Degraded
vm-web-02: Degraded
vm-web-03: Degraded
NSG associated with backend subnet:
Rule 100 - Allow - TCP - Any - 80 - Inbound
Rule 200 - Allow - TCP - Any - 443 - Inbound
Rule 65000 - Allow - VirtualNetwork - Any - Any - Inbound
Rule 65001 - Allow - AzureLoadBalancer - Any - Any - Inbound
Direct connectivity check (via Bastion):
curl http://localhost:80 -> HTTP 200 OK (on all VMs)
Recent event in Activity Log:
2:28 PM - NSG modification: subnet-backend
The Load Balancer public IP remains accessible via ping. The application responds locally on all VMs. The team mentions that the application version was updated the day before, but the deployment was completed without errors.
What is the root cause of the problem?
A) The application update introduced a bug that makes VMs respond locally but reject external connections B) An NSG rule added at 2:28 PM is blocking health probe traffic from Azure Load Balancer, causing all instances to be marked as degraded C) The public frontend IP was dissociated from the Load Balancer during the NSG modification window D) Load Balancer Standard requires VMs to have individual public IPs to work correctly with HTTP health probes
Scenario 2 β Action Decisionβ
The cause of a failure has been identified: the Internal Load Balancer serving the application tier lost all backend pool instances because VMs were migrated to a new subnet during network reorganization. The new subnet belongs to the same VNet. The service is completely unavailable to internal users.
The operational context is as follows:
- The environment is production and serves real-time e-commerce orders
- The official maintenance window starts in 4 hours
- The team has Contributor permission on the Resource Group
- Recreating the Load Balancer from scratch would take approximately 40 minutes
- Adding VMs from the new subnet to the existing backend pool takes less than 5 minutes and requires no additional downtime
- The security team must be notified before any network topology changes
What is the correct action to take at this moment?
A) Wait for the official maintenance window in 4 hours and execute the correction with the complete change management process B) Recreate the Load Balancer from scratch immediately to ensure a clean configuration without inconsistencies C) Add VMs from the new subnet to the existing backend pool now, notifying the security team in parallel, to restore service with minimal impact D) Revert VM migration to the original subnet and plan network reorganization for the next maintenance window
Scenario 3 β Root Causeβ
A developer reports that after creating a public Azure Load Balancer Standard for a new test environment, backend VMs can receive traffic normally, but any external access attempts from the VMs themselves (for example, API calls to external services during application execution) fail with timeout.
Test executed from vm-test-01:
curl -v https://api.partner.com/health --max-time 10
* Trying 203.0.113.45:443...
* Connection timed out after 10001 milliseconds
curl: (28) Connection timed out after 10001 milliseconds
Test executed from vm-test-01 (DNS):
nslookup api.partner.com
Server: 168.63.129.16
Address: 168.63.129.16#53
Name: api.partner.com
Address: 203.0.113.45
Load Balancer configuration:
SKU: Standard
Frontend IP: 52.170.20.10 (Public)
Backend pool: vm-test-01, vm-test-02
Load balancing rule: TCP 80 -> 80
Outbound rules: none configured
Health probe: TCP 80, 5s interval
The subnet NSG allows all outbound traffic. VMs do not have individual public IPs. DNS resolution works correctly.
What is the root cause of the problem?
A) The NSG is blocking outbound traffic despite indicating permission, as Azure implicit rules take precedence over explicit allow rules B) The TCP health probe prevents outbound connections while running, as it consumes available SNAT capacity C) Azure Load Balancer Standard does not provision implicit SNAT for outbound traffic, and without an outbound rule or NAT Gateway, VMs without public IP cannot initiate external connections D) The Load Balancer frontend IP is configured as static, which blocks the use of dynamic SNAT for outbound traffic
Scenario 4 β Diagnostic Sequenceβ
An engineer receives the following alert at 9:15 AM:
"Customers report intermittent access to the application exposed by Load Balancer. Some users can access, others receive timeout. The issue started about 20 minutes ago."
The environment has a public Azure Load Balancer Standard with four VMs in the backend pool. The engineer has access to the Azure portal and VMs via Bastion.
Available investigation steps are:
- P1: Check health probe status for each instance in the Azure portal
- P2: Access one of the VMs via Bastion and test the application locally with curl
- P3: Analyze application logs on VMs to identify recent errors or exceptions
- P4: Check Activity Log of Load Balancer and NSGs for recent changes
- P5: Confirm if the public frontend IP is responding to TCP connections on the correct port
What is the most efficient diagnostic sequence for this symptom?
A) P5 -> P1 -> P4 -> P2 -> P3 B) P2 -> P3 -> P1 -> P5 -> P4 C) P1 -> P2 -> P4 -> P5 -> P3 D) P4 -> P5 -> P1 -> P2 -> P3
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The decisive clue is in the Activity Log: an NSG modification occurred at 2:28 PM, four minutes before the failure started at 2:32 PM. Azure Load Balancer health probe originates traffic from the special address 168.63.129.16, which belongs to the AzureLoadBalancer service tag. If an NSG rule added at that time blocked traffic from this source on the probe port, all instances would be marked as degraded simultaneously, and the Load Balancer would stop forwarding traffic even with VMs functioning.
Answer A is the most dangerous distractor: the application update the day before creates a plausible narrative, but the application responds with HTTP 200 locally on all VMs, which directly contradicts the application bug hypothesis. The update is irrelevant information in this context. Answer C is false because the Activity Log doesn't record frontend IP dissociation, only NSG modification. Answer D invents a requirement that doesn't exist in Load Balancer Standard.
The most dangerous error would be initiating an application rollback based on temporal correlation with the previous deployment, wasting time while the real cause remains active.
Answer Key β Scenario 2β
Answer: C
The scenario explicitly declares the cause (VMs migrated to new subnet) and presents context constraints. The correct fix is to add VMs to the existing backend pool, as this action takes less than 5 minutes, requires no additional downtime, and restores production service immediately. Notifying the security team in parallel respects the process without sacrificing service availability.
Alternative A ignores the criticality of the active production environment: waiting 4 hours with service completely unavailable is unacceptable when a quick and safe fix is available. Alternative B is technically valid but disproportionate to the problem and available time: recreating the Load Balancer would take 40 minutes unnecessarily, as the existing configuration is correct. Alternative D would revert a planned architectural change and cause a second unavailability cycle, making the problem bigger than it was.
The critical point of this scenario is distinguishing between the technically most complete action and the correct action given real constraint context.
Answer Key β Scenario 3β
Answer: C
The evidence set points directly to absence of SNAT. VMs don't have individual public IPs, no outbound rules are configured, and the SKU is Standard. Without an explicitly configured outbound path, traffic originated from VMs cannot be translated to a routable public IP on the internet. The timeout occurs because packets leave the VM but are dropped when trying to traverse the internet without a valid source address.
Alternative A is incorrect: the NSG explicitly allows all outbound traffic, and Azure implicit rules only deny traffic when there's no corresponding explicit rule. DNS resolution works correctly, which also contradicts a generalized NSG block. Alternative B is false: health probes are inbound connections to the backend, they don't consume SNAT. Alternative D invents a restriction that doesn't exist in Azure: the static or dynamic nature of the frontend IP doesn't interfere with SNAT availability.
The irrelevant information in this scenario is DNS resolution working correctly. It confirms the problem isn't DNS, but doesn't help identify the root cause of the TCP connection timeout.
Answer Key β Scenario 4β
Answer: A
The correct sequence is P5 -> P1 -> P4 -> P2 -> P3, as it follows progressive diagnostic logic from entry point to cause.
P5 first: checking if the frontend IP responds to TCP connections confirms whether the problem is in the Load Balancer or before it. This eliminates or confirms the entry component with a quick check without VM access.
P1 next: health probe status immediately reveals if the Load Balancer considers instances healthy. Intermittency with some degraded instances explains why some users get access and others don't.
P4 then: Activity Log reveals if there was recent change that could explain the problem starting at 8:55 AM, correlating time and change.
P2 next: testing locally on VMs confirms if the application is responding, separating network problems from application problems.
P3 last: application logs are the most time-consuming investigation and are accessed only after confirming the problem is at the application layer, avoiding unnecessary work.
Alternative B starts at VMs before checking Load Balancer, ignoring that the symptom may be at the entry component. Alternative D starts with Activity Log, which is useful but doesn't confirm current system state. Alternative C checks health probe before confirming if frontend is accessible, inverting priority.
Troubleshooting Tree: Configure an internal or public load balancerβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question (binary decision or verification) |
| Red | Identified cause |
| Green | Recommended action or resolved state |
| Orange | Validation or intermediate verification |
When facing a real problem, start with the root node identifying the main symptom. Follow branches by answering each question based on what you observe in the environment, not what you suspect. Blue questions are directly verifiable in Azure portal, via CLI, or via Bastion, without assumptions. When you reach a red node, you have the root cause. When you reach a green node, you have the action or resolution confirmation. Orange nodes indicate more information needs to be collected before proceeding.