Skip to main content

Troubleshooting Lab: Create and Configure an Azure Load Balancer

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

An operations team receives alerts that incoming traffic has stopped being distributed to VMs in a backend pool. The environment uses a public Azure Load Balancer Standard. The VMs are running, the monitoring agent installed on them reports normal CPU and memory, and the network team confirms that the VNet and subnets haven't had recent changes.

The responsible engineer collects the following information:

Load Balancer SKU: Standard
Frontend IP: 20.10.5.100 (public, static)
Backend pool: VM1, VM2, VM3 (same VNet, subnet: 10.0.2.0/24)
Health probe: HTTP / port 8080 / path: /healthz / interval: 15s / threshold: 2
Load balancing rule: TCP / port 80 -> port 80

Probe results (Azure portal):
VM1: Unhealthy
VM2: Unhealthy
VM3: Unhealthy

Direct test via Bastion Host to VM1:
curl http://10.0.2.4:8080/healthz
HTTP 200 OK

Recent changes recorded in Activity Log:
- VM1 resize (yesterday at 23:14)
- Resource Group tag update (today at 08:02)

What is the root cause of the observed problem?

A. The VM1 resize disassociated the network interface from the backend pool, automatically removing all VMs from the pool

B. The health probe is configured with HTTP protocol and the Load Balancer Standard doesn't support HTTP probes, only TCP

C. An NSG on the VMs subnet is blocking the source IP of the health probes, preventing the Load Balancer from validating instance availability

D. The balancing rule uses TCP protocol on port 80, but the probe monitors port 8080; since the application doesn't respond on port 80, traffic is dropped


Scenario 2 β€” Action Decision​

The platform team identified that the public Load Balancer Standard in a production environment has no outbound rule configured and no NAT Gateway associated with the backend pool subnet. The VMs in the pool need to access an external licensing service on the internet to validate tokens every hour. In the last 30 minutes, the licensing service started returning timeout on the VMs, and application logs show outbound connection failures.

The cause was confirmed: the VMs lost outbound connectivity to the internet because implicit SNAT is no longer available in the Standard SKU without outbound rule.

Environment restrictions:

  • Maintenance window: only on Saturdays between 00h and 04h (today is Tuesday)
  • VMs cannot be restarted during business hours
  • The network team has permission to modify Load Balancer configurations and create new resources without additional approval
  • The application tolerates up to 5 minutes of unavailability in the licensing service before entering degraded mode

What is the correct action to take at this moment?

A. Wait for Saturday's maintenance window to configure an outbound rule on the Load Balancer, as network changes in production outside the window are not allowed

B. Associate a NAT Gateway to the backend pool subnet immediately, as this operation doesn't require VM restart and restores outbound connectivity without impact on inbound traffic

C. Assign an individual public IP to each VM in the backend pool to restore per-instance SNAT, restarting each VM after assignment to apply the configuration

D. Create an outbound rule on the Load Balancer pointing to a new public frontend IP dedicated to outbound, but wait for security team approval before applying


Scenario 3 β€” Root Cause​

An internal Load Balancer Standard was configured to expose an API service on port 443 within a VNet. The frontend IP is 172.16.1.10. Clients in the same VNet report that they can establish TCP connection to 172.16.1.10:443, but receive immediate reset after the TLS handshake.

The engineer analyzes the environment:

Frontend IP: 172.16.1.10 (internal, static)
Backend pool: API-VM1, API-VM2
Health probe: TCP / port 443 / interval: 5s / threshold: 2

Probe status:
API-VM1: Healthy
API-VM2: Healthy

Load balancing rule configuration:
Protocol: TCP
Frontend port: 443
Backend port: 443
Floating IP: Enabled
Session persistence: None

TLS certificate installed on VMs: valid, issued for CN=api.internal.corp
Direct test on VM (without Load Balancer):
openssl s_client -connect 172.16.1.10:443 (via VM IP): handshake OK

Additional information: the security team updated the certificate policy two weeks ago, but the certificates on the VMs were renewed and validated manually yesterday.

What is the root cause of the problem observed?

A. The TLS certificate was issued for the name api.internal.corp and not for the IP 172.16.1.10, causing validation failure on the client

B. With Floating IP enabled, packets arrive at VMs with destination IP 172.16.1.10; since VMs don't have this IP configured on any interface, the VM's TCP stack terminates the connection after handshake

C. The internal Load Balancer Standard doesn't support TLS traffic on port 443 without an Application Gateway in front

D. The TCP health probe is marking VMs as Healthy even with the TLS service failing, because TCP and TLS operate on different layers


Scenario 4 β€” Diagnostic Sequence​

An engineer receives the following report: "The public Load Balancer stopped responding on port 80 after a configuration update made 20 minutes ago. External clients receive timeout."

The following investigation steps are available, but out of order:

  1. Check if the load balancing rule on port 80 exists and is associated with the correct frontend IP
  2. Confirm if the VMs in the backend pool have Healthy status in health probes
  3. Validate if the NSG associated with the subnet or VM NIC allows inbound traffic on port 80 and from source 168.63.129.16
  4. Test direct connectivity with the IP of one of the backend pool VMs on port 80, without going through the Load Balancer
  5. Check if the Load Balancer public frontend IP is active and with the correct address

What is the correct diagnostic sequence, starting from the most external failure point toward the most internal cause?

A. 2 -> 1 -> 5 -> 3 -> 4

B. 5 -> 1 -> 2 -> 3 -> 4

C. 1 -> 5 -> 3 -> 2 -> 4

D. 3 -> 5 -> 1 -> 2 -> 4


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: C

The definitive clue is in the contradiction between the two collected data points: the probes mark all VMs as Unhealthy, but a direct test via Bastion on port 8080 of the /healthz endpoint returns HTTP 200. This means the application is working correctly and responding on the port and path configured in the probe. What's failing is the path between the Load Balancer and the VMs, not the application itself.

The Load Balancer Standard sends health probes from the special address 168.63.129.16. If an NSG on the subnet or VM NIC doesn't have an explicit rule allowing inbound traffic from this IP (or from the AzureLoadBalancer service tag), the probes are silently dropped. The Load Balancer interprets the lack of response as failure and marks instances as Unhealthy, stopping routing.

The information about VM1 resize is irrelevant: resize doesn't remove VMs from backend pools nor affect other VMs in the pool. Alternative B is false: HTTP is a valid protocol for probes in the Standard SKU. Alternative D represents a classic misconception of confusing the probe port with the load balancing rule port; these two configurations are independent and data traffic being on port 80 doesn't interfere with the probe monitoring port 8080.

The most dangerous distractor is A: in a real pressure scenario, the recent resize is the most visible change in the Activity Log and can lead the engineer to investigate VM1 individually, delaying the correct diagnosis.


Answer Key β€” Scenario 2​

Answer: B

The cause is confirmed in the statement; the challenge is choosing the correct action within the restrictions. The critical restriction is that VMs cannot be restarted during business hours. This immediately eliminates alternative C, as assigning individual public IPs to VMs in execution may not require restart in all cases, but the statement explicitly states this restriction as an impediment.

Associating a NAT Gateway to a subnet is a control plane operation that doesn't affect running VMs and doesn't require restart. Outbound connectivity is restored in seconds after association. This is the least invasive and fastest action within the restrictions.

Alternative A ignores the active impact in production: the licensing service has been failing for 30 minutes and the application will enter degraded mode in less than 5 minutes. Waiting until Saturday is not viable. Alternative D invents a security team approval restriction that doesn't exist in the statement for outbound rules, and also ignores the necessary response time. The approval restriction mentioned in the statement is only for other types of changes, not for Load Balancer configurations.


Answer Key β€” Scenario 3​

Answer: B

The described behavior, TCP connection established followed by immediate reset after TLS handshake, is the exact behavior that occurs when a VM receives a packet destined for an IP that isn't configured on any of its interfaces. With Floating IP enabled, the Load Balancer delivers the packet with the destination IP preserved as 172.16.1.10. If the VM doesn't have this IP configured on its network interface or loopback, the operating system's network stack accepts the TCP packet (because the port is open), but the TLS process that binds to the VM's own IP doesn't recognize the connection as its own and issues a reset.

The clue is in the direct test: openssl s_client -connect <VM IP>:443 works because in this case the destination is the VM's real IP, which it recognizes. The same test via 172.16.1.10 would fail.

The information about certificate policy update two weeks ago is irrelevant: certificates were renewed and validated yesterday, and TCP handshake is established before any certificate validation by the client. Alternative A confuses certificate validation failure by the client with TCP connection failure; a certificate failure occurs within TLS, it doesn't reset the connection at TCP level. Alternative C is false: internal Load Balancer Standard supports any TCP port, including 443. Alternative D describes a correct fact (TCP and TLS are different layers), but it's not the cause of the observed problem.


Answer Key β€” Scenario 4​

Answer: B

The correct diagnostic sequence for a Load Balancer that stopped responding externally follows the logic of validation from outside to inside, from control plane to data plane:

5 -> 1 -> 2 -> 3 -> 4

First, confirm that the frontend IP is active and correct (step 5): without this, no other investigation makes sense, as the problem might simply be that the IP changed or was disassociated. Next, validate the load balancing rule (step 1) to ensure port 80 is mapped to the correct frontend. The next step is to check the health probe status (step 2), because if VMs are Unhealthy, the Load Balancer doesn't route traffic regardless of any other configuration. Then, investigate the NSG (step 3) to identify network blocks that might be preventing both probes and data traffic. Finally, test direct connectivity with the VM (step 4) to isolate whether the problem is in the Load Balancer or the application.

Alternative A starts with probe status before validating that the frontend and rule exist, which can lead the engineer to investigate VMs before confirming that the Load Balancer itself is correctly configured. Alternative C starts with the rule, ignoring that the frontend might have been the element changed in the recent update. Alternative D starts with NSG, which is a possible cause but not the most external one to check.


Troubleshooting Tree: Create and Configure an Azure Load Balancer​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color legend:

ColorMeaning
Dark blueInitial symptom or entry point
Medium BlueDiagnostic question or decision point
OrangeIntermediate verification or validation
GreenRecommended action or resolution
RedIdentified cause that requires thorough investigation

To use this tree when facing a real problem, start with the root node that represents the observed symptom and answer each diagnostic question based on what can be verified in the portal, logs, or through direct testing. Each branch eliminates a hypothesis and directs to the next verification, preventing corrective actions from being taken before the diagnosis is complete. The goal is to reach a green recommended action node only after following the correct validation path.