Skip to main content

Troubleshooting Lab: Troubleshoot Load Balancing

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The operations team reports that a web application exposed via Azure Application Gateway started returning HTTP 502 for all users at 2:32 PM. The infrastructure team verifies that the backend pool VMs are running, respond to internal ping, and serve pages correctly when accessed directly by private IP on port 8080.

The Application Gateway was configured three weeks ago and was operating normally. No infrastructure changes were recorded on the day of the incident. The TLS certificate for the listener was renewed yesterday by the security team.

The relevant backend configuration is as follows:

Backend Settings:
Protocol: HTTP
Port: 80
Cookie-based affinity: Disabled
Request timeout: 20s

Health Probe (custom):
Protocol: HTTP
Host: app.contoso.internal
Path: /healthz
Interval: 30s
Threshold: 3
Match status codes: 200-399

The Application Gateway log records the following:

ResourceId: /subscriptions/.../applicationGateways/agw-prod
Category: ApplicationGatewayFirewallLog
TimeGenerated: 2024-11-12T14:32:07Z
BackendPool: pool-web
BackendSettingName: settings-web
BackendHttpStatusCode: (none)
Reason: UnhealthyBackend

What is the root cause of the problem?

A) The TLS certificate renewed yesterday is incompatible with the Application Gateway listener, causing handshake failure with the backend
B) The health probe is trying to connect on port 80 as per Backend Settings, but the servers are listening on port 8080, marking all instances as unhealthy
C) The request timeout of 20 seconds is insufficient for the /healthz endpoint to respond, causing intermittent failure that escalated to total failure
D) The UnhealthyBackend log indicates that the Application Gateway subnet NSG was changed and is blocking return traffic from the backends


Scenario 2 β€” Action Decision​

The cause of a failure has been identified: a Azure Load Balancer Standard in a production environment has its health probe configured for TCP on port 443, but the backend pool servers terminated the service on that port after an application update that migrated HTTPS to port 8443. The Load Balancer marked all instances as unhealthy and traffic has been interrupted for 12 minutes.

The environment has the following restrictions:

  • The official maintenance window is at 10 PM; it's currently 3:45 PM
  • The security team requires formal approval for any firewall or NSG rule changes
  • Changing the health probe port on the Load Balancer does not require security approval and can be done immediately by the Azure administrator
  • Reverting the application update would require complete redeployment with an estimated time of 45 minutes and loss of active session data

What is the correct action to take at this time?

A) Wait for the 10 PM maintenance window to perform the change safely with adequate documentation
B) Immediately initiate redeployment of the previous application version to restore service on port 443
C) Change the Load Balancer health probe to monitor port 8443, restoring traffic without need for additional approval
D) Create a new NSG rule allowing port 8443 and submit approval to the security team before any other action


Scenario 3 β€” Root Cause​

A developer reports that their application, which consumes an API exposed by an internal Azure Load Balancer Standard, experiences sporadic connection failures. The failures occur exclusively on calls that take more than two minutes to complete, such as batch processing operations. Fast calls work without problems.

The administrator verifies that all backend pool VMs are healthy, the TCP health probe on port 443 returns success on all instances, and there are no CPU or memory alerts on the VMs. The client application logs show:

[ERROR] 2024-11-12 16:04:33 - Connection reset by peer
[ERROR] 2024-11-12 16:04:33 - Socket closed unexpectedly after 120s
[ERROR] 2024-11-12 16:04:33 - Retry attempt 1 of 3 failed

The Load Balancer was deployed six months ago with all default configurations. Recently the team added a second NIC to the VMs for storage traffic, but this NIC uses a different subnet and is not associated with the backend pool.

What is the root cause of the failures?

A) The second NIC added to the VMs is causing routing conflicts, making return traffic exit through the wrong interface and breaking the TCP flow
B) The Load Balancer default idle timeout is 4 minutes, but long connections without active traffic in the initial stages of processing reach this limit and are terminated
C) The TCP health probe does not validate the actual application state, only port availability, allowing degraded instances to receive traffic
D) The internal Load Balancer Standard does not support connections lasting more than 120 seconds due to layer 4 design limitations


Scenario 4 β€” Diagnostic Sequence​

An administrator receives an alert that the Azure Application Gateway in a production environment is returning HTTP 504 (Gateway Timeout) for a portion of requests. The problem is intermittent and affects only some users, not all.

The following investigation steps are available, out of order:

  1. Check Application Gateway logs in Log Analytics to identify which backend pool instances are being selected for failing requests
  2. Confirm if the health probe is marking any instances as unhealthy and how many instances remain active in the pool
  3. Check backend server logs to see if problematic requests arrive and what processing time is recorded
  4. Check the request timeout configuration in Application Gateway Backend Settings and compare it with average backend response time
  5. Reproduce the problem by making manual requests to the affected endpoint and record response time

What is the correct investigation sequence?

A) 5 β†’ 1 β†’ 2 β†’ 3 β†’ 4
B) 2 β†’ 1 β†’ 3 β†’ 4 β†’ 5
C) 2 β†’ 5 β†’ 1 β†’ 3 β†’ 4
D) 1 β†’ 2 β†’ 5 β†’ 3 β†’ 4


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The decisive clue is in the direct description: servers respond correctly on port 8080, but Backend Settings defines port 80 for communication. The custom health probe inherits the port defined in Backend Settings when it doesn't specify its own port, so all probes reach port 80, find no response, and mark all instances as unhealthy. The result is the UnhealthyBackend logged.

The information about the TLS certificate renewal yesterday is intentionally irrelevant: the backend uses HTTP, not HTTPS, and the problem occurs in communication between the gateway and backend, not in the external listener. Focusing on the certificate is the most likely diagnostic error to occur under pressure, as it's the most recent recorded change in the environment.

Alternative C is theoretically plausible, but a 20-second timeout would cause gradual and intermittent failures, not a total and abrupt failure at 2:32 PM without previous history. Alternative D reverses the logic: NSG changes would leave a trail in the change history and would also affect direct VM access, which the statement explicitly rules out.

The most dangerous distractor is A: acting on the certificate without validating the backend port would delay resolution and introduce an unnecessary change in production.


Answer Key β€” Scenario 2​

Answer: C

The cause is already identified in the statement: the health probe port no longer corresponds to the port where the service is active. Changing the monitoring port from 443 to 8443 directly solves the problem, doesn't require security team approval as explicitly stated, and can be executed immediately by the administrator.

Alternative A ignores the severity of a production service interrupted for 12 minutes. Waiting six hours when an immediate corrective action is available without restrictions is the wrong decision. Alternative B is technically valid but unnecessarily costly: 45 additional minutes of downtime and session loss when the actual cause doesn't require redeployment. Alternative D blocks resolution in an approval process before acting, when the necessary action doesn't involve NSG.

The central reasoning error of the distractors is treating restrictions as absolute obstacles without checking which ones apply to the specific action available.


Answer Key β€” Scenario 3​

Answer: B

The failure pattern is the definitive clue: all occur after exactly two minutes of active connection without traffic in the flow. The Azure Load Balancer default idle timeout is 4 minutes, but batch processing operations frequently establish TCP connection and remain without traffic while processing on the backend before sending the response. If the Load Balancer doesn't detect traffic in the flow for sufficient time, it terminates the connection, which the client records as "Connection reset by peer".

The second NIC added to the VMs is irrelevant information inserted intentionally. It's on a different subnet, not associated with the backend pool, and has no relation to Load Balancer traffic flow. Focusing on it would be a classic error of confusing recent change with problem cause.

Alternative C describes a real limitation of TCP health probes, but doesn't explain the failure pattern at exactly 120 seconds. Alternative D is technically false: Load Balancer Standard has no fixed connection duration limit by design.

The most dangerous distractor is A: investigating routing conflicts in the second NIC would consume considerable time and not lead to resolution.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence is: 2 β†’ 1 β†’ 3 β†’ 4 β†’ 5.

Progressive diagnostic reasoning starts from the highest level of available observability before descending to specific details. Step 2 (check instance status in pool) determines if the problem is backend availability, which would change the entire diagnostic direction. With instance status confirmed, step 1 (identify which instances receive failing requests) isolates whether the problem is in a specific instance or general behavior. Step 3 (server logs) verifies if requests reach the backend and how long they take. Step 4 (compare request timeout with actual response time) formulates the most likely hypothesis for the 504. Step 5 (manual reproduction) validates the hypothesis with concrete data.

Sequence A starts with manual reproduction, which doesn't add diagnostic information before understanding the environment state. Sequence C intersperses reproduction before consulting logs, inverting the cost versus information order. Sequence D starts by identifying specific instances without first checking the general pool state, which may waste time investigating the wrong instance.

The central discipline of this sequence is: validate environment state before trying to reproduce or hypothesize causes.


Troubleshooting Tree: Troubleshoot Load Balancing​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blueInitial symptom (entry point)
Medium blueDiagnostic question
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate verification or validation

To use this tree when facing a real problem, start with the root node describing the observed symptom and answer each question based on what is directly verifiable in the portal, logs, or via CLI. At each bifurcation, choose the path that corresponds to the observed state, not the expected state. Follow the branches until reaching an identified cause node (red) and then apply the corresponding action (green). If the action doesn't resolve the problem, return to the last intermediate verification node (orange) and review the actual state before following another path.