Skip to main content

Troubleshooting Lab: Implement a load balancing rule

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The operations team receives alerts that a web application is intermittently unavailable. The environment consists of a public Azure Load Balancer Standard with three VMs in the backend pool, all running the same application on port 443. The Load Balancer was deployed two weeks ago and was working normally until yesterday.

During investigation, the engineer collects the following information:

Load Balancer SKU: Standard
Frontend IP: 20.10.5.100 (static)
Load balancing rule:
- Protocol: TCP
- Frontend port: 443
- Backend port: 443
- Session persistence: None
- Floating IP: Disabled

Health Probe:
- Protocol: HTTPS
- Port: 443
- Path: /healthz
- Interval: 15s
- Unhealthy threshold: 2

Backend Pool:
- vm-app-01: marked as Degraded by probe
- vm-app-02: marked as Degraded by probe
- vm-app-03: marked as Degraded by probe

VM TLS certificate: renewed yesterday at 11:14 PM

When accessing each VM's IP directly via browser, the application responds normally. The /healthz endpoint also returns HTTP 200 when accessed directly.

What is the root cause of the unavailability?

A) The health probe is configured with insufficient threshold, marking healthy instances as degraded after two consecutive failures

B) The TLS certificate renewed on the VMs is not recognized by the Load Balancer HTTPS probe, which doesn't validate the new certificate correctly

C) The HTTPS probe fails because Load Balancer Standard doesn't support HTTPS protocol in health probes; only HTTP and TCP are accepted

D) The Network Security Group rules associated with the VM NICs were changed during certificate renewal, blocking traffic originating from Azure probe IP addresses


Scenario 2 β€” Action Decision​

An internal Load Balancer (ILB) Standard is distributing traffic to a cluster of four instances of an order processing service in production. The team identified that the root cause of a partial failure is the following: the load balancing rule is configured with Session persistence set to Client IP and Protocol, which is concentrating all traffic from an external integration system on a single overloaded instance, while the others remain idle.

The environment has the following constraints:

  • The external integration system makes batch calls every 5 minutes
  • The scheduled maintenance window is at 02:00 AM the next day
  • Changing the load balancing rule configuration causes a brief interruption of existing active connections
  • The overloaded instance is processing a critical batch started 3 minutes ago

What is the correct action to take at this time?

A) Immediately change the Session persistence to None in the load balancing rule, accepting the interruption of active connections to redistribute the load

B) Wait for the ongoing batch to complete and change Session persistence to None before the integrator's next call window, without waiting for the maintenance window

C) Wait for the scheduled maintenance window at 02:00 AM to make the change, as any production changes should follow the change management process

D) Add a second load balancing rule without Session persistence and remove the current rule, to avoid any interruption of existing connections


Scenario 3 β€” Root Cause​

A developer reports that after creating a new load balancing rule, no traffic is reaching the backend pool VMs. The responsible engineer inspects the configuration and collects the data below:

az network lb rule show \
--resource-group rg-producao \
--lb-name lb-api \
--name rule-http

{
"backendAddressPool": { "id": "...pools/pool-api-vms" },
"backendPort": 80,
"enableFloatingIp": false,
"frontendPort": 80,
"loadDistribution": "Default",
"probe": null,
"protocol": "Tcp",
"provisioningState": "Succeeded"
}

The engineer verifies that the VMs are running, respond on port 80 directly via private IP, and the NSG allows traffic on port 80 from any source. The backend pool contains two VMs with status listed as 0 of 2 instances are serving traffic.

The Load Balancer was created less than an hour ago. No other rules exist on this Load Balancer.

What is the root cause of the problem?

A) The Load Balancer was recently created and is still in the provisioning process; traffic will be forwarded automatically after complete propagation

B) The probe field has a null value in the load balancing rule, indicating no health probe was associated, and Standard SKU doesn't forward traffic without an active probe

C) The loadDistribution set to Default causes conflict with the backend pool when there are fewer than three available instances

D) The Tcp protocol is incompatible with HTTP applications in Standard SKU; the correct protocol would be Http for traffic on port 80


Scenario 4 β€” Diagnostic Sequence​

An engineer receives the following report: "The public Load Balancer Standard is responding on port 80, but clients report that some requests return random HTTP 502 errors, while others are successful."

The engineer has access to the Azure portal, Load Balancer diagnostic logs, VM consoles, and backend pool metrics.

The available investigation steps are:

  1. Check the health status of each instance individually in backend pool metrics
  2. Access application logs on each VM to identify errors during the failure period
  3. Confirm if the health probe is configured with the correct protocol and path for the application endpoint
  4. Check in Load Balancer metrics if the Health Probe Status counter shows oscillation per instance
  5. Test the probe endpoint directly on each VM from a machine on the same network

What is the correct investigation sequence?

A) 2 -> 1 -> 4 -> 3 -> 5

B) 3 -> 5 -> 4 -> 1 -> 2

C) 1 -> 4 -> 3 -> 5 -> 2

D) 4 -> 1 -> 3 -> 5 -> 2


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: D

The central clue is in the timing coincidence: the TLS certificate was renewed at 11:14 PM, and problems started shortly after. However, the cause is not the certificate itself, as direct access to VMs works normally, including the /healthz endpoint. What changes during certificate renewal in automated environments is often the execution of automation scripts that may alter adjacent configurations, such as NSG rules.

The Azure Load Balancer Standard health probe originates traffic from IP addresses in the 168.63.129.16 prefix. If an NSG rule was changed during the renewal process and blocked this source address on the VM NICs, probes will fail systematically, even though the application is healthy.

Alternative C is the most dangerous distractor: Load Balancer Standard does support HTTPS protocol in health probes. Acting based on this hypothesis would lead the engineer to unnecessarily change the probe to HTTP.

The information about the threshold of 2 failures (alternative A) is irrelevant to the diagnosis, as the observed behavior is all instances marked as degraded simultaneously, not oscillation.


Answer Key β€” Scenario 2​

Answer: B

The scenario presents an already identified cause and precise constraints. The correct action is to wait for the ongoing batch to complete (started 3 minutes ago) and make the change before the integrator's next call window, which occurs every 5 minutes. This minimizes impact on active connections without waiting for the maintenance window, which is unnecessary for a load balancing rule configuration change.

Alternative A ignores the critical constraint: the ongoing batch would be interrupted immediately, causing direct production impact unnecessarily.

Alternative C applies a correct process in the wrong context. Waiting until 02:00 AM prolongs the overload condition for hours when a natural low-impact window exists in less than 2 minutes.

Alternative D is technically invalid: it's not possible to have two load balancing rules for the same frontend IP and port simultaneously, and removing the active rule would interrupt traffic.


Answer Key β€” Scenario 3​

Answer: B

The "probe": null field in the command output is the direct and definitive evidence. In Azure Load Balancer Standard SKU, a load balancing rule without an associated health probe results in no backend pool instances being considered available to receive traffic. The status 0 of 2 instances are serving traffic confirms exactly this behavior.

The fact that VMs respond directly and the NSG allows traffic are relevant information to eliminate other hypotheses, but don't change the diagnosis, as the problem is at the Load Balancer configuration layer, not VM connectivity.

Alternative A represents the most dangerous diagnostic error: attributing the problem to provisioning time would lead to indefinite waiting without corrective action. The provisioningState: Succeeded already confirms provisioning completed.

Alternatives C and D are distractors that introduce non-existent restrictions: loadDistribution: Default has no relation to the number of instances, and the Tcp protocol is perfectly valid for HTTP traffic in Standard SKU.


Answer Key β€” Scenario 4​

Answer: C

The correct sequence starts from the most aggregated level and progressively advances to the most granular and specific level.

Step 1 (health status per instance) immediately reveals whether the problem is load balancing or application-related, establishing context. Step 4 (Health Probe Status metric) confirms if the Load Balancer is actively removing instances from the pool, which would explain random 502 errors. Step 3 (verify probe configuration) identifies if the probe is testing the correct endpoint. Step 5 (direct probe endpoint test) validates the hypothesis before any changes. Step 2 (application logs on VMs) is the most granular and should only be executed after confirming specific instances are failing.

Sequence D is the most plausible distractor but reverses the order of steps 4 and 1. Starting with aggregated Health Probe Status before checking individual instance status is less efficient, as oscillation in the aggregate would already be visible in the per-instance status.

Starting with application logs (alternative A) is the classic error of going to the deepest level without first confirming which layer the problem occurs in, consuming unnecessary time.


Troubleshooting Tree: Implement a load balancing rule​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Legend:

  • Dark blue: initial symptom or diagnostic entry point
  • Blue: diagnostic question with yes or no answer
  • Orange: intermediate verification or evidence collection
  • Red: identified cause
  • Green: recommended action or resolution

To use this tree when facing a real problem, start at the root node and answer each question based on what is directly observable in the portal, CLI, or Load Balancer metrics. Follow the path that corresponds to the found state, without skipping steps. When reaching a red node, the cause is identified; the green node immediately below indicates the corrective action. Resist the temptation to go directly to application logs before exhausting Load Balancer and probe configuration checks, as most failures in this objective originate from the configuration layer, not the application itself.