Skip to main content

Troubleshooting Lab: Choose between regional and cross-region load balancers

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The infrastructure team reports that a newly created Cross-Region Load Balancer is not distributing traffic to any of the configured regions. The environment was provisioned via Azure Portal and the team claims all resources were created correctly.

Upon investigation, you collect the following information:

Cross-Region Load Balancer
SKU: Standard
Frontend IP: 20.10.55.100 (Global, public)
Status: Successfully provisioned

Backend Pool
Member 1: lb-eastus-prod (East US) β€” IP: 20.20.10.5
Member 2: lb-brazilsouth (Brazil South) β€” IP: 10.0.1.4

Health Probe
Protocol: TCP
Port: 443
East US Status: Healthy
Brazil South Status: Unhealthy (no response)

Load Balancing Rules
Frontend port: 443
Backend port: 443
Session persistence: None

The team mentions that the TLS certificate for the Brazil South Load Balancer was renewed two days ago, but the service is responding normally when accessed directly by the regional public IP.

What is the root cause of the observed problem?

A) The Cross-Region Load Balancer has a misconfigured load balancing rule, as the frontend port must be different from the backend port for global traffic

B) The lb-brazilsouth member is configured with an internal IP (10.0.1.4) in the Cross-Region Load Balancer's backend pool, which is not supported; only public frontend IPs from Standard regional Load Balancers are accepted

C) The TLS certificate renewal caused a temporary interruption that has not yet been propagated to the Cross-Region Load Balancer's health probe

D) The Cross-Region Load Balancer requires both backend members to be in the same availability zone to operate correctly


Scenario 2 β€” Action Decision​

The operations team identified the root cause of a production incident: a Standard regional Load Balancer serving as a backend for a Cross-Region Load Balancer was accidentally migrated to Basic SKU during a maintenance window. Since the migration, the Cross-Region Load Balancer stopped accepting that member in the backend pool and global traffic is being entirely directed to the only remaining region.

The current context is:

  • SKU migration from Basic to Standard requires resource recreation, it's not an in-place operation
  • The affected region (West Europe) still has healthy and responding backend VMs
  • The recreated Basic Load Balancer has not yet been added to the Cross-Region Load Balancer
  • There is an approved maintenance window in 4 hours
  • The only active region (East US) is operating at 85% capacity and can sustain full load for up to 6 hours without degradation

What is the correct action to take at this moment?

A) Immediately recreate the Standard Load Balancer in West Europe and add it to the Cross-Region Load Balancer pool, without waiting for the maintenance window, as the risk of East US overload justifies the urgency

B) Wait for the approved maintenance window in 4 hours, recreate the Standard Load Balancer in West Europe, validate health probes, and then add it to the Cross-Region Load Balancer pool

C) Change the existing Basic Load Balancer SKU to Standard directly in the portal, without recreation, taking advantage of the fact that the maintenance window hasn't started yet

D) Remove East US from the Cross-Region Load Balancer pool to force complete failover to West Europe while the problem is resolved


Scenario 3 β€” Root Cause​

A company operates a critical application with the following network design:

[External users]
|
[Cross-Region Load Balancer] β€” Global IP: 52.100.200.1
|
+-----+-----+
| |
[Regional LB [Regional LB
East US] Southeast Asia]
| |
[VMs East US] [VMs Southeast Asia]

Users from SΓ£o Paulo report high latency and occasionally refused connections in the last 3 hours. Users from London report no issues. The monitoring team shares the following data:

Metrics β€” Cross-Region Load Balancer (last hour)
Health probe status East US: Healthy
Health probe status Southeast Asia: Healthy
Outbound bytes East US: 12.4 GB
Outbound bytes Southeast Asia: 0.02 GB
Packet count East US: high
Packet count Southeast Asia: minimal

Logged event β€” 3 hours ago:
"Frontend IP configuration updated β€” Southeast Asia regional LB"
Type: frontend IP address change

The team argues that the problem must be at the application layer of the Southeast Asia VMs, since health probes are returning Healthy for both regions.

What is the root cause of the observed problem?

A) The Southeast Asia VMs are experiencing CPU overload, which explains the low traffic volume to the region and health probes still passing because they operate on a different port than the application

B) The frontend IP of the Southeast Asia regional Load Balancer was changed, making the address registered in the Cross-Region Load Balancer's backend pool invalid; traffic is being sent almost entirely to East US, increasing latency for SΓ£o Paulo users

C) The Cross-Region Load Balancer has the distribution algorithm prioritizing East US due to geographical proximity of anycast servers, which is expected behavior that requires no intervention

D) The Cross-Region Load Balancer health probes are configured with incorrect protocol and return false positive Healthy for Southeast Asia, while the application is effectively inactive


Scenario 4 β€” Diagnostic Sequence​

An engineer receives the following report: "The Cross-Region Load Balancer is accepting connections, but some users from a specific region are receiving responses with latency much higher than expected. There are no active health probe alerts."

The available investigation steps are:

  1. Verify if the frontend IP of regional Load Balancers in the Cross-Region Load Balancer's backend pool matches the current public IP of each regional
  2. Analyze Cross-Region Load Balancer traffic distribution metrics by backend region to identify imbalance
  3. Confirm that regional Load Balancers added to the pool have Standard SKU and public IP, not Basic or internal
  4. Check network latency between anycast presence points and regional backends using Network Watcher
  5. Analyze CPU and throughput metrics on backend VMs in the high-latency region

What is the correct investigation sequence?

A) 3 β†’ 1 β†’ 2 β†’ 4 β†’ 5

B) 2 β†’ 1 β†’ 3 β†’ 5 β†’ 4

C) 5 β†’ 4 β†’ 2 β†’ 1 β†’ 3

D) 1 β†’ 3 β†’ 4 β†’ 5 β†’ 2


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The decisive clue is in the second backend pool member's IP address: 10.0.1.4 is a private IP address (RFC 1918), not a public IP. The Cross-Region Load Balancer exclusively accepts public frontend IPs from Standard regional Load Balancers as backend pool members. An internal IP address invalidates the member and prevents any traffic distribution to that entry.

The information about TLS certificate renewal is intentionally irrelevant. The fact that the service responds directly via regional IP confirms the application is healthy, eliminating any hypothesis of certificate-related failure. Focusing on this information is the diagnostic error that distractor C exploits.

Alternative D is a non-existent restriction: the Cross-Region Load Balancer operates between distinct regions by design and imposes no availability zone requirements between members.

The most dangerous distractor is C, because it leads the team to investigate certificate renewal and configuration propagation, wasting time while the real problem is structural and immediate.


Answer Key β€” Scenario 2​

Answer: B

The scenario presents all conditions to wait for the maintenance window: East US sustains full load for up to 6 hours without degradation, the window occurs in 4 hours, and Standard Load Balancer recreation requires a destructive operation that should be planned. Acting before the window without real necessity exposes the environment to production configuration errors without the complete support provided for the window.

Alternative A disregards the critical constraint that East US has sufficient capacity for the necessary time; the invoked urgency is not real given the numerical context provided. Alternative C is factually incorrect: Basic to Standard SKU migration is not possible in-place and would result in operation failure. Alternative D would cause complete service interruption, since West Europe lacks a functional Standard Load Balancer in the pool.

The correct reasoning here is: validate if time and capacity constraints allow the controlled approach before assuming urgency.


Answer Key β€” Scenario 3​

Answer: B

The root cause is in the event logged 3 hours ago: the Southeast Asia regional Load Balancer's frontend IP was changed. The Cross-Region Load Balancer stores a reference to the regional frontend IP at the moment it was added to the pool. When this IP changes, the reference becomes invalid. Traffic destined for Southeast Asia fails silently or is dropped, forcing the Cross-Region Load Balancer to concentrate practically all flow in East US.

Cross-Region Load Balancer health probes check the regional backend endpoint, but if the probe still reaches some valid IP in the chain, it can return Healthy even with compromised distribution. This explains why probes pass while routing behavior is wrong.

The team's statement that the problem is at the Southeast Asia VMs application layer is the distractor represented by alternative A, and is the classic diagnostic error of focusing on the component closest to visible failure instead of investigating the configuration chain.

Alternative C describes behavior as if it were expected, which would disorient the team and prematurely end the investigation.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence is 2 β†’ 1 β†’ 3 β†’ 5 β†’ 4.

The reasoning starts from the most directly observable symptom: analyzing traffic distribution metrics (step 2) immediately confirms if there's imbalance between regions, locating the problem before investigating causes. Next, verifying if regional frontend IPs in the pool are correct (step 1) identifies the most common cause of silent imbalance. Confirming SKU and IP type (step 3) validates structural compatibility of members. Analyzing backend VM metrics (step 5) rules out compute layer overload. Finally, using Network Watcher for anycast latency (step 4) is the most granular and costly step, reserved for when more common causes have been eliminated.

Alternative A starts with SKU validation, which is a valid structural verification but less efficient as a first step when the symptom already points to routing imbalance. Alternative C starts at VMs, inverting the outside-in diagnostic logic. Alternative D starts with IP verification without first confirming imbalance exists, which might lead the engineer to investigate the right cause for the wrong reason.


Troubleshooting Tree: Choose between regional and cross-region load balancers​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blueInitial symptom (entry point)
BlueDiagnostic question
RedIdentified cause
GreenRecommended action or resolution
OrangeValidation or intermediate verification

To use this tree when facing a real problem, start with the root node describing the observed symptom. At each question node, answer based on what you can verify directly in the portal or via CLI, following the corresponding path. Red nodes indicate where to stop to confirm the cause before acting. Green nodes indicate the precise corrective action for that cause. Orange nodes indicate that action has been taken and that it's necessary to validate if behavior has returned to normal before ending the diagnosis.