Troubleshooting Lab: Implement Gateway Load Balancer
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that, after chaining a Standard SKU public IP from a production VM with a newly created Gateway Load Balancer, all incoming traffic stopped reaching the VM. Before the chaining, the VM responded normally. The Gateway Load Balancer has a backend pool with two NVAs and a configured load balancing rule.
The operator runs the following tests:
# Connectivity test from external client
curl -v --connect-timeout 10 http://<vm-public-ip>
# Result: Connection timed out
# Check NVA status in Gateway LB backend pool
az network lb show \
--name GatewayLB-Prod \
--resource-group rg-networking \
--query "backendAddressPools[].loadBalancerBackendAddresses[].name"
# Result: ["nva-1-nic", "nva-2-nic"]
# Health probe verification
az network lb probe show \
--lb-name GatewayLB-Prod \
--resource-group rg-networking \
--name probe-nva
# Result: protocol: TCP, port: 80, intervalInSeconds: 15, numberOfProbes: 2
The team also informs that both NVAs have the inspection service running on port 443, and that the network security group associated with the NVA NICs allows all inbound and outbound traffic.
What is the root cause of the traffic interruption?
A) Direct VM public IP chaining is not supported by Gateway Load Balancer; the consumer resource needs to be a Standard Load Balancer.
B) The health probe is configured to check TCP port 80, while the NVAs only listen on port 443, causing both to be marked as unhealthy and no traffic to be delivered.
C) The NVA network security group is blocking VXLAN-GPE encapsulated traffic, as rules that allow "all traffic" do not apply to tunneling protocols.
D) The Gateway Load Balancer load balancing rule is not correctly associated with the frontend IP, as the Azure portal requires an additional "Apply" step after chaining.
Scenario 2 β Action Decisionβ
The cause of the following problem has already been identified by the team: the tunnel interface of the NVAs in the Gateway Load Balancer backend pool is configured with the incorrect type. The ingress interface is set as External for both traffic coming from the client and return traffic, when the correct setting should be Internal for the return interface.
The environment is production and it's during peak hours. The correction requires administrative access to the NVAs, which are third-party appliances managed by an external vendor. A change ticket has been opened, but the approved maintenance window is only in the next early morning hours. Production traffic is passing through the Gateway Load Balancer with partial inspection, but without total service interruption.
The team considers the following immediate actions:
A) Immediately remove the chaining between the VM public IP and the Gateway Load Balancer to restore direct traffic flow without inspection, waiting for the maintenance window to correct the NVA configuration.
B) Request emergency access to the NVA vendor and correct the tunnel interface configuration now, outside the approved maintenance window, to ensure correct inspection immediately.
C) Keep the chaining active and wait for the approved maintenance window, documenting the current behavior and monitoring traffic to detect any service deterioration.
D) Create a second Gateway Load Balancer with the correct configurations and redirect the chaining to the new resource, eliminating the downtime of the correction.
Scenario 3 β Root Causeβ
An architect analyzes user complaints accessing an application published behind a public Standard Load Balancer, chained to a Gateway Load Balancer with three NVAs in the backend pool. Users report that long sessions of uploading large files are intermittently interrupted, while short requests work perfectly.
The architect collects the following data:
Gateway Load Balancer load balancing rule:
Protocol: All
Session persistence: None
Idle timeout: 4 minutes (default value)
NVA configuration:
Type: Third-party stateful firewall
Session synchronization state between NVAs: Disabled
Uptime: 99.97% in the last 30 days
Health probes:
All 3 NVAs: Healthy
Interval: 5 seconds
Threshold: 2 consecutive failures
The architect also notes that interrupted uploads always occur with files above 500 MB and that the underlying network infrastructure underwent a firmware update on switches two weeks ago, coinciding with the start of complaints.
What is the root cause of the intermittent interruptions?
A) The switch firmware update introduced packet fragmentation for large files, causing drops before even reaching the Gateway Load Balancer.
B) The idle timeout of 4 minutes is terminating long upload connections before they complete, as uploads of large files exceed this limit without sufficient activity.
C) The absence of session persistence causes packets from the same upload session to be distributed among different NVAs, and as state synchronization between them is disabled, the session is dropped by the NVA that receives packets without context.
D) The health probe with a 5-second interval and threshold of 2 failures is removing and reinserting NVAs from the pool frequently enough to interrupt long sessions during the verification cycle.
Scenario 4 β Diagnostic Sequenceβ
An engineer receives an alert: incoming traffic to a public service chained with a Gateway Load Balancer is reaching the destination VM without passing through the inspection NVAs. The chaining was configured three days ago and worked correctly until this morning.
The following investigation steps are available, but out of order:
- Verify if the Gateway Load Balancer frontend IP is still referenced in the consumer resource configuration (VM public IP or Standard Load Balancer).
- Analyze Network Watcher flow logs to confirm if traffic is bypassing the NVAs or if the NVAs are forwarding without inspecting.
- Check the health status of Gateway Load Balancer backend pool members.
- Confirm if there were any recent changes to the consumer resource configuration, such as public IP recreation or SKU change.
- Test direct connectivity to NVAs to verify they are active and responding to the health probe.
Which sequence represents the most efficient diagnostic reasoning, from most comprehensive to most specific?
A) 2 β 4 β 1 β 3 β 5
B) 1 β 4 β 3 β 5 β 2
C) 3 β 5 β 1 β 4 β 2
D) 4 β 1 β 2 β 5 β 3
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The health probe configured on TCP port 80 attempts to establish connection with the NVAs on this port every 15 seconds. Since the NVAs only listen on port 443, the connections are refused or timeout. After 2 consecutive failed attempts (according to the configured threshold), both NVAs are marked as unhealthy and removed from the active backend pool. With an empty pool, the Gateway Load Balancer has nowhere to forward traffic, resulting in total timeout for the client.
The direct clue in the scenario is the combination between port: 80 in the probe and the fact that NVAs listen on port 443. This discrepancy is sufficient to completely explain the symptom.
The information about the NSG allowing "all traffic" is purposefully irrelevant: even if the NSG blocked VXLAN encapsulation, the symptom would be different and wouldn't depend on which port the probe uses. Alternative A is false: chaining with VM public IP is explicitly supported. Alternative D describes a portal flow that doesn't exist. The most dangerous distractor is alternative C, which plausibly invokes the VXLAN mechanism, but the NSG has already been described as permissive, eliminating this hypothesis with the available data.
Answer Key β Scenario 2β
Answer: C
The scenario explicitly states that the service is not totally interrupted: traffic flows with partial inspection. There is no immediate risk that justifies actions outside the approved change process. The correct action is to maintain the current state, actively monitor, and wait for the maintenance window, which occurs the following early morning.
Alternative A would be correct if there were risk of total interruption or critical security exposure, but the scenario doesn't describe this level of impact. Removing the chaining would completely eliminate inspection, which could be worse than the current partial inspection depending on the security policy. Alternative B ignores the explicit restriction that the NVAs are managed by an external vendor and that there's an already approved maintenance window. Acting outside this window without authorization could violate contracts and governance processes. Alternative D is technically creative, but would create a new resource in production without planning, introducing unnecessary risk during peak hours, and doesn't solve the original problem in the existing NVAs.
Answer Key β Scenario 3β
Answer: C
The key to diagnosis lies in the combination of two stated facts: session persistence disabled and state synchronization between NVAs disabled. Without session persistence, the Gateway Load Balancer distributes packets from the same TCP connection among different NVAs using the 5-tuple hash, which can vary for long-duration flows or with multiple packets. When a packet arrives at an NVA that doesn't have the initial session state, the stateful firewall drops the packet for not recognizing the connection, breaking the upload.
The fact that only large files are affected confirms this hypothesis: short sessions complete before load balancing redistributes packets to a different NVA, while long sessions have a higher probability of suffering redistribution over time.
The information about the switch firmware update is the scenario's red herring. Although the temporal coincidence is provocative, the scenario provides no evidence of fragmentation or dropping at the switch level, and the cause explained by alternative C completely justifies the symptom without needing this element. The most dangerous distractor is alternative B: the 4-minute idle timeout is plausible, but active uploads of large files generate continuous traffic, which prevents the idle timer from expiring during active transfer.
Answer Key β Scenario 4β
Answer: B
The correct sequence is 1 β 4 β 3 β 5 β 2, which follows diagnostic reasoning from most comprehensive and least costly to most specific and confirmatory.
Step 1 first verifies if the chaining still exists: if the Gateway Load Balancer frontend IP is no longer referenced in the consumer resource, the bypass is explained immediately without needing further investigation. Step 4 investigates if there was a recent change (public IP recreation or SKU change) that could have silently undone the chaining. Step 3 verifies if the backend pool has healthy members. Step 5 tests the NVAs directly to confirm probe responsiveness. Step 2 is the most costly and granular (flow log analysis) and only makes sense after eliminating the previous structural causes.
Sequence A starts with log analysis, which is the most detailed but also the most time-consuming tool and assumes the problem is in data flow, not chaining configuration. Sequence C starts with health probe status, which is relevant but doesn't explain a complete bypass if the chaining is intact. Sequence D starts investigating recent changes before confirming if the chaining still exists, skipping a faster step to verify the current state.
Troubleshooting Tree: Implement Gateway Load Balancerβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark Blue | Initial symptom (entry point) |
| Blue | Diagnostic question |
| Green | Recommended action or resolution |
| Red | Identified cause |
| Orange | Intermediate validation or verification |
To use this tree when facing a real problem, start with the root node that describes the observed symptom and answer each question based on what you can directly verify in the environment. Follow the path corresponding to your answer until you reach a red node (identified cause) or green node (recommended action). Orange nodes indicate points where it's necessary to collect more data before proceeding. Don't skip steps: the diagnostic value lies in the progressive elimination of hypotheses, not in intuition about which cause seems most likely.