Troubleshooting Lab: Create and configure explicit outbound rules, including SNAT
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that VMs in a production environment lost outbound connectivity to external endpoints after a maintenance window performed the previous night. The environment uses a public Standard Load Balancer with explicit outbound rules configured for months without issues. During maintenance, the infrastructure team added three new load balancing rules to expose new internal services via the existing frontend. No changes were made to outbound rules, frontend IPs, or backend pool.
The on-call engineer runs the following command to check the current state:
az network lb outbound-rule list \
--resource-group rg-prod \
--lb-name lb-prod \
--output table
Name Protocol AllocatedOutboundPorts EnableTcpReset IdleTimeoutInMinutes
---------------- ---------- ------------------------ ---------------- ----------------------
outbound-rule-1 Tcp 1024 True 4
The outbound rule is present and apparently intact. Load Balancer diagnostic logs show outbound connection attempts being silently dropped.
Additional information collected:
- The subnet NSG was not changed during maintenance
- The outbound rule's
idleTimeoutInMinutesis4, the minimum allowed value - The new load balancing rules were created with the
disableOutboundSnatproperty at default value
What is the root cause of the outbound connectivity loss?
A. The allocatedOutboundPorts: 1024 value is insufficient for the current connection volume, causing SNAT exhaustion.
B. The new load balancing rules were created with disableOutboundSnat: false, reactivating automatic SNAT via inbound and creating conflict with the explicit outbound rule, which is now being ignored.
C. The idleTimeoutInMinutes with value 4 causes premature port release, leading to active connection drops.
D. The new load balancing rules created a frontend conflict, causing the Load Balancer to stop forwarding outbound traffic through the IP configured in the explicit outbound rule.
Scenario 2 β Action Decisionβ
The root cause has been identified: a NAT Gateway was mistakenly associated with the main subnet of the environment during a resource reorganization. The environment has a public Standard Load Balancer with explicit outbound rules configured and a dedicated public IP. Since the incorrect NAT Gateway association, all outbound traffic from VMs started using the NAT Gateway IP instead of the IP configured in the Load Balancer outbound rules.
Current constraints:
- The environment is in active production with 99.9% SLA
- External partners have firewall rules based on the Load Balancer public IP
- Removing the NAT Gateway causes outbound connectivity interruption for a few seconds during transition
- There is a scheduled maintenance window in 4 hours
- The security team needs to be notified before any network topology changes in production
What is the correct action to take at this moment?
A. Immediately remove the NAT Gateway from the subnet, since the cause is identified and every minute of traffic exiting through the wrong IP represents risk of blocking by partner firewalls.
B. Document the problem, notify the security team, and wait for the scheduled maintenance window to execute NAT Gateway removal with impact control.
C. Create a new public IP and associate it with the NAT Gateway so it starts using the same IP expected by partners, eliminating impact without need for immediate removal.
D. Add a UDR route in the subnet to force outbound traffic to ignore the NAT Gateway and return to the Load Balancer without need for maintenance window.
Scenario 3 β Root Causeβ
A batch processing application executes thousands of short HTTP requests to an external API throughout each hour. The environment uses a Standard Load Balancer with an explicit outbound rule configured as follows:
{
"name": "outbound-batch",
"protocol": "Tcp",
"allocatedOutboundPorts": 2048,
"enableTcpReset": false,
"idleTimeoutInMinutes": 30,
"frontendIPConfigurations": [
{ "id": "/subscriptions/.../frontendIPConfigurations/pip-batch" }
],
"backendAddressPool": {
"id": "/subscriptions/.../backendAddressPools/pool-batch"
}
}
The backend pool has 8 instances. Failures begin occurring approximately 20 minutes after the start of each processing cycle. Application logs show connection timeout errors on new connections, while already established connections continue working normally. The network team confirms no changes were made to the Load Balancer configuration in the last 30 days. The subnet NSG allows all outbound traffic on port 443.
Additional information:
- Each instance processes an average of 400 simultaneous requests at peak
- The external API has an average latency of 800ms per request
- The external endpoint's TLS certificate was recently renewed
What is the root cause of the connection failures?
A. The external endpoint's TLS certificate renewal is causing handshake failures on new TCP connections after the 20th minute.
B. With idleTimeoutInMinutes: 30 and enableTcpReset: false, SNAT ports from previous connections remain reserved for 30 minutes even after connection termination, leading to gradual exhaustion of the 2048 port pool per instance throughout the cycle.
C. The allocatedOutboundPorts: 2048 is insufficient for 8 simultaneous instances, since the Load Balancer's total limit is divided equally and each instance receives less than configured.
D. The Tcp protocol in the outbound rule doesn't cover HTTPS requests on port 443, which require All protocol to work correctly.
Scenario 4 β Diagnostic Sequenceβ
An engineer receives the following alert: VMs in subnet-app cannot reach external endpoints on the internet. The environment has a public Standard Load Balancer with outbound rule configured. No recent changes were logged in Change Management.
Available investigation steps are:
- Check if there's a NAT Gateway associated with the subnet and which IP it uses
- Confirm if backend pool VMs can resolve DNS for the external endpoint
- Verify if the Load Balancer outbound rule is associated with the correct backend pool and if the frontend IP is active
- Check if the subnet NSG has outbound denial rules for internet
- Validate if
disableOutboundSnatis set totruein load balancing rules associated with the same backend pool
What is the correct investigation sequence?
A. 2, 1, 4, 3, 5
B. 4, 3, 5, 1, 2
C. 3, 5, 4, 1, 2
D. 1, 4, 3, 5, 2
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The decisive clue is in the description: the new load balancing rules were created with the disableOutboundSnat property at default value, which is false. When this property is false in load balancing rules associated with the same backend pool that has an explicit outbound rule, the automatic SNAT behavior via inbound coexists conflictingly with the outbound rule. In scenarios where the explicit outbound rule should be the sole responsible for outbound traffic, the default false value can cause the Load Balancer to not apply the outbound rule as expected for the affected pool, resulting in silent drops.
The information about the unchanged NSG is irrelevant and was included purposefully to divert focus. The idleTimeoutInMinutes: 4 (alternative C) doesn't cause active connection drops; it only controls the reservation time for idle ports. Alternative A describes SNAT exhaustion, which would be gradual and progressive, not an abrupt interruption coinciding with maintenance. Alternative D describes behavior that doesn't occur in Standard Load Balancer: load balancing rules don't interfere with frontend IP selection for outbound.
The most dangerous distractor is alternative A: an operator under pressure could add frontend IPs unnecessarily, without solving the real problem.
Answer Key β Scenario 2β
Answer: B
The cause is identified and the solution is known, but the set of constraints defines the correct moment to act. There's a maintenance window in 4 hours, the security team needs to be notified before topology changes, and the interruption caused by NAT Gateway removal, though brief, occurs in active production with active SLA.
Alternative A represents the technically correct action applied at the wrong time: acts without notifying the security team and without using the available window, violating two operational processes explicit in the scenario. Alternative C is unfeasible because a NAT Gateway uses its own IPs and cannot be reconfigured to use a Load Balancer's public IP. Alternative D is incorrect: UDRs don't override the behavior of a NAT Gateway directly associated with the subnet; the NAT Gateway takes precedence over outbound routes and cannot be bypassed via UDR within the same subnet.
The most dangerous distractor is alternative A, which represents poorly calibrated urgency: the cause was identified, the solution is simple, and acting immediately seems rational, but ignores process constraints that exist to protect the environment.
Answer Key β Scenario 3β
Answer: B
The scenario describes a clear pattern: failures in new connections after approximately 20 minutes of operation, while existing connections continue working. This behavior is characteristic of gradual SNAT port exhaustion, not an abrupt infrastructure failure.
The critical combination is: enableTcpReset: false with idleTimeoutInMinutes: 30. Without TCP Reset enabled, the Load Balancer doesn't send RST packets when terminating idle connections. This means SNAT ports from terminated connections remain reserved for the full 30-minute period. With 8 instances, 2048 ports per instance, and 400 simultaneous requests on average, the port pool gradually exhausts throughout the cycle, preventing new connections from approximately the 20th minute.
Alternative A is the distractor built on irrelevant information: the TLS certificate renewal was included purposefully, but has no causal relationship with failures that appear only after 20 minutes of normal operation. Alternative C incorrectly describes allocatedOutboundPorts behavior: the configured value is allocated per instance, not divided among instances. Alternative D is factually incorrect: the Tcp protocol in the outbound rule covers TCP connections on any port, including 443.
Answer Key β Scenario 4β
Answer: B
The correct sequence is: 4, 3, 5, 1, 2.
The correct diagnostic reasoning starts from the layer closest to the VM toward external layers, eliminating simple causes before investigating complex configurations.
Step 4 (check NSG) comes first because a denial rule in the NSG would block traffic regardless of any Load Balancer or NAT Gateway configuration. It's the simplest and most local cause to verify. Step 3 (check outbound rule and frontend) comes next, confirming if the Load Balancer configuration is intact. Step 5 (check disableOutboundSnat) complements step 3, as a conflict between inbound and outbound rules can explain silent drops even with the outbound rule present. Step 1 (NAT Gateway) comes after because its presence overrides everything previous and completely changes the diagnosis. Step 2 (DNS) comes last because DNS failure would produce different errors than connection timeout, being less likely given the described symptom.
Sequence A starts with DNS, which is unlikely as root cause of a generalized connection timeout. Sequence C starts with the outbound rule before checking NSG, skipping the simplest and most local cause. Sequence D starts with NAT Gateway, which is a valid hypothesis but less likely in the absence of documented recent changes.
Troubleshooting Tree: Create and configure explicit outbound rules, including SNATβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark Blue | Initial symptom (entry point) |
| Blue | Diagnostic question |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate validation or verification |
To use this tree when facing a real problem, start with the root node that describes the observed symptom and answer each diagnostic question based on what is directly verifiable in the environment: NSG state, NAT Gateway presence, outbound rule existence and configuration, disableOutboundSnat value, backend pool composition, and temporal behavior of failures. Each answer eliminates a branch and leads to the next level, until the cause is identified and the corresponding action can be executed based on evidence, not assumption.