Troubleshooting Lab: Identify appropriate use cases for Azure NAT Gateway

Diagnostic Scenarios

Scenario 1 — Root Cause

An operations team reports that a set of production VMs stopped being able to establish new outbound connections to an external API. The VMs are in a subnet associated with a NAT Gateway with two public IP addresses. The environment has been working without issues for six months.

The responsible engineer collects the following information:

The subnet contains 180 active VMs
Each VM maintains an average of 850 simultaneous outbound connections to the same external endpoint
The NAT Gateway has a Succeeded status in the Azure portal
A new Network Security Group was associated with the subnet three days ago, allowing outbound port 443
NAT Gateway logs show the following counter with continuous growth in recent hours:

Metric: SNATConnectionCount
State: Failed
Value: 38.420 (last 10 minutes)

What is the root cause of the observed connection failures?

A) The new NSG is blocking return traffic from established connections, as there's no corresponding inbound rule for asymmetric traffic.

B) The number of available SNAT ports has been exhausted: with 180 VMs and 850 simultaneous connections each, demand exceeds the total ports provided by the NAT Gateway's two public IPs.

C) The NAT Gateway doesn't support more than 100 VMs per subnet; the current configuration exceeds the documented service limit.

D) The external endpoint is blocking requests originating from multiple NAT Gateway IPs alternately, interpreting the behavior as suspicious traffic.

Scenario 2 — Action Decision

The root cause has been identified: the NAT Gateway of a critical workload subnet is associated with only one public IP address, and the volume of simultaneous outbound connections has reached the available SNAT port limit. The exhaustion is causing intermittent failures in calls to external payment services.

The operational context is as follows:

The environment is in active production with a 99.9% SLA
The payment partner applies allowlist by IP; any new IP must be communicated 48 hours in advance
The network team has permission to modify the NAT Gateway without a maintenance window
There's a second public IP address already provisioned in the subscription, not yet associated with any resource
The business team is present and available to approve urgent communications to the partner

What is the correct action to take at this moment?

A) Replace the current NAT Gateway with a new one configured with two public IPs, performing the swap during low traffic hours.

B) Immediately associate the second public IP to the existing NAT Gateway without waiting, as IP addition doesn't interrupt existing traffic; notify the partner in parallel to update the allowlist.

C) Create a new subnet with a separate NAT Gateway and migrate half the VMs to distribute the SNAT port load between two resources.

D) Associate the second public IP to the NAT Gateway and wait for partner confirmation before activating any traffic through the new IP, keeping the current IP as the only one until the allowlist is updated.

Scenario 3 — Root Cause

An administrator configures a new environment and reports that a specific VM, called vm-app-01, is not using the subnet's NAT Gateway to access the internet, despite the subnet being correctly associated with the resource. Other VMs in the same subnet work normally through the NAT Gateway.

The administrator shares the following environment survey:

Subnet: snet-app (10.2.0.0/24)
  NAT Gateway: natgw-prod (associated, status: Succeeded)
  NAT Gateway public IP: 20.10.5.80

VM: vm-app-01
  NIC: nic-app-01
  Private IP: 10.2.0.10
  Public IP assigned to NIC: 40.80.120.55

VM: vm-app-02
  NIC: nic-app-02
  Private IP: 10.2.0.11
  Public IP assigned to NIC: none

The administrator also informs that the subnet NSG was created a week ago and that the route table associated with the subnet has no custom routes.

What is the root cause of the observed behavior in vm-app-01?

A) The subnet NSG, recently created, is intercepting traffic from vm-app-01 before it reaches the NAT Gateway, diverting the outbound flow.

B) The route table without custom routes doesn't include an explicit route to the NAT Gateway, so vm-app-01 cannot reach the resource.

C) The public IP address directly assigned to vm-app-01's NIC takes precedence over the subnet's NAT Gateway, causing outbound traffic to use IP 40.80.120.55.

D) The NAT Gateway limits the number of private IPs served per subnet; since vm-app-01 was the last VM added, it was automatically excluded from the SNAT pool.

Scenario 4 — Diagnostic Sequence

An engineer receives the following report: "VMs in a specific subnet cannot access the internet. Other subnets in the same VNet work normally."

The engineer has access to the Azure portal and terminal. Below are five possible investigation steps, presented out of order:

[P1] Check if the NAT Gateway associated with the subnet has "Succeeded" status
     and has at least one public IP or IP prefix associated

[P2] Execute an outbound connectivity test from a VM in the subnet,
     pointing to a known external endpoint (e.g., curl -v https://example.com)

[P3] Confirm that the subnet in question is effectively associated
     with a NAT Gateway (and not another subnet)

[P4] Check if there's a UDR (User Defined Route) with route 0.0.0.0/0
     associated with the subnet that might be overriding the NAT Gateway path

[P5] Analyze the subnet NSG rules to confirm that outbound traffic
     on necessary ports is not being blocked

What is the correct investigation sequence?

A) P2 -> P3 -> P1 -> P5 -> P4

B) P3 -> P1 -> P4 -> P5 -> P2

C) P1 -> P3 -> P5 -> P4 -> P2

D) P4 -> P3 -> P1 -> P2 -> P5

Answer Key and Explanations

Answer Key — Scenario 1

Answer: B

The calculation confirms the exhaustion: each NAT Gateway public IP provides 64,512 SNAT ports. With two IPs, the total available is 129,024 ports. With 180 VMs maintaining 850 simultaneous connections each, demand is 153,000 connections, exceeding capacity by approximately 24,000 ports. The SNATConnectionCount counter in Failed state with continuous growth is direct confirmation in the logs.

The information about the new NSG is the intentional irrelevant element of the scenario. The NSG allows outbound port 443, which is compatible with expected traffic. NSGs don't affect SNAT ports or NAT Gateway behavior. Including it forces the reader to resist temporal causality bias ("it happened after, so it was the cause").

Alternative A is the most dangerous distractor: it confuses asymmetric traffic of TCP connection state with NSG blocking, which has no relation to port exhaustion. Alternative C invents a limit that doesn't exist in NAT Gateway documentation. Alternative D attributes behavior to the external endpoint that contradicts what local logs already reveal directly.

Answer Key — Scenario 2

Answer: D

The critical constraint of the scenario is the payment partner's IP allowlist. Adding a new IP to the NAT Gateway is an operation that doesn't interrupt existing traffic, but the new IP will immediately start being used by the NAT Gateway to balance outbound connections. If the partner hasn't updated the allowlist yet, connections exiting through the new IP will be rejected by the external side, which aggravates the problem instead of solving it.

Alternative B is technically correct regarding the NAT Gateway, but ignores the allowlist constraint, which is the central point of the scenario. In active production, acting without coordinating with the partner would mean introducing a second immediate failure point. Alternative A is unnecessarily disruptive: replacing the entire NAT Gateway causes unavailability when just adding an IP already solves the capacity problem. Alternative C introduces unnecessary architectural complexity and migration impact without technical justification.

The correct action is to associate the IP to the resource and wait for partner confirmation before outbound traffic starts using the new address unrestricted.

Answer Key — Scenario 3

Answer: C

The definitive clue is in the vm-app-01 NIC configuration: it has a public IP address (40.80.120.55) assigned directly. Azure applies a well-defined precedence order for outbound traffic: public IP on the NIC takes priority over the subnet's NAT Gateway. Therefore, vm-app-01 uses its own IP for outbound, not the NAT Gateway IP.

The behavior is not a failure. It's the expected functioning of Azure's outbound precedence.

The irrelevant elements are the recent NSG creation and the absence of custom routes in the route table. Neither has relation to outbound IP precedence. The NSG doesn't interfere in choosing which IP will be used for SNAT. The absence of UDR means default routing is active, which is compatible with the NAT Gateway working normally for other VMs.

Alternative A is the distractor based on the temporal irrelevant element (recent NSG). Alternative B confuses routing with address translation: the NAT Gateway doesn't require explicit route in the route table to function. Alternative D invents exclusion behavior that doesn't exist in the service.

Answer Key — Scenario 4

Answer: B

The correct diagnostic sequence follows the principle of validating from most external to most specific, eliminating structural hypotheses before testing behaviors:

P3 confirms if the subnet is actually associated with the NAT Gateway, because without this association no other investigation makes sense.

P1 validates if the NAT Gateway is operational and has at least one public IP, because a resource in degraded state or without IP won't provide outbound.

P4 checks if a UDR is overriding the NAT Gateway default path, which is a common and silent cause of outbound failure.

P5 confirms if the NSG is blocking outbound traffic before it reaches the NAT Gateway.

P2 is the final validation test: it only makes sense to execute it after confirming that infrastructure is correctly configured, because without this the test result doesn't allow distinguishing between possible causes.

Alternative A starts with the active test (P2), which only produces a symptom without diagnosis. Alternative C starts with NAT Gateway health before confirming if it's even associated. Alternative D starts with UDR, which is a specific hypothesis, skipping basic structural verification.

Troubleshooting Tree: Identify appropriate use cases for Azure NAT Gateway

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Color Legend:

Color	Node Type
Dark Blue	Initial symptom (entry point)
Blue	Diagnostic question (binary or state decision)
Red	Identified cause
Green	Recommended action or resolution
Orange	Intermediate check or validation

To use this tree when facing a real problem, start at the root node describing the symptom and follow the branches answering each question based on what you observe in the environment. Blue questions require active verification in the portal, via CLI, or via metrics before advancing. When reaching a red node, you've identified the cause; the green node immediately connected to it indicates the correct action. Orange nodes signal points where additional evidence collection is needed before continuing diagnosis.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Action Decision​

Scenario 3 — Root Cause​

Scenario 4 — Diagnostic Sequence​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Identify appropriate use cases for Azure NAT Gateway​