Skip to main content

Troubleshooting Lab: Create and configure inbound NAT rules

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The operations team reports that RDP connections to a specific VM stopped working after a maintenance window. The environment has a Standard Load Balancer with public IP 20.30.10.50. The inbound NAT rule for the VM is listed as successfully provisioned in the portal.

Information gathered during the investigation:

Frontend IP:        20.30.10.50
Frontend port: 50010
Backend port: 3389
Protocol: TCP
Floating IP: Disabled
Target NIC: nic-vm-prod-03
Provisioning state: Succeeded

The administrator also reports that during maintenance, a new load balancing rule was created on port 443 for a different backend pool, and the web server's TLS certificate was successfully renewed. The vm-prod-03 VM has Running status and responds to internal pings.

Additional tests performed from a VM within the same VNet:

# Executed from vm-jumpbox (10.0.1.10)
Test-NetConnection -ComputerName 10.0.2.15 -Port 3389

ComputerName : 10.0.2.15
RemotePort : 3389
TcpTestSucceeded : True

What is the root cause of the RDP access failure via public IP?

A) The load balancing rule created on port 443 generated a frontend IP conflict that disabled the existing NAT rule
B) The NSG associated with the VM's subnet or NIC started blocking TCP connections on port 3389 originating from outside the VNet
C) The TLS certificate renewal restarted the RDP service on the VM, leaving port 3389 temporarily unavailable
D) The public IP 20.30.10.50 was disassociated from the Load Balancer's frontend IP configuration during maintenance


Scenario 2 β€” Root Cause​

An engineer configures an inbound NAT rule pool on a Standard Load Balancer to allow SSH access to instances of a Virtual Machine Scale Set. After deployment, they attempt to connect and receive the following result:

$ ssh adminuser@40.90.22.11 -p 50000
ssh: connect to host 40.90.22.11 port 50000: Connection timed out

$ ssh adminuser@40.90.22.11 -p 50001
ssh: connect to host 40.90.22.11 port 50001: Connection timed out

All tested ports result in timeout. The engineer checks the NAT rule pool status in the portal and finds:

NAT rule pool name:     nat-pool-ssh
Frontend port start: 50000
Backend port: 22
Protocol: TCP
Backend pool: vmss-be-pool
Provisioning state: Succeeded
Instances mapped: 3

The engineer also confirms that all three VMSS instances are in Running state and that an HTTP health probe on port 80 is returning Healthy for all of them. The SSH service is active and listening on port 22 on all instances, confirmed via serial console.

What is the root cause of the timeout on all ports?

A) The health probe is configured for HTTP/80, but should be on TCP/22 for the NAT rule pool to function correctly
B) The NAT rule pool requires VMSS instances to have individual public IPs in addition to port mapping
C) The NSG associated with the VMSS doesn't have an inbound rule allowing TCP on frontend port range (50000+) or on port 22 originating from the Load Balancer
D) The frontend port start 50000 is above the limit supported by NAT rule pools on Standard Load Balancer


Scenario 3 β€” Action Decision​

During an audit, the security team identifies that the inbound NAT rule of a critical production VM has Floating IP enabled, but the application running on this VM was never configured to listen on the Load Balancer frontend IP. The VM processes real-time financial transactions and cannot have interruption during business hours (07:00 to 19:00). It's 14:30 on a Monday.

The team confirms that despite the incorrect Floating IP configuration, connections are currently working because the client accesses via a second IP address configured directly on the VM's NIC as a secondary IP, which coincidentally matches the frontend IP due to an undocumented legacy configuration.

The cause is identified: the Floating IP configuration is incorrect for the current usage model and needs to be corrected. What is the correct action at this time?

A) Disable Floating IP on the NAT rule immediately, as the correction doesn't require VM restart and the risk of impact is low
B) Remove the legacy secondary IP from the VM's NIC now to force traffic to follow the correct path without Floating IP
C) Document the current configuration, plan the correction for a maintenance window outside business hours, and validate behavior in test environment first
D) Create a new NAT rule without Floating IP pointing to the same NIC and delete the current rule while traffic migrates


Scenario 4 β€” Diagnostic Sequence​

An operator reports that external connections to a VM via inbound NAT rule are failing with immediate timeout. The environment never worked since creation, i.e., there was no regression. Below are the available investigation steps, in random order:

[P1] Verify if the NIC/subnet NSG allows traffic on the backend port
[P2] Confirm that the VM is in Running state and the service is listening on the backend port
[P3] Confirm that the public IP is associated with the Load Balancer's frontend IP configuration
[P4] Check the Provisioning State of the inbound NAT rule in the portal
[P5] Test internal direct connectivity to the backend port from another VM in the same VNet

Which sequence represents the correct diagnostic reasoning, from most general to most specific?

A) P4 β†’ P3 β†’ P2 β†’ P5 β†’ P1
B) P3 β†’ P4 β†’ P2 β†’ P1 β†’ P5
C) P4 β†’ P3 β†’ P1 β†’ P2 β†’ P5
D) P2 β†’ P5 β†’ P3 β†’ P4 β†’ P1


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The decisive clue is in the successful internal test: TcpTestSucceeded : True on port 3389 from a VM within the VNet confirms that the VM is healthy, the RDP service is active, and the problem is in a layer that only affects external traffic. The Load Balancer itself is not the problem because the NAT rule provisioning status shows Succeeded.

The component that differentiates external from internal traffic and could have been changed during maintenance without direct relation to the main task is the NSG. NSGs are frequently modified in maintenance windows as part of security reviews, and a rule denying RDP from outside the VNet would explain exactly the observed symptom.

Alternative A is a classic misconception: load balancing rules and NAT rules can coexist on the same frontend IP on different ports without conflict. Alternative C is implausible because the internal test worked, ruling out that the RDP service is stopped. Alternative D, while technically possible, would have resulted in a different Provisioning State and would affect all Load Balancer rules, not just external access.

The most dangerous distractor is alternative A: an administrator under pressure could waste time looking for rule conflicts in the Load Balancer while ignoring the NSG, which is the real cause.


Answer Key β€” Scenario 2​

Answer: C

The timeout on all ports, combined with confirmed active SSH service and the NAT rule pool provisioned with mapped instances, directs the diagnosis to a blocking layer before the packet even reaches the application. The NSG is that layer.

In VMSS environments, there are two points where the NSG can block: at the VMSS subnet or at the instance NICs. For the NAT rule pool to work, the NSG needs to allow both traffic on frontend ports (50000, 50001, 50002) coming from the public IP and traffic on port 22 originating from the Load Balancer IP (AzureLoadBalancer service tag). The absence of either rule results in timeout.

Alternative A represents a frequent misconception: the health probe and NAT rule pool are independent. The probe determines whether the instance receives balanced traffic but doesn't control NAT functionality. Alternative B is false: NAT rule pools were designed precisely to eliminate the need for individual public IPs. Alternative D is incorrect as 50000 is a valid port within Standard Load Balancer limits.

The information about the health probe returning Healthy is deliberately irrelevant and serves to attract the wrong diagnosis of alternative A.


Answer Key β€” Scenario 3​

Answer: C

The determining constraint in this scenario is not technical but operational: it's 14:30 during business hours, the VM processes real-time financial transactions, and any error in the correction could cause immediate interruption. The fact that connections are working, even through an undocumented legacy path, means there's no active incident justifying immediate action.

Alternative A ignores a critical risk: disabling Floating IP changes how the Load Balancer delivers packets to the NIC. Even without VM restart, if the application or legacy secondary IP doesn't behave as expected after the change, the service fails in production. Alternative B is the most dangerous: removing the secondary IP without first disabling Floating IP would guarantee immediate interruption, as the application would start receiving packets destined for the frontend IP without any configuration to accept them. Alternative D introduces a window of dual configuration and risk of connections being lost during transition, also during peak hours.

The principle applied here is: when the system is working and the risk of immediate correction is greater than the risk of delay, the correct action is to plan, not act.


Answer Key β€” Scenario 4​

Answer: A

The correct sequence is: P4 β†’ P3 β†’ P2 β†’ P5 β†’ P1.

The reasoning goes from broadest to most specific, eliminating hypotheses from outside in:

  1. P4 (Provisioning State): confirm that the rule exists and is valid before any other step. If provisioning failed, the other steps are irrelevant.
  2. P3 (public IP associated): verify that the Load Balancer data plane has external connectivity. Without this, no NAT rule can receive external traffic.
  3. P2 (VM Running + service active): confirm that the destination is operational. A stopped service would explain the failure regardless of any network configuration.
  4. P5 (internal test): isolate whether the problem is in the VM/application or in the network/NAT layer. If the internal test fails, the problem is in the VM. If it passes, the problem is between the Load Balancer and VM.
  5. P1 (NSG): checked last because it requires context from previous steps to be interpreted correctly.

Alternative D represents the most common error: starting with the VM (the destination) before validating that the ingress path is intact, wasting time on a component that may be perfectly functional.


Troubleshooting Tree: Create and configure inbound NAT rules​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blueInitial symptom (entry point)
BlueDiagnostic question (binary decision)
OrangeIntermediate verification or validation point
RedIdentified cause
GreenRecommended action or resolution

To use this tree when facing a real problem, start with the root node representing the observed symptom and answer each question based on what you can verify directly in the environment. Follow the branch corresponding to your answer without skipping steps. Each path ends with a specific cause accompanied by a corrective action. If the path reaches an action node and the problem persists after correction, return to the previous node and reassess the given answer, as an assumption may have been made without direct verification.