Troubleshooting Lab: Design service chaining, including gateway transit

Diagnostic Scenarios

Scenario 1 — Root Cause

A network team reports that VMs in Spoke-A cannot communicate with the on-premises environment via VPN. The topology is classic hub-and-spoke: the Hub-VNet has an active and connected VPN gateway, and peerings between Hub and Spoke-A were created two weeks ago without incidents. Recently, the infrastructure team provisioned a virtual network gateway within Spoke-A itself for an ExpressRoute pilot project that was canceled before entering production. The gateway was kept for future use. The peering from Spoke-A to Hub-VNet continues to be listed as Connected in the portal.

The administrator runs the following command to check the effective routes on a VM NIC in Spoke-A:

az network nic show-effective-route-table \
  --resource-group rg-spoke-a \
  --name vm-spoke-a-nic01 \
  --output table

The output shows only routes from Spoke-A's local address space and no routes with on-premises network prefixes.

What is the root cause of the absence of on-premises routes on Spoke-A VMs?

A) The VPN gateway in Hub-VNet lost connection with the on-premises device and stopped advertising routes

B) The existence of a virtual network gateway in Spoke-A prevents Use remote gateways from being enabled on this VNet, blocking route propagation from the Hub

C) The peering between Spoke-A and Hub-VNet has Allow forwarded traffic disabled, preventing external routes from being accepted

D) The canceled ExpressRoute gateway left conflicting routes in Spoke-A's effective routing table

Scenario 2 — Action Decision

The security team identified that traffic between Spoke-B and Spoke-C is flowing directly between VMs without going through the centralized NVA in the Hub-VNet. The cause was confirmed: no UDRs were applied to Spoke-B and Spoke-C subnets. The environment is in production with dozens of active VMs in each Spoke. The NVA is already configured, with IP Forwarding enabled on the NIC, and the peering between each Spoke and the Hub has Allow forwarded traffic enabled.

The team has a 30-minute maintenance window available now. The application running on the VMs is sensitive to interruptions and any route change causes a TCP session reconnection of approximately 8 seconds.

What is the correct action to take at this time?

A) Recreate the peerings between Spoke-B, Spoke-C and the Hub-VNet with correct configurations to force route table updates

B) Apply UDRs to Spoke-B and Spoke-C subnets during the maintenance window, accepting the predicted 8-second reconnection impact

C) Enable Use remote gateways on Spoke-B and Spoke-C so the Hub will automatically control routing between Spokes

D) Wait for a larger maintenance window before applying any changes, as recreating peerings causes more severe impact than applying UDRs

Scenario 3 — Root Cause

An engineer configured service chaining between Spoke-D and Spoke-E via NVA in the Hub-VNet. UDRs were applied correctly in both Spokes. IP Forwarding is enabled on the NVA NIC. The peering between Spoke-D and Hub-VNet has Allow forwarded traffic enabled. The peering between Hub-VNet and Spoke-E also has Allow forwarded traffic enabled.

During tests, pings from VMs in Spoke-D reach the NVA (confirmed via tcpdump on the NVA VM), but do not advance to Spoke-E. The engineer checks the NVA operating system:

cat /proc/sys/net/ipv4/ip_forward
0

The engineer then checks the Azure portal and confirms that the NVA NIC has the IP Forwarding option marked as Enabled.

What is the root cause of the problem?

A) The peering between Hub-VNet and Spoke-E has Allow forwarded traffic disabled, dropping packets before they reach the destination VMs

B) The UDR in Spoke-D has the wrong next hop type; it should be VirtualNetworkGateway instead of VirtualAppliance

C) IP Forwarding is enabled in Azure (platform layer), but is disabled in the NVA VM operating system, which is where actual forwarding occurs

D) The NVA has only one NIC, which prevents routing between different subnet interfaces

Scenario 4 — Diagnostic Sequence

An administrator receives the following report: VMs in Spoke-F cannot reach an on-premises server with IP 192.168.10.50, but can reach other resources within Spoke-F itself and within Hub-VNet without problems. Hub-VNet has an active VPN gateway. The peering between Spoke-F and Hub-VNet exists and is listed as Connected.

The following investigation steps are available, out of order:

Check if Use remote gateways is enabled on the peering from Spoke-F to Hub-VNet
Check the effective routes on a VM NIC in Spoke-F and confirm if there is a route for 192.168.0.0/16
Check the VPN connection status on the Hub-VNet gateway and confirm if the tunnel with the on-premises environment is Connected
Check if Allow gateway transit is enabled on the peering from Hub-VNet to Spoke-F
Test direct connectivity from a VM in Hub-VNet to 192.168.10.50

What is the correct investigation sequence?

A) 3 -> 5 -> 1 -> 4 -> 2

B) 2 -> 1 -> 4 -> 3 -> 5

C) 5 -> 3 -> 4 -> 1 -> 2

D) 1 -> 2 -> 3 -> 4 -> 5

Answer Key and Explanations

Answer Key — Scenario 1

Answer: B

The central clue is the combination of two facts: the peering remains Connected (therefore the peering itself is not the problem) and a gateway was provisioned within Spoke-A. Azure enforces a restriction that makes Use remote gateways and the existence of a local gateway mutually exclusive in the same VNet. With a gateway present in Spoke-A, the Use remote gateways option is automatically blocked, and routes learned by the Hub gateway are simply not propagated to Spoke-A.

The detail about the canceled ExpressRoute project is irrelevant information inserted intentionally. The fact that the project was canceled doesn't matter; what matters is that the gateway continues to exist in the VNet.

Alternative A is the most dangerous distractor: it's plausible to suspect the Hub gateway, but the statement explicitly says the VPN connection is active and other VNets (presumably) work. Alternative C confuses Allow forwarded traffic with the gateway route propagation mechanism, which are distinct functionalities. Alternative D is factually incorrect: conflicting routes from a gateway never deployed in production would not cause this type of selective absence.

Acting on distractor A would lead to investigating and possibly recreating the VPN connection in the Hub, generating unnecessary impact across the entire topology.

Answer Key — Scenario 2

Answer: B

The cause is already stated in the scenario (absence of UDRs) and the NVA is fully configured. The correct decision is to apply UDRs during the available window. The 8-second reconnection impact is predictable, controlled, and compatible with a maintenance window.

Alternative A is technically incorrect for this problem: recreating peerings does not solve the absence of UDRs and would cause much greater impact than necessary. Alternative C represents a serious conceptual error: Use remote gateways serves to share a VPN/ExpressRoute gateway, not to force the Hub to control routing between Spokes via NVA. This configuration would have no effect on inter-Spoke traffic. Alternative D could be valid if the impact were unpredictable or severe, but the scenario provides the exact 8-second reconnection data, making postponement unjustified.

The most dangerous distractor is A: recreating peerings in production causes complete connectivity interruption during the process, much worse than 8 seconds of reconnection.

Answer Key — Scenario 3

Answer: C

The command cat /proc/sys/net/ipv4/ip_forward returning 0 is the definitive clue. IP Forwarding operates at two independent layers: the Azure platform layer (NIC configuration in the portal) and the VM operating system layer. The Azure layer allows the platform not to drop packets destined for other addresses before delivering them to the VM. But who actually forwards the packet to another interface is the operating system kernel. If the kernel has ip_forward = 0, the packet reaches the VM and is dropped there.

The fact that pings reach the NVA (confirmed via tcpdump) immediately eliminates alternatives A and B: if the peering from Hub to Spoke-E were the problem, traffic would be dropped before reaching the NVA. Alternative D is a plausible technical distractor in older architectures, but Azure supports packet forwarding with single-NIC NVA via kernel routing, and the scenario indicates no NIC restrictions.

Acting on alternative A would lead the engineer to modify peering configurations that are correct, without solving the real problem.

Answer Key — Scenario 4

Answer: A

The correct sequence is 3 -> 5 -> 4 -> 1 -> 2, and alternative A represents exactly this progressive elimination reasoning:

Step 3 confirms if the VPN tunnel is active. If the tunnel is down, all other steps are irrelevant.

Step 5 isolates whether the problem is specific to Spoke-F or affects the entire topology. If a VM in the Hub also cannot reach 192.168.10.50, the problem is with the gateway or connection, not the peering.

Step 4 checks if the Hub is sharing the gateway (Allow gateway transit).

Step 1 checks if Spoke-F is configured to consume this gateway (Use remote gateways).

Step 2 confirms the final result: if the route for 192.168.0.0/16 appears in the effective table, the propagation mechanism is working.

Alternative B starts with the final symptom (effective routes) without first validating if the gateway and tunnel are operational, which is a classic diagnostic error: investigating the consequence before checking if the primary cause exists. Alternative C jumps straight to testing from the Hub without first confirming tunnel status. Alternative D is random and follows no progressive elimination logic.

Troubleshooting Tree: Design service chaining, including gateway transit

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Legend:

Dark blue: initial symptom, investigation entry point
Blue: diagnostic question with yes or no answer
Orange: validation or intermediate verification node
Red: identified cause with recommended action
Green: not used in this diagram; reserved for confirmed resolutions

To use this tree when facing a real problem, start at the root node and answer each question based on what is directly observable in the environment: tunnel status, peering configurations in the portal, effective routes on the NIC, and IP Forwarding state in the operating system. Each answer eliminates an entire branch of hypotheses. The goal is to reach a red node with the identified cause before making any changes to the environment.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Action Decision​

Scenario 3 — Root Cause​

Scenario 4 — Diagnostic Sequence​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Design service chaining, including gateway transit​