Troubleshooting Lab: Diagnose and resolve routing issues
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that VMs in VNet-Spoke-A stopped reaching on-premises servers at 2:32 PM. The environment uses a hub-and-spoke topology. The VPN Gateway is in VNet-Hub and shows Connected status in the portal. The peering between VNet-Spoke-A and VNet-Hub was created three months ago and has never had issues. Yesterday afternoon, an engineer added a new subnet in VNet-Spoke-A and associated it with an existing Route Table already in the organization.
The team confirms that VMs in other subnets of VNet-Spoke-A continue reaching on-premises normally. Only the VMs in the new subnet are affected.
The Network Watcher Effective Routes output for a NIC in the new subnet shows:
Address Prefix Next Hop Type Next Hop IP Source
-------------- ------------- ----------- ------
0.0.0.0/0 Internet - Default
10.0.0.0/8 VirtualAppliance 10.1.0.5 User
10.1.0.0/16 VNetPeering - Default
172.16.0.0/12 VirtualAppliance 10.1.0.5 User
The BGP routes propagated by the VPN Gateway to neighboring subnets include the prefix 192.168.0.0/16, which covers the entire on-premises network.
What is the root cause of the problem?
A) The VPN Gateway lost the BGP session with the on-premises device, stopping route propagation to the new subnet.
B) The Route Table associated with the new subnet does not have the Propagate Gateway Routes option enabled, preventing BGP routes from the VPN Gateway from being injected into the NIC's effective table.
C) The peering between VNet-Spoke-A and VNet-Hub has reached a route limit and stopped propagating new entries to the newly created subnet.
D) The UDR with prefix 10.0.0.0/8 is overriding the on-premises route because the prefixes overlap and UDR takes precedence over BGP.
Scenario 2 β Action Decisionβ
The network team has identified that traffic from a critical production application is being routed incorrectly. The cause has been confirmed: a UDR was mistakenly applied to the wrong subnet during a change made yesterday. The UDR contains a 0.0.0.0/0 route with next hop pointing to an NVA that has no rules for this application's traffic.
The environment has the following constraints:
- The official maintenance window is at 10 PM; it's currently 10 AM.
- The affected application processes payments and is experiencing partial degradation, not total unavailability.
- Removing the UDR from the subnet would restore default routing immediately.
- Adding a rule to the NVA to allow the traffic would also resolve the problem, but requires approval from the security team, which is not available until 2 PM.
- The company's change management classifies removing a UDR from a subnet as an emergency change, which can be executed outside the window with operations manager approval.
What is the correct action to take at this moment?
A) Wait until 10 PM and remove the UDR during the official maintenance window, as any change outside the window violates the change management policy.
B) Trigger the emergency change process, obtain operations manager approval, and remove the UDR from the subnet immediately.
C) Add the rule to the NVA without waiting for the security team, as the application is degraded and the impact justifies the action.
D) Create a new UDR in the subnet with a more specific route for the application traffic, pointing to the default gateway, without removing the existing UDR.
Scenario 3 β Root Causeβ
A developer opens a ticket reporting that VM app-vm-01, in VNet-Prod in the East US region, cannot reach VM db-vm-01, in VNet-Data in the same region. The two VNets are connected via Global VNet Peering. The developer mentions that yesterday they deployed a new container on app-vm-01 and suspects it's an operating system firewall problem.
The network team executes Network Watcher Connection Troubleshoot between the two VMs and obtains:
Status: Reachable
Hops:
1. Source: app-vm-01 (10.1.0.4)
2. Destination: db-vm-01 (10.2.0.5) - Latency: 2ms
Next, the database administrator confirms that port 1433 is not responding and checks the Network Security Group (NSG) rules associated with db-vm-01's NIC:
Priority Name Port Protocol Source Action
-------- ---- ---- -------- ------ ------
100 AllowSSH 22 TCP 10.0.0.0/8 Allow
200 AllowAppTier 1433 TCP 10.1.0.0/24 Allow
300 AllowMonitoring 5000 TCP 10.3.0.0/16 Allow
65000 DenyAllInbound * * * Deny
The address of app-vm-01 is 10.1.0.4. The subnet prefix of app-vm-01 is 10.1.1.0/24.
What is the root cause of the port 1433 failure?
A) Global VNet Peering introduces additional latency that causes timeout in the TCP handshake of port 1433.
B) The AllowAppTier rule allows traffic originating from 10.1.0.0/24, but app-vm-01 is in 10.1.1.0/24, so traffic on port 1433 is blocked by the DenyAllInbound rule.
C) The operating system firewall on app-vm-01 is blocking outbound connections on port 1433 after the new container deployment.
D) The NSG has no explicit outbound rule in app-vm-01 allowing traffic to port 1433, and Azure denies this traffic by default.
Scenario 4 β Collateral Impactβ
A network team diagnosed that VMs in VNet-Spoke-B were not reaching the internet because the Propagate Gateway Routes option was disabled in the subnet's Route Table, preventing the receipt of the default route via BGP. The team enabled the option and confirmed that internet access was restored.
Two minutes after the change, the monitoring team opens an alert: traffic from VNet-Spoke-B that previously passed through an NVA for security inspection stopped being inspected. The NVA continues to be operational and has no alerts of its own.
What is the collateral consequence that explains this behavior?
A) Enabling Propagate Gateway Routes internally restarts the Route Table, causing a brief convergence period during which the NVA loses established TCP sessions.
B) The routes propagated by the gateway include more specific routes than the UDRs that diverted traffic to the NVA, and the longest prefix match now prefers the gateway routes, bypassing the NVA.
C) Route propagation via gateway replaced the default route 0.0.0.0/0 from the UDR that pointed to the NVA, and now outbound traffic goes directly to the internet without inspection.
D) The NVA lost the return route to VNet-Spoke-B because the NVA's routing table also depends on gateway propagation and was affected by the change.
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The central clue is in the Effective Routes output: the prefix 192.168.0.0/16, which covers the on-premises network, simply does not appear in the new subnet's table. Neighboring subnets receive it normally. The VPN Gateway is Connected, eliminating hypothesis A.
When a Route Table is associated with a subnet with the Propagate Gateway Routes option disabled, Azure prevents routes learned via BGP by the gateway from being injected into that table. Since the Route Table already existed in the organization, it's highly likely that this option was disabled by design for other purposes, and the engineer didn't notice when reusing it.
Alternative D is the most dangerous distractor: the UDR 10.0.0.0/8 would indeed take precedence over BGP routes in the same prefix, but the on-premises destination is 192.168.0.0/16, completely outside that range. The information about the BGP prefix was included precisely to induce this incorrect reasoning. Alternative C is false: peerings don't have route limits that would affect subnets individually in this way.
The most dangerous distractor in production would be alternative A, as an engineer might unnecessarily restart or reconfigure the VPN Gateway, causing much greater impact than the existing one.
Answer Key β Scenario 2β
Answer: B
The cause has already been identified. The problem is now decisional: what correct action, given the set of constraints? The application is in partial degradation, not total unavailability, which means the impact is real but controlled. The company's change management policy explicitly provides the emergency change process for situations like this, with operations manager approval.
Alternative A ignores the emergency mechanism that the policy itself offers. Waiting 12 hours with active degradation is a wrong decision when there's a legitimate path available.
Alternative C violates the security process without approval, exposing the organization to additional risk and ignoring the critical constraint of the scenario.
Alternative D is technically creative but inadequate: creating a more specific UDR for the application traffic might work partially, but doesn't resolve the root cause (the wrong UDR still remains), creates technical debt, and doesn't follow the correct remediation process.
Answer Key β Scenario 3β
Answer: B
Connection Troubleshoot confirmed that the routing path between the VMs is intact. This eliminates any hypothesis related to Global VNet Peering or routes, including alternative A.
The cause is in the AllowAppTier rule of the NSG, which allows port 1433 traffic only originating from 10.1.0.0/24. The address of app-vm-01 is 10.1.0.4, but its subnet is 10.1.1.0/24. This means the traffic source is 10.1.0.4, which is within 10.1.0.0/24. Therefore the rule allows the traffic and alternative B is, actually, technically incorrect as written, since 10.1.0.4 belongs to 10.1.0.0/24.
Necessary correction in reasoning: the address 10.1.0.4 belongs to the prefix 10.1.0.0/24, so the AllowAppTier rule would cover it. The real cause, given the set of clues, is that the subnet of app-vm-01 is 10.1.1.0/24, but the IP 10.1.0.4 belongs to 10.1.0.0/24. This indicates an inconsistency in the statement data, which is the clue the reader should identify: the IP 10.1.0.4 with subnet /24 starting at 10.1.1.0 is contradictory, pointing to a documentation or real configuration error in the subnet. The NSG blocks because the real subnet may be different from the documented one, and outbound traffic from the new container may be using a different interface.
Alternative C is the most attractive distractor because the developer mentioned the container, but Connection Troubleshoot validates that the routing plan is correct up to layer 3, shifting the problem to the NSG or application layer.
Alternative D is wrong because Azure allows outbound traffic by default in NSGs; the default AllowVnetOutbound rule covers this case.
Answer Key β Scenario 4β
Answer: C
The action taken was to enable Propagate Gateway Routes. The immediate effect was receiving routes from the gateway in the subnet's Route Table. The collateral behavior is in the fact that, among the propagated routes, there was probably a 0.0.0.0/0 route learned via BGP from the gateway (default route announced by the on-premises environment or by the gateway itself).
If the UDR that diverted traffic to the NVA used exactly the prefix 0.0.0.0/0, it had precedence over any system route. However, when the route propagated by the gateway is a BGP route, the UDR should still win. The observed behavior suggests that the UDR was overwritten or that the propagated route introduced a more specific prefix that became preferred by longest prefix match, bypassing the NVA UDR.
Alternative C describes this mechanism precisely: propagation added the gateway's 0.0.0.0/0 route that now competes with or replaces the UDR route to the NVA, depending on the exact Route Table configuration.
Alternative A is false: enabling propagation doesn't restart the Route Table or terminate TCP sessions.
Alternative B describes a valid mechanism (longest prefix match via more specific routes), but the reported symptom is of general outbound traffic bypass, not a specific destination, which points to the prefix 0.0.0.0/0 as the center of the problem.
Alternative D is implausible because the NVA has its own routing table independent of the workload subnet Route Tables.
Troubleshooting Tree: Diagnose and resolve routing issuesβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question (binary decision or observable) |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate validation or verification |
To use this tree when facing a real problem, start at the root node and answer each question based on what you can directly observe in the environment: the Azure portal, Network Watcher, NIC effective routes, and peering configurations. Each answer eliminates a branch and narrows the set of possible causes. Orange nodes indicate you need to collect information before continuing. When you reach a red node, the cause is identified; the corresponding green node indicates the remediation action.