Troubleshooting Lab: Design and Implement Azure Route Server
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A network engineer deployed Azure Route Server in a hub VNet and configured BGP peering with a third-party NVA. The NVA is running correctly, the BGP session was successfully established, and the Route Server appears as Succeeded in the portal. The NVA's ASN is 65001 and the Route Server's ASN is 65515.
However, VMs in the spoke VNets, connected via VNet Peering to the hub VNet, continue using Azure's default routes and cannot see the routes advertised by the NVA. The team verifies that the NVA is advertising prefixes correctly via BGP and that the routes appear in the Route Server's table.
The hub VNet has an ExpressRoute Gateway configured, but with no active circuit connected. The peering between the hub VNet and spoke VNets was created six months ago and is in Connected state.
What is the root cause of the problem?
A) The NVA's ASN 65001 conflicts with Azure's reserved range, preventing routes from being programmed in the spoke VNets.
B) The Use Remote Gateway or Route Server option is not enabled in the spoke VNets' peerings toward the hub VNet.
C) The presence of the ExpressRoute Gateway without an active circuit blocks route propagation from the Route Server to the spoke VNets.
D) Route Server does not propagate routes to spoke VNets when the BGP session is established with a third-party NVA outside the Microsoft ecosystem.
Scenario 2 β Action Decisionβ
The operations team identified that the Branch-to-Branch functionality was inadvertently enabled on the production Azure Route Server. As a consequence, routes learned via VPN Gateway are being redistributed to the NVA, which in turn is re-advertising them to on-premises networks via ExpressRoute, creating an asymmetric routing loop that affects critical database sessions.
The cause is confirmed. The environment has the following constraints:
- ExpressRoute is the primary path for production financial workloads
- VPN Gateway is used by secondary branches with up to 30 minutes downtime tolerance
- Disabling Branch-to-Branch on Route Server does not require resource recreation
- A scheduled maintenance window is available in 48 hours
What is the correct action to take at this moment?
A) Wait for the maintenance window in 48 hours and disable Branch-to-Branch during the scheduled period, minimizing the risk of additional impact.
B) Immediately disable Branch-to-Branch on the Route Server, as the operation does not require resource recreation and the impact of the active loop exceeds the risk of the change.
C) Immediately remove the BGP peering between the NVA and Route Server to stop the loop, and reconfigure after the maintenance window.
D) Temporarily disconnect the VPN Gateway from the VNet to eliminate the source of problematic routes without altering the Route Server.
Scenario 3 β Root Causeβ
An NVA was deployed in a VNet and BGP peering with Azure Route Server was configured according to documentation. The operator observes that the BGP session oscillates between Connected and Idle every few minutes, never remaining stable.
The team collects the following output from the NVA:
BGP neighbor is 10.2.0.4, remote AS 65515
BGP state = Active
Last read 00:00:42, hold time is 90, keepalive interval is 30 seconds
Connect retry interval is 120 seconds
neighbor 10.2.0.4 remote-as 65515
neighbor 10.2.0.4 update-source loopback0
neighbor 10.2.0.4 ebgp-multihop 2
The operator verifies that:
- The NSG associated with RouteServerSubnet allows inbound traffic on port 179 from the NVA
- The NVA has IP connectivity to address 10.2.0.4 (confirmed via ping)
- The Route Server was deployed two days ago and has never established a stable session with this NVA
- The VNet has three subnets:
default,RouteServerSubnet, andnva-subnet
What is the root cause of the BGP session instability?
A) The RouteServerSubnet NSG is blocking return traffic from the Route Server to the NVA on port 179, as the inbound rule is not sufficient for bidirectional BGP sessions.
B) The update-source loopback0 causes BGP packets to originate from an address not recognized by the Route Server as a valid peer source, preventing session stability.
C) The ebgp-multihop 2 is incorrect because the NVA and Route Server are in the same VNet, and the value should be 1 for direct connectivity.
D) The hold time of 90 seconds is incompatible with the Route Server's default timer, which requires a minimum hold time of 180 seconds for external eBGP sessions.
Scenario 4 β Collateral Impactβ
An operations team detected that Azure Route Server was advertising an unexpectedly high volume of prefixes to the NVA, including routes from the address space of all VNets connected via peering to the hub VNet. To simplify the routing table on the NVA and reduce BGP processing, the team decides to remove the VNet peering between the hub VNet and two lower-priority spoke VNets, keeping only the critical spoke VNets connected.
The action was executed successfully and the volume of prefixes advertised to the NVA was reduced as expected.
What is the most relevant secondary consequence of this action?
A) The Route Server starts advertising duplicate routes to the remaining spoke VNets, as it detects inconsistency in the peering topology.
B) VMs in the removed spoke VNets lose all connectivity to the hub VNet and to any resources accessible only through it, including gateways and shared services.
C) The NVA loses the ability to advertise routes via BGP to the Route Server, as the minimum quorum of connected VNets is no longer met.
D) The Route Server enters a degraded state and stops propagating routes to the remaining spoke VNets until a new peering is added.
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The decisive clue is in the peering description: it was created six months ago, before the Route Server existed in the environment. Peerings created before Route Server deployment do not automatically inherit the necessary configuration. The Use Remote Gateway or Route Server option needs to be explicitly enabled in the peering of each spoke VNet toward the hub VNet for Azure to program the routes learned by the Route Server in those VNets.
The information about the ExpressRoute Gateway without an active circuit is purposely irrelevant. The absence of an active circuit does not interfere with route propagation from Route Server to spoke VNets. This detail exists to lead the reader to alternative C.
ASN 65001 does not belong to Azure's reserved range (65515 is the Route Server's fixed ASN; the NVA can use any private ASN outside the reserved range). The most dangerous distractor is alternative C: acting by removing or reconfiguring the ExpressRoute Gateway without investigating the peering states would cause real impact without solving the problem.
Answer Key β Scenario 2β
Answer: B
The routing loop is active and affecting database sessions in production. The critical constraint here is not the ExpressRoute or VPN Gateway itself, but the nature of the fix: disabling Branch-to-Branch is an atomic operation that does not require Route Server recreation or unavailability of other resources. Waiting 48 hours (alternative A) means keeping the loop active and the production impact for two more days, which is unacceptable given that the fix is low risk.
Alternative C is dangerous because removing BGP peering with the NVA interrupts all dynamic routing of the infrastructure, causing much greater impact than the original problem. Alternative D disconnects the VPN Gateway, unnecessarily affecting secondary branches when the root cause can be fixed directly on the Route Server without touching the gateways. The 30-minute tolerance constraint for branches exists to confuse the reader and make them consider alternative D as acceptable.
Answer Key β Scenario 3β
Answer: B
The root cause is the use of update-source loopback0. Azure Route Server requires the BGP peer to use the IP address of the directly connected interface as the source of BGP packets. When the NVA uses a loopback interface as update-source, the TCP packets of the BGP session originate from an IP address that the Route Server does not associate with the configured peer, resulting in session rejection and the observed reconnection cycle.
The fact that ping to 10.2.0.4 works is a classic distractor: it confirms IP connectivity but does not validate the source of BGP packets. The NSG allowing inbound on port 179 is another true and irrelevant piece of information for this specific cause. Alternative C about ebgp-multihop would be relevant if the value were 1 in a real multihop scenario, but value 2 does not cause instability by itself when the correct interface is used. Alternative D about hold time is technically incorrect: the Route Server accepts the 90-second value without problem.
The most dangerous distractor is alternative A: an operator might spend hours adjusting NSG rules without realizing that the source of BGP packets is the real problem.
Answer Key β Scenario 4β
Answer: B
Removing VNet peering between the hub VNet and spoke VNets is an infrastructure operation with immediate and total impact on those VNets' connectivity. Without peering, VMs in those VNets lose access to everything in the hub VNet: the NVA, VPN or ExpressRoute gateways, centralized DNS services, jumpboxes, and any other shared resources. The Route Server and NVA continue functioning normally for the remaining VNets.
Alternatives A, C, and D describe behaviors that the Route Server does not possess. There is no minimum quorum of connected VNets, no duplicate advertising due to topology inconsistency, and the Route Server does not enter a degraded state due to peering removal. These distractors exploit the tendency to attribute effects that belong to the VNet data plane to the Route Server. The real consequence is simple and severe: complete isolation of the VNets removed from peering.
Troubleshooting Tree: Azure Route Serverβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom or entry point |
| Blue | Diagnostic question |
| Orange | Intermediate verification or validation |
| Red | Identified cause |
| Green | Recommended action or resolution |
When facing a real problem with Azure Route Server, start at the root node and answer each question based on what is directly observable in the environment: the BGP session state on the NVA, the presence of routes in the Route Server table via portal or CLI, and the routing behavior in spoke VNets. Each branch eliminates a class of causes and directs to the next level of verification, until the cause is identified and the corresponding corrective action can be safely applied.