Troubleshooting Lab: Design and implement ExpressRoute to meet requirements, including cross-region connectivity, redundancy, and disaster recovery
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
The network team reports that after a successful workload migration to Azure East US, virtual machines in this region can no longer reach the on-premises environment. The existing ExpressRoute circuit was provisioned two years ago and shows Enabled status in the Azure portal. The connectivity provider confirmed that the physical link is operational and BGP peering with the provider side is established.
During investigation, the engineer runs the following command:
Get-AzExpressRouteCircuit -Name "er-circuit-prod" -ResourceGroupName "rg-network" |
Select-Object -ExpandProperty Peerings
The output returns:
Name : AzurePrivatePeering
PeeringType : AzurePrivatePeering
State : Disabled
AzureASN : 12076
PeerASN : 65100
PrimaryAzurePort :
SecondaryAzurePort :
PrimaryPeerAddressPrefix : 10.0.0.0/30
SecondaryPeerAddressPrefix : 10.0.0.4/30
The circuit has Standard SKU and is associated with a VNet in East US via ExpressRoute gateway with HighPerformance SKU. The subscription has available quotas and no policy blocking was identified. The virtual network gateway was recreated during the migration.
What is the root cause of the connectivity failure?
A) The Standard SKU circuit does not support association with HighPerformance type gateways
B) The virtual network gateway was recreated without reestablishing the connection to the ExpressRoute circuit
C) Azure Private Peering is in Disabled state, preventing BGP session establishment between Azure and on-premises
D) The configured peering prefixes (10.0.0.0/30 and 10.0.0.4/30) conflict with the VNet address space
Scenario 2 β Action Decisionβ
The problem cause has been identified: the Azure Private Peering of the production ExpressRoute circuit is in Disabled state after an administrative change made by an engineer during a maintenance window. The circuit serves critical database workloads with zero RPO and RTO less than 15 minutes.
The environment has:
ExpressRoute Circuit: er-circuit-prod (Standard SKU, active provider)
ExpressRoute Gateway: gw-er-prod (HighPerformance SKU, active-active mode)
Circuit connection: conn-er-prod (state: Degraded)
Available backup: Site-to-Site VPN (conn-vpn-backup, state: Connected, bandwidth: 1 Gbps)
The maintenance window ended 20 minutes ago. The impact is in production. Peering re-enablement requires a BGP session to be reestablished, which can take up to 5 minutes after configuration. The database team is waiting to restart services.
What is the correct action to take at this moment?
A) Immediately re-enable Azure Private Peering on circuit er-circuit-prod and wait for BGP reestablishment before releasing the database team
B) Immediately redirect traffic to Site-to-Site VPN and start peering re-enablement in parallel, notifying the database team about reduced bandwidth
C) Delete and recreate connection conn-er-prod to force reestablishment without manually re-enabling peering
D) Escalate to the connectivity provider to restart the BGP session on their side, avoiding intervention in Azure control plane
Scenario 3 β Root Causeβ
A company operates two ExpressRoute circuits on different providers, both connected to the same gateway in active-active mode in Brazil South. The environment has been in production for six months without incidents. Last week, the operations team identified that after failover to Circuit B (during a scheduled test shutdown of Circuit A), traffic from some on-premises subnets stopped reaching specific virtual machines in Azure.
The team checks the effective route tables of one of the affected VMs:
Source AddressPrefix NextHopType NextHopIP
------- ------------- ----------- ---------
Default 10.0.0.0/16 VnetLocal -
Default 172.16.0.0/12 VirtualNetworkGateway 10.20.0.4
Default 0.0.0.0/0 Internet -
Circuit B has Enabled status and the BGP session with the provider is Established. The gateway reports both connections with Connected state. The network team confirms that the affected on-premises prefixes are from the 192.168.50.0/24 block, which is within the addressing plan scope of the impacted branch. Circuit B's SKU is Standard.
What is the root cause of the problem observed during failover?
A) The gateway in active-active mode cannot process routes from two circuits simultaneously when one enters failover
B) Circuit B is not advertising the 192.168.50.0/24 prefix via BGP to Azure, which is why it doesn't appear in the VMs' effective route table
C) Circuit B's Standard SKU limits the number of advertised prefixes, and the 192.168.50.0/24 block was dropped due to route excess
D) The gateway's active-active mode distributes connections but doesn't synchronize route tables between the two redundant gateways
Scenario 4 β Diagnostic Sequenceβ
An engineer receives the following report: "Since last night, ExpressRoute is working, but only some VNets can communicate with on-premises. Other VNets, associated with the same gateway, have no connectivity".
The following investigation steps are available:
- Check if VNets without connectivity are in different geopolitical regions from the circuit region
- Confirm that Azure Private Peering is in Enabled state with BGP session Established
- Verify if VNets without connectivity are associated with the same ExpressRoute gateway via active connection
- Confirm if the circuit has Premium add-on enabled, in case VNets are in regions outside Standard SKU scope
- Check if there are routes advertised by on-premises that cover the address spaces of affected VNets
What is the correct investigation sequence for this symptom?
A) 2 -> 3 -> 1 -> 4 -> 5
B) 1 -> 4 -> 2 -> 3 -> 5
C) 3 -> 2 -> 1 -> 4 -> 5
D) 2 -> 1 -> 4 -> 3 -> 5
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: C
The definitive clue is in the command output: State: Disabled for Azure Private Peering. Without peering enabled, no BGP session is established between Azure and on-premises, regardless of physical link status or gateway health. The ExpressRoute data plane depends entirely on the Private Peering BGP session to learn and advertise routes.
The information about gateway recreation during migration is intentionally irrelevant: a recreated gateway without reestablished connection would cause similar symptoms, but the command output rules out this hypothesis by showing that the peering itself is disabled, a layer above the gateway-circuit connection.
Analysis of distractors:
- A is false: Standard SKU is compatible with HighPerformance gateways; SKU restrictions refer to features like Global Reach and number of VNets, not gateway type.
- B is plausible and represents the most dangerous reasoning error: focusing on gateway recreation ignores direct evidence in the command output.
- D is false: /30 prefixes are used for BGP peering interconnection, should not coincide with VNet space, and the presented values don't characterize actual conflict.
Acting based on distractor B would lead to unnecessary connection recreation, prolonging the incident without resolving the cause.
Answer Key β Scenario 2β
Answer: A
The cause is identified and simple to fix: re-enable peering. The scenario specifies RTO less than 15 minutes and BGP reestablishment takes up to 5 minutes after configuration, which is within acceptable window. The correct action is to immediately restore the original production path.
Analysis of incorrect alternatives:
- B seems pragmatic but is technically incorrect for the context: Site-to-Site VPN with 1 Gbps may be insufficient for database workloads with zero RPO, and introducing a second unnecessary failover increases compound incident risk. Additionally, the statement declares the cause is known and correction is direct.
- C is technically possible but ineffective: recreating the connection doesn't re-enable peering. Peering state is independent of the gateway-circuit connection.
- D is incorrect because peering was disabled on the Azure side, not provider side. Escalating to the provider creates unnecessary delay for an internally resolvable problem.
The main reasoning error in distractors is confusing urgency with need for intermediate action, when direct and definitive action is available within RTO.
Answer Key β Scenario 3β
Answer: B
The effective route table of affected VMs contains no entry for the 192.168.50.0/24 prefix. This means Azure never learned this route via BGP. The cause is that Circuit B is not advertising this specific prefix, either due to incomplete route summary configuration on the on-premises side for this provider, or a BGP route filter on Circuit B that excludes this block.
The information that BGP is Established is the main trap of the scenario: an established BGP session doesn't guarantee all necessary prefixes are being advertised. Active session and complete route advertisement are independent conditions.
Analysis of distractors:
- A is false: active-active mode processes routes from multiple circuits normally; this is exactly its purpose.
- C is plausible but incorrect: Standard SKU prefix limit (4,000 routes) would hardly be reached by a single /24 block, and the statement indicates no route drop logs.
- D reveals a misunderstanding of gateway architecture: in active-active mode, both gateway instances share the same connections and route tables; there aren't two independent tables to synchronize.
The most dangerous distractor is A: an engineer believing this might reconfigure the gateway without solving the actual problem, prolonging impact.
Answer Key β Scenario 4β
Answer: A
The correct sequence is 2 -> 3 -> 1 -> 4 -> 5, following diagnostic logic from most general to most specific:
Step 2 first: confirming Private Peering is enabled and BGP is established is the prerequisite for everything. If peering is degraded, subsequent steps are irrelevant.
Step 3 next: verifying that problematic VNets are actually associated with the gateway via active connection eliminates the most basic configuration error before investigating SKU restrictions.
Step 1 after: identifying if affected VNets are in different geopolitical regions directs investigation toward circuit SKU.
Step 4 in sequence: the Premium add-on question only makes sense after confirming VNets are in regions outside Standard scope, which step 1 would reveal.
Step 5 last: checking specific route advertisements is the most granular investigation and should only be done after ruling out all scope and association problems.
Alternatives B and D make the error of investigating SKU restrictions before confirming peering state, which is inefficient. Alternative C skips peering verification and starts with VNet association, omitting the diagnostic foundation.
Troubleshooting Tree: Design and Implement ExpressRouteβ
Legend:
- Dark blue: initial symptom, diagnostic entry point
- Blue: objective diagnostic question with yes or no answer
- Red: identified cause requiring deep investigation
- Green: recommended action or direct resolution
- Orange: validation state or intermediate checkpoint
To use this tree when facing a real problem, always start with the root node describing the connectivity absence symptom. Answer each diagnostic question based on what is observable in the portal, PowerShell commands, or effective route table outputs. Follow the path indicated by your answer until reaching a red node (cause) or green node (action). Orange nodes indicate the current state appears correct and investigation should continue or end with active monitoring.