Skip to main content

Troubleshooting Lab: Diagnose and Resolve ExpressRoute Connection Issues

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

A company has an active ExpressRoute circuit with Private Peering configured for six months without incidents. In the last week, after a change to the on-premises firewall, the operations team reports that VMs in two Azure VNets stopped responding to pings originated from the corporate network. The network team confirms that the circuit is in "CircuitProvisioningState: Enabled" state and that "ServiceProviderProvisioningState: Provisioned". The connectivity provider has not reported any degradation in the physical link.

The BGP table was verified and shows the following result:

Get-AzExpressRouteCircuitRouteTable `
-ResourceGroupName "rg-expressroute" `
-ExpressRouteCircuitName "erc-corp" `
-PeeringType "AzurePrivatePeering" `
-DevicePath "Primary"

Network NextHop LocPrf Weight Path
10.10.0.0/16 10.0.0.1 100 0 65100
10.20.0.0/16 10.0.0.1 100 0 65100
0.0.0.0/0 10.0.0.5 100 0 65100

The team also confirms that the virtual network gateway for both VNets was provisioned as ErGw1AZ and is associated with the circuit. The gateway uptime in the last 24 hours is 100%.

What is the root cause of the connectivity failure?

A) The ErGw1AZ gateway does not support ICMP traffic, blocking pings even when the route is present.

B) The default route 0.0.0.0/0 advertisement from the on-premises environment is overlapping the specific VNet routes in Azure's routing plane, redirecting return traffic to the on-premises environment and causing loops or drops.

C) The circuit is operating only on the primary path; the secondary path failed after the firewall change, reducing available bandwidth below the minimum required.

D) The on-premises firewall change introduced rules that block return traffic originated from Azure VNet prefixes, preventing responses from reaching the corporate network.


Scenario 2 β€” Root Cause​

An engineer is commissioning a new ExpressRoute circuit with Microsoft Peering to allow access to Azure Storage and Azure SQL via public addresses. The peering was configured and the returned state is "PeeringState: Enabled". A Route Filter was created and associated with the peering, containing BGP community 12076:5010 (Azure Storage). Connectivity testing to Azure SQL fails completely, while Azure Storage access works normally.

The current Route Filter configuration is:

RouteFilterName  : rf-microsoft-peering
Rules:
- Name : allow-storage
Action : Allow
Communities : 12076:5010
Access : Allow

The engineer verifies and confirms that the peering has valid PeerASN, peering prefixes are public registered in RIR, and the VLAN ID is correct. The circuit has Standard SKU.

What is the root cause of the Azure SQL access failure?

A) The circuit's Standard SKU does not support access to PaaS services like Azure SQL via Microsoft Peering; Premium SKU is required.

B) The Route Filter is configured only with the Azure Storage BGP community. The BGP community corresponding to Azure SQL was not included, so SQL prefixes are not being advertised to the on-premises environment.

C) The peering needs a second dedicated BGP session for each PaaS service; the current session only supports the service whose prefixes were learned first.

D) Azure SQL is not accessible via Microsoft Peering; this service requires Private Peering combined with Private Endpoint configured in the VNet.


Scenario 3 β€” Action Decision​

The root cause has been identified: the production ExpressRoute circuit has the secondary path in layer 2 failure state. Analysis of the secondary path ARP table returned empty, while the primary path operates normally. All production traffic is flowing through the primary path without perceptible degradation at the moment.

The responsible engineer has the following context information:

  • The incident occurred at 2 PM on a Friday
  • The connectivity provider confirmed equipment failure on their side and estimates resolution within 4 hours
  • The production environment SLA requires active redundancy on both paths
  • The company's change team requires formal approval for any changes to production circuits outside the standard window, which occurs on Tuesdays
  • The primary path is stable and shows no signs of degradation

What is the correct action to take at this time?

A) Open a formal emergency change ticket, wait for approval and, once approved, perform manual failover to a backup circuit while the provider resolves the failure.

B) Wait for resolution by the connectivity provider, actively monitor the primary path and trigger the formal incident process for registration and tracking, without making changes to the circuit.

C) Immediately execute failover of all traffic to an alternative backup circuit without opening a ticket, prioritizing redundancy recovery before the primary path also fails.

D) Disable and re-enable peering in the Azure portal to force BGP renegotiation on both paths, attempting to recover the secondary path without involving the provider.


Scenario 4 β€” Diagnostic Sequence​

An administrator receives the following alert at 09:15:

ALERT: ExpressRoute circuit 'erc-prod-br' connectivity degraded
Affected resource: VNet 'vnet-prod-eastus'
Symptom: On-premises hosts unable to reach VMs in vnet-prod-eastus
Circuit state: Enabled | Provider state: Provisioned
Time of first occurrence: 09:02 UTC

The administrator has access to the Azure portal, PowerShell, and the connectivity provider team. No planned changes were scheduled for this period.

The following investigation steps are available:

  1. Check the ARP table of primary and secondary paths to identify layer 2 failures
  2. Check the virtual network gateway state and confirm it is associated with the circuit
  3. Check the BGP route table for advertised and received routes in private peering
  4. Confirm with the connectivity provider if there is reported degradation on their side
  5. Check "ServiceProviderProvisioningState" and "CircuitProvisioningState" in the portal

What is the correct investigation sequence?

A) 2 -> 5 -> 1 -> 3 -> 4

B) 5 -> 1 -> 3 -> 4 -> 2

C) 5 -> 4 -> 1 -> 3 -> 2

D) 1 -> 3 -> 5 -> 2 -> 4


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: D

The decisive clue in the statement is when the failure began: immediately after a change to the on-premises firewall. The circuit is operational, provisioning is correct, the BGP table shows that VNet routes are being received normally by the Microsoft side, and the gateway is healthy. This eliminates any hypothesis of failure in the Azure control plane.

What changed was the firewall. The "ping does not respond" failure is a symptom of traffic being blocked on return, not absence of route. The VMs receive the packets (the route exists), but the response is blocked by the new firewall rules when trying to return to the corporate network.

The information about 100% gateway uptime is irrelevant and was purposely included to divert diagnosis to the Azure side. Alternative B is the most dangerous distractor: the default route 0.0.0.0/0 appears in the table and may seem suspicious, but it is being advertised by on-premises and does not affect return routing from VMs to the corporate network in a way that would cause the described symptom. Alternative C describes a capacity problem, not total connectivity. Acting on alternative B would lead to lengthy and incorrect investigation in the Azure routing plane, while the real cause would be in the on-premises firewall.


Answer Key β€” Scenario 2​

Answer: B

The Route Filter is the control mechanism for which BGP prefixes are advertised to the on-premises environment via Microsoft Peering. Each Azure service has a specific BGP community: Azure Storage uses 12076:5010 and Azure SQL uses a different community. Since the Route Filter contains only the rule for Storage, only Storage prefixes are advertised. Azure SQL prefixes simply do not reach the on-premises router, making the service unreachable.

Standard SKU supports Microsoft Peering normally; Premium SKU is required for access to regions outside the geopolitical area, not for service types. This is the confusion that alternative A exploits. Alternative C invents behavior that does not exist: a single BGP session per peering supports multiple communities and prefixes simultaneously. Alternative D is technically incorrect: Azure SQL is accessible via Microsoft Peering using its public addresses; Private Endpoint is an alternative, not an exclusive requirement. The most dangerous distractor is D, as an engineer who believes it might create an unnecessary Private Endpoint architecture, adding complexity and cost.


Answer Key β€” Scenario 3​

Answer: B

The scenario establishes critical constraints that eliminate the other alternatives. The primary path is stable and operating normally, the provider has already identified the cause and has a 4-hour resolution estimate, and any production changes require formal approval outside the standard window.

The correct action is to monitor, formally register the incident, and wait. The SLA requires redundancy, but does not require the company to execute an unapproved emergency change when the primary path is working and resolution is underway.

Alternative A is attractive because it seems responsible, but opens an emergency change process for an action (failover) that is unnecessary given that the primary is stable. Alternative C is the most dangerous: executing an unauthorized production failover can violate governance policies and introduce unnecessary additional risks. Alternative D is technically incorrect and potentially destructive: re-enabling peering affects both paths, including the healthy primary, potentially causing complete interruption while BGP renegotiates.


Answer Key β€” Scenario 4​

Answer: C

The correct ExpressRoute diagnostic sequence follows layer logic, from general state to detail, prioritizing what can be verified locally before involving third parties.

Step 5 confirms that the circuit is provisioned correctly on both sides. Without this, any subsequent investigation may be useless.

Step 4 triggers the provider immediately in parallel, because layer 1 and 2 failures on the provider side are invisible to the Azure portal and can be quickly confirmed or ruled out with a call.

Step 1 checks the ARP table to identify layer 2 failures on the Microsoft side.

Step 3 checks the BGP control plane to identify if correct routes are being exchanged (layer 3).

Step 2 checks the gateway last, since the alert already indicates that the circuit is Enabled and Provisioned. The gateway is the highest layer and should be verified after confirming that lower layers are healthy.

Alternative A starts with the gateway, which is a high-layer hypothesis without evidence. Alternative D starts with the ARP table before confirming circuit state, which is a premature diagnostic jump. The error represented by incorrect alternatives is common: going directly to the component most familiar to the engineer, instead of following logical layer progression.


Troubleshooting Tree: Diagnose and Resolve ExpressRoute Connection Issues​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blueInitial symptom (entry point)
BlueDiagnostic question (binary decision or observable state)
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate verification or validation

When facing a real problem, start at the root node and answer each question based on what is directly observable, whether in the Azure portal, via PowerShell, or with provider information. Follow the path corresponding to the obtained answer. Orange nodes indicate that additional verification is necessary before concluding the diagnosis. When reaching a red node, the cause is identified and corrective action can be safely initiated.