Skip to main content

Troubleshooting Lab: Configure encryption over ExpressRoute

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

A network team reports that an IPsec tunnel over ExpressRoute Private Peering was configured two days ago and shows Connected status in the Azure portal. However, on-premises applications cannot reach VMs in the Azure VNet, and the team opens a ticket describing the problem as "VPN works but traffic doesn't pass."

The responsible engineer collects the following information:

# Output from Get-AzVirtualNetworkGatewayConnection command on Azure side
ConnectionStatus : Connected
EgressBytesTransferred : 0
IngressBytesTransferred : 0
RoutingWeight : 10
SharedKey : (configured)

# Route verification on on-premises device
Destination Gateway Flags Metric Interface
10.10.0.0/16 192.168.1.1 UG 100 eth0
172.16.0.0/12 0.0.0.0 U 0 lo

The ExpressRoute circuit is Provisioned and BGP peering is established with prefixes being advertised in both directions. The team also reports that they replaced the distribution switch uplink cable yesterday, but connectivity to other Azure services via ExpressRoute (without IPsec) returned to normal operation after the replacement.

What is the root cause of the problem?

A) The uplink cable replacement caused a BGP interruption that hasn't fully recovered yet, preventing traffic from flowing through the tunnel.

B) The IKE/IPsec policy (algorithms, DH group, or lifetime) is incompatible between the on-premises device and the Azure VPN Gateway, resulting in a connected tunnel without traffic.

C) The route to the Azure VNet on the on-premises device is pointing to the wrong gateway, causing traffic destined for the VNet to not enter the IPsec tunnel.

D) The SharedKey was configured incorrectly; when there's a pre-shared key mismatch, Azure reports Connected status but doesn't forward traffic.


Scenario 2 β€” Action Decision​

The cause of the problem has been identified: the MACsec cipher suite configured on the on-premises device is GCM-AES-256, but the Microsoft side is provisioned with GCM-AES-128. The team needs to resolve the problem. The operational context is as follows:

  • The ExpressRoute circuit is in production and carries critical traffic from other applications that don't use MACsec
  • The scheduled maintenance window is in 48 hours
  • The engineer has permission to change configurations on the on-premises device immediately
  • Changing the cipher suite on the Microsoft side requires opening a ticket with the connectivity provider and takes between 24 and 72 hours
  • The security team confirms that GCM-AES-128 meets the current corporate policy

What is the correct action to take at this moment?

A) Open a ticket with the provider to change the Microsoft cipher suite to GCM-AES-256, as stronger algorithms should be prioritized, and wait for the maintenance window.

B) Change the cipher suite on the on-premises device to GCM-AES-128 immediately, without waiting for the maintenance window, as the change is on the local equipment and the impact is limited to the MACsec link that's not yet operational.

C) Temporarily disable MACsec on the on-premises device and wait for the maintenance window to re-enable it with the correct cipher suite, as any change in production requires a formal window.

D) Escalate to the security team the need to review the corporate policy and authorize GCM-AES-256 on the Microsoft side before any changes.


Scenario 3 β€” Root Cause​

An architect configured IPsec over ExpressRoute in an environment with the following connectivity diagram:

On-premises (10.0.0.0/8)
|
[VPN Device - IKEv2]
|
[ExpressRoute Circuit - Private Peering]
|
[VPN Gateway - active-active - VpnGw2]
|
VNet: 172.16.0.0/16

After configuration, traffic flows normally. Two days later, the operations team enables ExpressRoute Global Reach to connect this circuit to a second ExpressRoute circuit from another branch. Immediately after activation, the team reports that VMs in the Azure VNet became reachable from the second branch without passing through the IPsec tunnel, and the traffic appears in logs without encryption.

The security team reports the failure as critical. What is the root cause?

A) ExpressRoute Global Reach created a direct routing path between the two circuits that bypasses the VPN Gateway, causing traffic from the second branch to reach the VNet without passing through the IPsec tunnel.

B) Global Reach activation restarted the VPN Gateway in active-active mode, and during restart traffic was forwarded without encryption via the native ExpressRoute route.

C) The VPN Gateway in VpnGw2 SKU doesn't support multiple ExpressRoute circuits simultaneously, and the second circuit was treated as a direct connection without IPsec policy.

D) BGP from the second circuit advertised routes with shorter AS Path to the VNet, making the VPN Gateway prefer the unencrypted path in the routing table.


Scenario 4 β€” Diagnostic Sequence​

An engineer receives the following report: "The IPsec tunnel over ExpressRoute was configured yesterday. The status in the portal is Connected, but on-premises applications cannot reach VMs in the Azure VNet. ExpressRoute itself works normally for other resources."

The available investigation steps are:

  1. Verify if routes to the VNet are being learned via BGP on the on-premises device and point to the VPN Gateway IP as next-hop
  2. Confirm that the IPsec tunnel status is Connected and check byte transfer counters (EgressBytes and IngressBytes)
  3. Compare IKE/IPsec parameters configured on the on-premises device with the policy configured on the Azure VPN Gateway
  4. Verify basic ExpressRoute circuit connectivity and confirm that BGP peering is established
  5. Execute a traceroute from on-premises to a VM IP in the VNet and observe where traffic stops

What is the correct diagnostic sequence?

A) 4 β†’ 2 β†’ 1 β†’ 5 β†’ 3

B) 2 β†’ 4 β†’ 3 β†’ 1 β†’ 5

C) 1 β†’ 3 β†’ 4 β†’ 2 β†’ 5

D) 4 β†’ 1 β†’ 2 β†’ 3 β†’ 5


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: C

The decisive clue is in the on-premises device routing table: the route to 10.10.0.0/16 (the Azure VNet) points to 192.168.1.1, which is the local network's default gateway, and not to the IPsec tunnel endpoint. For traffic to enter the tunnel, the destination route must have the IPsec tunnel IP or VPN interface as next-hop, not the LAN gateway. With this configuration, packets destined for the VNet exit through the wrong interface and never reach the Azure VPN Gateway.

The information about the uplink cable replacement is the irrelevant information included intentionally. The fact that BGP is established and other services work via ExpressRoute confirms that the transport infrastructure is healthy; the cable replacement has no relation to IPsec tunnel routing.

Alternative B is a plausible distractor because IKE policy incompatibility generates exactly the symptom of Connected tunnel with zero bytes. However, the statement informs that the tunnel was configured two days ago and has Connected status; if the policy were wrong from the beginning, the tunnel would never have reached Connected state. Alternative D is incorrect because SharedKey mismatch prevents the tunnel from reaching Connected state. The most dangerous distractor would be choosing alternative A, as it would lead the engineer to investigate BGP and the circuit when the problem is in the local routing table.


Answer Key β€” Scenario 2​

Answer: B

The scenario constraints define the decision space: MACsec is not yet operational (therefore there's no impact on existing traffic when changing the link), the security team confirmed that GCM-AES-128 is acceptable, and the engineer has immediate permission to act on the on-premises device. Changing the cipher suite on the local equipment to GCM-AES-128 solves the problem without affecting any production traffic, as the MACsec link with the wrong cipher is simply not operating.

Alternative A ignores two critical constraints: it prioritizes a stronger algorithm when corporate policy already accepts the weaker one, and triggers a 24 to 72-hour process with the provider when the solution can be applied immediately on the on-premises side. Alternative C would be correct if the change on the on-premises device carried real risk of production impact, but since MACsec is not active, there's no traffic to protect. Alternative D diverts the problem to a policy review when the decision has already been made by the security team. The most dangerous distractor is C, as the prudence of "waiting for window" is valid in many contexts, but here represents unnecessary paralysis.


Answer Key β€” Scenario 3​

Answer: A

ExpressRoute Global Reach connects two ExpressRoute circuits directly in Microsoft's backbone network, creating a routing path between the two on-premises networks that doesn't pass through the Azure VNet or VPN Gateway. When the second branch sends traffic to the VNet (172.16.0.0/16), Global Reach isn't involved in that path; but when the VNet tries to reach the second branch, routing can take the direct path via Global Reach, bypassing the VPN Gateway and therefore bypassing IPsec.

More critically, traffic originating from the second branch that eventually needs to reach the VNet may find a path via Global Reach to the first circuit and then to the VNet without passing through the configured IPsec tunnel. Global Reach doesn't inherit security policies from the VPN Gateway; it's an independent routing connection.

Alternative B is incorrect because Global Reach doesn't restart the VPN Gateway. Alternative C is wrong; VpnGw2 supports multiple circuits and the described limitation doesn't exist. Alternative D is a plausible technical distractor, but shorter AS Path via BGP doesn't explain VPN Gateway bypass; the problem is architectural, not routing preference. The most dangerous distractor is D, as it would lead the engineer to investigate BGP tables when the cause is the topology itself introduced by Global Reach.


Answer Key β€” Scenario 4​

Answer: A

The correct sequence is 4 β†’ 2 β†’ 1 β†’ 5 β†’ 3, which follows progressive diagnostic logic from simplest to most specific.

Step 4 validates the foundation: if ExpressRoute and BGP aren't healthy, nothing else works and the other steps are irrelevant. Step 2 checks tunnel state and, importantly, byte counters; zeros in both directions confirm the problem is forwarding, not tunnel establishment. Step 1 investigates on-premises routing to determine if traffic is even being directed to the tunnel. Step 5 empirically confirms where traffic stops, corroborating or refuting the routing hypothesis. Step 3 is left for last because IKE/IPsec parameter analysis is the deepest and most time-consuming investigation; it's only necessary if previous steps don't reveal the cause.

Alternative B starts with the tunnel before validating the underlying infrastructure, which can lead to false conclusions. Alternative C starts with routing before confirming the tunnel exists, inverting logical order. Alternative D is closest to correct but inverts the order of 1 and 2, going to routing before confirming tunnel state and counters, which loses diagnostic information from bytes transferred before investigating routes.


Troubleshooting Tree: Configure encryption over ExpressRoute​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark BlueInitial symptom (entry point)
BlueDiagnostic question (binary decision)
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate validation or verification

To use this tree when facing a real problem, start at the root node describing the general symptom and follow the branches answering each question based on what you can observe in the environment. Each answer eliminates a set of hypotheses and directs to the next verification. When a red node is reached, the cause is identified and corrective action should be applied before returning to the beginning to confirm resolution.