Skip to main content

Troubleshooting Lab: Create and configure an ExpressRoute gateway

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

An operations team reports that the provisioning of an ExpressRoute gateway failed after 45 minutes of execution. The target environment is a newly created production VNet in the East US 2 region. The team reports that the subscription has available quotas, the chosen SKU was ErGw2AZ, and the operation was executed by a user with the Network Contributor role on the VNet.

The portal output at the time of failure was:

Deployment failed
Resource: Microsoft.Network/virtualNetworkGateways
Error code: GatewaySubnetNotFound
Message: The GatewaySubnet was not found in the virtual network
'vnet-prod-eastus2'. A subnet named 'GatewaySubnet' is
required to deploy a virtual network gateway.

The team verifies that there is a subnet called gateway-subnet with a /27 prefix and that it has no Network Security Group associated. The security team confirms that no blocking policy was applied to this subnet.

What is the root cause of the provisioning failure?

A) The /27 prefix is insufficient for the ErGw2AZ SKU, which requires at least /26
B) The subnet name is incorrect; Azure mandatorily requires the exact name GatewaySubnet
C) The Network Contributor role does not have permission to provision virtual network gateways
D) Zonal SKUs like ErGw2AZ require the subnet to be associated with a specific availability zone before provisioning


Scenario 2 β€” Action Decision​

The root cause has been identified: the ExpressRoute gateway of a production VNet is configured with the Standard SKU, which does not support ExpressRoute FastPath. The architecture team approved the migration to the UltraPerformance SKU to enable this feature. The gateway currently has two active connections with ExpressRoute circuits in use by critical workloads with 99.9% SLA.

Current restrictions are:

  • The scheduled maintenance window starts in 6 hours
  • There is no approval for production impact outside the window
  • Gateway SKU upgrade in Azure causes temporary interruption of connections

What is the correct action to take at this moment?

A) Start the SKU upgrade immediately to take advantage of available time before the window and complete before peak usage
B) Wait for the approved maintenance window and execute the SKU upgrade within the authorized period
C) Create a new gateway with UltraPerformance SKU in parallel and migrate connections now, avoiding in-place upgrade
D) Remove active connections, upgrade SKU immediately and recreate connections before the maintenance window


Scenario 3 β€” Root Cause​

A network engineer is investigating why a VNet cannot receive routes advertised by a newly associated ExpressRoute circuit. The gateway is already in Succeeded state and the connection between the gateway and circuit also appears as Succeeded in the portal. The circuit was confirmed by the provider as provisioned and active.

The engineer executes the following command to check learned routes:

az network vnet-gateway list-learned-routes \
--name ergw-prod \
--resource-group rg-network \
--output table

The output returns an empty table, with no routes listed. The engineer verifies that the gateway SKU is ErGw1AZ, that the GatewaySubnet has a /27 prefix, and that there is no Route Table associated with the GatewaySubnet. The provider team confirms that the circuit's private peering is configured, but that the BGP session has not yet been established with the Azure side.

What is the root cause of the absence of learned routes?

A) The ErGw1AZ SKU does not support route learning via BGP and must be upgraded to ErGw2AZ
B) The absence of a Route Table on the GatewaySubnet prevents the gateway from propagating received routes to the VNet
C) The BGP session between private peering and the gateway has not yet been established, therefore no routes have been exchanged
D) The list-learned-routes command requires the gateway to be in active-active mode to return results


Scenario 4 β€” Diagnostic Sequence​

An operator receives the following report: "After a network reconfiguration performed last night, VMs in a production VNet stopped accessing on-premises resources via ExpressRoute. The gateway is in Succeeded state and the circuit is Enabled."

The following investigation steps are available, out of order:

  1. Verify if the private peering BGP session is in Connected state in the circuit portal
  2. Confirm if the connection between gateway and circuit is in Succeeded state
  3. Execute az network vnet-gateway list-learned-routes to verify if the gateway is receiving routes from on-premises
  4. Verify if learned routes are being propagated to the VNet subnet route tables
  5. Test connectivity from a specific VM using Test-NetConnection to the on-premises destination IP

Which sequence represents the correct order of progressive diagnosis?

A) 5, 1, 2, 3, 4
B) 2, 1, 3, 4, 5
C) 1, 3, 2, 4, 5
D) 3, 2, 1, 5, 4


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The error message is direct: GatewaySubnetNotFound. Azure requires that the subnet used by any virtual network gateway, including the ExpressRoute gateway, have the exact name GatewaySubnet, with this capitalization. The name gateway-subnet is not recognized by the platform as the reserved subnet for this purpose, regardless of prefix, NSG, or permissions.

The confirming clue is in the error message itself, which explicitly names what was not found.

The information about the absence of NSG and available quotas is irrelevant to this diagnosis and was included purposely to lead the reader to investigate a wrong direction. The /27 prefix is adequate for production, therefore alternative A is a distractor based on a misapplied requirement. Alternative C is incorrect because Network Contributor has permission to create gateways. Alternative D does not correspond to any real behavior of the Azure platform.

The most dangerous distractor is C: attributing the error to insufficient permissions can lead to unnecessary escalation while the real problem remains unsolved.


Answer Key β€” Scenario 2​

Answer: B

The critical restriction of the scenario is that there is no approval for production impact outside the maintenance window. Upgrading the SKU of an ExpressRoute gateway causes temporary interruption of active connections, which constitutes direct impact on workloads with 99.9% SLA. Executing the operation before the authorized window violates the approved process and exposes the environment to risk without authorization.

Alternative A ignores the approval restriction, which is a process restriction, not just technical. Alternative C is technically valid as a zero-downtime migration strategy, but the question does not state that this approach was approved or planned, and creating a new gateway in parallel can also impact the environment without prior authorization. Alternative D combines removing active connections with immediate upgrade, which represents the highest possible risk to the SLA.

The discipline of not acting before the approved window is the core skill tested in this scenario.


Answer Key β€” Scenario 3​

Answer: C

The Succeeded state of the connection in the portal indicates that the connection resource was successfully created in the Azure control plane, but does not guarantee that the BGP session has been established in the data plane. Without an active BGP session on private peering, no routes are exchanged between the on-premises side and the gateway, which explains the empty table returned by the command.

The confirming clue is in the provider's statement: "the BGP session has not yet been established with the Azure side." This information directly explains the observed symptom.

The absence of a Route Table on the GatewaySubnet is irrelevant to BGP session establishment and does not prevent route learning. The ErGw1AZ SKU supports BGP normally. The list-learned-routes command works regardless of active-active mode. These three points were included as plausible but irrelevant information to filter reasoning.

The most dangerous distractor is B: focusing on the absence of Route Table can lead the engineer to create unnecessary configurations without solving the real problem, which is in the circuit's BGP control plane.


Answer Key β€” Scenario 4​

Answer: B

The correct progressive diagnostic sequence is: 2, 1, 3, 4, 5.

The reasoning starts from the component closest to the gateway toward the final destination:

StepActionLogic
1Check connection state (Succeeded)Confirms if gateway-circuit link exists in control plane
2Check private peering BGP sessionConfirms if data plane is operational
3List routes learned by gatewayConfirms if on-premises routes are reaching Azure
4Check propagation to subnet route tablesConfirms if routes reach VMs
5Test connectivity from specific VMValidates real end-to-end behavior

Starting with VM connectivity testing (alternative A) is the most common error: testing the symptom without any intermediate diagnosis does not indicate where the failure is. Alternative C reverses connection and BGP verification, which prevents quickly isolating whether the problem is in the control plane or data plane. Alternative D starts with learned routes without first confirming that previous components are operational.


Troubleshooting Tree: Create and configure an ExpressRoute gateway​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

  • Dark blue: initial symptom or diagnostic entry point
  • Medium blue: objective diagnostic question with yes/no branching
  • Red: identified cause requiring external investigation or deeper analysis
  • Green: recommended action or resolution state
  • Orange: validation or intermediate verification node

To use this tree when facing a real problem, start with the root node representing the observed symptom and answer each diagnostic question based on what is directly verifiable in the portal or via CLI. Follow the path corresponding to the answer until reaching an identified cause or recommended action node. Do not skip levels: the order of questions reflects the dependency between ExpressRoute gateway components, from control plane to data plane.