Skip to main content

Troubleshooting Lab: Deploy a gateway into a virtual hub

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

A network team successfully deployed a site-to-site VPN gateway in a virtual hub in the East US region. The hub is part of a Standard type Virtual WAN created six months ago. There are three spokes connected to the hub, all with functional connectivity between each other.

After the gateway deployment, the team attempted to create a connection with a corporate branch. The connection was created in the portal but remains in Failed state. The branch uses a Microsoft-certified VPN device and the device's public IP address was correctly provided.

The team checks the gateway logs and observes the following:

IKE diagnostic log:
[2026-03-20 14:32:11] Initiating IKEv2 SA negotiation with 203.0.113.45
[2026-03-20 14:32:11] Sending IKE_SA_INIT request
[2026-03-20 14:32:41] Retransmit 1 of IKE_SA_INIT request
[2026-03-20 14:33:11] Retransmit 2 of IKE_SA_INIT request
[2026-03-20 14:33:41] ERROR: No response received from peer. IKE negotiation timed out.

The team also confirms that BGP is enabled on the connection and that the pre-shared key was configured on both sides. The Virtual WAN has a custom routing policy applied to the hub, redirecting management traffic to an NVA.

What is the most likely root cause of the failure?

A) The hub's custom routing policy is intercepting IKE traffic before it reaches the gateway.

B) The remote VPN device is not responding to IKE requests, indicating UDP 500/4500 traffic blocking on the branch side.

C) BGP enabled on the connection is conflicting with IKEv2 negotiation, as Virtual WAN does not support BGP together with IKEv2.

D) The pre-shared key configured in the portal was not propagated to the gateway because provisioning has not yet completed.


Scenario 2 β€” Action Decision​

The platform team identified that an ExpressRoute gateway in a production virtual hub has an undersized scale unit. The gateway was provisioned with scale 1 (1 Gbps), but traffic measurement from the last 30 days shows consistent peaks above 900 Mbps, with packet drop episodes recorded.

The environment has the following constraints:

  • The ExpressRoute circuit connects the main corporate headquarters and supports real-time financial operations
  • The approved maintenance window starts in 72 hours
  • There is a backup circuit via site-to-site VPN configured, but not tested in the last 4 months
  • The team has permission to execute the change immediately if necessary

What is the correct action to take at this moment?

A) Increase the gateway scale unit immediately, as the resizing operation is non-disruptive and can be done outside the maintenance window.

B) Wait for the maintenance window, test the VPN backup circuit before the change, and then increase the scale unit within the approved window.

C) Activate the VPN backup circuit immediately to free load from ExpressRoute and then resize the gateway outside the window, without need for additional testing.

D) Open a Microsoft support case to request emergency capacity increase, as scale unit changes in production gateways require support approval.


Scenario 3 β€” Root Cause​

An organization maintains two virtual hubs in the same Standard Virtual WAN: hub-eastus and hub-westus. Each hub has a site-to-site VPN gateway provisioned and multiple associated spokes.

A new business requirement demands that branches connected to hub-eastus reach workloads in spokes of hub-westus. The team enabled interhub routing in the Virtual WAN configuration and waited 20 minutes.

After the waiting period, branches from hub-eastus still cannot reach spokes from hub-westus. The team checks the effective routes of a VM in a spoke of hub-westus and finds no routes referring to the subnets of branches from hub-eastus.

The configuration of both gateways is shown below:

Configurationhub-eastushub-westus
SKUVpnGw1VpnGw1
Scale unit11
BGP enabledYesNo
Branch-to-branchEnabledEnabled

The team mentions that hub-westus was recently created and does not yet have any VNet spoke formally associated via portal, although the VNets have manually configured peering.

What is the root cause of the routing problem?

A) BGP asymmetry between the two gateways prevents interhub route propagation, as both need BGP enabled for transitive routing to work.

B) The VNets in hub-westus were connected via manual peering instead of managed spoke connection by Virtual WAN, preventing the Virtual Hub Router from learning and advertising these routes.

C) Scale unit 1 in both gateways is insufficient to support interhub routing; it's necessary to scale to at least 2 in both hubs.

D) The 20-minute interval was not sufficient; interhub routing in Virtual WAN can take up to 4 hours to propagate routes after activation.


Scenario 4 β€” Diagnostic Sequence​

An administrator receives the following report: "The virtual hub's point-to-site VPN gateway stopped accepting new connections from remote clients. Already connected clients continue working normally."

The available investigation steps are:

  1. Check the number of active connections on the gateway and compare with the maximum limit of the configured scale unit
  2. Confirm if the root authentication certificate configured on the gateway is valid and not expired
  3. Check if there was a recent change to the scale unit or gateway SKU
  4. Check if new connection requests appear in the gateway diagnostic logs with specific error
  5. Validate if the VPN client IP address pool has available addresses or is exhausted

What is the correct diagnostic sequence for this symptom?

A) 3 β†’ 1 β†’ 5 β†’ 4 β†’ 2

B) 4 β†’ 1 β†’ 5 β†’ 2 β†’ 3

C) 2 β†’ 3 β†’ 4 β†’ 1 β†’ 5

D) 4 β†’ 5 β†’ 1 β†’ 2 β†’ 3


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The IKE diagnostic log is the central clue. The Azure gateway correctly sent the IKE_SA_INIT request and performed retransmissions, but never received a response from the remote peer at address 203.0.113.45. This specific pattern of timeout without response indicates that traffic is not reaching the remote device or is being blocked before returning. UDP ports 500 (IKE) and 4500 (NAT-T) must be open in the branch firewall for IKEv2 negotiation to occur.

The information about the custom routing policy for management traffic is the irrelevant element in the scenario. Routing policies in Virtual WAN affect data traffic after the tunnel is established; they do not interfere with the control plane of IKE negotiation, which occurs before any tunnel exists.

Alternative C represents a serious technical misconception: Virtual WAN fully supports BGP with IKEv2, as they are completely distinct planes. Acting based on this alternative would lead to unnecessarily disabling BGP, breaking dynamic routing when the tunnel is finally established.


Answer Key β€” Scenario 2​

Answer: B

The set of scenario constraints defines the correct answer by elimination. The scale unit increase operation on an ExpressRoute gateway in a virtual hub is disruptive: it causes temporary connectivity interruption during resizing. Therefore, executing immediately (alternative A) would put financial operations at risk without any validated contingency.

The VPN backup circuit has not been tested in 4 months, which means activating it directly without prior validation (alternative C) would be equally risky. The correct alternative combines the two necessary safeguards: validate the backup before needing it and execute the change within the approved window, which exists precisely to protect the production environment.

Alternative D represents a process misconception: scale unit changes are self-service operations and do not require Microsoft support involvement, except in cases of technical failure.


Answer Key β€” Scenario 3​

Answer: B

The decisive clue is in the observation that the VNets in hub-westus were connected via manual peering instead of managed spoke connection by Virtual WAN. In the Virtual WAN model, the Virtual Hub Router learns and advertises routes only from connections registered as spoke connections within Virtual WAN. VNets with manually configured peering, outside the Virtual WAN spoke model, are invisible to the hub's routing plane and therefore their routes are not propagated interhub or to connected gateways.

The BGP asymmetry described in alternative A is a true but irrelevant detail for this specific problem: interhub routing in Virtual WAN uses the Virtual Hub Router as the primary mechanism, regardless of BGP state in branch gateways. The absence of BGP in a gateway affects branch route propagation, but does not prevent routing between spokes of different hubs when spoke connections are correctly registered.

The most dangerous distractor is alternative D: waiting longer without investigating the real root cause would delay resolution indefinitely, as the problem is not convergence, but connection architecture.


Answer Key β€” Scenario 4​

Answer: B

The central symptom is objective: new connections fail, but existing connections remain active. This behavior points to a capacity limit or exhausted resource, not to an authentication or certificate failure, which would affect all connections indiscriminately.

The correct sequence starts at 4 (check logs to get the specific error) because the diagnostic log frequently delivers the refusal reason directly, avoiding unnecessary investigation. With the error in hand, proceed to 1 (check if the scale unit's simultaneous connection limit has been reached) and 5 (check if the client IP address pool is exhausted), which are the two most common causes for this exact symptom. Step 2 (check root certificate) comes after because, if the certificate were expired, no connection would work. Step 3 (check recent changes) is historical context investigation, useful at the end to correlate cause with event.

Sequence C makes the classic mistake of starting with the hypothesis most familiar to the investigator (check certificate) instead of the most objective data available (the error log).


Troubleshooting Tree: Deploy a gateway into a virtual hub​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blueInitial symptom (tree root)
Medium blueDiagnostic question
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate validation or verification

To use this tree when facing a real problem, start with the root node describing the observed symptom and answer each diagnostic question based on what can be verified directly in the portal, logs, or via CLI. Follow the path indicated by the answer until reaching a cause node (red) or action node (green). Never skip an intermediate validation node (orange), as they protect against premature diagnoses that lead to corrective actions applied to the wrong problem.