Skip to main content

Troubleshooting Lab: Diagnose and Resolve Virtual Network Gateway Connectivity Issues

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

A company has operated a Site-to-Site VPN connection between the on-premises network and an Azure VNet for six months without issues. After scheduled maintenance on the on-premises firewall last Friday, the connection stopped working. The network team confirmed that no changes were made on the Azure side. The VPN Gateway uses SKU VpnGw2 and is in Active-Passive mode.

The Network Watcher VPN Diagnostics output returned the following:

Initiating IKEv2 Main Mode...
Sending SA proposal to peer: 52.x.x.x
No response received from peer after 3 retries
IKE negotiation failed: timeout waiting for peer response

Additional information collected by the team:

  • The public IP address of the Local Network Gateway in Azure is 198.51.100.10
  • The on-premises firewall was updated to a newer firmware version
  • The Azure portal TLS certificate was automatically renewed this week
  • The public IP assigned to the on-premises gateway remains 198.51.100.10
  • UDP 500 and UDP 4500 are allowed in the gateway subnet NSG

What is the root cause of the observed failure?

A) The VpnGw2 SKU does not support IKEv2 after firmware updates on third-party devices
B) The firewall firmware update changed the IKE Phase 1 policy, making it incompatible with the Azure gateway configuration
C) The automatically renewed TLS certificate invalidated the VPN tunnel authentication credentials
D) The gateway subnet NSG started blocking UDP 4500 after maintenance, preventing NAT-T


Scenario 2 β€” Action Decision​

The operations team identified that the cause of a failure in an ExpressRoute connection is the expiration of the authentication certificate configured in private peering. The environment has the following constraints:

  • The ExpressRoute circuit is used by production payment systems with a 99.95% SLA
  • There is a Site-to-Site VPN backup connection already configured and tested, currently in standby
  • The approved maintenance window for this circuit is Saturdays between 02:00 and 04:00
  • It is 15:00 on a Thursday
  • Certificate renewal requires between 20 and 40 minutes and causes complete peering interruption

What is the correct action to take at this moment?

A) Immediately start certificate renewal, since the cause has been identified and each minute represents a risk of total failure
B) Manually activate failover to the backup VPN connection and execute certificate renewal within the approved maintenance window
C) Open a ticket with the connectivity provider to have the renewal done without impact on the Azure side
D) Monitor the circuit until the maintenance window without any preventive action, since the certificate is still active


Scenario 3 β€” Root Cause​

An organization expanded its topology by adding a new Spoke VNet (10.2.0.0/16) connected via peering to an existing Hub VNet (10.0.0.0/16). The Hub has a VPN Gateway connected to the on-premises network (192.168.0.0/16). The peering configurations were applied according to the table below:

PeeringAllow Gateway TransitUse Remote GatewaysAllow Forwarded Traffic
Hub to SpokeEnabledDisabledEnabled
Spoke to HubDisabledEnabledEnabled

After configuration, the team reported that machines on the on-premises network can reach resources in the Hub VNet normally, but cannot reach any VM in the Spoke VNet. Machines in the Spoke can communicate with the Hub without problems. The team verified that the NSGs of all Spoke subnets allow inbound traffic from 192.168.0.0/16.

The Azure gateway was queried with the command below and the prefix 10.2.0.0/16 does not appear in the routes advertised to the on-premises peer:

Get-AzVirtualNetworkGatewayLearnedRoute `
-VirtualNetworkGatewayName "gw-hub" `
-ResourceGroupName "rg-network"

What is the root cause of the problem?

A) The Hub VNet NSG is blocking return traffic from 10.2.0.0/16 to 192.168.0.0/16
B) The "Use Remote Gateways" flag in the Spoke to Hub peering is enabled, creating a routing conflict when BGP is not active
C) The "Allow Gateway Transit" flag in the Hub to Spoke peering is enabled, but the "Use Remote Gateways" flag in the Spoke to Hub peering is disabled, preventing the gateway from learning and advertising the Spoke prefix
D) The prefix 10.2.0.0/16 overlaps internal Hub VNet routes, causing silent packet dropping


Scenario 4 β€” Diagnostic Sequence​

An engineer receives the following report: "The Site-to-Site VPN connection was recreated yesterday after a subscription migration. Now the status in the portal shows as 'Connected', but no traffic flows between networks."

The following investigation steps are available, out of order:

  1. Verify if the PSK configured in the Azure Local Network Gateway matches the one configured on the on-premises device
  2. Execute Network Watcher VPN Diagnostics to capture IKE negotiation logs
  3. Confirm if the address prefixes in the Local Network Gateway exactly match the on-premises network prefixes
  4. Check if there is prefix overlap between the VNet address space and the prefixes declared in the Local Network Gateway
  5. Analyze the effective route tables on destination VM NICs to confirm if the route to 192.168.x.x is present

What is the correct investigation sequence, from broadest to most specific?

A) 2, 1, 3, 4, 5
B) 4, 3, 1, 2, 5
C) 1, 2, 3, 5, 4
D) 2, 4, 3, 1, 5


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

Explanation:

  • The log clearly shows that IKE negotiation fails due to timeout: the Azure gateway sends proposals and receives no response from the peer. This indicates that the on-premises device is ignoring or rejecting the received proposals, which is typical behavior of an IKE Phase 1 policy incompatibility. After a firmware update, devices frequently reset or change default encryption, integrity, and DH group algorithms, making proposals sent by Azure unacceptable.
  • The information about the Azure portal TLS certificate is deliberately irrelevant: portal TLS certificates have no relationship with VPN tunnel authentication, which uses PSK or separate tunnel certificates. Including this data simulates real diagnostic pressure where recent but causally unrelated information diverts attention.
  • Alternative A is incorrect because the SKU does not impose restrictions based on third-party firmware. Alternative D is ruled out by the statement itself, which confirms that UDP 500 and 4500 are allowed and that the NSG was not changed on the Azure side. Alternative C confuses management TLS certificates with VPN authentication credentials.
  • Acting based on alternative D would lead the team to unnecessarily review NSGs, consuming time while the IKE Phase 1 incompatibility remains undiagnosed.

Answer Key β€” Scenario 2​

Answer: B

Explanation:

  • The cause is identified and the necessary action is clear, but the constraint context determines the correct path. It is 15:00 on a Thursday: executing renewal immediately (alternative A) would violate the payment system SLA and the approved maintenance window, as the operation causes complete interruption of 20 to 40 minutes in production.
  • The correct action is to activate failover to the backup VPN now, protecting operational continuity, and execute renewal within the approved window on Saturday. This simultaneously respects the SLA, the change management process, and operational security.
  • Alternative C is incorrect because private peering certificate expiration on the Azure side is the Azure team's responsibility, not the connectivity provider's. Alternative D represents the most dangerous error: waiting without preventive action with an expired or about-to-expire certificate on a critical circuit is accepting risk of total failure without control.
  • The most dangerous distractor is alternative A, as it has apparent technical urgency. Correct reasoning requires recognizing that "cause identified" does not mean "act immediately without considering constraints."

Answer Key β€” Scenario 3​

Answer: C

Explanation:

  • The configuration table precisely reveals the problem: in the Spoke to Hub peering, the "Use Remote Gateways" flag is disabled. This flag is the mechanism by which the Spoke instructs Azure to use the Hub gateway to route external traffic. Without it, the Spoke does not delegate its routing to the gateway, and the gateway never learns the prefix 10.2.0.0/16 as a reachable destination, therefore never advertises it to the on-premises network via BGP or static routes.
  • The result of Get-AzVirtualNetworkGatewayLearnedRoute directly confirms this hypothesis: the Spoke prefix simply does not exist in the gateway's learned route table.
  • The information about Spoke NSGs allowing 192.168.0.0/16 is deliberately irrelevant to this diagnosis: the problem is in the control plane (routing), not the security plane. Traffic never reaches the Spoke to be filtered by the NSG.
  • Alternative B represents a common misconception: "Use Remote Gateways" enabled on the Spoke does not create conflict when "Allow Gateway Transit" is active on the Hub; this is exactly the correct and necessary combination. Alternative D is ruled out because 10.2.0.0/16 and 10.0.0.0/16 are distinct prefixes and do not overlap.

Answer Key β€” Scenario 4​

Answer: D

Explanation:

  • The correct sequence is: 2, 4, 3, 1, 5.
  • The starting point should always be the comprehensive diagnostic tool (step 2: Network Watcher VPN Diagnostics), as IKE logs reveal in which phase negotiation fails and direct the next steps, avoiding blind verifications.
  • Next, checking prefix overlap (step 4) is critical because it is a cause of silent dropping that does not generate IKE errors: the tunnel can be "Connected" while Azure drops packets whose destination conflicts with the VNet address space.
  • Step 3 (Local Network Gateway prefixes) comes next because incorrect or incomplete prefixes explain why traffic to certain destinations does not flow even with an active tunnel.
  • Step 1 (PSK) is verified afterward because a PSK mismatch would prevent tunnel establishment itself; since the portal shows "Connected," the PSK is probably correct, but should be confirmed.
  • Step 5 (effective routes on NICs) is last as it is the most granular: it only makes sense to investigate a specific VM's route table after confirming that the gateway is receiving and advertising prefixes correctly.
  • Alternative A makes the error of verifying PSK before confirming if there is prefix overlap or incorrect prefixes, skipping more likely causes given the symptom. Alternative B starts with overlap without using diagnostic logs, which could eliminate entire hypotheses before any manual verification.

Troubleshooting Tree: Diagnose and Resolve Virtual Network Gateway Connectivity Issues​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Legend:

ColorNode Type
Dark blueInitial symptom (entry point)
Medium blueDiagnostic question (decision)
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate verification or validation

To use this tree when facing a real problem, start with the root node describing the observed symptom and answer each question based on what is verifiable in the environment at that moment. Follow the path that corresponds to the actual observed state, without skipping levels. Each branch eliminates a class of causes and narrows the diagnosis until the cause is identified or the resolution action is found. If a resolution action is applied and the symptom persists, return to the intermediate verification node immediately above and follow the alternative path.