Skip to main content

Troubleshooting Lab: Identify when to use a policy-based VPN versus a route-based VPN connection

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

A company's infrastructure team reports that VPN connectivity between the on-premises headquarters and an Azure VNet stopped working after a planned update to the local firewall. The Azure VPN gateway was provisioned two years ago and has never had problems. The administrator collected the following information from the environment:

Gateway Name    : vpn-gw-sede
VPN Type : PolicyBased
SKU : Basic
Status : Connected (reported by portal)
BGP : Disabled
IKE Version : IKEv1

Firewall update : Firmware v4.2 -> v4.8
Firewall vendor : Checkpoint (legacy model)
Change applied : NAT rules updated for new DMZ segment

The Azure portal indicates the connection status as Connected, but users report that no application traffic can pass through the tunnel. The network team confirms that the physical link between headquarters and the internet provider is operational with normal latency. The security team reports that no ACL rules were changed in Azure during the maintenance window.

What is the most likely root cause of the observed problem?

A) The Basic SKU gateway doesn't support IKEv1 version after platform updates, causing silent tunnel failure
B) The firewall firmware update changed the traffic selectors configured on the local device, making them incompatible with the policy defined on the Azure policy-based gateway
C) The "Connected" status in the portal indicates the control plane is active, but disabled BGP prevents route propagation, blocking traffic
D) The NAT rules update on the firewall caused an addressing conflict with the Azure VNet prefix, blocking layer 3 routing


Scenario 2 β€” Action Decision​

A manufacturing company uses a route-based VPN gateway with VpnGw2 SKU to connect three branches to the Azure environment. During a capacity audit, it was identified that the environment needs to support a fourth branch with a legacy VPN device that operates exclusively with IKEv1 and requires static traffic selectors based on prefix.

The cause is identified: the current route-based gateway supports IKEv1 on specific Site-to-Site connections, but the fourth branch device requires a pure policy-based configuration that the route-based gateway cannot fully emulate for this equipment model.

The environment is in production. The three existing connections are active and critical. The team has a 4-hour maintenance window next Saturday.

What is the correct action to take at this moment?

A) Provision a second policy-based VPN gateway on a dedicated new GatewaySubnet to exclusively serve the fourth branch, keeping the existing route-based gateway intact
B) Recreate the existing gateway as policy-based to unify all connections under a single IKEv1-compatible gateway
C) Configure a custom IPsec policy on the existing route-based gateway with forced IKEv1 and manual traffic selectors for the fourth branch
D) Replace the fourth branch VPN device with an IKEv2-compatible model before the maintenance window


Scenario 3 β€” Root Cause​

An administrator reports attempting to create a second Site-to-Site connection on an existing VPN gateway in Azure, but the portal returned the following error:

Error Code    : GatewayConnectionCountExceeded
Message : The gateway does not support more than 1 Site-to-Site connection.
Resource : /subscriptions/.../vpnGateways/vpn-gw-legacy
Operation : CreateOrUpdateVirtualNetworkGatewayConnection

The administrator verified the following environment details before opening a ticket:

Gateway Name   : vpn-gw-legacy
Location : East US
SKU : Basic
VPN Type : PolicyBased
Connections : vpn-conn-matriz (Site-to-Site, IKEv1, Status: Connected)
GatewaySubnet : 10.0.255.0/27
VNet Address : 10.0.0.0/16
Uptime : 847 days

The administrator suspects the problem is related to the Basic SKU, which has lower throughput and simultaneous connection limits compared to higher SKUs. He opens a ticket requesting an upgrade to VpnGw1 SKU.

What is the actual root cause of the observed error?

A) The Basic SKU limits the number of Site-to-Site connections to 1; upgrading to VpnGw1 would solve the problem
B) The GatewaySubnet size (/27) is insufficient to support multiple connections and is causing the allocation error
C) The gateway's PolicyBased type imposes the limit of a single Site-to-Site connection, regardless of SKU
D) The gateway has reached the continuous uptime limit and needs to be restarted before accepting new connections


Scenario 4 β€” Diagnostic Sequence​

An administrator receives the following report: "The VPN tunnel between Azure and the remote office appears as connected in the portal, but no traffic passes between the networks."

The configured gateway is policy-based type. The administrator has access to the Azure portal, the on-premises VPN device, and platform diagnostic tools.

The available investigation steps are:

  • Step P: Verify that the traffic selectors configured on the on-premises device exactly match the prefixes defined in the Azure gateway policy
  • Step Q: Execute VPN Diagnostics in the Azure portal to capture data plane logs
  • Step R: Confirm that IKE Phase 1 tunnel status is established on the on-premises device
  • Step S: Test end-to-end connectivity with ping from a VM in the VNet to an on-premises host
  • Step T: Check if there's a static route or UDR in the VNet that might be redirecting traffic to another next hop

What is the correct investigation sequence?

A) S, R, P, Q, T
B) R, Q, P, T, S
C) Q, S, R, P, T
D) R, P, T, Q, S


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The central clue lies in the combination of two facts: the portal reports Connected (indicating that IKE Phase 1, the control plane, is active) and no traffic passes (pointing to a data plane failure, specifically in IKE Phase 2). In a policy-based gateway, Phase 2 depends on negotiated traffic selectors, which are exact combinations of source and destination prefixes. The firmware update that included changes to the firewall's NAT rules may have altered the prefixes declared in the local selectors, making them incompatible with those configured on the Azure side.

The information about normal physical link latency and Azure ACL integrity are purposefully irrelevant: they only confirm that the problem isn't at the physical layer or Azure access control, without pointing to the actual cause.

Alternative A is incorrect: the Basic SKU natively supports IKEv1 and isn't affected by platform updates as described. Alternative C is the most dangerous distractor: disabled BGP is the default and expected configuration for a policy-based gateway, and doesn't cause traffic blocking when selectors are correct. Alternative D confuses the problem layer: a NAT conflict would affect tunnel establishment in Phase 1, not allow the portal to report Connected.


Answer Key β€” Scenario 2​

Answer: A

The critical constraint of the scenario is that the three existing connections are in production and are critical. Alternative B would require recreating the existing gateway as policy-based, which would destroy the three active connections and limit the environment to a single future Site-to-Site connection, making it impossible to maintain existing branches. Alternative C is technically attractive, but the statement explicitly declares that the fourth branch device requires pure policy-based incompatible with what route-based can emulate for this specific model. Alternative D ignores the time constraint and transfers the problem to the branch team without solving the immediate need.

Alternative A solves the problem without affecting the existing environment: a second policy-based gateway on a dedicated GatewaySubnet (a VNet can have only one VPN gateway per GatewaySubnet, but it's possible to use peering or hub-spoke architectures to integrate both) exclusively serves the legacy branch without any impact on production connections. This is the correct action given the set of constraints.


Answer Key β€” Scenario 3​

Answer: C

The GatewayConnectionCountExceeded error with the explicit message of 1 connection limit, combined with the PolicyBased type visible in the configuration, points directly to the gateway type's architectural limitation, not the SKU. Policy-based gateways in Azure support a maximum of one Site-to-Site connection, regardless of the SKU used.

The most dangerous distractor here is alternative A: the administrator's diagnosis is plausible because the Basic SKU does indeed have limits, but these limits relate to throughput and features (like BGP and P2S), not the number of Site-to-Site connections. An upgrade to VpnGw1 would not solve the problem, as the VPN Type would remain PolicyBased. The result would be wasted time and cost, with the error persisting after migration. The correct solution would be migrating to a route-based gateway, which requires recreating the resource.

Alternative B is irrelevant: GatewaySubnet size doesn't impact the number of allowed connections. Alternative D is technically invalid: uptime is not a limiting factor for creating connections.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence is R, Q, P, T, S, which follows the logic of progressive diagnosis from most fundamental to most specific:

R confirms that IKE Phase 1 is established on the on-premises device. If Phase 1 failed, the rest of the investigation is irrelevant. This is always the first step in any VPN diagnosis.

Q executes VPN Diagnostics in the Azure portal to capture data plane evidence from the Azure side, identifying if the problem is in the tunnel itself or beyond it.

P investigates traffic selectors, which are the most critical component of a policy-based gateway. An incompatibility here explains exactly the symptom: tunnel reported as active, but no traffic passing.

T checks if there's a route or UDR redirecting traffic before it reaches the gateway, which is a common collateral cause in environments with complex routing.

S is the final validation step: only after eliminating the above hypotheses does it make sense to test end-to-end connectivity, as the positive or negative ping result will have sufficient diagnostic context to be correctly interpreted.

Alternatives A and C make the mistake of starting with end-to-end tests before validating tunnel state, which generates diagnostic noise without eliminating hypotheses. Alternative D reverses P and T relative to the ideal order, investigating routes before validating selectors, which are the most likely cause in a policy-based scenario.


Troubleshooting Tree: Identify when to use a policy-based VPN versus a route-based VPN connection​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blueInitial symptom (entry point)
BlueDiagnostic question
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate verification or validation

To use this tree when facing a real problem, start at the root node by identifying the observed symptom. Follow the branching by matching each question to what you can confirm in the environment: always start with the most fundamental question (gateway type, connection state, IKE Phase 1 state) before advancing to more specific checks like traffic selectors or routes. Each path ends with a named cause or concrete action, progressively eliminating hypotheses without jumping to premature conclusions.