Skip to main content

Troubleshooting Lab: Design a site-to-site VPN connection, including for high availability

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

An operations team reports that the site-to-site VPN between the branch office and Azure stopped working after a maintenance window during early morning hours. The on-call engineer checks the portal and observes that the connection status shows as "Not connected". During maintenance, two activities were performed: updating the on-premises device firmware and renewing the TLS certificate of the branch office's internal web server.

During investigation, the engineer collects the following log from the on-premises device:

[IKEv2] SA negotiation failed
Peer: 52.10.4.1 (Azure VPN Gateway)
Reason: TS_UNACCEPTABLE
Local proposal: 10.1.0.0/24
Remote proposal: 10.2.0.0/24
Phase 1: SUCCESS
Phase 2: FAILED

The engineer also notes that the Azure gateway's public IP has not changed and that the on-premises device's uptime shows a restart at 02:14.

What is the root cause of the failure?

A) The TLS certificate renewal corrupted the VPN device's key store, invalidating IKEv2 authentication.

B) The updated firmware reset the phase 2 traffic selectors to values incompatible with those configured in Azure.

C) The device restarted during maintenance and lost active IKEv2 SAs, requiring manual renegotiation by the Azure administrator.

D) The Azure gateway's public IP underwent a silent change after maintenance, and the local device still points to the old address.


Scenario 2 β€” Action Decision​

The networking team identified that VPN connectivity between headquarters and Azure failed completely because the single on-premises VPN device physically burned out. The cause is confirmed. The Azure VPN Gateway is in Active-Active mode with two public IPs provisioned and operational. No backup device is immediately available on-site.

The critical application depending on the VPN serves external customers and has compromised SLA. Management authorizes any emergency action that restores connectivity within 30 minutes. The team has full administrative access to Azure and the branch office's Internet provider.

What is the correct action to take at this time?

A) Reconfigure the Azure VPN Gateway to Active-Standby mode, reducing costs while awaiting new hardware arrival.

B) Provision a VM in Azure with Network Virtual Appliance (NVA) role as temporary VPN gateway, replacing the Azure VPN Gateway.

C) Provision an Azure Point-to-Site VPN for local administrators as contingency solution while hardware is replaced.

D) Provision a new virtual VPN device at the branch office (e.g., cloud appliance or VM with VPN router role) and establish tunnels with both public IPs of the existing Azure gateway.


Scenario 3 β€” Root Cause​

An architect configured a site-to-site VPN solution with BGP enabled to connect the branch office to Azure. After deployment, VMs in the Azure VNet can ping on-premises resources normally. However, when testing communication between the branch office and a second VNet connected via VNet Peering to the main VNet, packets do not reach the destination.

The architect verifies configurations and collects the following information:

Main VNet (hub): 10.0.0.0/16
Secondary VNet (spoke): 10.1.0.0/16
VNet Peering: hub <-> spoke (configured and status: Connected)
BGP: enabled on gateway and connection
Routes advertised by branch via BGP: 192.168.10.0/24

The architect also verifies that peering between the two VNets was created three weeks ago and worked correctly before the VPN was configured. The Azure subscription used has sufficient public IP quota and the gateway is on VpnGw2 SKU.

What is the root cause of the problem?

A) The VpnGw2 SKU does not support BGP route propagation to VNets connected via peering.

B) The VNet Peering between hub and spoke does not have the "Use Remote Gateway" option enabled on the spoke side, preventing use of the hub VNet gateway.

C) BGP is only advertising routes from the branch to Azure, but not redistributing secondary VNet routes back to the on-premises device.

D) The public IP quota consumed by the gateway prevents route propagation to additional VNets.


Scenario 4 β€” Diagnostic Sequence​

An engineer receives the following ticket: "The site-to-site VPN was working yesterday. This morning users at the branch office cannot access any resources in Azure. The connection status in the Azure portal shows as Connected."

The engineer has the following investigation steps available, presented out of order:

  1. Check the Effective Routes of VM NICs in Azure to confirm if the route for the on-premises prefix is present.
  2. Execute Connection troubleshoot from Network Watcher between an Azure VM and an on-premises IP.
  3. Confirm if the VPN connection status in the portal actually reflects data connectivity by testing with ICMP ping between endpoints.
  4. Check NSGs (Network Security Groups) applied to subnets and NICs of Azure VMs that users are trying to access.
  5. Confirm if there were recent changes to route tables (UDR) associated with VNet subnets.

What is the correct investigation sequence for this scenario?

A) 1 -> 2 -> 3 -> 5 -> 4

B) 3 -> 1 -> 5 -> 4 -> 2

C) 2 -> 4 -> 1 -> 3 -> 5

D) 3 -> 5 -> 1 -> 4 -> 2


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The log shows Phase 2 failure with the TS_UNACCEPTABLE message, which in IKEv2 means that the traffic selectors proposed by the local device are unacceptable to the remote peer. Phase 1 completed successfully, ruling out any authentication, PSK, or IP reachability problems. This directly eliminates alternatives A and D.

The device restart at 02:14 is irrelevant information for root cause diagnosis: after restarting, the device attempted to renegotiate the SA normally, and Phase 1 was successful, confirming the device is functional. The problem is in the proposal sent in Phase 2.

Alternative C represents the most common reasoning error: confusing the loss of existing SAs (transient symptom resolved automatically by renegotiation) with the cause of persistent failure. SA renegotiation does not require manual intervention in Azure. If the cause were only SA loss, the connection would reestablish itself within minutes.

The real cause is that the firmware update reset the Phase 2 traffic selectors to values that don't match the prefixes configured in Azure, resulting in permanent rejection. The internal web server's TLS certificate renewal is irrelevant and was purposely included to mislead the diagnosis.


Answer Key β€” Scenario 2​

Answer: D

The cause is confirmed: the on-premises device was physically destroyed. The Azure VPN Gateway is operational with two public IPs. The single point of failure is on the on-premises side. The critical constraint is restoring connectivity within 30 minutes.

Alternative D is the only one that solves the problem within the constraint: provision a virtual VPN appliance or VM with VPN router role at the branch office and connect it to the two existing Azure gateway IPs, leveraging the already provisioned Azure infrastructure.

Alternative A does not restore connectivity; it only changes the gateway mode, which is already operational. This doesn't solve the on-premises problem.

Alternative B is technically valid in other contexts, but creating an NVA in Azure to replace the existing Azure gateway within 30 minutes is unfeasible and unnecessary, since the Azure gateway is working.

Alternative C solves the access problem for individual administrators, but doesn't restore connectivity for the application serving external customers, ignoring the SLA constraint stated in the scenario.


Answer Key β€” Scenario 3​

Answer: B

The symptom is clear: communication works between the branch office and hub VNet, but fails between the branch office and spoke VNet connected by peering. This indicates the problem is in route propagation through peering, not in the VPN tunnel itself.

For a spoke VNet to use a hub VNet's gateway through peering, two configurations are necessary: on the hub side, enable "Allow Gateway Transit"; on the spoke side, enable "Use Remote Gateway". Without the second option active on the spoke, routes learned via BGP by the hub gateway are not propagated to the spoke VNet.

The clue in the scenario is that peering worked before the VPN was configured: this means connectivity within Azure between hub and spoke was working, but gateway transit configuration hadn't been necessary until then, and wasn't enabled when the VPN was created.

Alternative A is false; the VpnGw2 SKU supports BGP route propagation via peering. Alternative C describes a reverse routing problem (from Azure to branch), but the described symptom is failure from branch to spoke, not lack of routes on the on-premises device. Alternative D is technically incoherent; public IP quota doesn't affect route propagation.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence is: 3 -> 1 -> 5 -> 4 -> 2.

The mandatory starting point is step 3: confirm with a real test (ICMP ping) if the "Connected" status in the portal corresponds to effective data connectivity. The Azure portal can display "Connected" even when data transfer is blocked by other controls, such as NSGs or UDRs. Proceeding to any analysis before this confirmation is a methodological error.

With data failure confirmed, the next step is step 1: check effective routes of VM NICs. If the route for the on-premises prefix is not present, the problem is at the routing layer.

Step 5 comes next: check if there were recent changes to UDRs, which can override routes learned via BGP and redirect traffic to a wrong destination or drop it.

Step 4 investigates NSGs, which act at the filtering layer after routing is correct. Checking NSGs before confirming routing inverts the diagnostic logic.

Finally, step 2 (Network Watcher Connection troubleshoot) is useful to confirm and document the problem after more common hypotheses have been eliminated, as it consumes additional time and resources.

The most dangerous alternative is C, which starts with Connection troubleshoot without first validating if the problem is at the data or control layer, wasting critical time.


Troubleshooting Tree: Design a site-to-site VPN connection, including for high availability​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blue (almost black)Initial symptom (tree root)
BlueDiagnostic question
OrangeIntermediate validation or verification
RedIdentified cause
GreenRecommended action or resolution

To use this tree when facing a real problem, start at the root node and answer each question based on what can be directly observed in the Azure portal, on-premises device logs, or connectivity tests. Follow the branch corresponding to the observed answer until reaching an identified cause node. From the cause, execute the recommended action in the corresponding green node and validate that connectivity has been restored before closing the ticket.