Skip to main content

Troubleshooting Lab: Implement a site-to-site VPN connection

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

An operations team reports that the site-to-site VPN connection between the main data center and the Azure VNet went down after maintenance performed during the night. The on-call engineer checks the Azure portal and observes the "Not Connected" status on the connection. The tunnel does not re-establish automatically after 40 minutes.

During the investigation, the engineer collects the following information:

Virtual Network Gateway
SKU: VpnGw2
VPN Type: Route-Based
BGP: Disabled
Active-Active: Disabled

Local Network Gateway
IP Address: 198.51.100.45
Address Space: 172.16.0.0/12

Connection
Shared Key (PSK): Az700Lab#2024
Status: Not Connected
Last connected: 03/26/2026 02:14 UTC

The engineer confirms with the on-premises infrastructure team that, during maintenance, the edge firewall was replaced with a new appliance. The new equipment is online and responds to pings on the internal network. The gateway SKU was not changed. The Azure portal TLS certificate was renewed on the same night.

What is the root cause of the VPN connection failure?

A) The Azure portal TLS certificate renewal corrupted the existing connection configuration.

B) The new firewall appliance assumed a different public IP than what was configured in the Local Network Gateway.

C) BGP deactivation prevented automatic tunnel re-establishment after the interruption.

D) The VpnGw2 SKU lost connection state due to gateway restart during the Azure maintenance window.


Scenario 2 β€” Action Decision​

The network team identified that a branch office's site-to-site VPN connection is down because the Shared Key (PSK) configured on the on-premises VPN device does not match the PSK registered in Azure. The cause is confirmed by logs from the local VPN device.

The operational context is as follows:

  • The connection is the only access path from the branch to ERP systems hosted in the Azure VNet
  • It's 2:30 PM on a Friday and the ERP is actively being used by 37 branch users
  • Changing the PSK in Azure requires the connection to be dropped and re-established, causing approximately 3 to 5 minutes of interruption
  • The branch's local team has access to the VPN device panel and can apply the change on the on-premises side immediately
  • There is a scheduled maintenance window for Sunday at 2:00 AM

What is the correct action to take at this time?

A) Immediately correct the PSK in Azure and request the branch team to update the VPN device in parallel, accepting the 3 to 5 minute interruption.

B) Wait for Sunday's maintenance window to correct the PSK, as the interruption during production hours will impact the 37 active users.

C) Request the branch team to correct only the on-premises side to try to re-establish the connection without changing Azure.

D) Create a new VPN connection with the correct PSK in parallel and remove the old connection after synchronization.


Scenario 3 β€” Root Cause​

An engineer receives a ticket reporting that VMs in the Azure VNet can ping on-premises servers, but TCP connections on port 1433 to the on-premises database server consistently fail. The engineer checks the VPN connection status in the portal:

Connection Status: Connected
Data In: 142.7 MB
Data Out: 98.3 MB

The engineer then executes a connectivity test from a VM in the VNet:

# ICMP Test
ping 10.20.5.10
# Result: 4 packets sent, 4 received, 0% loss

# TCP Test
Test-NetConnection -ComputerName 10.20.5.10 -Port 1433
# TcpTestSucceeded : False
# PingSucceeded : True

The engineer verifies that the NSG associated with the VM subnet in Azure has no rules blocking outbound traffic to port 1433. The on-premises database server is running and accepts connections from other hosts on the same local network. The VPN gateway was provisioned 6 months ago without changes.

What is the root cause of the TCP connection failures on port 1433?

A) The Azure VM subnet NSG is blocking outbound traffic to port 1433, but the rule doesn't appear in the portal due to propagation delay.

B) The firewall or ACL on the on-premises side is blocking TCP connections on port 1433 originating from the Azure VNet IP range.

C) The VPN tunnel does not support TCP traffic on ports above 1024 without additional policy configuration.

D) The VPN gateway is operating in Policy-Based mode, which restricts protocols allowed through the tunnel.


Scenario 4 β€” Diagnostic Sequence​

An engineer receives the following report: the site-to-site VPN connection shows "Connected" status in the Azure portal, but no application traffic flows between the on-premises network and VMs in the VNet. The issue started after an IP addressing reconfiguration performed at the branch office.

The following investigation steps are available, out of order:

  1. Verify if the Local Network Gateway address space reflects the new IP ranges of the on-premises network
  2. Confirm that the connection status in the Azure portal is "Connected"
  3. Execute a ping test from an Azure VM to an on-premises host using the new IP
  4. Verify if effective routes on the VM NICs include the on-premises network prefixes
  5. Confirm with the branch team which IP ranges were changed and what the new values are

What is the correct investigation sequence for this scenario?

A) 2 -> 1 -> 5 -> 4 -> 3

B) 2 -> 5 -> 1 -> 4 -> 3

C) 5 -> 2 -> 1 -> 3 -> 4

D) 1 -> 5 -> 2 -> 4 -> 3


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The determining clue in the statement is the replacement of the on-premises firewall appliance. Edge devices frequently obtain a new public IP when replaced, either through different manual configuration or new assignment from the internet provider. The Local Network Gateway in Azure stores the on-premises VPN device's public IP as a fixed reference. If this IP changed and the Local Network Gateway wasn't updated, Azure will try to establish the IKE tunnel to a non-existent or incorrect endpoint, resulting in permanent "Not Connected" status.

The information about the Azure portal TLS certificate renewal is deliberately irrelevant: portal TLS certificates have no relationship with the VPN data plane. Disabled BGP and gateway SKU are also irrelevant, as neither of these factors changed and neither causes tunnel failure by itself in this context.

The most dangerous distractor is alternative D, as gateway restarts are real events, but Azure automatically re-establishes connections after restarts and the statement describes a persistent 40-minute failure with a clear operational cause on the on-premises side.


Answer Key β€” Scenario 2​

Answer: B

The cause is confirmed and the technical solution is known. What determines the correct answer are the scenario constraints: 37 active users on a critical ERP system during production hours, with a scheduled maintenance window available in less than 48 hours.

A 3 to 5 minute interruption in an actively used ERP can cause loss of ongoing transactions, corruption of unconfirmed data, and direct business impact. The Sunday 2:00 AM window exists precisely to absorb this type of intervention.

Alternative C represents the most common mistake: changing only the on-premises side doesn't solve the problem because the PSK is compared bilaterally during the IKE handshake. If both sides don't agree, the tunnel won't come up, regardless of which side was corrected.

Alternative D is technically unfeasible in this context: it's not possible to have two simultaneous active connections to the same on-premises endpoint on the same gateway without specific configuration, and the process wouldn't eliminate the interruption.


Answer Key β€” Scenario 3​

Answer: B

The set of evidence points directly to a firewall rule or ACL on the on-premises side. The elimination reasoning is as follows: the VPN tunnel is active and transporting data (evidenced by Data In/Out), ICMP works in both directions, and the Azure NSG was verified and doesn't block port 1433 outbound. The database server is operational and accepts local connections.

The only element not explicitly verified is the on-premises firewall. It's common for on-premises firewalls to allow traffic between internal hosts but block connections originating from external ranges, even when those ranges arrive through the VPN tunnel. To the on-premises firewall, traffic coming from Azure appears with VNet range IPs, which may not be on the allowed sources list for port 1433.

Alternative A is ruled out by the statement itself: the NSG was verified without blocking rules. Alternative C is technically incorrect: Azure VPN tunnels are transparent to layer 4 protocols and don't impose port restrictions. Alternative D is ruled out because Route-Based gateways don't restrict protocols through the tunnel.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence follows progressive diagnostic logic: start from the current observable state, collect reference information, compare configuration with reality, verify propagation, and validate with functional testing.

Step 2 comes first because it confirms the tunnel itself is active, separating the application connectivity problem from a possible VPN problem itself. Step 5 comes next because without knowing the actual new ranges, it's impossible to assess whether any configuration is correct. Step 1 compares the Local Network Gateway with values obtained in step 5. Step 4 verifies if effective routes on VMs reflect the correct prefixes. Step 3 is the final functional test, executed only after confirming the configuration is correct.

Alternative A errs by placing Local Network Gateway verification before confirming new ranges with the branch team, which may lead the engineer to incorrectly validate a configuration based on outdated information. Alternative D errs by starting with the Local Network Gateway without first confirming connection state and actual values of the new IPs.


Troubleshooting Tree: Implement a site-to-site VPN connection​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blueInitial symptom (entry point)
BlueDiagnostic question
OrangeIntermediate verification or validation
RedIdentified cause
GreenRecommended action or resolution

When facing a real problem, start at the root node and answer each question based on what is directly observable: portal status, connectivity tests, confirmations with the on-premises team. At each branch, choose only the path supported by collected evidence, not assumptions. Intermediate validation nodes indicate that more information must be collected before advancing to a cause. Upon reaching an identified cause node, the corresponding recommended action specifically resolves that cause without side effects on other configuration parts.