Skip to main content

Troubleshooting Lab: Configure Forced Tunneling

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The operations team reports that after a maintenance window performed the previous night, VMs in the AppSubnet of a production VNet lost internet access. The environment uses forced tunneling: a route table with the route 0.0.0.0/0 β†’ VirtualNetworkGateway is associated with the AppSubnet. The VpnGw2 type VPN gateway is in Succeeded state in the portal. The Site-to-Site connection with the on-premises datacenter also appears as Connected.

During maintenance, the following actions were performed:

  • Firmware update of the on-premises VPN appliance
  • Gateway resizing from VpnGw1 to VpnGw2
  • Addition of a new DbSubnet to the VNet

The on-premises team confirms that the corporate firewall is operational and allowing internet egress normally. An engineer executes the following command from one of the affected VMs:

curl -I https://www.microsoft.com
# Result: curl: (6) Could not resolve host: www.microsoft.com

Additionally, the following test is performed:

ping 8.8.8.8
# Result: Request timeout for icmp_seq 0

What is the root cause of internet access loss in the VMs of the AppSubnet?

A) The VPN gateway resizing recreated the resource, disassociating the route table from the AppSubnet during the process. B) The on-premises firewall is blocking outbound traffic despite reporting operational status, as the firmware update may have reset the ACL rules. C) The addition of the DbSubnet changed the VNet's address space and invalidated existing routes in the route table associated with the AppSubnet. D) The VPN gateway resizing temporarily interrupted the tunnel and, after reconnection, BGP did not re-announce the 0.0.0.0/0 route to on-premises, causing return traffic to not find a path back.


Scenario 2 β€” Action Decision​

The problem cause has been identified: the route table with the forced tunneling route was inadvertently disassociated from the AppSubnet during an automation operation via Terraform. The Terraform state file does not reflect the current Azure state because the pipeline was executed with an outdated version of the network module.

The environment is critical production. VMs in the AppSubnet are actively processing financial transactions. Security policy requires that no VM in this subnet have direct internet access without passing through the on-premises firewall. The security team is monitoring the incident in real time.

Currently, without the associated route table, VMs are using Azure's default system route and accessing the internet directly, violating the security policy.

What is the correct action to take at this moment?

A) Execute terraform apply with the updated module to restore the desired state and automatically reassociate the route table. B) Manually reassociate the route table to the AppSubnet via portal or CLI immediately, without waiting for the Terraform correction, to restore security compliance. C) Remove all default system routes from the AppSubnet to block internet access while the Terraform correction is prepared. D) Isolate the AppSubnet by applying a total outbound blocking NSG while the Terraform pipeline is corrected and re-executed.


Scenario 3 β€” Root Cause​

An engineer deployed forced tunneling in a new VNet following exactly the company's internal documentation. The configuration was applied via CLI as shown below:

az network route-table create \
--name ForcedTunnelRT \
--resource-group NetRG \
--location eastus

az network route-table route create \
--route-table-name ForcedTunnelRT \
--resource-group NetRG \
--name DefaultToGateway \
--address-prefix 0.0.0.0/0 \
--next-hop-type VirtualNetworkGateway

az network vnet subnet update \
--name WorkloadSubnet \
--vnet-name ProdVNet \
--resource-group NetRG \
--route-table ForcedTunnelRT

The VPN gateway was provisioned three weeks ago and is in Succeeded state. The Site-to-Site connection with on-premises is also Connected. The on-premises team confirms that return routing is correct.

After deployment, the engineer tests connectivity from a VM in the WorkloadSubnet:

# Test 1: Access to on-premises resource
ping 192.168.10.5
# Result: 64 bytes from 192.168.10.5: icmp_seq=1 ttl=62 time=18.4 ms

# Test 2: Internet access through expected path (via on-premises)
curl -I https://ifconfig.me
# Result: HTTP/1.1 200 OK
# The returned IP is the datacenter's on-premises public IP (correct)

# Test 3: Access to Azure Storage via public endpoint
curl -I https://mystorageaccount.blob.core.windows.net/container/file.txt
# Result: curl: (7) Failed to connect to mystorageaccount.blob.core.windows.net

The application team reports that Test 3 worked perfectly before configuring forced tunneling. The VPN gateway SKU is VpnGw1. The VNet has no Service Endpoints or Private Endpoints configured for the Storage Account.

What is the root cause of the failure in Test 3?

A) The VpnGw1 SKU does not support the traffic volume needed to simultaneously route on-premises traffic and PaaS service traffic, causing packet drops. B) The on-premises firewall is blocking traffic destined for the Storage Account public endpoint, as this traffic now passes through the VPN tunnel before exiting to the internet. C) The 0.0.0.0/0 route configured in the route table is redirecting all traffic destined for the Storage Account public IP through the VPN gateway, which previously went directly via Azure's default system route. D) The Storage Account has a network firewall configured that started blocking the VM source IPs after the routing change.


Scenario 4 β€” Diagnostic Sequence​

An engineer receives the following ticket: "VMs in the FrontendSubnet cannot access the internet. The environment should have active forced tunneling, but traffic is not reaching on-premises."

The following investigation steps are available but out of order:

  • Step P: Check in the portal if the route table with the route 0.0.0.0/0 β†’ VirtualNetworkGateway is associated with the FrontendSubnet
  • Step Q: Access a VM in the FrontendSubnet and execute curl https://ifconfig.me to observe if the returned IP is Azure's public IP or the on-premises datacenter IP
  • Step R: Check the Site-to-Site connection state in the VPN gateway (status Connected or Disconnected)
  • Step S: Confirm with the on-premises team if the firewall is receiving traffic from the VNet and allowing internet egress
  • Step T: Check in the portal if the VPN gateway is provisioned and in Succeeded state

What is the correct investigation sequence to diagnose this problem progressively and efficiently?

A) Q β†’ P β†’ T β†’ R β†’ S B) T β†’ R β†’ P β†’ Q β†’ S C) P β†’ T β†’ Q β†’ S β†’ R D) R β†’ T β†’ S β†’ P β†’ Q


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: A

Resizing a VPN gateway in Azure is not an in-place operation: the resource is destroyed and recreated with the new SKU. During this process, all dependent associations, including references to the gateway as next hop in route tables, remain intact in the route table itself, but the route table association to the subnet may be maintained. The critical point is that gateway resizing generates a new Resource ID for the resource, and depending on the automation mechanism used, this can cause disassociation.

However, the determining clue in the scenario is that the gateway and connection states appear as Succeeded and Connected, respectively. This eliminates hypotheses related to the tunnel itself. The symptom of DNS resolution failing (Could not resolve host) indicates that traffic is not reaching any resolver, whether on-premises or internet, suggesting that the route is not being applied and VMs were left without a functional egress route.

The irrelevant information in the scenario is the addition of the DbSubnet: adding a subnet to a VNet does not affect route tables associated with other subnets nor invalidate existing routes.

The most dangerous distractor is B. Blaming the on-premises firewall is a classic diagnostic error that shifts investigation outside the Azure environment, delaying identification of the real cause. If traffic doesn't leave the VM, the on-premises firewall would never see it, regardless of its state.


Answer Key β€” Scenario 2​

Answer: B

The critical constraint of the scenario is twofold: production in operation and active security policy violation. Manual reassociation of the route table via portal or CLI is the only action that resolves the violation immediately, without risk of side effects and without depending on tools with outdated state.

Alternative A is technically correct under normal conditions, but is dangerous here: executing terraform apply with an outdated module and a divergent state file in a critical production environment can cause destruction or modification of resources beyond the route table, expanding the incident.

Alternative D represents a common reasoning error: applying a total outbound blocking NSG would interrupt ongoing financial transactions, causing direct business impact that may be greater than the temporary security violation being monitored.

Alternative C is technically invalid: Azure system routes cannot be directly removed; they are only overridden by UDRs, which is exactly the mechanism that failed.


Answer Key β€” Scenario 3​

Answer: C

The forced tunneling configuration is correct and working as expected. Tests 1 and 2 confirm that on-premises traffic and internet traffic are being routed correctly through the VPN tunnel. The problem with Test 3 is a direct and expected consequence of forced tunneling: the 0.0.0.0/0 route captures all traffic that doesn't match a more specific route, including traffic destined for the Storage Account public IP.

Before forced tunneling, this traffic used the default system route 0.0.0.0/0 β†’ Internet and exited directly to the Microsoft network via Azure's backbone. After forced tunneling, traffic is sent to on-premises, and the corporate firewall, depending on its configuration, may be blocking requests to the Storage Account endpoint, or the additional latency causes timeout.

The irrelevant information purposely included is the VpnGw1 SKU: the gateway throughput limitation has no relation to the specific symptom observed, which is a connection failure, not performance degradation.

The most dangerous distractor is D. While network firewalls in Storage Accounts are a real cause of blocking, the scenario states that Test 3 worked before the routing change, which eliminates the Storage firewall as a new cause, since nothing in it was changed.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence is T β†’ R β†’ P β†’ Q β†’ S, which follows the logic of progressive diagnosis from the most fundamental to the most peripheral component.

The first mandatory step is to verify if the VPN gateway exists and is provisioned (T), because without it no next hop VirtualNetworkGateway in a UDR has a resolvable destination. With the gateway confirmed, check the tunnel state (R): if the connection is Disconnected, traffic will reach the gateway but have no path to on-premises. Confirming that the infrastructure is up, verify if the route table is correctly associated with the subnet (P), because without this association forced tunneling simply doesn't apply. Step Q (VM egress test) validates in practice the behavior observed by the user after confirming that the configuration should be correct. Step S (confirmation with on-premises team) is last because it only makes sense to check what happens in the corporate firewall after confirming that traffic is effectively reaching there.

Sequence A starts with the symptom (Q) before any infrastructure diagnosis, which can confirm the problem but doesn't point to the cause. Sequence D starts with the tunnel state (R) without first verifying if the gateway exists, which is a shortcut that can confuse a Disconnected state with gateway absence.


Troubleshooting Tree: Configure Forced Tunneling​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark BlueInitial symptom (entry point)
BlueDiagnostic question (binary decision or verification)
RedIdentified cause
GreenRecommended action or resolution
OrangeValidation or intermediate verification

When facing a real problem, start with the dark blue root node and answer each diagnostic question based on what you can observe in the portal, CLI, or logs. Follow the branch corresponding to the observed state. Orange nodes indicate that more information needs to be collected before proceeding. When reaching a red node, the cause is identified; the green node immediately below indicates the corrective action. Don't skip questions: the tree was designed to eliminate hypotheses in the most efficient order, from the most fundamental to the most peripheral component.