Troubleshooting Lab: Create and configure a virtual network gateway
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A network team has just completed the deployment of a VPN Gateway with SKU VpnGw2 in Active-Active mode for a Site-to-Site connection with the on-premises datacenter. The on-premises VPN device is a Microsoft-certified appliance, with IKEv2 enabled and pre-shared key configured correctly on both sides.
After deployment, both tunnels appear as Connected in the Azure portal. However, virtual machines in the VNet cannot reach on-premises hosts by IP, and on-premises hosts cannot reach the VMs.
The team collects the following additional information:
- The Local Network Gateway was created with the prefix
10.10.0.0/16 - The VNet address space is
10.20.0.0/16 - The VMs are in the subnet
10.20.1.0/24 - The on-premises hosts that need to be reached are in
10.10.5.0/24 - The SKU
VpnGw2was chosen for throughput requirements; the SKUVpnGw1was discarded - The gateway was provisioned 40 minutes ago and the status of both tunnels has been
Connectedfor 35 minutes - There is no Network Security Group associated with the GatewaySubnet
What is the root cause of the connectivity failure?
A) Active-Active mode requires the on-premises device to support BGP. Since the configuration uses static routing, traffic is not forwarded correctly through both tunnels.
B) The prefix configured in the Local Network Gateway does not cover the target on-premises hosts, preventing the gateway from knowing where to forward return traffic.
C) The absence of an NSG on the GatewaySubnet prevents IPSec traffic from being inspected, causing packets to be silently dropped.
D) The gateway is still in the route convergence process. The 35 minutes of Connected status is still not sufficient for routing tables to stabilize with SKU VpnGw2.
Scenario 2 β Action Decisionβ
The cause of a failure has been identified: the production Virtual Network Gateway is operating with SKU VpnGw1, and the aggregate throughput of S2S connections has consistently reached 95% of the SKU limit during business hours, causing packet loss and high latency. The team needs to upgrade to VpnGw2.
The operational context is as follows:
- The gateway serves 3 active S2S connections to critical branch offices
- There is a scheduled maintenance window for this Friday at 11 PM, lasting 2 hours, approved by the Change Advisory Board (CAB)
- The on-premises operations team at the branches has been notified and will be on standby during the window
- The Azure network team has full access to the portal and PowerShell
- An engineer suggests executing the upgrade immediately to avoid further degradation during the day
- The staging environment does not have an equivalent gateway to test the procedure
What is the correct action to take at this time?
A) Execute the upgrade immediately via PowerShell to minimize the impact of ongoing degradation, since the cause is confirmed and the procedure is documented.
B) Wait for the approved maintenance window and execute the upgrade on Friday at 11 PM, following the established change management process.
C) Create a new gateway with SKU VpnGw2 in parallel, migrate the connections and delete the old gateway, all today, to avoid planned interruption.
D) Open a support ticket with Microsoft requesting that the upgrade be done without interruption, since the environment is production and critical.
Scenario 3 β Root Causeβ
A company uses Point-to-Site (P2S) with certificate authentication for remote access by developers. The environment has been working for six months without incidents. In the last week, three developers reported that they can no longer connect the VPN client. The other 14 developers continue to connect normally.
The VPN client output on affected computers is:
Connecting to VPN gateway...
Authentication failed: certificate validation error (code 0x80090326)
The certificate chain was issued by an authority that is not trusted.
Information collected by the team:
- The three affected developers received new corporate laptops last week
- The root certificate in the P2S gateway has not been changed recently
- The other 14 developers use older machines and continue to work
- The P2S gateway configuration has not been modified in the last 30 days
- The gateway SKU is
VpnGw1, with capacity available for new connections - Client certificates were exported from the same template used by other developers
What is the root cause of the problem?
A) The SKU VpnGw1 has reached the limit of simultaneous P2S connections, and new connection attempts are rejected with certificate error as a generic failure message.
B) The root certificate configured in the gateway expired recently, but since existing sessions use authentication cache, only new connections are affected.
C) The new laptops do not have the chain root certificate installed in the operating system trusted certificates repository, causing chain validation to fail on the client side.
D) Client certificates were exported without the private key, which is detected only at connection time on machines where the VPN profile is imported for the first time.
Scenario 4 β Diagnostic Sequenceβ
An engineer receives a ticket: the ExpressRoute connection between Azure and the on-premises datacenter has status Not Connected in the portal, after working normally for months. No changes were made on the Azure side in the last 48 hours.
The available investigation steps are:
- Verify that the Virtual Network Gateway of type ExpressRoute has status
Succeededin provisioning - Contact the connectivity provider to verify if there was an event in the physical circuit or BGP peering on their side
- Verify the status of the ExpressRoute Circuit in the Azure portal (
Provider statusandCircuit statusfields) - Execute the command
Get-AzVirtualNetworkGatewayConnectionto verify the state of the logical connection between the gateway and the circuit - Review change logs in the Azure Activity Log for the last 48 hours to confirm if there was any unrecorded modification
What is the correct investigation sequence?
A) 1 β 3 β 4 β 2 β 5
B) 5 β 1 β 3 β 2 β 4
C) 3 β 1 β 4 β 5 β 2
D) 5 β 3 β 1 β 4 β 2
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: A
The central clue is in the combination between Active-Active mode and absence of BGP. In Active-Active mode, the gateway creates two IPSec tunnels, each with a distinct public IP. For traffic to be balanced and forwarded correctly between the two tunnels, the on-premises device needs to support BGP. With static routing, the on-premises device has no mechanism to distribute traffic between the two gateway endpoints, and return packets may be sent to the wrong tunnel or dropped, even with both tunnels appearing as Connected.
The Connected status of the tunnels confirms that the IPSec control plane is functional. This eliminates pre-shared key or IKE negotiation issues.
The information about the SKU VpnGw2 and provisioning time is irrelevant to the diagnosis and was included intentionally. The SKU does not influence routing behavior in Active-Active mode, and 35 minutes of established connection is more than sufficient for convergence.
Alternative B would be valid in another context, but the prefix 10.10.0.0/16 covers the destination 10.10.5.0/24, so it's not the cause. Alternative C reverses the logic: the absence of NSG on the GatewaySubnet is the recommended state, not a problem. Alternative D has no technical foundation.
The most dangerous distractor is B, as it induces the engineer to review the Local Network Gateway unnecessarily, wasting time in production.
Answer Key β Scenario 2β
Answer: B
The cause is identified and the solution is known, but the operational context imposes a critical restriction: there is a maintenance window approved by the CAB for Friday, and the production environment serves critical branches. The SKU upgrade causes connectivity interruption. Executing this type of change outside an approved window, without coordination with branch teams and without prior validation, violates the change management process and exposes the organization to uncontrolled impact.
The ongoing degradation is real, but does not constitute an emergency that justifies bypassing the established process. Alternative A represents the technically correct decision applied at the wrong time, ignoring the process restriction.
Alternative C is technically impossible in the current Virtual Network Gateway model: you cannot have two gateways of the same type in the same VNet simultaneously for in-place migration purposes. This would eliminate this option even without the process restriction.
Alternative D does not correspond to any real Microsoft support capability for this type of operation; SKU upgrades always cause interruption regardless of the execution channel.
Acting on alternative A would be the most costly error: it would cause immediate, unplanned interruption of all 3 branches, without the support of on-premises teams on standby.
Answer Key β Scenario 3β
Answer: C
The determining clue is the pattern of affected users: only developers with new laptops cannot connect, while the others, with older machines, work normally. The error certificate chain was issued by an authority that is not trusted is specific: it indicates that the client operating system does not recognize the root certificate authority, not that the client certificate is invalid.
In new corporate laptops, the trusted certificates repository may not have been populated with the company's internal root certificate, which is the same one used to issue P2S client certificates. The old laptops already had this root certificate installed, possibly via Group Policy or previous manual configuration.
The information about SKU and available capacity is irrelevant and was included intentionally. The certificate error is not a generic capacity rejection message; the SKU VpnGw1 supports up to 250 simultaneous P2S connections, and 17 connections are far from this limit.
Alternative B is plausible, but contradictory with the facts: if the root certificate in the gateway had expired, no client would be able to connect, not just the new ones. Alternative D would be detected before connection, during profile import, and would affect any new machine regardless of the trust repository.
The most dangerous distractor is D, as it may lead the team to unnecessarily revoke and reissue certificates.
Answer Key β Scenario 4β
Answer: D
The correct sequence is 5 β 3 β 1 β 4 β 2, and the reasoning follows the logic of progressive elimination from simplest to most complex, starting with what can be verified without external dependency.
Step 5 (Activity Log) comes first because the statement affirms that no changes were made in Azure. Confirming or refuting this is the starting point: if there is a recorded change, the diagnostic path changes completely.
Step 3 (ExpressRoute Circuit status) comes next because the circuit is the highest-level resource in the chain. If the Circuit status or Provider status indicates a problem, the cause is in the circuit or provider, and subsequent steps lose immediate relevance.
Step 1 (gateway status) verifies that the Azure resource is healthy before inspecting the logical connection that depends on it.
Step 4 (Get-AzVirtualNetworkGatewayConnection) inspects the logical layer of the connection between gateway and circuit, useful after confirming that both underlying resources are provisioned correctly.
Step 2 (provider contact) comes last because it depends on an external agent and should only be triggered after exhausting internal verifications, avoiding unnecessary handoffs that delay diagnosis.
Alternative A errs by starting with the gateway before checking the circuit and Activity Log. Alternative B starts with Activity Log correctly, but places provider contact before logical connection inspection. Alternative C starts with the circuit without checking Activity Log, missing the change confirmation step.
Troubleshooting Tree: Create and configure a virtual network gatewayβ
Color legend:
| Color | Node type |
|---|---|
| Dark blue (navy) | Initial symptom, entry point |
| Blue | Diagnostic question, decision point |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate verification or validation |
To use this tree when facing a real problem, start with the root node identifying the type of failure observed (S2S, ExpressRoute, P2S or performance degradation) and follow the branches answering each question based on what is directly verifiable in the portal, PowerShell commands or logs. Each bifurcation eliminates a class of cause and directs the diagnosis to the next level. Red nodes indicate that the cause has been isolated; green ones indicate the corresponding corrective action. Never skip an intermediate verification step (orange) before advancing to corrective action.