Troubleshooting Lab: Select an appropriate virtual network gateway SKU for site-to-site VPN requirements
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A company's infrastructure team reports that a site-to-site VPN connection between the SΓ£o Paulo branch and Azure was successfully established three weeks ago. The gateway is on VpnGw1 SKU with VPN Type RouteBased. In the last week, the company acquired two new partners and tried to add site-to-site connections for each of them. Both attempts failed with the same error.
The responsible engineer checks the current gateway state:
$ az network vnet-gateway show \
--name gw-hub-prod \
--resource-group rg-networking \
--query "{SKU:sku.name, VpnType:vpnType, ActiveActive:activeActive, Connections:ipConfigurations}" \
-o json
{
"SKU": "VpnGw1",
"VpnType": "RouteBased",
"ActiveActive": false,
"Connections": [
{ "id": "/subscriptions/.../connections/conn-matriz" },
{ "id": "/subscriptions/.../connections/conn-parceiro1" },
{ "id": "/subscriptions/.../connections/conn-parceiro2" }
]
}
The engineer observes that the connections for the partners appear as created in the Azure portal, but the status of both is Not Connected. The original connection with the headquarters remains stable. The gateway was recently updated from Basic to VpnGw1 two months ago. The region used is Brazil South, which supports all SKUs in the VpnGw family.
What is the root cause of the problem?
A) The VpnGw1 SKU doesn't support more than one simultaneous connection; it's necessary to migrate to VpnGw2.
B) The site-to-site tunnel limit of the VpnGw1 SKU has been reached, as it supports a maximum of 30 tunnels, and the third connection exceeded the instance's simultaneous processing capacity.
C) The connections with the partners have Not Connected status because the corresponding Local Network Gateways were created with incorrect or missing on-premises gateway IP addresses.
D) The ActiveActive: false configuration prevents the gateway from accepting more than one simultaneous connection in environments with multiple partners.
Scenario 2 β Action Decisionβ
The cause of the problem has been identified: the production VPN Gateway is on Basic SKU with VPN Type PolicyBased, and the company needs to enable BGP to support the dynamic routing required by a new strategic partner. The approved maintenance window is 4 hours, scheduled for tonight.
The environment has the following constraints:
- The gateway is in production and currently supports an active site-to-site connection with headquarters
- The connection with headquarters uses static routes and cannot be interrupted for more than 30 minutes
- The team has Contributor permissions on the network resource group
- There is no backup gateway available
What is the correct action to take within the maintenance window?
A) Change the VPN Type of the existing gateway from PolicyBased to RouteBased directly in the Azure portal, without deleting the resource, and then enable BGP in advanced settings.
B) Delete the current gateway, provision a new gateway with VpnGw1 SKU and RouteBased VPN Type, recreate the connection with headquarters using static routes, and configure the new connection with BGP for the partner.
C) Provision a second gateway in the same VNet with VpnGw1 SKU and RouteBased VPN Type in parallel, migrate the headquarters connection to the new gateway, and then delete the Basic gateway.
D) Resize the gateway SKU from Basic to VpnGw1 using the az network vnet-gateway update command without deleting the resource, which preserves existing connections and allows enabling BGP afterwards.
Scenario 3 β Root Causeβ
A network administrator reports that the throughput of the site-to-site VPN connection between the on-premises datacenter and Azure is consistently limited to approximately 100 Mbps, even during low-usage hours. The gateway was provisioned six months ago by the previous team.
The administrator collects the following information:
$ az network vnet-gateway show \
--name gw-corp-eastus \
--resource-group rg-network-prod \
--query "{SKU:sku.name, Tier:sku.tier, VpnType:vpnType, Generation:vpnGatewayGeneration}" \
-o json
{
"SKU": "VpnGw1",
"Tier": "VpnGw1",
"VpnType": "RouteBased",
"Generation": "Generation1"
}
Latency tests between the datacenter and Azure show normal values (12 ms). The MPLS link contracted by the datacenter has a capacity of 500 Mbps and is operating at 40% utilization. The ISP confirmed that there is no throttling applied to outbound traffic. The connection uses IKEv2 and the tunnel is in Connected state.
What is the root cause of the throughput limitation?
A) The IKEv2 protocol has higher encryption overhead than IKEv1, reducing the tunnel's effective throughput to approximately 100 Mbps on the VpnGw1 SKU.
B) The VpnGw1 SKU in Generation1 has a maximum aggregate throughput of 650 Mbps, but this value is distributed among all active tunnels; with multiple tunnels, the available bandwidth per tunnel may be lower than expected.
C) The maximum throughput of a single site-to-site tunnel on the VpnGw1 SKU, regardless of generation, is limited to 100 Mbps per tunnel; the 650 Mbps limit is the gateway's total aggregate.
D) The 500 Mbps MPLS link is the real bottleneck, as 40% utilization of a 500 Mbps link represents 200 Mbps consumed, leaving only 100 Mbps available for the VPN tunnel.
Scenario 4 β Diagnostic Sequenceβ
An engineer receives the following report: "The VPN Gateway was migrated from VpnGw2 to VpnGw3 SKU yesterday afternoon to support more tunnels. Since then, three branches report that connectivity to Azure is intermittent, dropping and reconnecting every 5 to 10 minutes."
The available investigation steps are:
- Check the gateway diagnostic logs in Log Analytics to identify IKE renegotiation events or authentication errors after the resize.
- Confirm that the SKU resize was completed successfully and that the gateway is in Succeeded state in Azure.
- Verify if the affected connections have BGP enabled and if the BGP peers have Connected status on the resized gateway.
- Compare the Dead Peer Detection (DPD) configurations and IKE timers between the on-premises devices of the branches and the new gateway SKU.
- Check if the gateway's public IP address was changed during the resize and if the on-premises devices of the branches were updated with the new IP.
What is the correct investigation sequence?
A) 2, 5, 1, 3, 4
B) 1, 2, 4, 5, 3
C) 3, 1, 5, 2, 4
D) 2, 1, 5, 4, 3
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: C
The decisive clue is in the command output: the connections appear as created in Azure, but with Not Connected status. This means the problem is not with the SKU capacity or gateway configuration itself, but with tunnel negotiation with the remote endpoint. The Not Connected status indicates that the IKE handshake was not completed, which happens when the Local Network Gateway associated with each connection contains incorrect information about the partner's on-premises gateway, especially the public IP address.
Distractor A is incorrect: VpnGw1 supports up to 30 tunnels, and only three connections were created. Distractor B confuses the gateway's aggregate limit with a simultaneous connection limitation, in addition to applying saturation logic where there's no evidence of overload. Distractor D is the most dangerous: the ActiveActive: false field describes high availability with two public IP addresses, and does not limit the number of supported connections. Acting based on distractor D would lead to unnecessary gateway reconfiguration without solving the real problem.
The information about the gateway being updated from Basic to VpnGw1 two months ago is irrelevant to the diagnosis: it's a historical detail that has no relation to the status of the new connections.
Answer Key β Scenario 2β
Answer: B
The VPN Type of a gateway cannot be changed after provisioning. This directly eliminates alternative A. The fundamental technical constraint is that migrating from PolicyBased to RouteBased requires deleting and recreating the gateway. Given that the maintenance window is 4 hours and the maximum tolerable interruption is 30 minutes, alternative B is viable: provision a new VpnGw1 RouteBased gateway and recreate the connection with headquarters within the window.
Alternative C seems attractive but is technically unfeasible: Azure does not allow two VPN Gateways in the same VNet simultaneously. Alternative D is the most dangerous distractor: although it's possible to resize the SKU from Basic to VpnGw1 with az network vnet-gateway update, this does not change the VPN Type. The gateway would remain PolicyBased after resizing, making it impossible to enable BGP. Executing this action and then discovering that the VPN Type didn't change is an error that consumes critical time from the maintenance window.
Answer Key β Scenario 3β
Answer: C
The 100 Mbps per tunnel limit on the VpnGw1 SKU is a real and documented restriction by Microsoft. The 650 Mbps value frequently mentioned is the gateway's aggregate throughput, applicable when multiple tunnels sum to that capacity. For a single site-to-site tunnel, the ceiling is significantly lower. This distinction between per-tunnel throughput and aggregate throughput is a common point of confusion in gateway sizing.
Distractor A is factually incorrect: the IKEv2 protocol does not reduce throughput to 100 Mbps by design; encryption overhead exists but is not the limiting factor in this context. Distractor B is partially correct about bandwidth distribution among tunnels but doesn't identify the per-tunnel limit as the root cause. Distractor D is the most dangerous, as it uses plausible numerical data from the scenario (40% of 500 Mbps = 200 Mbps), creating a convincing narrative that blames the MPLS link while completely ignoring the SKU limitation. The information about 12 ms latency and the tunnel state as Connected is deliberately irrelevant: it confirms connectivity exists but has no relation to throughput.
Answer Key β Scenario 4β
Answer: A
The correct sequence is 2, 5, 1, 3, 4.
The mandatory starting point is confirming whether the resize itself was completed successfully (step 2): a gateway in partial provisioning state or with failure can exhibit unpredictable behavior. Next, checking if the gateway's public IP address was changed (step 5) is critical, as SKU resizes can, in some scenarios, result in a new public IP. If on-premises devices still point to the old IP, tunnels will drop cyclically when trying to renegotiate. Then, examining diagnostic logs (step 1) provides concrete evidence about what's happening during the drops. BGP verification (step 3) comes next, as it's specific to connections using dynamic routing. Lastly, DPD timer analysis (step 4) is the most granular level of investigation, applicable when more likely causes have already been ruled out.
Sequence B starts with logs before confirming if the gateway is even operational, which inverts the triage logic. Sequence C starts with BGP, which is specific and not the correct starting point when the failure is generalized and affects multiple branches. Sequence D is the second most plausible but omits the public IP check before logs, missing the most common and quickly verifiable cause in post-resize scenarios.
Troubleshooting Tree: Select an appropriate virtual network gateway SKU for site-to-site VPN requirementsβ
Legend:
| Color | Meaning |
|---|---|
| Dark blue | Initial symptom or entry point |
| Blue | Objective diagnostic question |
| Red | Identified cause requiring recreation or provisioning |
| Green | Corrective action or applicable resolution |
| Orange | Validation or intermediate verification before deciding |
To use this tree when facing a real problem, start with the root node describing the general symptom and follow the branches by answering each question with what you observe in the environment. The blue questions are verifiable directly in the Azure portal or via CLI. When you reach a red node, the cause requires structural replanning; when you reach a green node, the action can be executed within normal operational scope. Orange nodes indicate that more evidence needs to be collected before deciding which path to follow.