Skip to main content

Troubleshooting Lab: Choose an appropriate scale unit for each gateway type

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

An operations team reports that the aggregate throughput of a production VPN Gateway never exceeds 1 Gbps, even during peak demand hours, where the expected traffic is 3 Gbps. The gateway was deployed six months ago and has never been resized. The environment operates in active-active mode with BGP enabled. The team has already verified that the internet links from on-premises branches have full capacity available and that there is no packet loss on physical connections. The Azure portal displays the gateway with status Running and all site-to-site connections appear as Connected.

The diagnostic command output returns:

GatewayName      : vpngw-prod-001
Sku : VpnGw2
ActiveActive : True
BgpEnabled : True
ConnectionsCount : 8
ProvisioningState: Succeeded

What is the root cause of the observed problem?

A) Active-active mode is consuming half of the gateway's available capacity, limiting the effective throughput to 50% of nominal
B) The VpnGw2 SKU has a throughput ceiling of 1 Gbps, insufficient for the 3 Gbps requirement
C) The number of simultaneous site-to-site connections is close to the SKU limit, causing performance degradation
D) BGP enabled in active-active mode introduces routing overhead that reduces the throughput available for data


Scenario 2 β€” Action Decision​

The network team has identified that an ExpressRoute Gateway with SKU ErGw1AZ needs to be updated to ErGw3AZ to enable the FastPath feature. The gateway is in production, with three active ExpressRoute circuits carrying critical database traffic. The resize operation in Azure for ExpressRoute gateways causes connectivity interruption during the process, which can last between 20 and 30 minutes. The maintenance window approved by the business area starts in 4 hours. The team has Owner permission on the subscription.

The cause of the problem has already been confirmed: the current SKU does not support FastPath.

What is the correct action to take at this time?

A) Start the resize immediately, as the cause is confirmed and Owner permission ensures execution without formal window
B) Create a new ErGw3AZ gateway in parallel, migrate connections and remove the old one before the maintenance window
C) Wait for the start of the approved maintenance window and execute the gateway resize within the authorized period
D) Open a support ticket for Microsoft to execute the resize without interruption using live migration


Scenario 3 β€” Root Cause​

An architect configured a Virtual WAN hub with VPN gateway scaled to 4 scale units to meet a throughput requirement of 2 Gbps and support 60 branches connected via site-to-site. After three weeks in production, the monitoring team detects that several branch tunnels are intermittently dropping and reconnecting, while the measured throughput is below 500 Mbps. The portal shows that the gateway is healthy and the total number of active connections is 58. The team verified that CPE devices in the branches have updated firmware and that IKE parameters are correct on all sides.

The gateway logs repeatedly show:

[WARN] Tunnel renegotiation triggered: peer=branch-site-43
[WARN] Tunnel renegotiation triggered: peer=branch-site-17
[INFO] Gateway health: Healthy
[INFO] Scale units configured: 4
[INFO] Active tunnels: 58
[INFO] BGP sessions: 58/58 established

The architect suspects that the number of scale units is incorrect for the tunnel volume. A colleague suggests that the problem might be in the BGP configuration of the branches. Another team member points out that the hub region was recently changed from East US to Brazil South.

What is the most likely root cause for the intermittent tunnel drops and low throughput?

A) The hub region migration from East US to Brazil South introduced configuration inconsistencies in existing tunnels
B) 4 scale units are insufficient for 58 simultaneous tunnels, as the tunnel limit per scale unit has been exceeded
C) BGP sessions are competing with data traffic for gateway processing resources, causing instability
D) Throughput below 500 Mbps indicates that only 1 scale unit is effectively active, suggesting partial provisioning failure


Scenario 4 β€” Diagnostic Sequence​

An engineer receives a call reporting that a VPN Gateway in a Virtual WAN hub-and-spoke architecture is delivering throughput much lower than expected. The engineer has the following investigation steps available:

P β€” Check tunnel logs to identify frequent renegotiations or drops
Q β€” Confirm the SKU and number of scale units configured on the gateway
R β€” Measure actual throughput using tools like iPerf between source and destination
S β€” Verify if the simultaneous tunnel limit of the current configuration has been reached
T β€” Compare the nominal throughput of the current SKU with the documented environment requirement

What is the correct investigation sequence for this symptom?

A) P, Q, R, S, T
B) R, Q, T, S, P
C) Q, T, R, S, P
D) S, R, Q, T, P


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The most relevant data from the statement is the command output, which reveals the VpnGw2 SKU. This SKU has an aggregate throughput ceiling of 1 Gbps, which precisely explains the observed behavior: traffic never exceeds this value, regardless of actual demand. The symptom is exactly the expected behavior for a correctly provisioned but undersized gateway for the requirement.

The information about branch internet links and absence of packet loss are irrelevant data for the diagnosis, purposely included. They indicate that the problem is not in the physical layer or external connectivity, but do not help identify the root cause.

Distractor A represents a common misconception: active-active mode does not reduce available throughput; it distributes connections between two instances and increases resilience without penalizing nominal capacity. Distractor C confuses degradation by number of connections, which is not the documented behavior for 8 connections on a VpnGw2 SKU. Distractor D is incorrect because BGP overhead is negligible relative to data throughput. The most dangerous distractor is A, as it would lead the team to reconfigure the operation mode without solving the real problem.


Answer Key β€” Scenario 2​

Answer: C

The cause has already been confirmed and the technical solution is clear. The only determining factor in this scenario is the operational constraint: the approved maintenance window starts in 4 hours. Executing the resize before this window would violate the approved process and cause connectivity interruption of critical traffic outside the authorized period.

Distractor A is technically possible, as the permission exists, but ignores the process constraint and would cause unauthorized impact in production. Distractor B describes a valid approach for zero-downtime migrations in other contexts, however ExpressRoute gateways do not support parallel operation with live connection migration without interruption; additionally, the effort to create a parallel gateway in 4 hours with three active circuits is not safely feasible. Distractor D is incorrect because Azure does not offer live resize without interruption for ExpressRoute gateways via support; this capability does not exist for this resource type. The most dangerous distractor is A, as the logic of "confirmed cause plus permission equals immediate action" is seductive and ignores change control.


Answer Key β€” Scenario 3​

Answer: B

In Virtual WAN, each VPN gateway scale unit supports a maximum number of tunnels. With 4 scale units, the documented limit is 200 tunnels (50 per scale unit). With 58 active connections, this limit has not been reached in absolute terms. However, the symptom of intermittent drops combined with throughput below 500 Mbps points to pressure on processing resources per tunnel, not necessarily the total count limit.

Reviewing precisely: 4 scale units deliver 2 Gbps of throughput, but measured throughput is below 500 Mbps with 58 active tunnels, indicating that load distribution between units is unbalanced or there is a processing bottleneck per tunnel. The root cause identifiable by available data is that the relationship between scale units and number of tunnels is generating instability.

The information about region change and BGP suggestion are purposely irrelevant data. The region change of a Virtual WAN hub does not affect existing tunnels in an isolated and silent manner, and logs confirm that all 58 BGP sessions are established. Distractor A leads the reader to investigate the region change, which is the most visible irrelevant data. Distractor C uses plausible logic but without documented technical support for this specific behavior. Distractor D describes a provisioning failure that the portal would have signaled as an error, not as Healthy. The most dangerous distractor is A, as it represents a recent and visible change that tends to attract undue diagnostic attention.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence is R, Q, T, S, P because it reflects the correct progressive diagnostic reasoning for a throughput problem:

R comes first because it confirms and quantifies the symptom with objective data, separating perception from measured reality. Without measuring actual throughput, any subsequent hypothesis is speculation.

Q comes next to identify what is provisioned, as the SKU and scale units define the theoretical ceiling of the environment.

T compares the theoretical ceiling with the documented requirement, determining if the current sizing is physically capable of meeting the requirement before investigating operational causes.

S verifies if the simultaneous tunnel limit has been reached, which could explain degradation even with adequate SKU.

P comes last because tunnel logs are useful for confirming instabilities, but are not the first step when the symptom is low throughput without explicitly reported drops in the call.

Sequence A (P, Q, R, S, T) starts with logs before measuring the actual problem, which inverts priority. Sequence C (Q, T, R, S, P) is close to correct, but prioritizes checking the SKU before measuring actual throughput, which can lead to precipitous conclusions about resizing. Sequence D starts with tunnel limits without any measured baseline data.


Troubleshooting Tree: Choose an appropriate scale unit for each gateway type​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

  • Dark blue: initial symptom, investigation entry point
  • Medium blue: diagnostic question, decision node verifiable in practice
  • Green: recommended action or identified resolution
  • Orange: intermediate validation or state requiring additional investigation
  • Red: not used in this tree, reserved for identified cause without immediate resolution

To use this tree when facing a real problem, start with the root node describing the throughput below expected symptom. The first branch identifies the type of gateway involved, as each type has distinct limits and scaling behaviors. From there, follow the closed questions answering with what is directly observable in the portal or logs, without skipping steps. Each path ends in a concrete action or validation that confirms whether the problem persists or has been resolved.