Troubleshooting Lab: Choose between manual and autoscale

Diagnostic Scenarios

Scenario 1 — Root Cause

An Azure Application Gateway v2 is in production with autoscale configured as shown below. The environment was recently migrated from v1 to v2 and the operations team is monitoring the behavior in the first few weeks.

{
  "sku": {
    "name": "WAF_v2",
    "tier": "WAF_v2"
  },
  "autoscaleConfiguration": {
    "minCapacity": 2,
    "maxCapacity": 5
  }
}

During a high load event on a Friday afternoon, the monitoring team observes the following data in Azure Monitor:

Metric: ComputeUnits
Timestamp        Value
14:32:00         4.8
14:33:00         4.9
14:34:00         5.0
14:35:00         5.0
14:36:00         5.0

Metric: FailedRequests
Timestamp        Value
14:35:00         312
14:36:00         489
14:37:00         601

The team notes that the region in use has normal availability and that the TLS listener certificate was successfully renewed two days earlier. The gateway is associated with a WAF policy in Prevention mode with OWASP 3.2 ruleset, with no recent rule changes.

What is the root cause of the increase in FailedRequests observed starting at 14:35?

A) The WAF policy in Prevention mode is blocking legitimate requests after an automatic update of the OWASP ruleset

B) The maxCapacity: 5 was reached and the gateway cannot provision new instances to absorb the additional load

C) The newly renewed TLS certificate has an algorithm incompatible with some clients, generating handshake failures

D) The region is experiencing silent degradation not reflected in the availability dashboard

Scenario 2 — Action Decision

The infrastructure team identified that an Azure Application Gateway v2 in production is configured with manual scaling with capacity: 3. The system is used by a financial application with a 99.9% SLA, which processes transactions during business hours from Monday to Friday. At 5:45 PM on a Thursday, the on-call receives alerts indicating that the gateway is near maximum capacity of the 3 configured instances, with peak expected in the next 20 minutes.

The cause has been confirmed: the transaction volume is above expected due to an early month-end closure requested by the business. The change was not communicated to the infrastructure team in advance.

There is no approved change window for this afternoon. Internal policy requires prior approval for production changes, but admits documented exceptions in case of imminent SLA risk.

What is the correct action to take at this moment?

A) Wait for formal approval of a change window before any modification, as changing production without a window violates policy regardless of risk

B) Increase capacity from 3 to 8 immediately via portal, without registration, to restore capacity margin before peak

C) Trigger the documented exception process, register the imminent SLA risk and execute the capacity increase with emergency approval

D) Migrate the gateway from manual scaling to autoscale immediately to solve the problem definitively

Scenario 3 — Root Cause

An engineer is reviewing the logs of an Azure Application Gateway v2 recently provisioned in a development environment. The team reported that, after a period without traffic during the weekend, the first requests on Monday morning have response times above 30 seconds, normalizing after a few minutes.

The engineer collects the following information from the environment:

SKU: Standard_v2
autoscaleConfiguration:
  minCapacity: 0
  maxCapacity: 4

Backend pool: 2 VMs (status: healthy)
Health probe interval: 30s
Health probe threshold: 3

Last active instance log:
  Friday 18:47:22 - Instance count: 1
  Friday 20:03:11 - Instance count: 0
  Monday  08:01:44 - Instance count: 0
  Monday  08:02:31 - Instance count: 1
  Monday  08:02:58 - Request served successfully

The engineer additionally verifies that the backend pool is healthy, that there was no configuration change over the weekend, and that the NSG associated with the gateway subnet was not modified.

What is the root cause of the high latency on the first Monday requests?

A) The health probe with 30-second interval and threshold of 3 is marking backends as unhealthy after the weekend

B) The backend pool VMs enter hibernation mode during periods without traffic, increasing initial response time

C) With minCapacity: 0, the gateway scales to zero instances and needs time to provision when receiving the first requests

D) The NSG configuration on the subnet temporarily blocks traffic after prolonged periods of inactivity

Scenario 4 — Side Impact

An operations team identified that an Azure Application Gateway v2 configured with minCapacity: 0 and maxCapacity: 8 was causing elevated latency on first requests after periods of inactivity. To solve the problem, the team increased minCapacity from 0 to 2, ensuring that at least two instances remain always active.

The action solved the initial latency problem and was considered successful by the team.

What secondary consequence does this change introduce to the environment?

A) The gateway starts rejecting TLS connections when operating with exactly 2 instances, requiring a minimum of 3 for stability

B) The gateway's fixed monthly cost increases, as the 2 instances from minCapacity are charged continuously even without traffic

C) The maxCapacity: 8 is no longer respected, as autoscale does not operate correctly when minCapacity is greater than zero

D) Azure Monitor stops issuing scaling alerts, as it interprets minCapacity: 2 as sufficient permanent capacity

Answer Key and Explanations

Answer Key — Scenario 1

Answer: B

The decisive clue is in the ComputeUnits metric, which reaches exactly 5.0 and remains at that value in the following minutes, coinciding with the start of the increase in FailedRequests. The value 5.0 corresponds to the configured maxCapacity. When the gateway reaches the maximum instance limit, it cannot scale beyond that point and starts rejecting or failing requests that exceed available capacity.

The information about TLS certificate renewal and WAF policy are purposefully included as distractions. The certificate was renewed two days earlier without incidents, and the WAF policy had no recent changes. The most obvious cause that a hasty diagnosis would follow would be the WAF (alternative A), as it is in Prevention mode. However, the ComputeUnits logs saturate exactly at the configured ceiling before the failures, which is the determining evidence.

The most dangerous distractor is alternative A: acting on the WAF policy based on this hypothesis could result in improper deactivation of security rules without solving the real problem.

Answer Key — Scenario 2

Answer: C

The cause is already identified and confirmed in the statement. What is at stake is the correct decision within the described constraints: no approved window, with imminent SLA risk and with policy that admits documented exception.

Alternative C is the only one that simultaneously respects both dimensions of the problem: the real technical risk (SLA at risk) and the organizational constraint (change policy). The exception process exists precisely for these cases.

Alternative A ignores the imminent SLA risk and applies the policy rigidly where it itself provides flexibility. Alternative B executes the correct action incorrectly, violating policy without registration and creating audit risk. Alternative D is the worst possible choice at this moment: migrating from manual to autoscale in production without a window, during an active crisis, is an architectural change that can introduce new problems at a critical moment. Even if autoscale is the long-term solution, the timing and method are completely wrong.

Answer Key — Scenario 3

Answer: C

The instance log is direct and sufficient evidence for the diagnosis. The sequence clearly shows that the gateway went to Instance count: 0 on Friday night and remained so until Monday morning. The interval between the first request (08:01:44) and successful service (08:02:58) represents exactly the provisioning time for a new instance from zero.

The information about the healthy backend pool and absence of NSG changes is irrelevant for this diagnosis and was included to simulate the real informational noise of an investigation. The most tempting distractor is alternative A, as the health probe with 30-second interval and threshold of 3 could, in theory, mark backends as unhealthy. However, the statement explicitly declares that the backend pool is healthy, eliminating this hypothesis. Alternative B describes behavior that does not exist in Application Gateway: the gateway does not control the power state of backend VMs.

Answer Key — Scenario 4

Answer: B

Increasing minCapacity from 0 to 2 solves the cold start problem because it ensures two instances are always ready. However, this decision has a direct and immediate cost: the instances corresponding to minCapacity are always provisioned and charged, regardless of traffic volume. In a development environment with long periods of inactivity, this represents a significant change in the monthly cost profile.

The other alternatives describe behaviors that do not exist in Application Gateway. TLS has no minimum instance restriction (alternative A). The maxCapacity continues to be respected normally with any minCapacity value less than it (alternative C). Azure Monitor does not change its alert behavior based on the minCapacity value (alternative D).

The central point of this scenario is to develop awareness that every corrective action in scaling has an impact on the cost model, and this impact should be evaluated before implementation, not after.

Troubleshooting Tree: Choose between manual and autoscale

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Color Legend:

Color	Node Type
Dark Blue	Initial symptom (entry point)
Blue	Diagnostic question
Red	Identified cause
Green	Recommended action or resolution decision
Orange	Validation or intermediate verification

To use this tree when facing a real problem, start with the root node describing the observed symptom and follow the branches by answering each question based on what you can verify in Azure Monitor, gateway logs, or current configuration. Each path ends in a named cause or a concrete action. If the diagnosis reaches an orange validation node, collect the indicated evidence before advancing to the next branch.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Action Decision​

Scenario 3 — Root Cause​

Scenario 4 — Side Impact​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Choose between manual and autoscale​