Troubleshooting Lab: Identify appropriate use cases for Azure Application Gateway
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A team reports that, after migrating an application to Azure, requests from external clients are intermittently returning HTTP 502 Bad Gateway. The issue occurs for approximately 30% of requests throughout the day.
The environment is configured as follows:
Application Gateway (SKU Standard_v2)
|-- Backend Pool: 4 VMs running IIS (port 80)
|-- Health Probe: HTTP, path "/", port 80, interval 30s, threshold 3
|-- Listener: HTTP, port 80
|-- NSG on Application Gateway subnet: no inbound restrictions
The team investigated and collected the following additional information:
- IIS logs on the VMs show that when requests arrive, they are responded to with HTTP 200
- The VM subnet has an NSG with the following manually configured inbound rule:
Priority Name Port Source Action
100 AllowHTTPFromLB 80 168.63.129.16 Allow
200 DenyAll * * Deny
- The Application Gateway is on the latest firmware version
- The security team approved the network topology two weeks ago
- The Standard_v2 SKU was chosen for its autoscaling support
What is the root cause of the intermittent 502 error?
A. The 30-second health probe interval is too long, causing VMs that become temporarily slow to be maintained in the active pool when they no longer respond correctly
B. The NSG on the VM subnet is blocking health probes from the Application Gateway, as they originate from a different IP range than 168.63.129.16
C. The Standard_v2 SKU requires the health probe to use HTTPS, and the current configuration using HTTP causes the probe to return false positive health status
D. The threshold of 3 consecutive failures on the health probe is insufficient to detect instability, causing VMs with intermittent failures to remain active in the pool
Scenario 2 β Action Decisionβ
The operations team identified that the WAF on the Azure Application Gateway is in Detection mode in a production environment that processes financial transactions. After analyzing the logs, it was confirmed that real SQL injection attacks are being logged in the WAF logs but are not being blocked.
The operational context is as follows:
- The application has a 99.9% SLA and any unavailability requires opening a critical incident
- The development team identified that two legacy application endpoints generate false positives in WAF rules, but the endpoints are rarely used (less than 1% of traffic)
- There is a scheduled maintenance window in 72 hours
- The team has permission to create rule exclusions in the WAF
- Changing the Application Gateway SKU is not authorized at this time
What is the correct action to take now?
A. Wait for the maintenance window to change the WAF to Prevention mode, as production changes outside the window violate the change process
B. Create rule exclusions for the two legacy endpoints immediately and then change the WAF to Prevention mode without waiting for the window, given the active attack risk
C. Change the WAF to Prevention mode immediately without creating exclusions, accepting that false positives on legacy endpoints will generate blocks while exclusions are prepared
D. Escalate to the security team and wait for formal approval before any changes, as the WAF involves corporate security policy
Scenario 3 β Root Causeβ
An architect configured an Application Gateway to route traffic between two environments: production and staging. The intention was to separate traffic by URL path:
app.contoso.com/prod/* --> Backend Pool: Production VMs
app.contoso.com/hml/* --> Backend Pool: Staging VMs
After deployment, the QA team reports that all requests sent to /hml/api/test are being responded to by production VMs, not staging ones. Requests to /prod/ work correctly.
The current rule configuration in the Application Gateway is as follows:
Rule 1 (Priority 100): PathBasedRouting
Path: /prod/* --> BackendPool-Production
Rule 2 (Priority 200): PathBasedRouting
Path: /hml/* --> BackendPool-Staging
Default Rule: BasicRouting
--> BackendPool-Production
The team verified that:
- Staging VMs are healthy and responding on port 80
- The health probe returns HTTP 200 for all backends
- The listener's SSL certificate was renewed last week and is valid
- The DNS for
app.contoso.comcorrectly points to the Application Gateway's public IP
What is the root cause of the problem?
A. The health probe is configured for HTTP while staging VMs use HTTPS, causing false positive health status and silent removal from the active pool
B. Requests to /hml/api/test do not match the path /hml/* due to a case sensitivity issue in the routing rules
C. The path configuration /hml/* is correct, but the Default Rule redirects requests that arrive before the priority 200 rule is evaluated, as rule processing is not sequential in this scenario
D. The path configured as /hml/* does not match URLs that contain subdirectories after the prefix, as the wildcard * in Application Gateway path rules does not cover additional URL segments beyond the immediately following one
Scenario 4 β Diagnostic Sequenceβ
A client reports that the web application, which was functioning normally, started returning HTTP 403 Forbidden for all requests after a configuration change made the previous afternoon. The change was vaguely described by the team as "gateway adjustments."
The available investigation steps are:
- Check WAF diagnostic logs in Log Analytics to identify if requests are being blocked by any specific rule
- Confirm that the Application Gateway is operational by checking health status in the Azure portal and if the backend pool has healthy instances
- Check the Application Gateway change history via Activity Log to identify exactly what was changed the previous afternoon
- Test a request directly against the backend (bypassing the Application Gateway) to determine if the problem is in the gateway or the application
- Check if the WAF mode was changed from Detection to Prevention and if new custom rules were added
What is the correct investigation sequence to reach the root cause with the fewest open hypotheses?
A. 2 β 4 β 3 β 5 β 1
B. 3 β 5 β 1 β 2 β 4
C. 2 β 3 β 5 β 1 β 4
D. 4 β 1 β 3 β 2 β 5
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The key clue is in the NSG rule on the VM subnet: it only allows inbound traffic originating from IP 168.63.129.16, which is the Azure platform service address (used by Azure Load Balancer, for example). However, Application Gateway health probes originate from the GatewayManager IP range and from the Application Gateway instance IPs themselves, not from 168.63.129.16. This means the probes are blocked by the DenyAll rule, causing healthy backends to be intermittently marked as unhealthy, resulting in 502 errors.
The irrelevant information in the scenario is the Standard_v2 SKU and the fact that the firmware is updated. This information is true and plausible but has no relation to the described symptom.
The most dangerous distractor is alternative A: the 30-second probe interval is indeed an adjustable parameter, and thinking of it as a cause leads the diagnosis in the wrong direction. The consequence of acting based on this hypothesis would be to reduce the probe interval without solving the NSG blocking, and the problem would continue.
Answer Key β Scenario 2β
Answer: B
The cause is identified and confirmed: real attacks are happening now and the WAF is not blocking because it's in Detection mode. The critical constraint is that false positives exist on two endpoints, but this has a known and authorized solution: create rule exclusions. The scenario explicitly states that the team has permission to create exclusions.
The correct sequence is: create exclusions first to eliminate the risk of legitimate interruption, then change to Prevention. This can be done outside the window because the impact of not acting (active attacks) is greater than the risk of the controlled change.
Alternative A is the most dangerous distractor: waiting 72 hours with confirmed active attacks in financial production is indefensible. The maintenance window is for changes requiring planning, not for responses to active security incidents. Alternative C ignores a concrete constraint (false positives would cause legitimate traffic blocking), violating the SLA. Alternative D uses the correct process in the wrong context.
Answer Key β Scenario 3β
Answer: D
The wildcard * in Application Gateway path-based routing rules corresponds to any path extension within the segment immediately following the defined prefix, but the actual behavior is that /hml/* should correspond to paths like /hml/anything. The problem, however, is that the URL /hml/api/test has multiple segments after the prefix. Although the * in Application Gateway is intended to cover everything, the configuration may not be working as expected depending on how the path was defined. More precisely: if the path was configured as /hml/* but the rule is not being evaluated, the Default Rule captures the request first.
The irrelevant information in the scenario is the SSL certificate renewal. The problem is HTTP path routing, and the certificate state has no influence on which backend pool receives the request.
The most dangerous distractor is alternative C, which proposes non-sequential processing logic for rules. In reality, path rules are evaluated sequentially by priority, and the Default Rule is only triggered when no path rule matches. Believing alternative C would lead the team to investigate priority order instead of checking wildcard behavior.
Answer Key β Scenario 4β
Answer: C
The correct sequence is: 2 β 3 β 5 β 1 β 4.
The progressive diagnostic reasoning follows this logic:
- Step 2 first: confirm that the Application Gateway is operational and backends are healthy. If the gateway has a health problem, subsequent steps are irrelevant.
- Step 3 next: use the Activity Log to identify what was changed. The statement indicates there was a vague change the previous afternoon. Knowing exactly what changed directs all subsequent steps.
- Step 5 in sequence: with Activity Log information in hand, specifically check if the WAF was changed to Prevention or if new rules were added. HTTP 403 is consistent with WAF blocking.
- Step 1 right after: confirm in Log Analytics which specific rules are blocking requests.
- Step 4 last: test the backend directly only if previous steps don't identify the problem in the gateway, to rule out that the cause is in the application.
Alternative A (2 β 4 β 3 β 5 β 1) is tempting because testing the backend seems logical early, but it bypasses the Application Gateway before checking what changed in it, wasting time and creating unnecessary open hypotheses. Alternative B starts with Activity Log without checking gateway health, which is an error when the symptom could be total unavailability.
Troubleshooting Tree: Identify appropriate use cases for Azure Application Gatewayβ
Color legend:
| Color | Node type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question (verifiable decision) |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate validation or additional investigation |
To use this tree when facing a real problem, start with the root node by identifying the error code or observed symptom. At each question node, answer based on what you can directly verify in the Azure portal, diagnostic logs, or through direct testing against the backend. Follow the path corresponding to your answer until you reach a red node (identified cause) or green node (resolution action). Orange nodes indicate that more information needs to be collected before taking action.