Troubleshooting Lab: Map requirements to features and capabilities of Azure Front Door
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A global web application is configured in Azure Front Door Premium with two origin groups: prod-eastus and prod-westeu. The operations team reports that users in the United States are receiving responses with latency consistently above 400ms, while users in Europe report normal performance. The configured routing method is Latency.
During the investigation, the engineer collects the following information:
Origin Group: prod-eastus
Origin: app-eastus.azurewebsites.net
Health Probe Protocol: HTTP
Health Probe Path: /healthz
Health Probe Interval: 30s
Status reported by Front Door: Degraded
Origin Group: prod-westeu
Origin: app-westeu.azurewebsites.net
Health Probe Protocol: HTTP
Health Probe Path: /healthz
Health Probe Interval: 30s
Status reported by Front Door: Healthy
The engineer also verifies that the /healthz endpoint on the app-eastus server responds with HTTP 200 when accessed directly via curl from a VM in the same region. The TLS certificate for the custom domain was renewed two days ago and is valid. WAF is enabled on the profile with a policy in Detection mode.
What is the root cause of the elevated latency for US users?
A) The TLS certificate was recently renewed and Front Door is still propagating the new certificate chain to all PoPs, causing temporary slowness in connections for American users.
B) The WAF policy in Detection mode is inspecting and logging all requests without blocking them, which adds processing overhead sufficient to elevate perceived latency.
C) The app-eastus origin has Degraded status in Front Door, causing traffic from American users to be redirected to prod-westeu in Europe, which increases latency for this user group.
D) The health probe is configured with a 30-second interval, which is too long. Front Door doesn't have updated information about the backend state and is making incorrect routing decisions.
Scenario 2 β Action Decisionβ
The security team identified that the WAF policy associated with the Azure Front Door Premium profile is in Detection mode in production. The cause was confirmed: an engineer changed the policy mode from Prevention to Detection during a maintenance window three weeks ago to investigate false positives, and forgot to revert it. Since then, malicious requests are being logged but not blocked.
The current context is:
- The application is in production with active traffic of approximately 8,000 requests per minute
- The next maintenance window is scheduled in 72 hours
- Changing the WAF policy mode does not require Front Door restart nor causes traffic interruption
- The security team has Contributor permission on the WAF policy resource
- Logs show that in the last 3 days there were 47 SQL Injection attempts that were only logged
What is the correct action to take at this moment?
A) Wait for the maintenance window in 72 hours to revert the policy mode to Prevention, ensuring the change is made in a controlled and documented manner.
B) Immediately revert the WAF policy mode from Detection to Prevention, as the change does not cause traffic interruption and active exposure to attacks does not justify waiting for the maintenance window.
C) Create a new WAF policy in Prevention mode and associate it with the Front Door profile to replace the current one, preserving the old policy in Detection for audit purposes.
D) Add custom blocking rules for the SQL Injection patterns identified in logs and maintain Detection mode until the maintenance window, reducing risk without changing the global mode.
Scenario 3 β Root Causeβ
A development team published a new version of an API managed by Azure Front Door Standard. After publication, clients consuming the /api/v2/orders endpoint start receiving outdated responses, with data reflecting the application state from hours ago. The /api/v2/products endpoint works correctly and returns real-time data.
The engineer verifies the profile configuration:
Route: api-v2-route
Patterns to match: /api/v2/*
Origin Group: orders-backend
Caching: Enabled
Query string caching behavior: Ignore query strings
Cache duration: 4 hours
Compression: Enabled
Backend response headers observed in /api/v2/orders:
Cache-Control: no-store
Content-Type: application/json
X-Request-ID: 7f3a2b1c
Vary: Accept-Encoding
The /api/v2/products endpoint is on a separate route with caching disabled. The orders-backend origin group is healthy. The application was deployed successfully and responds correctly when accessed directly via the backend URL.
What is the root cause of the observed behavior?
A) The Vary: Accept-Encoding header in the backend response is causing conflict with Front Door's cache mechanism, causing old response versions to be delivered to clients with different accepted encodings.
B) Caching is enabled on the /api/v2/* route with 4-hour duration. Even though the backend returns Cache-Control: no-store, Front Door cached the response from the previous version before deployment and is serving this cached content until the TTL expires.
C) The query string behavior configured as Ignore query strings is causing requests with different parameters to be treated as the same cache object, always returning the first stored response.
D) The new deployment changed the X-Request-ID of responses. Front Door uses this header as a cache key, and the divergence between the old and new values is causing inconsistency in delivered responses.
Scenario 4 β Diagnostic Sequenceβ
An engineer receives the following alert at 2:32 PM:
"Users report HTTP 503 error when accessing the application via Azure Front Door. The issue affects all regions."
The engineer has access to the Azure portal, Front Door logs via Azure Monitor, the application backend, and profile configurations. The following investigation steps are available, but out of order:
- Check the health status of origins in the origin group within the Front Door profile
- Access request logs in Azure Monitor to identify the error code returned by the origin
- Confirm if Front Door is returning the 503 directly (without reaching the backend) or if the error comes from the origin itself
- Check if there are any misconfigured routing rules or missing routes that could be causing the error before reaching the origin group
- Access the backend URL directly (outside Front Door) to verify if the service responds correctly
What is the correct investigation sequence for this symptom?
A) 1 β 5 β 2 β 4 β 3
B) 3 β 1 β 5 β 4 β 2
C) 4 β 1 β 3 β 2 β 5
D) 3 β 4 β 1 β 5 β 2
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: C
The determining clue in the statement is the Degraded status reported by Front Door for the app-eastus origin. When an origin is degraded, Front Door considers it less preferable or unavailable for routing, even when the configured method is Latency. Traffic from American users, which would normally be routed to the nearest backend (app-eastus), is redirected to prod-westeu in Europe, which explains the elevated latency.
The detail that the /healthz endpoint responds with HTTP 200 when accessed directly via curl is the irrelevant information included purposefully. This direct access bypasses Front Door's health probe mechanism, which may be facing a different problem, such as a firewall blocking requests originating from Front Door PoP IPs, or an HTTP redirect that the probe doesn't follow correctly.
Distractors A and B are technically plausible in other contexts, but don't explain why the problem is geographically restricted to American users. Distractor D confuses the probe interval (which influences status update frequency) with the cause of incorrect routing; Front Door already has a status reading showing Degraded, regardless of the interval.
The most dangerous error would be acting based on distractor D and increasing probe frequency without investigating why the origin is being marked as Degraded, which wouldn't solve the real problem.
Answer Key β Scenario 2β
Answer: B
The cause is already stated in the scenario: the policy is in Detection mode when it should be in Prevention. The correct reasoning here depends on analyzing constraints, not identifying the cause.
The critical constraints are: the mode change doesn't cause traffic interruption, the team has adequate permission (Contributor on the resource), and there's active evidence of attacks being only logged. None of these conditions justify waiting 72 hours. The maintenance window exists for changes that cause interruption risk or require broad coordination; reverting a WAF policy mode doesn't fit this criterion.
Alternative A is the most dangerous distractor: it sounds like good governance practice, but ignores that active exposure to SQL Injection for an additional 72 hours represents a concrete and unnecessary security risk when the fix is low-risk and can be done immediately.
Alternative C creates unnecessary work and configuration inconsistency risk. Alternative D is a partial solution that doesn't address the structural problem, which is the incorrect policy mode.
Answer Key β Scenario 3β
Answer: B
The described behavior, outdated responses after a deployment, with the backend responding correctly when accessed directly, is the classic symptom of non-invalidated edge cache. The route configuration confirms: caching enabled with 4-hour duration on the /api/v2/* route.
The relevant technical point here is that Azure Front Door can respect or override the Cache-Control: no-store header depending on route configuration. When caching is explicitly enabled on the route with a defined TTL, Front Door may store the response even if the backend sends Cache-Control: no-store, because route configuration takes precedence over origin response headers in certain scenarios. The cache stored before deployment will continue to be served until the 4-hour TTL expires or the cache is manually purged.
Alternative A is the distractor designed to attract those who focus on plausible but irrelevant technical details for the symptom. The Vary: Accept-Encoding header is standard for compression and doesn't cause the described behavior. Alternative C describes a real cache configuration problem, but one that would cause identical responses for different queries, not generalized outdated responses. Alternative D is incorrect: Front Door doesn't use X-Request-ID as a cache key.
Answer Key β Scenario 4β
Answer: D
The correct sequence is: 3 β 4 β 1 β 5 β 2
Progressive diagnostic reasoning starts from the outermost symptom to the innermost causes:
Step 3 should be first because it determines where the error is being generated: in Front Door itself (configuration, route, or origin health issues) or in the backend. This distinction guides the entire subsequent investigation.
Step 4 comes next because route configuration errors are quickly verifiable in the control plane and can explain a 503 without traffic reaching the origin group.
Step 1 checks origin status, which is the most common cause of 503 when routes are correct.
Step 5 isolates whether the problem is in Front Door or the backend by accessing the backend directly.
Step 2, detailed log analysis, is the last step because it's the most time-consuming and should be used to confirm or detail an already formed hypothesis, not as a starting point.
Alternative B seems logical, but starts with step 3 and jumps to checking the backend (step 5) before checking route configurations (step 4) and origin health (step 1), skipping faster steps that could end the investigation earlier.
Troubleshooting Tree: Map requirements to features and capabilities of Azure Front Doorβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question (binary decision or by state) |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate validation or verification |
To use this tree when facing a real problem, identify the observed symptom (503 error, elevated latency, outdated content, or absence of threat blocking) and locate the corresponding entry node at the root. From there, answer each diagnostic question based on what is observable in the Azure portal or Azure Monitor logs, without presuming the cause. Each branch progressively leads to an identified cause (red) that directly guides corrective action (green). Validation nodes (orange) indicate points where evidence must be collected before confirming the hypothesis.