Skip to main content

Troubleshooting Lab: Configure traffic acceleration

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

An operations team reports that users in Europe are experiencing response latency above 800ms when accessing a web application hosted in Azure, even after Azure Front Door deployment three weeks ago. North American users report normal latency around 60ms.

The responsible engineer checks the Front Door profile and finds the following state:

Profile: fd-app-prod
SKU: Standard
Route: /api/* -> origin app-service-eastus
Route: /* -> origin app-service-eastus
Cache: Disabled
Health probe: HTTP/80 - interval 30s - Status: Healthy

During investigation, the engineer also observes that the application's TLS certificate was renewed last week and that the App Service is running with 35% CPU consumption, within normal range.

When executing DNS resolution of the Front Door URL from a European client, the result returns the following:

nslookup app.contoso.com
Response: app.contoso.com -> app-service-eastus.azurewebsites.net
Direct CNAME to origin, without Front Door intermediary

What is the root cause of high latency for European users?

A) Cache is disabled in the Front Door profile, forcing all requests to hit the origin without optimization
B) Front Door is not being used in the traffic path; DNS points directly to the App Service, bypassing Front Door
C) The TLS certificate was renewed and Front Door is still using the old version, causing excessive renegotiation
D) The Front Door Standard SKU doesn't have points of presence in Europe, forcing traffic to route through North America


Scenario 2 β€” Action Decision​

The cause of a production incident has been identified: the Azure Front Door origin group has the priority of two origins configured with the same value (priority 1), when the intention was for the second origin to function exclusively as failover (priority 2). As a result, Front Door is actively distributing traffic between the two origins, and the second origin, in a secondary region, is not scaled for production load, generating intermittent 503 errors for users.

The additional context is:

  • The incident is active and affecting 30% of requests
  • The second origin is responding with high latency and sporadic errors
  • Changing priority in the origin group is a configuration operation with no downtime declared by Microsoft
  • The security team requires formal approval for any changes to production network resources, a process that takes an average of 4 hours
  • There's an option to temporarily disable the second origin in the group, an action that doesn't require security approval according to internal policy

What is the correct action to take at this moment?

A) Wait for formal security approval and only then correct the origin priority, as any change without approval violates internal policy
B) Immediately correct the origin priority to 2, without waiting for approval, as it's a configuration change without downtime
C) Temporarily disable the second origin in the origin group to contain immediate impact, then initiate the approval process for definitive priority correction
D) Permanently remove the second origin from the origin group to eliminate the problem immediately


Scenario 3 β€” Root Cause​

A developer configured Azure Front Door to serve an e-commerce application. After a few days in production, the QA team reports that authenticated users are occasionally receiving responses from other users, such as incorrect shopping carts and mixed session data.

The engineer investigates and finds the following route rule configuration:

Route: /* -> origin group ecommerce-origins
Cache: Enabled
Cache duration: 1 hour
Cache key: Full URL (default)
Query strings: Ignore all

The origin App Service is healthy. Front Door logs show cache HIT for authenticated requests. The application uses session cookies to identify the user, and the origin response headers include:

Set-Cookie: sessionid=abc123; Path=/; HttpOnly
Vary: Cookie

What is the root cause of the observed behavior?

A) The App Service has session affinity issues, routing requests from the same user to different instances
B) Front Door is serving cached responses from one user to another, as the cache key doesn't include the session cookie, making personalized responses indistinguishable between users
C) The Vary: Cookie header is corrupting Front Door's cache, causing premature invalidation and resending old responses
D) The 1-hour cache duration is excessive for session data; reducing it to 5 minutes would solve the data mixing problem


Scenario 4 β€” Diagnostic Sequence​

A user reports that when accessing https://app.contoso.com, they receive HTTP 503 error intermittently. Azure Front Door is configured as the entry point. The team needs to diagnose the problem in a structured way.

The following investigation steps are available:

  1. Check the status of origins in the Front Door origin group (health probe results)
  2. Analyze Front Door diagnostic logs in Log Analytics to identify if the 503 is generated by Front Door or passed from the origin
  3. Confirm if the DNS for app.contoso.com resolves to the Front Door endpoint and not directly to the origin
  4. Verify health probe configuration: protocol, port, and path are correct and correspond to what the origin actually responds to
  5. Test the origin directly (without going through Front Door) to determine if it responds successfully on the same route

What is the correct investigation sequence?

A) 3 -> 2 -> 1 -> 4 -> 5
B) 1 -> 4 -> 5 -> 2 -> 3
C) 2 -> 1 -> 3 -> 5 -> 4
D) 3 -> 1 -> 4 -> 5 -> 2


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The determining clue is in the nslookup output: the DNS for app.contoso.com resolves directly to app-service-eastus.azurewebsites.net, without going through Front Door. This means the DNS record was configured incorrectly as a CNAME pointing to the origin, not to the Front Door endpoint. Front Door is simply not in the European traffic path.

When Front Door is not being used, all acceleration benefits disappear: no routing through Microsoft's backbone, no intermediate PoP, no connection optimization. European traffic traverses the public internet to the origin in the East US region, explaining the 800ms latency.

The information about TLS certificate renewal is irrelevant to this diagnosis. Alternative C uses it as a classic distractor: connecting a recent event (renewal) to the observed symptom (latency), even without causal relationship. Alternative A is technically plausible, as disabled cache increases load on the origin, but doesn't explain 800ms latency for a specific region. Alternative D is factually false: Front Door Standard SKU has global PoPs, including Europe.

The most dangerous distractor is alternative A: an operator acting on it would enable cache without solving the real problem, and European users would continue with high latency.


Answer Key β€” Scenario 2​

Answer: C

The scenario presents two simultaneous and conflicting constraints: an active incident affecting users (urgency) and a security policy requiring approval for network resource changes (formal restriction). The correct action balances both.

Disabling the second origin doesn't require approval according to the described internal policy, and resolves immediate impact by removing the undersized origin from load balancing. Then, the formal approval process can be initiated for definitive priority correction, which is the correct structural solution.

Alternative A ignores active impact: waiting 4 hours with 30% production errors is unacceptable when there's a permitted containment action. Alternative B violates security policy, even being technically correct in another context. Alternative D is destructive and irreversible: permanently removing a valid failover origin to solve a priority configuration problem is an overreaction that compromises architecture resilience.

The correct reasoning in decision scenarios with constraints is: contain impact within permitted limits first, then correct root cause through formal channels.


Answer Key β€” Scenario 3​

Answer: B

The core problem is the cache key configuration. Front Door is using the full URL as the default cache key and ignoring all query strings. This means two requests to GET /cart from different users with distinct session cookies generate the same cache key, and Front Door serves the first user's stored response to the second.

The critical clue is in the logs: cache HIT for authenticated requests. An authenticated request should never be served from cache without the session identifier being part of the cache key, or without the response being configured as non-cacheable by the origin.

The Vary: Cookie header in the origin response is an important signal that was ignored in configuration. This header instructs intermediaries to vary the cache based on cookie content. Front Door, however, doesn't include cookies in the cache key by default, and the lab configuration didn't override this behavior.

Alternative A (App Service problem) is a plausible distractor because session data mixing can be caused by poorly configured session affinity, but cache HIT logs rule out this hypothesis: the problem occurs before reaching the origin. Alternative D describes a partially reasonable solution, but reducing TTL doesn't solve the cause: the problem is the absence of user differentiation in the key, not the duration.


Answer Key β€” Scenario 4​

Answer: A

The correct sequence is: 3 -> 2 -> 1 -> 4 -> 5

The progressive diagnostic reasoning starts from the outermost to the innermost:

Step 3 (DNS): Before any Front Door analysis, confirm that traffic actually goes through it. If DNS doesn't point to Front Door, all other steps are irrelevant to the service diagnosis.

Step 2 (logs): With DNS confirmed, analyzing Front Door logs in Log Analytics allows determining the 503 error origin: is it generated by Front Door itself (e.g., all origins marked as unhealthy) or is it a response passed from the origin?

Step 1 (health probe results): Now that we know Front Door is in the path, checking origin status reveals whether Front Door considers them healthy or not.

Step 4 (health probe configuration): If origins appear as unhealthy, verify if the probe is configured correctly. A probe with incorrect path or port will mark a healthy origin as failed.

Step 5 (direct origin test): Only after identifying that the probe might be incorrect, testing the origin directly confirms if it's actually capable of responding, distinguishing between actual origin failure and probe configuration failure.

Sequence B starts with origin status without confirming if Front Door is in the path, skipping the most basic validation step. Sequence C starts with logs without checking DNS, introducing unnecessary ambiguity. Sequence D is similar to A until steps 4 and 5 are inverted, which would make the engineer test the origin directly before checking if the probe is misconfigured, losing important context to interpret the test result.


Troubleshooting Tree: Configure traffic acceleration​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Legend:

  • Dark blue: initial symptom that triggers diagnosis
  • Medium blue: diagnostic question with verifiable answer in practice
  • Red: identified root cause
  • Green: recommended action or resolution
  • Orange: validation or verification step after correction

To use this tree when facing a real problem, start at the root node and answer each question based on what is directly observable: DNS result, Front Door logs, health probe status. Each answer eliminates an entire branch of hypotheses. Never advance to the next step without having verified the previous one in practice. Most traffic acceleration problems are resolved in the first three levels of the tree, before reaching origin investigation itself.