Skip to main content

Troubleshooting Lab: Implement Azure Traffic Manager

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The operations team received an alert at 2:32 PM stating that the api-westeurope endpoint was marked as Degraded in the Traffic Manager profile. The profile uses the Performance routing method and has three active endpoints. The service SLA requires 99.9% availability.

The on-call engineer verified the following in the portal:

Profile: tm-api-global.trafficmanager.net
Method: Performance
Probe protocol: HTTPS
Port: 443
Path: /health
Probe interval: 30 seconds
Failure tolerance: 3
Timeout per probe: 10 seconds

Endpoint status:
api-eastus β†’ Online
api-westeurope β†’ Degraded
api-southeastasia β†’ Online

The engineer accessed https://api-westeurope.azurewebsites.net/health directly through the browser and received HTTP 200. The endpoint responded in 180 ms. The infrastructure team confirmed there were no configuration changes to the App Service in the last 6 hours. The endpoint's TLS certificate is valid for 11 more months.

What is the most likely root cause for the endpoint being marked as Degraded?

A) The endpoint's TLS certificate is configured incorrectly and the HTTPS probe rejects the connection due to handshake failure
B) The 10-second timeout per probe is insufficient for the latency between Microsoft's probe servers and the West Europe region
C) The /health route returns HTTP 200 in the browser, but returns a different code when accessed by Traffic Manager probe agents
D) The Performance method is not compatible with HTTPS probes on App Service type endpoints


Scenario 2 β€” Action Decision​

The cause of an outage has been identified: the Traffic Manager profile has TTL configured at 300 seconds and a critical endpoint became unavailable. Traffic Manager correctly removed the endpoint from rotation, but users in various regions continued receiving errors for up to 5 minutes after removal, as their DNS resolvers still had the record cached.

The team concluded that the TTL needs to be reduced. The environment has the following constraints:

  • The profile is in production and serves 40,000 requests per minute
  • The corporate DNS team warned that very low TTL values significantly increase load on resolvers and on Traffic Manager itself
  • The goal is to reduce failover propagation time without compromising service stability
  • A maintenance window is available in 48 hours

What is the correct action to take at this time?

A) Reduce the TTL to 0 immediately, as it eliminates the DNS cache problem without needing a maintenance window
B) Wait for the maintenance window and reduce the TTL gradually, evaluating the impact on resolution load before setting the final value
C) Replace Traffic Manager with Azure Front Door, which doesn't depend on DNS TTL for routing
D) Keep the TTL at 300 seconds and compensate by reducing the probe interval to 10 seconds, accelerating failure detection


Scenario 3 β€” Root Cause​

A developer reported that when testing Traffic Manager failover behavior, they noticed that users in SΓ£o Paulo continue being directed to the app-brazil endpoint even after it was manually disabled in the portal 20 minutes ago. The profile uses the Geographic method with Brazil mapped to app-brazil.

The following test was executed on a machine in the corporate network in SΓ£o Paulo:

$ nslookup tm-app.trafficmanager.net
Server: 10.0.0.1
Address: 10.0.0.1#53

Non-authoritative answer:
Name: tm-app.trafficmanager.net
Address: 20.201.28.100

The address 20.201.28.100 corresponds to the public IP of the app-brazil endpoint. The developer confirms that the endpoint status in the portal shows as Disabled. The profile TTL is configured at 60 seconds. The network team reports that the internal DNS resolver (10.0.0.1) has a 1800-second fixed TTL cache, regardless of the value declared in the record.

What is the root cause of the observed behavior?

A) The Geographic method doesn't respect manual endpoint deactivations; it requires the endpoint to fail health checks to be removed from rotation
B) The internal DNS resolver is ignoring the TTL declared by Traffic Manager and keeping the record cached for 1800 seconds
C) The 60-second TTL in the profile is insufficient for the Geographic method and should be at least 300 seconds
D) Traffic Manager takes up to 30 minutes to propagate endpoint state changes to regions outside the United States


Scenario 4 β€” Diagnostic Sequence​

An engineer received the following report: users in Europe are being directed to an endpoint in the United States, even with an endpoint in West Europe configured in the profile and marked as Online. The profile uses the Performance method.

The available investigation steps are:

  1. Verify if the West Europe endpoint is responding to the probe path with HTTP 200
  2. Confirm the routing method configured in the Traffic Manager profile
  3. Compare Microsoft's latency table for the European users' source IPs
  4. Check the status of the West Europe endpoint in the portal (Online, Degraded, or Disabled)
  5. Test DNS resolution from a European client using nslookup or dig

What is the most logical and efficient diagnostic sequence?

A) 2 β†’ 4 β†’ 1 β†’ 5 β†’ 3
B) 5 β†’ 4 β†’ 1 β†’ 2 β†’ 3
C) 4 β†’ 1 β†’ 2 β†’ 5 β†’ 3
D) 2 β†’ 5 β†’ 4 β†’ 1 β†’ 3


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: C

The decisive clue in the scenario is the combination of two facts: the endpoint responds with HTTP 200 when accessed through the browser, but is marked as Degraded by Traffic Manager. This indicates that the problem isn't with service availability itself, but with the response perceived by Microsoft's probe agents, which operate from specific points of presence and use strict criteria.

Traffic Manager probe agents evaluate the HTTP response code returned by the configured path. If the /health route redirects to a login page, returns a 3xx code, or requires authentication, the probe will interpret this as failure, even though the service is operational for human users.

Alternative B is plausible, but the 180 ms response time in the manual test indicates that latency wouldn't be a problem with a 10-second limit. Alternative A is ruled out because the scenario states the certificate is valid. Alternative D is technically false: HTTPS probes are fully supported with App Services.

The most dangerous distractor is B, as it leads the engineer to change the timeout without investigating the response content, which wouldn't solve the problem.


Answer Key β€” Scenario 2​

Answer: B

The scenario presents explicit constraints: production environment with high load and DNS team warning about the impact of very low TTLs. The only action that respects all constraints is to wait for the maintenance window available in 48 hours and execute the reduction gradually and monitored.

Alternative A violates two constraints: the minimum TTL supported by Traffic Manager is 0 seconds, but this causes severe increase in DNS queries and can overload the production service with 40,000 RPM. Alternative C is a valid architectural decision in other contexts, but doesn't solve the immediate problem and is outside the scope of a specific emergency action. Alternative D confuses two independent mechanisms: probe interval affects the speed of failure detection by Traffic Manager, but doesn't reduce failover propagation time to clients, which is controlled by TTL.

The central reasoning error of the distractors is treating TTL and probe interval as equivalent, or ignoring environment constraints when choosing the action.


Answer Key β€” Scenario 3​

Answer: B

The decisive information is explicit in the scenario and shouldn't be ignored: the internal DNS resolver has fixed TTL cache of 1800 seconds, regardless of the value declared by Traffic Manager. This means that even if Traffic Manager stops returning the disabled endpoint's IP immediately, the corporate resolver will continue responding with the cached value for up to 30 minutes.

The 60-second TTL in the profile is irrelevant in this scenario because the intermediate resolver doesn't respect it. This is exactly the information that should be filtered as irrelevant for diagnosis.

Alternative A is false: manual deactivations have the same effect as health check failures for routing purposes. Alternative C inverts the logic: a lower TTL would be better for failover, not a higher value. Alternative D is a distractor without technical basis; Traffic Manager propagates changes globally in seconds, not minutes or hours.

The real risk here is the engineer focusing on the profile TTL (60 seconds) as the cause, without realizing that the corporate resolver is the real obstacle.


Answer Key β€” Scenario 4​

Answer: A

The correct sequence is 2 β†’ 4 β†’ 1 β†’ 5 β†’ 3.

Efficient diagnostic reasoning starts from the broadest to the most specific:

  1. Confirm routing method (step 2): Before any endpoint investigation, it's necessary to confirm the method is actually Performance. A misconfiguration here would explain everything immediately.
  2. Check endpoint status (step 4): If the West Europe endpoint is Degraded or Disabled, Traffic Manager correctly directs to another endpoint. This step eliminates or confirms the most common hypothesis.
  3. Check probe response (step 1): If status is Online but behavior is unexpected, investigate if the probe is actually receiving valid responses.
  4. Test DNS resolution (step 5): Verify what the European client actually receives, confirming that Traffic Manager is returning the American endpoint.
  5. Compare latency table (step 3): Only if all previous steps don't reveal the cause, investigate if Microsoft's latency table classifies the European source as closer to the US, which would be unusual but possible in corporate networks with US proxy exit.

Alternative B starts with DNS before understanding endpoint state, which can generate hasty conclusions. Alternative C checks the probe before confirming status, inverting priority. Alternative D is similar to the correct one but postpones status verification to the end, losing efficiency.


Troubleshooting Tree: Implement Azure Traffic Manager​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color legend:

  • Dark blue: initial symptom, diagnostic entry point
  • Blue: diagnostic question, objective verification to be done
  • Red: identified cause
  • Green: recommended action or resolution

To use this tree facing a real problem, start with the root node describing the unexpected routing symptom and follow the branches answering each question based on what you can observe directly in the portal, DNS logs, or probe tests. Each answer eliminates hypotheses and narrows the path to the cause. When reaching a red node, the cause is identified; the immediately associated green node indicates the corresponding corrective action.