Skip to main content

Troubleshooting Lab: Configure deployment slots for an App Service

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The operations team performs a swap between the staging slot and the production slot of an App Service hosted on a Standard S2 plan. The swap completes without errors in the Azure portal. Minutes later, the database team reports that the production application is trying to connect to the staging database, causing end-user transaction failures.

The administrator checks the deployment history and confirms that no deployment was made directly to the production slot in the last 24 hours. The App Service has Always On enabled and Health Check configured to verify /health every 30 seconds. The database connection string is stored as an Application Setting called DB_CONNECTION_STRING.

Configuration state before the swap:

Slot: production
DB_CONNECTION_STRING = Server=prod-db.database.windows.net;...
Slot Setting: false

Slot: staging
DB_CONNECTION_STRING = Server=staging-db.database.windows.net;...
Slot Setting: false

What is the root cause of the observed problem?

A. The Health Check did not detect the connection failure in time to automatically revert the swap.

B. The DB_CONNECTION_STRING configuration was not marked as a slot setting, therefore it migrated along with the code during the swap.

C. The Standard S2 plan does not support configuration isolation between slots, requiring the Premium plan.

D. Always On kept the staging database connection active in cache before the swap could update the configurations.


Scenario 2 β€” Action Decision​

The cause has already been identified: during a swap between staging and production, the administrator realized they executed the swap in the wrong direction, promoting code from a hotfix slot to staging instead of to production. The error was identified 4 minutes after the swap. The production slot was not affected. The staging environment now contains the hotfix code, and the original staging code went to the hotfix slot.

The QA team is waiting for the staging environment to start a test suite scheduled for the next 20 minutes. The App Service plan is Premium P1v3. There is no open maintenance window for production at the moment.

What is the correct action to take at this time?

A. Perform a new deployment of the original staging code directly to the staging slot via CI/CD pipeline.

B. Execute a swap between the hotfix slot and the staging slot to restore the original state of both slots.

C. Delete the staging slot and recreate it from the hotfix slot using the slot cloning option.

D. Execute a swap between the hotfix slot and the production slot to ensure the hotfix reaches the correct environment before fixing staging.


Scenario 3 β€” Root Cause​

A developer reports that after enabling Auto Swap on the staging slot pointing to production, every commit that arrives at the staging slot triggers an immediate swap, as expected. However, after each swap, users report about 40 seconds of HTTP 503 responses before the application stabilizes.

The administrator checks the App Service logs and confirms that the production slot returns 503 during this interval. The application uses a framework that performs heavy initialization on the first request, including loading approximately 800 MB of in-memory cache. The staging slot is configured with 1 instance and the production slot with 3 instances. The plan is Premium P2v3. Health Check is enabled and configured correctly.

App Service Logs - slot: production
[INFO] 2026-03-15T14:02:11Z - Swap initiated from staging to production
[WARN] 2026-03-15T14:02:13Z - Instance prod-1: HTTP 503 on route /
[WARN] 2026-03-15T14:02:15Z - Instance prod-2: HTTP 503 on route /
[WARN] 2026-03-15T14:02:18Z - Instance prod-3: HTTP 503 on route /
[INFO] 2026-03-15T14:02:52Z - All instances healthy

What is the root cause of the observed unavailability period?

A. The Health Check is taking too long to detect that instances are ready because the verification interval is too high.

B. Auto Swap does not support applications with more than one instance in production, causing conflict during the swap.

C. The staging slot has only 1 warmed instance, and the 3 production instances receive cold code without prior warm-up, as Auto Swap does not perform warm-up per target instance.

D. The application framework has an initialization bug that only manifests when code is promoted via swap, not in direct deployments.


Scenario 4 β€” Diagnostic Sequence​

An administrator receives the following alert: the production slot of an App Service is returning HTTP 500 for all requests after a swap performed 15 minutes ago. The swap was done via Azure CLI by the DevOps team. The administrator needs to diagnose and resolve the problem with the least possible impact on users.

The following investigation steps are available, but were listed out of order:

[P] Check the production slot application logs to identify the exception being thrown
[Q] Execute a reverse swap between production and staging to restore the previous state
[R] Confirm that the staging slot was healthy before the swap by checking Health Check history
[S] Identify if any mandatory Application Setting is missing or has incorrect value in the production slot
[T] Compare slot settings configurations between production and staging at the current time

What is the correct investigation sequence?

A. R, P, T, S, Q

B. P, S, T, R, Q

C. T, P, R, S, Q

D. P, T, S, Q, R


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The decisive clue is in the configuration state before the swap: DB_CONNECTION_STRING was marked as Slot Setting: false in both slots. This means this configuration is not treated as belonging to the slot, and therefore it migrates along with the application content during the swap.

Before the swap, the staging slot had the value pointing to staging-db. After the swap, this value went to the production slot, replacing the production connection string.

Identifying irrelevant information: Always On and Health Check are real operational details, but they have no relation to the configuration migration behavior during the swap. They were included to lead the reader to focus on application health mechanisms instead of the configuration mechanism.

The main reasoning error in the distractors is confusing the monitoring plan (Health Check) with the configuration plan (slot settings), or attributing the cause to a plan limitation that doesn't exist.

The consequence of acting based on distractor A would be trying to adjust the Health Check interval without fixing the real problem, and connection errors would continue indefinitely.


Answer Key β€” Scenario 2​

Answer: B

The cause is known: the swap occurred in the wrong direction between hotfix and staging. The original staging code went to hotfix, and the hotfix code went to staging. No production slot was affected.

The correct action is to execute a new swap between hotfix and staging, reversing exactly what was done. This swap restores the original staging code to the staging slot and returns the hotfix code to its original slot, without any impact on production.

The critical constraint that eliminates the other distractors:

  • A is technically valid, but takes pipeline time and may not be ready before the 20-minute QA window.
  • C is destructive and unnecessary; deleting and recreating a slot when a simple swap solves it is a high operational cost action without justification.
  • D would promote the hotfix to production without an open maintenance window, violating the explicit scenario constraint and causing a second problem.

The reverse swap between hotfix and staging is the fastest, safest action aligned with all presented constraints.


Answer Key β€” Scenario 3​

Answer: C

The observed behavior, 40 seconds of HTTP 503 after each swap with Auto Swap, is consistent with missing warm-up process in target instances.

The App Service warm-up mechanism ensures that the source slot is warmed up before the swap. However, when the staging slot has 1 instance and the production slot has 3 instances, only the staging instance was warmed up. The 3 production slot instances receive the code and need to perform heavy initialization (loading 800 MB cache) when receiving the first real requests, causing the 503s during this period.

The irrelevant information in this scenario is the Premium P2v3 plan. The plan tier does not influence the per-instance warm-up behavior.

Distractor A is the most dangerous: adjusting the Health Check does not solve the problem because the 503 is caused by application initialization, not by a health detection problem. Increasing the Health Check interval would only delay detection of actually unhealthy instances in the future.

The real solution would be to configure applicationInitialization in web.config to force proper warm-up before the swap completes, or use Swap with Preview to control the process.


Answer Key β€” Scenario 4​

Answer: A β€” R, P, T, S, Q

The correct sequence follows progressive diagnostic logic: first eliminate hypotheses before acting.

OrderStepJustification
1RConfirming if staging was healthy before the swap eliminates the hypothesis that the problem existed before promotion
2PChecking production slot logs reveals the actual exception being thrown, directing the diagnosis
3TComparing slot settings between slots identifies if any critical configuration has wrong value after swap
4SVerifying if any mandatory Application Setting is missing confirms or discards the missing configuration hypothesis
5QOnly after confirming the cause and evaluating that there's no quick fix, the reverse swap is executed to restore production

The central reasoning error in the distractors is executing the reverse swap (Q) prematurely, before understanding the cause. Reverting without diagnosis can mask the problem and result in new failure on the next deployment. Sequence B puts P before R, ignoring the need to establish the reference state before analyzing logs.


Troubleshooting Tree: Configure deployment slots for an App Service​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark BlueInitial symptom (entry point)
BlueDiagnostic question (binary decision or verification)
OrangeValidation or intermediate verification
RedIdentified cause
GreenRecommended action or resolution

To use this tree when facing a real problem, start with the root node describing the symptom observed after the swap. At each question node, answer based on what you can observe directly in the portal, logs, or via CLI. Follow the path corresponding to your answer until reaching a red cause identification node. From the cause, the adjacent green node indicates the precise corrective action, without need for blind attempts.