Skip to main content

Troubleshooting Lab: Create an App Service

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

A development team deployed a Node.js 18 application to a newly created App Service in Azure. The App Service Plan used is Standard S2 tier, running on Linux. The application was published via GitHub Actions successfully, and the pipeline reported 200 status in the deploy stage.

When accessing the application's public URL, the browser consistently returns the following error:

Application Error
An error occurred while starting the application.

The team checks the diagnostic logs in the portal and finds:

[2025-10-14 13:42:01] INFO  Deployment successful
[2025-10-14 13:42:03] ERROR Failed to find a matching page for route: /
[2025-10-14 13:42:03] ERROR npm ERR! missing script: start
[2025-10-14 13:42:04] INFO Container exited with code 1

Additional information collected by the team:

  • The App Service Plan has 3 active instances configured for autoscale
  • Application Insights is enabled and showing 0ms latency (no requests reaching the app)
  • The GitHub repository is public
  • The application's package.json file does not contain the scripts.start property

What is the root cause of the observed error?

A) The autoscale with 3 instances is causing state conflicts during application initialization. B) The App Service cannot start the application process because the entry script is not defined in package.json. C) Application Insights is intercepting requests before they reach the application, causing silent failure. D) The public GitHub repository is exposing sensitive environment variables that the App Service rejects due to security policy.


Scenario 2 β€” Action Decision​

A critical e-commerce application is running in the production slot of an App Service (Premium P2v3 tier, Windows). The development team performed a swap between the staging slot and production slot at 2 PM. At 2:08 PM, alerts started arriving that the HTTP 500 error rate increased from 0.2% to 38%.

The cause has already been identified: an environment variable called CONNECTION_STRING_DB was configured in the staging slot with a test value pointing to a development database. After the swap, this variable was promoted to production along with the code, replacing the real database connection string.

The current context is:

  • It's 2:11 PM and production traffic is degraded
  • The previous application (before the swap) is still available in the staging slot
  • The database team is in a meeting and cannot be contacted for the next 20 minutes
  • The correct CONNECTION_STRING_DB production value is not documented in any accessible runbook at the moment

What is the correct action to take at this moment?

A) Execute an immediate new swap to revert the staging slot back to production, restoring the previous environment including its environment variables. B) Access the App Service configuration panel in production and manually update the CONNECTION_STRING_DB with the correct value by consulting the database team. C) Scale the App Service horizontally to 10 instances to distribute the load and reduce the impact of errors while the fix is prepared. D) Temporarily disable Application Insights to prevent alerts from continuing to fire during the investigation.


Scenario 3 β€” Root Cause​

An administrator created an App Service to host a REST API developed in .NET 8. The App Service Plan is Basic B2, running on Windows. The application works correctly in local development.

After deployment, the QA team reports that when accessing the /health endpoint, the response takes more than 30 seconds on the first call after a period of inactivity, but responds normally on subsequent calls.

The administrator accesses the settings and observes:

Always On: Off
ARR Affinity: On
HTTP version: 1.1
TLS minimum version: 1.2

Additional information:

  • The App Service is in a different region from the QA team (Brazil South vs East US)
  • QA tests are always done from an automated script that waits 45 minutes between each execution
  • The App Service Plan has 2 instances configured manually
  • The deployment was done via Visual Studio Publish and the final status was Publish succeeded

What is the root cause of the observed behavior?

A) The geographic latency between Brazil South and East US is causing timeout on the first request after inactivity. B) The App Service is unloading the application process after inactivity and the configuration that would keep the process active is disabled. C) The enabled ARR Affinity is redirecting the first request to a cold instance while the other instance is active. D) The configured HTTP 1.1 is causing handshake overhead on the first connection after the load balancer idle timeout.


Scenario 4 β€” Diagnostic Sequence​

An App Service in production started returning HTTP 503 error for all requests after a configuration update performed by the operations team. The team needs to diagnose the cause before acting.

The following investigation steps are available, out of order:

  • Step P: Verify the App Service Plan status and confirm if the current tier supports the configured number of instances
  • Step Q: Access the App Service Kudu (SCM) and verify if the application process is running in Process Explorer
  • Step R: Confirm if recent configuration changes included changing the App Service Plan tier to a lower tier
  • Step S: Check application logs in Log Stream to identify if there are exceptions during application startup
  • Step T: Check the Health Check panel in the App Service to validate if the health endpoint is responding internally

Which sequence represents the correct diagnostic reasoning, going from broadest to most specific?

A) R, P, Q, T, S B) Q, S, T, R, P C) S, T, Q, P, R D) T, Q, S, R, P


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The log is explicit in the causal sequence: npm ERR! missing script: start followed by Container exited with code 1. The App Service runtime for Node.js on Linux attempts to execute npm start as the default entry point. When the package.json doesn't define this property, the process cannot start and the container terminates immediately with error code 1, resulting in the "Application Error" displayed to the user.

Irrelevant information: The number of instances with autoscale (3 active instances) has no relationship to startup failure. Autoscale doesn't interfere with application runtime initialization process. This information was included to misdirect diagnosis toward infrastructure issues when the problem is code configuration.

The most dangerous distractor is A, as state conflicts in multiple instances is a real problem in App Services, but it would manifest as inconsistent behavior between requests, not total and immediate startup failure. Distractor C is implausible: Application Insights operates as an SDK within the application or as a passive agent, never as an interceptor that prevents initialization. Distractor D describes non-existent behavior in Azure.

Acting based on A would lead the team to reduce instances or disable autoscale, which would solve nothing and consume incident time unnecessarily.


Answer Key β€” Scenario 2​

Answer: A

The reverse swap is the only action that restores the production environment to the previous functional state in less than 2 minutes, without needing to know the correct connection string value. When a swap is executed, Azure exchanges the slots including their environment variables not marked as slot settings. The staging slot still contains the previous production environment (with the correct connection string), as the swap doesn't destroy the origin slot content.

Distractor B is correct as a definitive solution, but is unfeasible in the given context: the database team is unavailable for 20 minutes and the correct value is not documented. Executing B would mean leaving 38% HTTP 500 errors in production for at least 20 minutes without resolution guarantee.

Distractor C is the most dangerous: horizontally scaling an App Service that is failing due to incorrect configuration only multiplies instances with errors, increasing resource consumption without reducing the failure rate in any proportion. Distractor D is operationally incorrect as disabling Application Insights doesn't solve the underlying problem and removes incident visibility.

The central lesson of this scenario is that reverse swap is a real and immediate rollback mechanism, and should be the first action when the problem cause is the swap itself.


Answer Key β€” Scenario 3​

Answer: B

The described symptom (first response slow after inactivity period, subsequent responses normal) is the classic cold start pattern caused by application process unloading. The configuration that prevents this behavior is Always On, which appears explicitly as Off in the settings shown in the prompt. With Always On disabled, the App Service unloads the worker process after a period without requests and needs to restart it on the next call.

Irrelevant information: The geographic difference between Brazil South and East US is data purposefully inserted to induce diagnosis to focus on network latency. However, geographic latency would be constant across all calls, not only the first after inactivity. The 45-minute pattern between QA script executions confirms that the idle period is the trigger, not the distance.

Distractor C (ARR Affinity) is the most sophisticated: ARR Affinity can indeed cause inconsistent routing between instances, but its effect would be directing different users to different instances persistently, not causing slowness on the first request reproducibly. Distractor D (HTTP 1.1) describes real overhead in other contexts, but is not specific to the "slow only after inactivity" pattern.


Answer Key β€” Scenario 4​

Answer: A

The correct sequence is R, P, Q, T, S, which follows diagnostic logic from broadest (infrastructure) to most specific (internal application process):

  1. R β€” Confirming if there was a tier change is the first step because the configuration update mentioned in the prompt is the triggering event. Verifying what changed is always the starting point in post-change incidents.
  2. P β€” Validating if the current tier supports the configured instances eliminates or confirms an infrastructure cause before investigating the application.
  3. Q β€” Verifying if the process is running in Kudu confirms whether the problem is platform or application related.
  4. T β€” Health Check validates if the application responds internally, separating routing failure from application failure.
  5. S β€” Log Stream is the most time-consuming and specific step; it should be used to confirm the cause after eliminating infrastructure hypotheses.

Sequence B starts with Kudu (Q), which is process diagnosis before validating infrastructure, inverting the correct order. Sequence C starts with application logs (S), which is the deepest and most costly investigation level, without first eliminating infrastructure causes. Sequence D starts with Health Check (T), which presupposes the application is running before confirming this.

The most common reasoning error in this type of scenario is going straight to Log Stream for being the most familiar resource, ignoring that platform causes (incorrect tier, instances above limit) produce HTTP 503 without generating meaningful application logs.


Troubleshooting Tree: Create an App Service​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blueInitial symptom (entry point)
BlueDiagnostic question (binary or verifiable decision)
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate verification or validation

To use this tree when facing a real problem, start with the root node describing the observed symptom and follow the branches by answering each question based on what you can directly verify in the portal, Kudu, or logs. Always respond with what you observed, not what you suspect. Each orange node represents a point where you need to collect evidence before continuing. When reaching a red node, you've identified the cause; when reaching a green node, you have the action to execute.