Troubleshooting Lab: Provision an App Service plan
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
The operations team reports that a production Web App has been responding with increasing latency over the past few hours. The team confirms that no code changes have been made recently. The App Service plan being used is Standard S1, with two instances configured via manual scale-out. The application runs on a Windows plan in the East US region.
During investigation, the engineer runs the following command to check the plan status:
az appservice plan show \
--name plano-producao \
--resource-group rg-app \
--query "{sku:sku.name, workers:sku.capacity, status:status}" \
--output table
The returned output is:
Sku Workers Status
----- --------- --------
S1 2 Ready
Next, the engineer checks the plan metrics in the portal and observes the following:
| Metric | Observed Value |
|---|---|
| CPU Percentage | 94% |
| Memory Percentage | 41% |
| Disk Queue Length | 0 |
| HTTP Queue Length | 312 |
The engineer also notes that three other Web Apps were added to the same App Service plan two days ago, totaling five applications on the plan. The development team mentions that the East US region had a registered network instability yesterday, which has already been resolved.
What is the root cause of the observed performance degradation?
A) Residual network instability in the East US region affecting the application's connectivity. B) Insufficient instances in the plan to absorb the total load generated by the five applications sharing the same resources. C) Memory leak in the application, evidenced by the progressive growth in latency. D) The S1 tier doesn't support more than two simultaneous instances, creating a capacity bottleneck.
Scenario 2 β Action Decisionβ
The cause of the problem has already been identified: the production App Service plan is on the Free (F1) tier and has reached the daily CPU minutes limit allowed by the tier. The application returns HTTP 403 with the message below to all users until the counter resets the next day.
The Free plan has reached its quota.
Please upgrade your App Service plan.
The environment has the following constraints:
- The application is an order system for a retail chain, with critical operation throughout the day
- The infrastructure team has permission to modify resources in the production Resource Group
- There is no scheduled maintenance window available
- The development team is unavailable at this time
- The company has a service level agreement with customers that requires service restoration within 30 minutes
What is the correct action to take at this moment?
A) Wait for the automatic CPU limit reset the next day, as any plan changes outside the maintenance window represent operational risk. B) Delete the current Web App and recreate the entire application on a new App Service plan with Basic tier or higher. C) Upgrade the App Service plan to a paid tier (Basic or higher) directly through the portal or via CLI, without needing to recreate the application. D) Create a new App Service plan on Standard tier and move the application via clone to the new plan, ensuring zero downtime.
Scenario 3 β Root Causeβ
An engineer tries to execute the following command to create a new Web App and associate it with an existing App Service plan:
az webapp create \
--name minha-app-linux \
--resource-group rg-producao \
--plan plano-existente \
--runtime "NODE|18-lts"
The command returns the following error:
The plan 'plano-existente' is not a valid option.
WebApp 'minha-app-linux' requires a Linux App Service Plan.
(WebSpacesClient.CreateOrUpdateWebspace) ErrorCode=InvalidWebSpaceRequest
The engineer checks the existing plan with the command below:
az appservice plan show \
--name plano-existente \
--resource-group rg-producao \
--query "{sku:sku.name, os:kind, region:location}" \
--output table
Output:
Sku Os Region
----- -------- ----------
B2 app eastus
The engineer also reports that the rg-producao Resource Group already contains another Linux App Service plan in the same region, created three months ago, currently with no associated applications. The NODE|18-lts runtime is supported on B2 tier.
What is the root cause of the error?
A) The NODE|18-lts runtime requires Standard tier or higher and is not compatible with B2.
B) The plano-existente plan is a Windows plan and cannot host applications that require Linux runtime.
C) It's not possible to have two Linux App Service plans in the same Resource Group, causing a conflict.
D) The error occurs because the rg-producao Resource Group is in a different region from the application being created.
Scenario 4 β Collateral Impactβ
An administrator identifies that an application's App Service plan is on Standard S2 tier with 5 active instances, configured via autoscale with CPU-based rules. To reduce costs immediately, the administrator downgrades the plan to Basic B2 tier.
The operation completes successfully and the application continues responding normally in the first few minutes after the change.
What secondary consequence can this downgrade cause?
A) The 5 existing instances will be maintained, but the cost per instance for Basic tier will be higher than Standard, not generating real savings. B) The autoscale rules configured in the plan will stop working, as Basic tier doesn't support automatic autoscale based on metrics. C) The application will immediately lose all configured deployment slots, causing staging environment unavailability. D) The custom SSL certificate associated with the application will be revoked, as custom domains with SSL require Standard tier or higher.
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The decisive clue is in the combination of two pieces of information: the addition of three new Web Apps to the plan two days before the problem started, and the metrics showing CPU at 94% and HTTP Queue Length at 312. All applications in an App Service plan share the same computational resources. The increase in the number of applications increased demand on the same CPU capacity, saturating the plan.
The information about network instability in the East US region is deliberately irrelevant: the event was already resolved and wouldn't explain persistently high CPU. Choosing A would be the mistake of diagnosing based on the most recent visible event, ignoring the metric data.
Alternative C is incorrect because memory is at 41%, with no sign of pressure. Alternative D is technically false: S1 tier supports up to 10 instances. The most dangerous distractor is A, as it diverts investigation to an external event already closed, delaying the real solution.
Answer Key β Scenario 2β
Answer: C
Upgrading an App Service plan is an in-place operation that doesn't require application recreation or planned downtime. The Web App continues associated with the same plan after the upgrade, resuming service immediately after the tier change. Given the 30-minute SLA, this is the only action that solves the problem within the timeframe without additional risk.
Alternative A completely ignores the SLA constraint, making it unacceptable. Alternative B describes a technically possible action, but unnecessarily destructive and slow, violating the SLA. Alternative D is incorrect because "cloning" an application is not equivalent to a move and introduces unnecessary complexity and risk when direct upgrade solves the problem. The most dangerous distractor is D, as it seems technical and careful, but ignores that simplicity and speed are determining factors given the active SLA context.
Answer Key β Scenario 3β
Answer: B
The output of the az appservice plan show command shows kind: app, which identifies a Windows plan. The NODE|18-lts runtime in Linux mode requires a Linux App Service plan. This operating system incompatibility between the plan and the requested runtime is the exact cause of the error described in the message.
The information about the Linux plan already existing in the Resource Group is deliberately irrelevant: the presence of another Linux plan neither causes nor prevents the creation of a new Linux plan, and wasn't the cause of the error. Alternative C reverses the real Azure logic: multiple Linux plans can coexist in the same Resource Group. Alternative A is false because NODE|18-lts is compatible with B2. The most dangerous distractor is C, as the engineer might erroneously conclude that the existing Linux plan is blocking the operation and delete it unnecessarily.
Answer Key β Scenario 4β
Answer: B
The Basic tier doesn't support metric-based autoscale. When downgrading from Standard to Basic, the autoscale rules configured in the plan stop being executed. The plan operates only with manual scale-out. In a future traffic spike scenario, the plan won't scale automatically, potentially causing degradation or unavailability that was previously handled automatically.
Alternative A is technically false: the cost per instance for Basic is lower than Standard. Alternative D is incorrect because custom domains with SSL continue to be supported in Basic tier. Alternative C is the most dangerous distractor: deployment slots are indeed removed when downgrading from Standard to Basic, but the scenario specifies that the application "continues responding normally in the first few minutes," indicating there were no active slots in use in this case. The real and immediate impact relevant to the scenario is the silent loss of autoscale, which will only manifest the next time the load increases.
Troubleshooting Tree: Provision an App Service planβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark Blue | Initial symptom (entry point) |
| Blue | Diagnostic question (binary or state decision) |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate validation or verification |
To use this tree when facing a real problem, always start from the root node describing the observed symptom. At each question node, answer based on what you can verify directly in the portal, CLI, or plan metrics. Follow the path corresponding to your observation until you reach a red cause identification node, then execute the corresponding green action. If the problem doesn't fit the obvious path, return to the previous node and consider the alternative hypothesis before acting.