Troubleshooting Lab: Deploy and configure an Azure Virtual Machine Scale Sets
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A production VMSS with 8 instances is configured with autoscaling enabled. The monitoring team opens a ticket reporting that over the past 6 hours, the number of instances never exceeded 4, even with average CPU consistently above 80% for periods exceeding 10 minutes.
The administrator checks the Activity Log and finds scale-out events being generated correctly by the autoscaler. The subscription has sufficient quota for the SKU used. The VMSS is deployed in the East US region with availability zones 1, 2, and 3 enabled.
When executing the command below, the administrator gets the following output:
az vmss show \
--name vmss-prod-api \
--resource-group rg-producao \
--query "sku" \
--output json
{
"capacity": 4,
"name": "Standard_D2s_v3",
"tier": "Standard"
}
Next, they verify the autoscaling profile:
az monitor autoscale show \
--name autoscale-vmss-prod \
--resource-group rg-producao \
--query "profiles[0].capacity" \
--output json
{
"default": "2",
"maximum": "4",
"minimum": "2"
}
The network team reports that during the same time window, there was a latency incident with the VPN gateway connecting the on-premises environment to Azure, but external access to the VMSS via Load Balancer was never interrupted.
What is the root cause of the instance number limitation?
A) The subscription quota for the Standard_D2s_v3 SKU in the East US region has been reached, preventing provisioning of new instances beyond 4.
B) The VPN gateway incident degraded communication between the monitoring agent and Azure Monitor, causing CPU metrics to arrive delayed, delaying scale-out decisions.
C) The maximum value in the autoscaling profile is set to 4, preventing the autoscaler from exceeding this limit even under high load.
D) The distribution across availability zones is forcing the VMSS to balance instances equally across the 3 zones, limiting total capacity to multiples of 3.
Scenario 2 β Action Decisionβ
The infrastructure team has identified that a staging VMSS is performing aggressive scale-in during low utilization hours, terminating instances that still have active user sessions. Investigation confirmed the cause is the absence of Instance Protection configured on the instances being terminated.
The environment has the following restrictions:
- The VMSS uses Flexible orchestration mode
- There are currently 6 active instances
- 2 instances have active sessions with users performing manual load tests that cannot be interrupted
- The security team does not authorize changes to the autoscaling policy at this time
- The testing window deadline is 4 more hours
What is the correct action to take now?
A) Increase the minimum value in the autoscaling profile to 6, ensuring no instances are terminated during the testing window.
B) Apply Instance Protection of type protectFromScaleIn directly to the 2 instances with active sessions, without changing the autoscaling policy.
C) Change the upgradePolicy.mode to Manual to suspend any automatic VMSS actions until the end of the testing window.
D) Create a resource lock (ReadOnly) on the VMSS to prevent the autoscaler from terminating instances during the window.
Scenario 3 β Root Causeβ
A VMSS was updated with a new base image that includes a newer version of the monitoring agent. The upgradePolicy.mode is configured as Automatic. After the update, the operations team observes that older VMSS instances continue showing the previous agent version, while instances created after the image update already have the new version.
The administrator executes:
az vmss get-instance-view \
--name vmss-monit \
--resource-group rg-monit \
--instance-id 0 \
--query "extensions[?name=='CustomScriptExtension'].statuses" \
--output json
[
[
{
"code": "ProvisioningState/succeeded",
"displayStatus": "Provisioning succeeded",
"level": "Info",
"message": "Enable succeeded"
}
]
]
The team reports that the new image was published to the shared gallery (Azure Compute Gallery) 3 days ago, and the VMSS was updated to reference the new image version on the same day. The Load Balancer associated with the VMSS reports that all instances are responding normally to health probes.
The VMSS has 10 instances. The output below shows the instance states:
az vmss list-instances \
--name vmss-monit \
--resource-group rg-monit \
--query "[].{id:instanceId, model:latestModelApplied}" \
--output table
InstanceId LatestModelApplied
------------ --------------------
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 True
8 False
9 False
What is the root cause of the observed behavior?
A) The CustomScriptExtension on old instances failed silently during automatic upgrade, preventing the new agent installation without impacting the health status reported to the Load Balancer.
B) The latestModelApplied: False field indicates that old instances have not yet been updated to the latest model, which contradicts the expected behavior of Automatic mode and points to a failure in the automatic upgrade process.
C) Automatic mode does not update instances that are responding to health probes, as the system interprets healthy instances as not candidates for restart.
D) The image referenced in the VMSS points to the new version in the gallery, but the VMSS is still using an image reference with a fixed version (1.0.0) instead of latest, causing only newly provisioned instances to receive the correct image.
Scenario 4 β Diagnostic Sequenceβ
An administrator receives the following alert at 2:32 PM:
"VMSS vmss-web-frontend: scale-out operation failed. 0 of 3 requested instances were provisioned."
The VMSS is in production with active traffic. The administrator needs to diagnose the failure and, if possible, provision the instances with minimal service impact.
The available investigation steps are:
- Step P: Check the VMSS Activity Log to identify the detailed error message from the failed scale-out operation.
- Step Q: Check the subscription vCPU quota for the SKU used in the VMSS region.
- Step R: Confirm if the autoscaling profile has a
maximumconfigured above the current number of instances. - Step S: Check if there's a resource lock (
ReadOnlyorDelete) applied to the resource group or VMSS. - Step T: Manually provision instances via
az vmss scaleafter confirming the cause and correcting the impediment.
What is the correct diagnostic and action sequence?
A) R, P, Q, S, T
B) P, R, S, Q, T
C) S, Q, R, P, T
D) Q, P, S, R, T
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: C
The definitive clue is in the az monitor autoscale show command output, which displays "maximum": "4". The autoscaler respects this limit as an absolute ceiling, regardless of CPU demand or subscription quota availability. The scale-out events being generated correctly in the Activity Log confirm that the autoscaler is working but being blocked by its own configured limit.
The information about the VPN gateway incident is intentionally irrelevant. Access via Load Balancer was never interrupted and the autoscaler operates with Azure Monitor metrics, which don't depend on VPN connectivity. Including this detail forces the reader to resist the temptation to attribute the cause to a more visible infrastructure incident.
Alternative A is a sophisticated distractor: the error message that would appear in case of exhausted quota would be visible in the Activity Log, and the scenario mentions no provisioning errors, only the limitation in instance count. Alternative D represents a real misconception about how zone balancing works: VMSS distributes instances across zones but doesn't limit total capacity to multiples of the number of zones.
The most dangerous distractor is A. An administrator who would open a quota increase request without checking the autoscaling profile would waste time in production without solving the real problem.
Answer Key β Scenario 2β
Answer: B
The cause was explicitly identified in the scenario: absence of Instance Protection on instances with active sessions. The critical restriction is that the security team doesn't authorize changes to the autoscaling policy. This directly eliminates alternatives A and C, as both involve modifying autoscaler behavior systemically.
Alternative B applies surgical protection only to the 2 affected instances, without touching the autoscaling policy, respecting all scenario restrictions. The protectFromScaleIn in Flexible mode is applied directly to the instance via the protectionPolicy property, without requiring changes to the scale set configuration.
Alternative D represents a critical reasoning error: a ReadOnly lock would prevent not only scale-in but any write operation on the VMSS, including legitimate management operations and potentially the autoscaling scale-out itself, causing much greater impact than the problem it aims to solve.
Alternative C confuses upgradePolicy.mode with autoscaling control. Changing the upgrade policy to Manual doesn't suspend autoscaler actions; they are independent configurations.
Answer Key β Scenario 3β
Answer: D
The central clue is in the LatestModelApplied: False column for 7 of the 10 instances, combined with the fact that instances created after the update already have the correct model. This indicates that the VMSS is applying the new image only to new instances but not updating existing ones.
Automatic mode should update all instances progressively without manual intervention. If 7 instances remain with latestModelApplied: False after 3 days, automatic upgrade is not occurring for them. The most direct cause for this behavior is a image reference with a fixed version: when the reference uses "version": "1.0.0" instead of "latest", the VMSS doesn't automatically detect that a new version is available. New instances receive the latest image because they're provisioned from scratch with the current scale set model, which was manually updated to point to the new version, but the automatic upgrade trigger for existing instances isn't fired.
Alternative B correctly describes the symptom but isn't a root cause. It's a reformulation of the problem. Alternative C is technically false: Automatic mode updates healthy instances normally; unhealthy instances may be blocked in some scenarios.
The information about the CustomScriptExtension status is irrelevant to the diagnosis and was included to distract the reader toward alternative A.
Answer Key β Scenario 4β
Answer: B
The correct sequence is P, R, S, Q, T, which follows progressive diagnostic logic from simplest to most specific, without executing corrective actions before confirming the cause.
The first step is always P (Activity Log), as the detailed error message from the failed operation may already reveal the cause without additional investigation. The Activity Log is the diagnostic entry point for any failed operation in Azure.
With the error message in hand, subsequent steps filter hypotheses in order of probability and ease of verification: R (autoscaling maximum limit) is quickly verified via CLI and resolves most silently blocked scale-out cases. S (resource lock) is less common but critical, as a ReadOnly lock would prevent any write operation. Q (vCPU quota) is checked last among hypotheses, as it requires access to the subscription quota panel and is a slower resolution problem.
Only after confirming and correcting the impediment is T (manual provisioning) executed.
Sequence A starts with R, which may lead the administrator to unnecessarily change the maximum before even reading the error message. Sequence C starts with S (lock), which is the least likely cause and hardest to verify quickly. Sequence D starts with Q (quota), which is the most time-consuming and least likely investigation for a sudden failure in a previously working VMSS.
Troubleshooting Tree: Deploy and configure an Azure Virtual Machine Scale Setsβ
Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question or decision |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Validation or intermediate verification |
To use this tree for a real problem, start with the root node identifying the type of symptom observed: blocked scaling, outdated model, improper instance termination, or stuck upgrade. Follow the branches answering each question based on what is directly observable via CLI or portal. Always go through validation nodes before executing corrective actions, as they ensure the diagnosis is correct before any environment changes. When reaching a cause node, the corresponding recommended action indicates the next operational step.