Troubleshooting Lab: Manage Virtual Machine Sizes
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An administrator attempts to resize a production VM called vm-api-prod from Standard_D4s_v3 to Standard_D16s_v3 through the Azure portal. The VM is running, serves a critical API, and is located in the East US region. The associated storage account was created two years ago and uses LRS replication. The OS disk is managed and has 128 GB.
When trying to confirm the resize, the administrator receives the following message:
Resize virtual machine
The following VM sizes are not available for this VM.
Standard_D16s_v3 β Not available in current allocation cluster
The administrator verifies in the portal that the Standard_D16s_v3 size appears in the list of available sizes for the East US region. The managed disk is healthy and error-free. The LRS storage account shows no active alerts.
What is the root cause of the problem?
A) The Standard_D16s_v3 size has been discontinued and is no longer available for new allocations in the East US region
B) The VM is allocated in a hardware cluster that doesn't have hosts supporting the target size
C) The 128 GB managed disk is incompatible with the Standard_D16s_v3 size
D) The LRS storage account prevents resizing to sizes with more than 8 vCPUs
Scenario 2 β Action Decisionβ
The operations team has identified that a critical database VM needs to be migrated from the Standard_D family to the Standard_E family to meet new memory requirements. The cause has been diagnosed and confirmed: the target size Standard_E8s_v3 is not available in the current VM cluster, and the operation can only be completed after deallocation.
The VM is in active production, with a scheduled maintenance window for Saturday at 02:00, approved by the business team. It's Thursday, 14:00. The administrator has Contributor permission on the resource group. The VM is not part of an Availability Set.
What is the correct action to take at this moment?
A) Deallocate the VM immediately, perform the resize and restart, since the diagnosis has been confirmed
B) Wait for the approved maintenance window, deallocate the VM, apply the resize and restart
C) Create a new VM with the correct size now and migrate data without waiting for the window
D) Resize the VM without deallocation using Azure CLI, bypassing the portal restriction
Scenario 3 β Root Causeβ
A developer reports that VM vm-render-01, used for image processing, has CPU performance well below expected during load spikes. The VM's current size is Standard_B4ms. Over the last 7 days, Azure Monitor has recorded CPU utilization consistently above 90% during business hours.
The team reviewed the configuration and verified the following data:
VM Size: Standard_B4ms
vCPUs: 4
RAM: 16 GiB
CPU Credits: Remaining: 0 / Max: 576
Baseline CPU %: 90%
Disk IOPS: No throttling detected
Network: No packet loss
The VM was created three weeks ago. The infrastructure team mentions that the premium disk is correctly configured and the network shows no issues. Recently, the team also updated the monitoring agent on the VM.
What is the root cause of the observed performance degradation?
A) The updated monitoring agent is consuming CPU resources and causing the bottleneck
B) The premium disk, despite appearing correct, has insufficient IOPS for the workload
C) The VM has exhausted accumulated CPU credits and is operating at the B-series baseline limit
D) The Standard_B4ms size has continuous usage restrictions and enters throttling mode after 21 days
Scenario 4 β Diagnostic Sequenceβ
An administrator receives the following alert at 08:15:
[ALERT] vm-web-prod: resize operation failed
Time: 08:12
Requested size: Standard_F8s_v2
Current size: Standard_F2s_v2
Error: OperationNotAllowed β QuotaExceeded
Region: Brazil South
The administrator has never investigated this type of failure before and needs to resolve the problem. Below are the available investigation steps, out of order:
[P1] Check current vCPU usage in the subscription for the Brazil South region
[P2] Open a support ticket requesting vCPU quota increase for the F family
[P3] Confirm that OperationNotAllowed with QuotaExceeded error indicates quota exhaustion
[P4] Validate that the quota increase was approved and try the resize again
[P5] Identify how many additional vCPUs will be consumed by the new size
Which diagnostic and action sequence represents the correct approach?
A) P3 -> P1 -> P5 -> P2 -> P4
B) P1 -> P3 -> P2 -> P5 -> P4
C) P5 -> P1 -> P3 -> P4 -> P2
D) P3 -> P5 -> P1 -> P4 -> P2
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The error message accurately describes the problem: the requested size is not available in the current allocation cluster of the VM. This occurs because Azure allocates VMs in physical hardware clusters within a region, and not all sizes exist in all clusters. The regional availability listed in the portal indicates that the size exists in the region, not that it's accessible in the specific cluster where the VM resides.
The determining clue in the statement is exactly the error text: "Not available in current allocation cluster", which directly names the cluster as the limiting factor, not the region.
The information about the LRS storage account and the 128 GB disk is intentionally irrelevant and was included to induce incorrect diagnosis. Storage replication type and disk size do not influence VM size availability in clusters.
The most dangerous distractor is D, which could convince someone to try migrating the storage account before identifying the real cause. Acting based on this hypothesis would waste time and not solve the problem.
The correct solution is to deallocate the VM, which releases the binding to the current cluster and allows Azure to reallocate it to a host that supports the desired size.
Answer Key β Scenario 2β
Answer: B
The cause is already confirmed and the technical solution is known: deallocate and resize. What determines the correct answer in this scenario is the operational context restriction: the maintenance window was approved by the business team for Saturday at 02:00, and the action is still more than 36 hours away.
Performing the deallocation immediately (alternative A) is technically correct but violates the operational restriction. The VM is in active production and deallocation causes unavailability. Advancing an impactful action without authorization is a governance error, not technical.
Alternative C ignores the window by creating a new VM now, with risk of problems in data migration without adequate planning. Alternative D is factually incorrect: Azure CLI also requires deallocation when the target size is not available in the current cluster; the portal is not the limitation, Azure's behavior is.
The discipline of waiting for the approved window demonstrates that correct diagnosis doesn't authorize immediate action when operational restrictions are in effect.
Answer Key β Scenario 3β
Answer: C
The determining data is in the metrics block: CPU Credits Remaining: 0. The B-series operates with a CPU credits model. When the VM operates above baseline, it consumes credits. When credits are exhausted, CPU performance is limited to the defined baseline for that size. For Standard_B4ms, the baseline is 90% per vCPU, which means that without credits, the VM cannot sustain spikes beyond this level, generating exactly the reported behavior.
The information about the monitoring agent update is intentionally irrelevant and was included to divert diagnosis toward a software path. Monitoring agents consume marginal CPU and don't explain sustained utilization above 90%.
The most dangerous distractor is A, since the recent software update is a visible event in history and naturally attracts suspicion. Acting based on this hypothesis would lead to agent rollback with no performance gain.
Distractor D is technically invalid: the B-series doesn't have usage limits based on VM uptime.
The correct resolution is to migrate to a size with constant dedicated CPU, like the D or F series, suitable for workloads with continuous and predictable utilization.
Answer Key β Scenario 4β
Answer: A
The correct sequence is P3 -> P1 -> P5 -> P2 -> P4.
Progressive diagnostic reasoning requires:
- P3: Confirm what the error means before any action. The QuotaExceeded error indicates vCPU quota exhaustion, not cluster failure or regional unavailability.
- P1: Check current vCPU usage in the subscription for the affected region, establishing the consumption baseline.
- P5: Calculate how many additional vCPUs the new size requires, to know exactly how much additional quota to request. Opening a ticket without this number results in imprecise request.
- P2: With the necessary information in hand, open the support ticket with the correct increase value.
- P4: Validate that the increase was approved before trying again.
Alternative B seems reasonable but reverses P3 and P1: checking usage before understanding what the error means is investigating without diagnosis. Alternative C starts with P5 without validating what the error represents, skipping the diagnosis confirmation step. Alternative D skips calculating additional vCPUs before opening the ticket, resulting in a possibly undersized request.
Troubleshooting Tree: Manage Virtual Machine Sizesβ
Color Legend:
- Dark blue: symptom or entry point
- Blue: diagnostic question or investigation decision
- Red: identified cause
- Green: recommended action or resolution
- Orange: intermediate validation or verification
To use this tree when facing a real problem, start at the root node and answer each question based on what you observed in the environment. If the operation generated an error, identify the error type before any action. If the problem is silent, like performance degradation without explicit error, follow the series and credits investigation path. Each branch ends in a precise cause or concrete action, preventing you from acting on unconfirmed hypotheses.