Troubleshooting Lab: Manage Virtual Machine Disks
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A production VM of size Standard_DS2_v2 running Windows Server executes a file processing application. The operations team reports that since last night, read operations on the data disk are significantly slower than expected. No code changes were made to the application.
The administrator collects the following information:
Data disk: Premium SSD LRS | 512 GB | P20
Cache configured: None
Provisioned IOPS on disk: 2,300
IOPS consumed (portal metric): 2,290 (99.5% of limit)
OS disk: Premium SSD LRS | 128 GB | P10
Cache configured: Read/Write
IOPS consumed: 340
VM size: Standard_DS2_v2
Max IOPS allowed by VM: 6,400
Max throughput allowed by VM: 51,200 KB/s
The infrastructure team mentions that on the same night, a second 512 GB data disk of the same type was created and attached to the VM for a backup task, but this disk is not being used by the application.
What is the root cause of the read performance degradation on the data disk?
A) The cache configured as None on the data disk prevents recent reads from being served from the host cache, causing elevated latency.
B) The data disk reached the IOPS limit provisioned by the P20 SKU, causing throttling at the disk level.
C) The VM's IOPS limit was reached due to the sum of IOPS from both attached disks, causing throttling at the VM level.
D) The OS disk with Read/Write cache is consuming most of the VM's throughput bandwidth, leaving fewer resources for the data disk.
Scenario 2 β Action Decisionβ
The cause of the problem has been identified: a production Linux VM's OS disk is completely full, preventing log writes and causing intermittent application failures. The VM is running and serving production requests.
The administrator has the following constraints:
- No approved maintenance window for the next 6 hours
- The security team requires formal approval for any operation causing unplanned downtime
- The current OS disk size is 64 GB (SKU P6) and there is still unpartitioned allocated space on the disk
- The application tolerates a service restart without data loss, but does not tolerate VM deallocation
- A recent snapshot of the disk already exists
What is the correct action to take at this moment?
A) Expand the file system to utilize the existing unpartitioned space on the disk, without deallocating the VM.
B) Create a new larger data disk, copy log files to it, and redirect the application's log directory.
C) Deallocate the VM, resize the OS disk to 128 GB in the portal, and reactivate the VM.
D) Create a new snapshot of the current disk and restore the VM from the most recent snapshot to free up space.
Scenario 3 β Root Causeβ
An administrator tries to attach an existing managed disk to a new VM and receives the following error message in the Azure portal:
The disk 'disk-prod-data-01' is already attached to a VM.
Operation not allowed.
The administrator verifies in the portal that the original VM to which the disk was linked was deleted two days ago. The resource group still exists and contains other resources. The disk appears in the managed disks listing with the state Attached.
Information collected:
Disk: disk-prod-data-01
Reported state: Attached
Original VM: vm-prod-app-01 (deleted)
Resource group: rg-producao
Subscription: sub-prod-001
Region: East US
Type: Premium SSD LRS | 256 GB
The network team reports there was a temporary connectivity failure with the East US region at the time the VM was deleted, but connectivity was restored minutes later.
What is the root cause of the disk's inconsistent state?
A) The connectivity failure with East US region corrupted the disk metadata during VM deletion, requiring restoration from backup.
B) The VM deletion was completed, but the disk detachment process was not properly finalized, leaving the disk in an inconsistent Attached state on the platform.
C) The disk is protected by a Resource Lock of type ReadOnly applied to the resource group, preventing any state modification.
D) The VM was deleted, but the Delete disk with VM option was unchecked, and this behavior prevents the disk from automatically returning to Unattached state.
Scenario 4 β Diagnostic Sequenceβ
A Windows VM reports that a newly attached data disk does not appear in File Explorer, although the disk is visible in the Azure portal as Attached and healthy.
The following investigation steps are available, out of order:
- Step A: Check in Windows Disk Management if the disk appears as offline, uninitialized, or without partition
- Step B: Confirm in the Azure portal that the disk is indeed in Attached state to the correct VM
- Step C: Initialize the disk, create a partition, and format with the desired file system
- Step D: Assign a drive letter to the volume so it appears in File Explorer
- Step E: Check Windows event logs for disk driver-related errors
What is the correct diagnostic and resolution sequence?
A) B β A β C β D β E
B) B β E β A β D β C
C) B β A β E β C β D
D) A β B β C β E β D
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The decisive indicator is in the collected metrics: the P20 data disk is operating at 99.5% of its provisioned IOPS limit (2,290 out of 2,300). This value characterizes disk-level throttling, which is the mechanism by which Azure limits additional operations when the disk SKU reaches its performance ceiling. The practical result is read latency degradation, exactly the reported symptom.
The information about the second attached data disk is the irrelevant information inserted intentionally. The statement makes clear that this disk is not being used by the application, and the VM IOPS data shows that the VM ceiling (6,400 IOPS) was not reached, eliminating alternative C.
Alternative A describes real cache behavior, but None cache on data disks is actually recommended by Microsoft for random read/write workloads, and its absence doesn't cause the observed degradation pattern. Alternative D is refuted by the data itself: the OS disk consumes only 340 IOPS, far from saturating the VM's throughput.
The most dangerous distractor is C, because the second disk is visible and recent information, creating the illusion that the sum of disk IOPS would have reached the VM limit, but the numbers don't support this conclusion.
Answer Key β Scenario 2β
Answer: A
The statement provides critical technical information: there is still unpartitioned allocated space on the existing OS disk. This means it's possible to expand the file system to use this space without any downtime-causing operation. In Linux, this can be done with tools like growpart and resize2fs (or xfs_growfs for XFS) with the VM running, without deallocation.
Alternative C would be technically valid to expand the disk beyond current space, but requires VM deallocation, which violates two explicit scenario constraints: absence of maintenance window and requirement of formal approval for downtime. Acting based on this alternative would cause unauthorized production interruption.
Alternative B is creative, but unnecessarily complex when the non-destructive, no-downtime solution is already available. Alternative D reveals a conceptual misunderstanding: restoring a snapshot doesn't free disk space; it restores the previous state, which by the problem context was also out of free space.
Answer Key β Scenario 3β
Answer: B
The Attached state persisting after VM deletion is a metadata inconsistency behavior on the Azure platform. When a VM is deleted, Azure initiates the process of detaching associated disks and updating their state to Unattached. In rare situations, this process may not complete correctly, leaving the disk with incorrect state recorded in metadata, even though the VM no longer exists.
The standard resolution is to use Azure CLI to force disk state update:
az disk update --name disk-prod-data-01 \
--resource-group rg-producao \
--set managedBy=null
The irrelevant information is the connectivity failure reported by the network team. It has no relation to the disk state and serves to divert diagnosis to alternative A, which deals with metadata corruption and backup restoration, a disproportionate and incorrect response to the symptom.
Alternative C is plausible in other contexts, but the statement doesn't mention any Resource Lock, and the Attached state without existing VM is not the expected behavior of a lock. Alternative D mixes two distinct behaviors: the Delete disk with VM option controls whether the disk is deleted along with the VM, not the disk's attachment state after deletion.
Answer Key β Scenario 4β
Answer: C
The correct sequence is: B β A β E β C β D
The reasoning follows progressive diagnostic logic:
- B β Confirming in the portal that the disk is indeed Attached to the correct VM is always the first step. Without this confirmation, any investigation within the OS may be wasted effort.
- A β With portal confirmation, the next step is to verify how the OS sees the disk. Disk Management will reveal if the disk is offline, uninitialized, or without partition, which is the expected state for a newly attached disk in Windows.
- E β Checking event logs before acting allows identifying driver errors or issues that would make the next step useless.
- C β With the disk visible and healthy in Disk Management, the next operational step is to initialize it, create a partition, and format.
- D β Only after formatting is it possible to assign a drive letter, which is what will make the disk appear in File Explorer.
Alternative A places step C before E, meaning acting without previously checking driver errors. Alternative B reverses A and E, starting with log reading before confirming the disk state in the OS. Alternative D starts with step A without portal confirmation, skipping the most basic validation.
Troubleshooting Tree: Manage Virtual Machine Disksβ
Color legend:
| Color | Node type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Medium blue | Diagnostic question |
| Orange | Intermediate validation or check |
| Green | Recommended action or resolution |
| Red | Identified cause requiring corrective intervention |
To use this tree when facing a real problem, start at the root node and answer each question based on what you observe in the portal, metrics, and operating system. Don't advance to a branch without confirming the answer to the previous node. Orange nodes indicate you need to gather one more piece of information before deciding. Green nodes indicate there's a direct action available. Red nodes indicate the problem requires a higher-impact decision, such as resizing, migration, or maintenance window planning.