Skip to main content

Troubleshooting Lab: Deploy virtual machines to availability zones and availability sets

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The operations team reports that after a planned Azure maintenance window last night, three out of four VMs became simultaneously unavailable for approximately 12 minutes. The four VMs are part of the Availability Set as-api-prod, which was created six months ago. The team reports that the Availability Set was configured correctly and is associated with an internal Load Balancer.

The administrator checks the Availability Set configuration and obtains the following output:

az vm availability-set show \
--name as-api-prod \
--resource-group rg-producao \
--query "{faultDomains:platformFaultDomainCount, updateDomains:platformUpdateDomainCount}"
{
"faultDomains": 2,
"updateDomains": 2
}

The administrator also verifies that all four VMs have the Azure agent updated and that all disks are Premium SSD type. The Load Balancer has health probes configured correctly and is responding normally now.

What is the root cause of the behavior observed during maintenance?

A) The Load Balancer did not redistribute traffic correctly during maintenance, dropping VM connections

B) The Availability Set was configured with only 2 Update Domains, which allowed three of the four VMs to be restarted in the same update group

C) Premium SSD disks are not supported in Availability Sets, causing failure during reboot

D) The outdated Azure agent on VMs prevented Azure from correctly coordinating the maintenance sequence


Scenario 2 β€” Diagnostic Sequence​

An administrator tries to create a VM in the Azure portal and, when selecting the availability zone option, notices that the zone selection field is completely missing from the interface, as if the feature doesn't exist for that configuration.

The VM is being created with the following characteristics:

ConfigurationValue
RegionEast US
SizeStandard_B2s
Operating systemWindows Server 2022
OS diskStandard HDD
Virtual networkvnet-prod (existing)
Availability Setas-frontend (existing)

The administrator confirms that the subscription is active, that they have Contributor permission on the resource group, and that the East US region supports Availability Zones.

What is the correct investigation sequence to identify why the zone option is missing?

A) Check subscription permissions > Check zone support in region > Check VM size > Remove Availability Set association

B) Check if Availability Set is selected > Confirm that Availability Set and Availability Zone are mutually exclusive > Remove Availability Set association > Try selecting zone again

C) Check disk type > Change to Premium SSD > Check zone support in region > Remove Availability Set association

D) Check resource group permissions > Check operating system > Check zone support in region > Remove Availability Set association


Scenario 3 β€” Root Cause​

A company deployed its application tier on three VMs distributed across Availability Zones 1, 2, and 3 in the Brazil South region. The architect declared that the solution is protected against zone failure and that the 99.99% SLA applies to the complete solution.

Two months after deployment, a failure occurs in Zone 2 of the region. Monitoring indicates that the VM in Zone 2 became inaccessible, but the rest of the application also experienced severe degradation and partial data loss in active sessions.

The environment information is:

Zone 1 VM: vm-app-z1 | Public IP zone-redundant | OS Disk: Zone 1
Zone 2 VM: vm-app-z2 | Public IP zone-redundant | OS Disk: Zone 2
Zone 3 VM: vm-app-z3 | Public IP zone-redundant | OS Disk: Zone 3
Load Balancer: Standard SKU | Frontend IP: zone-redundant
Storage Account (sessions): Standard_LRS | Location: Brazil South

The network team confirms that the Standard Load Balancer with zone-redundant IP worked correctly and diverted traffic to zones 1 and 3 within seconds.

What is the root cause of the degradation and data loss in active sessions?

A) Standard Load Balancer doesn't support traffic distribution between zones during real failures, only in tests

B) Zone-redundant public IPs introduce additional latency that causes session timeouts during failover

C) The Storage Account with LRS replication stores data in only one zone; since sessions were writing to that zone, there was data loss when Zone 2 failed

D) The 99.99% SLA applies only to individual VMs, not to the complete solution, so protection was not effective


Scenario 4 β€” Action Decision​

The cause has been identified: a critical production VM called vm-db-primary is deployed without Availability Zone and without Availability Set. It runs a relational database with data that cannot be lost. The business team confirmed that a maintenance window is available next weekend, with tolerance for up to 4 hours of unavailability for migration.

The administrator knows that:

  • It's not possible to add an existing VM to an Availability Set or Availability Zone after creation
  • The database has a fully validated backup and restore mechanism
  • There's a documented failover process to a read replica on another VM
  • The window starts in 5 days

What is the correct action to take now?

A) Recreate the VM immediately in an Availability Zone, taking advantage that the window hasn't started yet, to minimize the risk of failure in the next 5 days

B) Plan VM recreation during the maintenance window, using the validated backup to restore data after deployment in an Availability Zone, following the documented process

C) Add the current VM to an Availability Set via CLI with the --availability-set parameter, taking advantage that the database is running and without impact window

D) Wait for a real failure of the current VM to justify recreation downtime outside the planned window


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

Explanations:

  • The decisive clue is in the command output: "updateDomains": 2. With only 2 Update Domains and 4 VMs, Azure distributes VMs cyclically: 2 VMs in UD 0 and 2 VMs in UD 1. During maintenance, Azure restarts one Update Domain at a time, but if half the VMs (2) are in each UD and the Load Balancer health behavior doesn't compensate quickly, 2 VMs can become simultaneously unavailable. The correct configuration for 4 VMs would be to use the default of 5 Update Domains, ensuring maximum 1 VM per group.
  • Irrelevant information here is the Premium SSD disk type and current state of Load Balancer health probes. Both work correctly and have no causal relationship with the event during maintenance.
  • The most dangerous distractor is alternative A, which blames the Load Balancer. The Load Balancer is operational now and health probes are correct; the problem was the number of VMs affected simultaneously, not traffic redistribution.
  • Acting on distractor A would lead the administrator to reconfigure a Load Balancer that has no problem, leaving the real cause (insufficient Update Domains) uncorrected.

Answer Key β€” Scenario 2​

Answer: B

Explanations:

  • The correct sequence starts with the most direct observation available in the interface: the Availability Set is already selected. Confirming that the two mechanisms are mutually exclusive is the next reasoning step, and removing the Availability Set is the action that will unlock zone selection.
  • Alternatives C and D divert diagnosis to disk type and operating system, which are completely irrelevant to the absence of the zone option in this context.
  • Alternative A includes the correct action at the end (remove Availability Set), but inserts unnecessary checks before, like verifying permissions and VM size, which have already been confirmed as correct by the statement. An efficient diagnostic sequence starts from the most specific symptom available, not from generic checks.
  • The central reasoning error in distractors is ignoring the most explicit information in the scenario (presence of Availability Set) and seeking causes in peripheral elements.

Answer Key β€” Scenario 3​

Answer: C

Explanations:

  • The critical element is in the configuration table: Storage Account: Standard_LRS. The LRS (Locally Redundant Storage) type replicates data only within a single physical zone. When Zone 2 failed, session data being written to that Storage Account became inaccessible or was lost, as there was no copy in another zone.
  • The Standard Load Balancer with zone-redundant frontend worked correctly, as confirmed by the network team. This information eliminates alternative A from diagnosis.
  • Alternative B is a plausible but incorrect technical distractor: zone-redundant IPs don't introduce perceptible latency and don't cause session timeouts during failover.
  • Information about the 99.99% SLA mentioned by the architect is irrelevant to root cause diagnosis. The SLA refers to VM availability, not data integrity of a Storage Account with LRS.
  • The most dangerous distractor is alternative D, which leads the administrator to question the SLA instead of investigating storage replication configuration. This would delay identification and correction of the real problem, which is replacing LRS with ZRS (Zone-Redundant Storage) in the sessions Storage Account.

Answer Key β€” Scenario 4​

Answer: B

Explanations:

  • The central constraint of the scenario is the existence of an already approved maintenance window, with defined tolerance and documented process. Acting within this window is the correct decision because it protects data integrity and respects the validated backup and restore process.
  • Alternative A seems prudent, but ignores a critical constraint: recreating the VM now means executing a production database migration without the approved maintenance window, without business alignment and with risk of unplanned impact. The 5-day period without the resource doesn't justify this risk.
  • Alternative C describes an impossible operation. There's no way to add an existing VM to an Availability Set after creation, whether via CLI, portal, or API. Executing this command would result in an error.
  • Alternative D is the most dangerous: waiting for a real failure means accepting unplanned downtime and potential data loss in a database without zone protection, instead of acting within a controlled window already available.
  • The discipline of not acting outside the planned window, even when risk seems urgent, is the correct behavior for production environments with critical data.

Troubleshooting Tree: Deploy virtual machines to availability zones and availability sets​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color legend:

ColorNode type
Dark blueInitial symptom (entry point)
Medium blueDiagnostic question
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate verification or ambiguous state

To use this tree when facing a real problem, start with the root node describing the observed symptom and answer each question based on what you can directly verify in the environment. Follow the path corresponding to your answer until you reach a red node (identified cause) or green node (recommended action). If you reach an orange node, it indicates that more verification is needed before concluding the diagnosis. The goal is always to traverse the fewest branches possible based on available evidence, avoiding investigating paths that the statement itself already rules out.