Skip to main content

Theoretical Foundation: Deploy Virtual Machines to Availability Zones and Availability Sets


1. Initial Intuition​

Imagine you're responsible for maintaining three ATMs in a city. If you place all three in the same bank branch, any problem at that branch (power outage, flooding, maintenance) makes all three unavailable. The obvious solution is to distribute the ATMs across different locations: if one location fails, the other two continue working.

In Azure, the same principle applies. When you have multiple VMs serving the same application, you need to ensure they don't all end up on the same physical hardware or in the same datacenter. If they do, a single failure can bring everything down simultaneously.

Availability Sets and Availability Zones are Azure's two mechanisms to intelligently distribute VMs, ensuring that physical failures or planned maintenance don't bring down your entire application at once.

The essential difference between the two lies in the scale of failure they protect against:

  • Availability Sets protect against hardware failures within a single datacenter
  • Availability Zones protect against failures of entire datacenters within a region

2. Context​

The problem these mechanisms solve​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Important: no mechanism protects against application-level failures or operating system errors. These mechanisms specifically protect against physical infrastructure failures.

SLA (Service Level Agreement) and its dependency​

The SLA of an isolated VM in Azure is 99.9% (without zone or availability set). With the correct mechanisms:

ConfigurationSLA
Single VM without AZ or AS99.9%
2+ VMs in Availability Set99.95%
2+ VMs in different Availability Zones99.99%

The difference from 99.95% to 99.99% may seem small, but it represents the difference between up to 4 hours of downtime per year and only 52 minutes per year.


3. Building the Concepts​

3.1 Availability Sets: protection within the datacenter​

An Availability Set is a logical grouping of VMs that instructs Azure to distribute these VMs across separate physical hardware within a single datacenter, using two concepts: Fault Domains and Update Domains.

Fault Domains (FDs)​

A Fault Domain is a group of hardware that shares a common power source and network switch. It's essentially a physical rack.

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

If Rack A (FD0) loses power, only VM1 and VM4 are affected. VM2 and VM3 continue operating. The maximum number of Fault Domains is 3 in Azure.

Update Domains (UDs)​

An Update Domain represents a group of VMs that can be restarted simultaneously during a planned host update (hypervisor maintenance, platform updates).

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Azure updates one Update Domain at a time, waiting 30 minutes between each one. With 5 VMs distributed across 5 Update Domains, never more than 1 VM is restarted simultaneously during maintenance. The maximum number of Update Domains is 20.

How FDs and UDs interact​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Azure automatically distributes VMs across FDs and UDs when adding VMs to the Availability Set. You don't choose which FD or UD each VM goes to.

3.2 Availability Zones: protection between datacenters​

An Availability Zone is a physically separate datacenter within an Azure region. Each zone has its own power, cooling, and network, with low-latency fiber optic connection between them.

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

If the Zone 1 datacenter has a total failure (building power outage, for example), VM-Web-2 and VM-Web-3 continue operating in Zones 2 and 3.

Regions with Availability Zones support: Not all Azure regions have Availability Zones. Larger regions like East US, West Europe, Southeast Asia, and brazilsouth have support. Check before architecting.

3.3 Flexible Orchestration and Virtual Machine Scale Sets​

In addition to traditional Availability Sets, Azure offers Virtual Machine Scale Sets (VMSS) with two orchestration modes:

Uniform Orchestration: all VMs are identical, created from the same template. Focus on automatic scaling of identical instances.

Flexible Orchestration: allows heterogeneous VMs, functional equivalent of a modern Availability Set with Zones support.

For AZ-104, the focus is on Availability Sets and Availability Zones with individual VMs.


4. Structural View​

Structural comparison: Availability Set vs. Availability Zone​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Reference architecture: 3-tier application with high availability​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

5. Practical Operation​

Availability Set lifecycle​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Non-obvious behaviors​

A VM cannot be added to an Availability Set after creation. The Availability Set is defined when creating the VM. If you want to put an existing VM in an Availability Set, you need to delete and recreate the VM. Disks can be preserved.

Availability Set and Availability Zone are mutually exclusive. It's not possible to place a VM simultaneously in an Availability Set and in a specific Availability Zone. They are alternative mechanisms. To use Availability Zones, you specify the zone when creating the VM, without an Availability Set.

VMs in different Availability Zones have 1-2ms network latency between them. Zones are different datacenters connected by fiber optic. The latency between zones in the same region is low enough for most applications, but extremely latency-sensitive applications (high-frequency trading) should consider this.

Managed disks in Availability Zones are zone-scoped. A VM in Zone 1 can only use managed disks that are also in Zone 1. When creating a VM in a specific zone, its disks are automatically created in the same zone.

Availability Set does not guarantee distribution across zones. An Availability Set protects against hardware failures within a datacenter. If the entire datacenter fails, all VMs in the Availability Set are affected. For datacenter-level protection, use Availability Zones.

Planned maintenance notifies in advance. Azure sends notifications before planned maintenance that will affect VMs. With Update Domains, maintenance is staggered, but each UD will go through maintenance in sequence.


6. Implementation Methods​

Azure Portal​

To create Availability Set:

  1. Portal > Availability sets > + Create
  2. Define name, region, subscription, RG
  3. Configure: Fault domains (1-3) and Update domains (1-20)
  4. Select: Use managed disks (Yes - Aligned recommended)
  5. Create

To create VM in Availability Set:

  1. Portal > Virtual machines > + Create
  2. Availability tab > Availability set
  3. Select existing Availability Set
  4. Complete creation normally

To create VM in Availability Zone:

  1. Portal > Virtual machines > + Create
  2. Availability tab > Availability zone
  3. Select Zone 1, Zone 2 or Zone 3
  4. Complete creation normally

Azure CLI​

# Create Availability Set with default configuration
az vm availability-set create \
--resource-group "rg-producao" \
--name "as-web-tier" \
--location "brazilsouth" \
--platform-fault-domain-count 3 \
--platform-update-domain-count 5

# View Availability Set details
az vm availability-set show \
--resource-group "rg-producao" \
--name "as-web-tier" \
--output json

# Create VM within an Availability Set
az vm create \
--resource-group "rg-producao" \
--name "vm-web-01" \
--availability-set "as-web-tier" \
--image "Ubuntu2204" \
--size "Standard_D2s_v5" \
--admin-username "azureadmin" \
--ssh-key-values ~/.ssh/id_rsa.pub

# Create second VM in the same Availability Set
az vm create \
--resource-group "rg-producao" \
--name "vm-web-02" \
--availability-set "as-web-tier" \
--image "Ubuntu2204" \
--size "Standard_D2s_v5" \
--admin-username "azureadmin" \
--ssh-key-values ~/.ssh/id_rsa.pub

# Create third VM in the same Availability Set
az vm create \
--resource-group "rg-producao" \
--name "vm-web-03" \
--availability-set "as-web-tier" \
--image "Ubuntu2204" \
--size "Standard_D2s_v5" \
--admin-username "azureadmin" \
--ssh-key-values ~/.ssh/id_rsa.pub

# Check which FDs and UDs the VMs were distributed to
az vm show \
--resource-group "rg-producao" \
--name "vm-web-01" \
--query "{Name: name, FD: platformFaultDomain, UD: platformUpdateDomain}" \
--output json

# Create VM in specific Availability Zone
az vm create \
--resource-group "rg-producao" \
--name "vm-web-z1" \
--zone 1 \
--image "Ubuntu2204" \
--size "Standard_D2s_v5" \
--admin-username "azureadmin" \
--ssh-key-values ~/.ssh/id_rsa.pub \
--location "brazilsouth"

# Create VM in Zone 2
az vm create \
--resource-group "rg-producao" \
--name "vm-web-z2" \
--zone 2 \
--image "Ubuntu2204" \
--size "Standard_D2s_v5" \
--admin-username "azureadmin" \
--ssh-key-values ~/.ssh/id_rsa.pub \
--location "brazilsouth"

# Create VM in Zone 3
az vm create \
--resource-group "rg-producao" \
--name "vm-web-z3" \
--zone 3 \
--image "Ubuntu2204" \
--size "Standard_D2s_v5" \
--admin-username "azureadmin" \
--ssh-key-values ~/.ssh/id_rsa.pub \
--location "brazilsouth"

# List VMs with their zones
az vm list \
--resource-group "rg-producao" \
--query "[].{Name: name, Zone: zones[0], Size: hardwareProfile.vmSize}" \
--output table

# Check which regions support Availability Zones
az account list-locations \
--query "[?metadata.supportsAvailabilityZones=='true'].name" \
--output table

# Check regions that support zones (including metadata)
az provider show \
--namespace Microsoft.Compute \
--query "resourceTypes[?resourceType=='virtualMachines'].zoneMappings[].location" \
--output table

Azure PowerShell​

# Create Availability Set
New-AzAvailabilitySet `
-ResourceGroupName "rg-producao" `
-Name "as-web-tier" `
-Location "brazilsouth" `
-PlatformFaultDomainCount 3 `
-PlatformUpdateDomainCount 5 `
-Sku "Aligned" # Aligned = Managed Disks

# Create VM in Availability Set
$avSet = Get-AzAvailabilitySet -ResourceGroupName "rg-producao" -Name "as-web-tier"
$cred = Get-Credential -Message "Admin credentials"

$vmConfig = New-AzVMConfig -VMName "vm-web-01" -VMSize "Standard_D2s_v5" `
-AvailabilitySetId $avSet.Id

$vmConfig = Set-AzVMOperatingSystem -VM $vmConfig -Linux -ComputerName "vm-web-01" -Credential $cred
$vmConfig = Set-AzVMSourceImage -VM $vmConfig -PublisherName "Canonical" -Offer "UbuntuServer" -Skus "20.04-LTS" -Version "latest"
$vmConfig = Add-AzVMNetworkInterface -VM $vmConfig -Id $nicId

New-AzVM -ResourceGroupName "rg-producao" -Location "brazilsouth" -VM $vmConfig

# Create VM in Availability Zone
$vmConfig = New-AzVMConfig -VMName "vm-web-z1" -VMSize "Standard_D2s_v5" -Zone "1"

# ... complete configuration
New-AzVM -ResourceGroupName "rg-producao" -Location "brazilsouth" -VM $vmConfig -Zone "1"

# View VM distribution in FDs and UDs
Get-AzVM -ResourceGroupName "rg-producao" |
Select-Object Name, `
@{N="Zone"; E={$_.Zones[0]}}, `
@{N="FD"; E={$_.PlatformFaultDomain}}, `
@{N="UD"; E={$_.PlatformUpdateDomain}} |
Format-Table

Bicep​

// Availability Set
resource availabilitySet 'Microsoft.Compute/availabilitySets@2023-03-01' = {
name: 'as-web-tier'
location: 'brazilsouth'
sku: {
name: 'Aligned' // Aligned = Managed Disks
}
properties: {
platformFaultDomainCount: 3
platformUpdateDomainCount: 5
}
}

// VM in Availability Set
resource vmInAvSet 'Microsoft.Compute/virtualMachines@2023-03-01' = {
name: 'vm-web-01'
location: 'brazilsouth'
properties: {
availabilitySet: {
id: availabilitySet.id
}
hardwareProfile: {
vmSize: 'Standard_D2s_v5'
}
// ... rest of properties
}
}

// VM in Availability Zone
resource vmInZone 'Microsoft.Compute/virtualMachines@2023-03-01' = {
name: 'vm-web-z1'
location: 'brazilsouth'
zones: ['1'] // Zone 1
properties: {
hardwareProfile: {
vmSize: 'Standard_D2s_v5'
}
// ... rest of properties
}
}

// Disk in Availability Zone (must be same zone as VM)
resource diskInZone 'Microsoft.Compute/disks@2022-07-02' = {
name: 'vm-web-z1-datadisk'
location: 'brazilsouth'
zones: ['1'] // Same zone as VM
sku: {
name: 'Premium_LRS'
}
properties: {
creationData: {
createOption: 'Empty'
}
diskSizeGB: 128
}
}

7. Control and Security​

Azure Policy to enforce high availability​

# Audit VMs without Availability Zone or Availability Set
az graph query -q "
Resources
| where type == 'microsoft.compute/virtualmachines'
| where isnull(zones) and isnull(properties.availabilitySet)
| project name, resourceGroup, location, subscriptionId
| order by location"

# Policy that audits VMs without HA mechanism
# Create custom policy that checks if VM has zone or AS
az policy definition create \
--name "audit-vm-ha-config" \
--display-name "VMs must have Availability Zone or Availability Set" \
--rules '{
"if": {
"allOf": [
{"field": "type", "equals": "Microsoft.Compute/virtualMachines"},
{"field": "zones", "exists": "false"},
{"field": "properties.availabilitySet.id", "exists": "false"}
]
},
"then": {"effect": "Audit"}
}' \
--mode "All"

Network considerations for high availability​

Load Balancers and Application Gateways also need to be configured for high availability:

  • Standard Load Balancer is compatible with Availability Zones and can be configured as zone-redundant
  • Basic Load Balancer does not support Availability Zones (use Standard)
  • For VMs in Availability Zones, the Load Balancer must be zone-redundant or span multiple zones

8. Decision Making​

Availability Set vs. Availability Zone​

SituationChoiceReason
Region without AZ support (e.g., some smaller regions)Availability SetOnly available option
Maximum SLA of 99.99% requiredAvailability ZoneOnly way to achieve this SLA
Critical application in region with AZ supportAvailability ZoneProtection against complete datacenter failure
Cost is constraint and region has AZ supportAvailability ZoneSame cost as single VM, better SLA than AS
Already have VMs in AS and moving is costlyKeep ASMigration to AZ requires recreating VMs
Database with Always On Availability GroupVMs in different zonesReplicas in separate zones for DR
Stateless application with many instancesAvailability ZoneDistributes VMs across 3 separate datacenters

Number of instances and FDs/UDs​

Number of VMsRecommended FDsRecommended UDsReason
2 VMs22Ensures the 2 are in different FDs
3 VMs33One per FD, one per UD
5 VMs35Optimal distribution for staged maintenance
10 VMs3103 FDs maximum, 10 UDs for granular scaling
20+ VMs320Azure maximums

9. Best Practices​

Prefer Availability Zones over Availability Sets for new architectures. Availability Zones offer superior protection (datacenter level vs. rack level) at the same cost. Availability Sets exist primarily for compatibility with regions without Zone support and for legacy workloads.

Distribute VMs from each tier across all available zones. For an application with 3 web server instances, place one in each zone (Z1, Z2, Z3). Don't place 2 in Z1 and 1 in Z2, because if Z1 fails you lose 2/3 of capacity.

Use Standard Load Balancer, not Basic, for architectures with zones. The Standard LB can be configured as zone-redundant, ensuring load balancing continues even if a zone fails. The Basic LB doesn't support Availability Zones.

Configure Azure Monitor health alerts for VMs in HA. Even with multiple instances, monitor the health of each one. A failed instance can go unnoticed if others are absorbing the load, creating an "available but vulnerable" scenario.

Availability Sets should use Sku: Aligned for Managed Disks. The Aligned SKU ensures that a VM's managed disks are in the same Fault Domain as the VM. With the Classic SKU (legacy), disks and VMs can be in different FDs, compromising failure protection.

Plan maintenance based on Update Domains. In scheduled maintenance windows, know how many UDs you have and that Azure sequences maintenance with 30 minutes between each UD. An AS with 5 UDs and 5 VMs will take ~2 hours to complete maintenance of all UDs.


10. Common Errors​

ErrorWhy it happensHow to avoid
VM added to AS after creationNot knowing AS is immutable post-creationPlan and create with AS from the start
VM in AS and Zone at the same timeMutually exclusiveChoose one or the other at creation
All 3 VMs in the same FD due to lack of ASVMs created individually without AS or ZoneAlways use AS or Zone in production
AS with Classic SKU and Managed DisksIncompatible configurationUse Aligned SKU for AS with Managed Disks
Using Basic LB with VMs in Availability ZonesBasic LB doesn't support AZAlways use Standard LB
One zone with 2 VMs and another with 1Unbalanced distributionDistribute equally: 1 per zone
Not checking AZ support in region before architectingAssuming all regions have AZCheck support before defining architecture
Disk in Zone 1, VM in Zone 2Attempted cross-zone usageDisks must be in the same zone as the VM

The most critical error​

Not using any HA mechanism on production VMs, relying only on the 99.9% SLA of a single VM. Mathematically, this means up to 8.7 hours of downtime per year. For an application with 5 VMs without any distribution mechanism, if they're all in the same rack and the rack fails, the application goes completely down. The cost of adding Availability Zones is zero: VMs in Availability Zones cost the same as VMs without zones.


11. Operation and Maintenance​

Check VM distribution in FDs and UDs​

# View distribution of all VMs in an Availability Set
az vm list \
--resource-group "rg-producao" \
--query "[?availabilitySet != null].{
Nome: name,
FD: platformFaultDomain,
UD: platformUpdateDomain,
AS: availabilitySet.id
}" \
--output table

# Via Resource Graph: all VMs with zone and AS
az graph query -q "
Resources
| where type == 'microsoft.compute/virtualmachines'
| project
name,
resourceGroup,
location,
zone=tostring(zones[0]),
availabilitySet=tostring(properties.availabilitySet.id),
faultDomain=tostring(properties.platformFaultDomain),
updateDomain=tostring(properties.platformUpdateDomain)
| order by location, zone"

# Count VMs per zone in a region (to check distribution)
az graph query -q "
Resources
| where type == 'microsoft.compute/virtualmachines'
| where location == 'brazilsouth'
| summarize count() by zone=tostring(zones[0])
| order by zone"

Monitor instance health​

# View status of each VM in an Availability Set
for vm in vm-web-01 vm-web-02 vm-web-03; do
echo "=== $vm ==="
az vm get-instance-view \
--resource-group "rg-producao" \
--name "$vm" \
--query "instanceView.statuses[].{Code: code, DisplayStatus: displayStatus}" \
--output table
done

# Configure alert for when a VM becomes unavailable
az monitor activity-log alert create \
--name "alerta-vm-indisponivel" \
--resource-group "rg-monitoramento" \
--condition \
category=ResourceHealth \
--scope "/subscriptions/<sub-id>/resourceGroups/rg-producao"

Important limits​

ResourceLimit
Fault Domains per Availability Set3 (maximum)
Update Domains per Availability Set20 (maximum)
VMs per Availability SetNo defined limit (practical: hundreds)
Availability Zones per region3 (standard, some regions have more)
VMs per Availability ZoneLimited by vCPU quota

12. Integration and Automation​

Deploy VMs distributed across zones via Terraform​

variable "vm_zones" {
default = ["1", "2", "3"]
}

resource "azurerm_linux_virtual_machine" "web" {
count = 3

name = "vm-web-z${var.vm_zones[count.index]}"
resource_group_name = azurerm_resource_group.prod.name
location = azurerm_resource_group.prod.location
size = "Standard_D2s_v5"
zone = var.vm_zones[count.index] # Z1, Z2, Z3

# ... rest of configurations

tags = {
Zone = "zone-${var.vm_zones[count.index]}"
}
}

# Zone-redundant Load Balancer
resource "azurerm_lb" "main" {
name = "lb-web"
resource_group_name = azurerm_resource_group.prod.name
location = azurerm_resource_group.prod.location
sku = "Standard" # Standard = AZ support

frontend_ip_configuration {
name = "frontend"
public_ip_address_id = azurerm_public_ip.lb.id
zones = ["1", "2", "3"] # Zone-redundant
}
}

Azure Policy to ensure zone distribution​

# Custom Policy: Production VMs must be in Availability Zones
az policy definition create \
--name "require-vm-availability-zone" \
--display-name "VMs de producao devem usar Availability Zones" \
--rules '{
"if": {
"allOf": [
{"field": "type", "equals": "Microsoft.Compute/virtualMachines"},
{"field": "tags.Environment", "equals": "Production"},
{"field": "zones", "exists": "false"}
]
},
"then": {"effect": "Deny"}
}' \
--mode "All"

# Assign policy to production RG
az policy assignment create \
--name "enforce-az-producao" \
--policy "require-vm-availability-zone" \
--scope "/subscriptions/<sub-id>/resourceGroups/rg-producao"

13. Final Summary​

Essential points:

  • Availability Set distributes VMs between Fault Domains (physical racks) and Update Domains (maintenance groups) within a single datacenter. Protects against hardware failures and planned maintenance.
  • Availability Zone distributes VMs between physically separate datacenters within a region. Protects against complete datacenter failures.
  • Maximum of 3 Fault Domains and 20 Update Domains per Availability Set
  • Azure automatically distributes VMs between FDs and UDs when adding to AS; you don't control which FD each VM goes to
  • A VM cannot be added to an AS after its creation; the AS is immutable post-deploy
  • AS and Availability Zone are mutually exclusive: choose one or the other at VM creation
  • Managed disks in AZ must be in the same zone as the VM

Critical differences:

  • Fault Domain vs. Update Domain: FD protects against physical failures (rack/power); UD controls planned maintenance (sequencing of reboots)
  • Availability Set vs. Availability Zone: AS protects within datacenter (rack); AZ protects between datacenters (building)
  • SLA: Single VM = 99.9%; 2+ VMs in AS = 99.95%; 2+ VMs in AZ = 99.99%
  • SKU Aligned vs. Classic for AS: Aligned is necessary for Managed Disks; Classic is legacy

What needs to be remembered for AZ-104:

  • To create VM in AS: --availability-set <name> in CLI
  • To create VM in Zone: --zone <1|2|3> in CLI
  • Default number of FDs in portal: 2 (configurable up to 3)
  • Default number of UDs in portal: 5 (configurable up to 20)
  • Basic Load Balancer doesn't support Availability Zones; use Standard LB
  • Availability Zones aren't available in all regions; check before architecting
  • VMs in AS receive 99.95% SLA; VMs in different AZ receive 99.99%
  • The combined SLA formula for N independent VMs: 1 - (1 - individual_SLA)^N