Troubleshooting Lab: Choose an Appropriate Tier
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
The operations team reports that a newly created virtual machine is not receiving external traffic, despite the Application Gateway responding normally for other VMs in the same backend pool. The environment was configured two days ago and was working correctly until the Load Balancer migration performed yesterday.
The responsible engineer collects the following information:
Load Balancer SKU: Basic
VM added to backend pool: vm-prod-03
VM Availability Zone: Zone 2
Load balancing rule: port 80, TCP protocol
NSG associated with vm-prod-03 NIC: allows HTTP inbound
Health probe: HTTP on port 80 β status: Degraded
During the investigation, the engineer observes that the other VMs in the pool are not in availability zones. The vm-prod-03 was the only one migrated to a zone during yesterday's update.
What is the root cause of the observed problem?
A) The health probe is configured with the wrong protocol; it should be TCP instead of HTTP for the Basic SKU.
B) The Basic Load Balancer does not support backends in Availability Zones, so vm-prod-03 cannot be an active member of the pool.
C) The NIC's NSG is not allowing probe traffic originating from address 168.63.129.16.
D) The load balancing rule is configured for TCP, but the health probe uses HTTP, which creates a conflict in the Basic Load Balancer.
Scenario 2 β Action Decisionβ
The team identified that a 1 Gbps ExpressRoute circuit is being consistently utilized above 90% capacity during peak hours, causing packet drops and degradation of critical ERP applications. The circuit is Provider-based type and the provider confirms availability of upgrade to 2 Gbps at the same peering location without BGP session interruption.
The operational context is:
Peak hours: 08:00 to 18:00, weekdays
Approved maintenance window: Sunday 02:00 to 06:00
Current day: Thursday, 14:30
Critical applications: in full production
Contractual SLA: 99.95% monthly availability
Change approval already obtained for bandwidth upgrade
The cause of the problem is identified: the circuit is undersized for current demand.
What is the correct action to take at this moment?
A) Start the bandwidth upgrade immediately with the provider, since approval has already been obtained and the provider confirmed that BGP will not be interrupted.
B) Wait for the approved maintenance window on Sunday to execute the upgrade, even though the provider can do it without interruption.
C) Create a second parallel ExpressRoute circuit and redistribute load via BGP as a short-term solution until Sunday.
D) Reduce non-critical traffic via QoS on the ExpressRoute gateway to alleviate saturation until the maintenance window.
Scenario 3 β Root Causeβ
A company operates a Standard-type Virtual WAN with two regional hubs: one in East US and another in Brazil South. The security team recently configured Routing Intent in both hubs with Azure Firewall as next hop for private and internet traffic. After activation, VMs in the spoke connected to the Brazil South hub lost connectivity with an on-premises server connected via site-to-site VPN to the East US hub.
The engineer collects the data below:
Routing Intent: enabled on both hubs
Azure Firewall Brazil South: provisioned, active policies
Azure Firewall East US: provisioned, active policies
Effective routes on VM (Brazil South spoke):
10.0.0.0/8 -> Next hop: Azure Firewall Brazil South
0.0.0.0/0 -> Next hop: Azure Firewall Brazil South
VPN Gateway East US: status Connected
BGP session with on-premises: Established
Brazil South Spoke VNet: active peering with hub
The team mentions that last week, a new service tag policy was added to Azure Firewall Brazil South blocking the 10.200.0.0/16 prefix, which is exactly the on-premises server range.
What is the root cause of the connectivity loss?
A) Routing Intent is forcing inter-hub traffic through Azure Firewall, and the firewall policy in Brazil South is blocking the on-premises prefix.
B) The BGP session between the VPN Gateway and on-premises server was interrupted by Routing Intent activation, which changes routes advertised by the hub.
C) The peering between the spoke VNet and Brazil South hub was invalidated by Routing Intent activation, preventing on-premises routes from being propagated.
D) Azure Firewall East US does not have a network rule allowing traffic originated from the Brazil South spoke range, blocking the flow at the destination hub.
Scenario 4 β Diagnostic Sequenceβ
An engineer receives a report that VMs in a spoke VNet cannot resolve names of other private resources in Azure after a DNS topology reorganization. The environment uses Azure Private DNS Zone without custom DNS servers.
The following investigation steps are available, out of order:
Step P: Verify if the Private DNS Zone has A records for target resources
Step Q: Confirm if the spoke VNet has a resolution Virtual Network Link with the Private DNS Zone
Step R: Test DNS resolution with nslookup from a VM in the spoke
Step S: Verify if the DNS server configured in the spoke VNet points to 168.63.129.16
Step T: Check if there is namespace overlap between the Private DNS Zone and a public zone of the same name
What is the correct progressive diagnostic sequence?
A) R -> S -> Q -> P -> T
B) S -> Q -> R -> P -> T
C) Q -> P -> S -> R -> T
D) P -> Q -> S -> T -> R
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The determining clue in the statement is that vm-prod-03 was placed in an Availability Zone during yesterday's migration, while the other VMs are not in zones. The Basic Load Balancer does not support backends that use Availability Zones. This is a structural product requirement: any zoned backend requires the Standard SKU.
The information about NSG allowing HTTP is irrelevant for this diagnosis and was purposely included to divert attention. The health probe in Degraded state is a consequence of the problem, not the cause.
Distractor C is the most dangerous: address 168.63.129.16 is indeed the origin of Azure probes and a misconfigured NSG can block probes, but the statement explicitly says the NSG allows HTTP inbound. Distractor A represents a misconception about protocol compatibility between probe and rule, which is not a real Basic Load Balancer limitation. Distractor D incorrectly mixes two true facts to create a false cause.
Acting based on distractor C would lead the engineer to modify the NSG without solving the real problem, potentially opening unnecessary security gaps.
Answer Key β Scenario 2β
Answer: B
The cause is identified and the technical solution is approved. The critical point of the scenario is the process restriction: there is an approved maintenance window for Sunday. Even though the provider confirms the upgrade can be done without BGP interruption, executing a critical infrastructure change outside the approved window violates the change management process and exposes the company to SLA and accountability risks.
Distractor A is the most dangerous: the technical logic is correct, but completely ignores the process restriction. In environments with contractual SLA and formal change management, change approval is tied to the maintenance window, not just the nature of the change. Distractor C introduces unnecessary operational complexity and additional risk without real gain. Distractor D is a valid short-term mitigation, but the statement doesn't ask for mitigation, it asks for the correct action given the complete context.
Waiting for the window is the technically responsible and procedurally correct decision.
Answer Key β Scenario 3β
Answer: A
The effective routes confirm that traffic from the VM destined for any private prefix (10.0.0.0/8) is being forwarded to Azure Firewall Brazil South. Routing Intent, when enabled for private traffic, forces all inter-hub and on-premises flow to pass through the origin hub's firewall before being routed. The recently added policy explicitly blocks prefix 10.200.0.0/16, which is exactly the on-premises server range. The combination of these two conditions is the root cause.
The information about the BGP session being Established is irrelevant and was purposely included. The fact that BGP is active confirms the problem is not control plane connectivity, but data plane filtering by the firewall.
Distractor B is the most superficially plausible: Routing Intent indeed changes advertised routes, but the Established BGP in the statement eliminates this hypothesis. Distractor D points to the topologically correct firewall (flow passes through East US afterward), but the blocking happens earlier, at the spoke's egress firewall, not at the destination. Distractor C confuses the Routing Intent effect with peering invalidation, which doesn't occur.
Answer Key β Scenario 4β
Answer: A
The correct sequence is R -> S -> Q -> P -> T.
The progressive diagnostic reasoning always starts from the observable symptom and eliminates layers:
R confirms the symptom precisely: nslookup fails, and reveals if it's total or partial resolution failure. S validates the most basic configuration layer: if the VNet DNS doesn't point to 168.63.129.16, no Private DNS Zone will be consulted, regardless of any other adjustment. Q verifies if the resolution link between the spoke VNet and private zone exists, which is the binding requirement. P confirms if records exist in the zone. T investigates the possibility of namespace conflict, which is the least common and hardest to detect cause, should be checked last.
Distractor B reverses S and Q, which may seem reasonable, but validating the link before the DNS server means investigating the binding layer before confirming the correct resolver is being used. Distractor C starts with the binding layer without confirming the symptom or resolver. Distractor D starts with records, which is the innermost layer, without validating the outer layers that may be blocking first.
Troubleshooting Tree: Choose an Appropriate Tierβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question (binary decision or by state) |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate validation or verification |
To use this tree when facing a real problem, start at the root node by identifying the affected service and answer each diagnostic question based on what can be directly observed or measured in the environment. Follow the path corresponding to the answer until reaching a red node of identified cause, then advance to the green node of recommended action. Orange nodes indicate that the investigation needs additional data before concluding the diagnosis.