Troubleshooting Lab: Design name resolution inside a VNet
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A development team reports that VMs in vnet-app intermittently fail to resolve internal names during peak hours. The infrastructure team checks the environment and collects the following information:
- VNet
vnet-appuses two custom DNS servers:10.2.0.4and10.2.0.5 - Both servers are Windows Server DNS VMs configured as forwarders to
168.63.129.16 - Private DNS zone
app.internalis linked to the VNet with auto-registration enabled - Affected VMs have accelerated networking enabled
- Subnet NSG allows unrestricted outbound traffic
10.2.0.5was provisioned two weeks ago to increase resilience
Output collected from an affected VM during a failure:
C:\> nslookup svc-orders.app.internal
Server: 10.2.0.5
Address: 10.2.0.5
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
*** Request to 10.2.0.5 timed-out
The VM then retries and resolution works using 10.2.0.4.
What is the root cause of the intermittent behavior?
A) The NSG is selectively blocking UDP queries on port 53 to 10.2.0.5 during traffic peaks.
B) Server 10.2.0.5 is not properly configured as a forwarder to 168.63.129.16, causing failures only for queries that reach it.
C) Auto-registration enabled on the private zone is generating record conflicts during peak hours, making records temporarily inconsistent.
D) Accelerated networking on affected VMs alters DNS client load balancing behavior, causing 10.2.0.5 to receive more requests than it can process.
Scenario 2 β Diagnostic Sequenceβ
A VM in vnet-hub cannot resolve on-premises host names. The environment uses:
- A DNS forwarder at
10.0.0.10(Linux VM with Bind9) configured in the VNet - Peering between
vnet-hubandvnet-spoke - A private DNS zone
azure.corplinked tovnet-hub - On-premises DNS servers at
192.168.1.10and192.168.1.11, accessible via ExpressRoute
The responsible engineer has the following investigation steps available:
- Verify that
10.0.0.10has connectivity to192.168.1.10on port 53 - Verify that VNet
vnet-hubis configured to use10.0.0.10as custom DNS - Run
nslookup <on-premises-host> 10.0.0.10from the affected VM - Verify that Bind9 on
10.0.0.10has a forwarding zone configured for the on-premises domain - Verify that the affected VM can ping
10.0.0.10
What is the correct investigation sequence?
A) 5, 2, 3, 4, 1
B) 2, 5, 3, 4, 1
C) 3, 1, 4, 2, 5
D) 5, 3, 2, 4, 1
Scenario 3 β Root Causeβ
An architect receives a ticket: VMs in vnet-spoke cannot resolve records from private DNS zone corp.internal. She accesses the portal and documents the current state:
| Resource | Configuration |
|---|---|
| Private DNS zone | corp.internal |
Link with vnet-hub | Present, auto-registration enabled |
Link with vnet-spoke | Present, auto-registration disabled |
Peering vnet-hub / vnet-spoke | Active, bidirectional |
Custom DNS in vnet-spoke | None (using Azure default) |
Custom DNS in vnet-hub | None (using Azure default) |
The architect tests from a VM in vnet-hub and resolution works perfectly. The peering was configured three days ago with no changes since.
# Test from VM in vnet-spoke
$ nslookup svc-auth.corp.internal
Server: 168.63.129.16
Address: 168.63.129.16#53
** server can't find svc-auth.corp.internal: NXDOMAIN
What is the root cause?
A) Bidirectional peering between vnet-hub and vnet-spoke does not automatically propagate access to private DNS zones linked to the partner VNet.
B) Auto-registration disabled on the link with vnet-spoke prevents any zone records from being resolved from that VNet.
C) Resolver 168.63.129.16 returns NXDOMAIN for VNets that don't have custom DNS configured, requiring a forwarder to be added.
D) The link with vnet-spoke was created after the peering, and Azure requires private DNS zone links to be created before peering to work correctly.
Scenario 4 β Action Decisionβ
The cause has been identified: VNet vnet-prod is configured with a custom DNS server (10.1.0.20) that does not have a conditional forwarding rule for private zone payments.internal. As a result, all VMs in vnet-prod resolve external names correctly but fail to resolve any records in payments.internal.
The current context is:
- The environment is in production with dozens of active VMs
- Changing the VNet's custom DNS forces a DHCP lease renewal on all VMs
- The scheduled maintenance window occurs in 6 hours
- Server
10.1.0.20is managed by the security team, which confirms immediate access to apply configurations - An open incident classifies the impact as moderate, without total service interruption
What is the correct action to take at this moment?
A) Remove the custom DNS from the VNet immediately, reverting to Azure's default resolver, to restore private zone resolution without waiting for the maintenance window.
B) Request the security team to add conditional forwarding for payments.internal pointing to 168.63.129.16 on server 10.1.0.20, without changing the VNet configuration.
C) Create a second link of zone payments.internal with vnet-prod with auto-registration enabled, fixing resolution without needing to change the DNS server.
D) Wait for the maintenance window and, during it, replace the custom DNS server with a new VM with the correct forwarding configuration.
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The decisive clue in the scenario is that 10.2.0.5 was recently provisioned and that failures occur specifically when queries reach this server. The DNS client tries 10.2.0.5, waits for the 2-second timeout, and on the next attempt uses 10.2.0.4, where resolution works normally.
The intermittent behavior is not randomness: it's the predictable result of two DNS servers with different configurations, where only one was correctly configured as a forwarder to 168.63.129.16. Since the client alternates between the two servers listed in the VNet configuration, failures appear in approximately half of the queries, more visible during peak hours due to volume.
The irrelevant information is accelerated networking. This feature affects VM network performance at virtualization layers but does not interfere with DNS server selection logic by the client.
The most dangerous distractor is option A: attributing the problem to the NSG can lead the engineer to create unnecessary permissive rules, consuming time without solving the real cause. Checking the forwarding configuration on the newly added server is the correct first step.
Answer Key β Scenario 2β
Answer: B
The correct sequence is: 2, 5, 3, 4, 1.
The diagnostic reasoning moves from broadest to most specific, eliminating layers progressively:
- Step 2 first: confirm that the VNet is actually using
10.0.0.10as custom DNS. If the VNet still uses Azure's default resolver, no forwarder will work and the remaining steps are irrelevant. - Step 5 next: verify basic connectivity to the forwarder. If the VM cannot reach
10.0.0.10, the problem is routing, not DNS. - Step 3: test resolution directly against the forwarder to isolate whether the problem is in the client/forwarder chain or forwarder/upstream.
- Step 4: verify that Bind9 has the correct forwarding zone for the on-premises domain. This is the most likely cause given the scenario.
- Step 1 last: confirm that the forwarder can reach on-premises servers on port 53. This step only makes sense after confirming the previous chain is intact.
Option A seems logical by starting with connectivity, but testing ping before confirming that custom DNS is configured in the VNet inverts the diagnostic priority.
Answer Key β Scenario 3β
Answer: A
The private DNS zone corp.internal link with vnet-spoke exists and is active. VNet Peering is also active and bidirectional. Still, the VM in vnet-spoke queries 168.63.129.16 and receives NXDOMAIN.
The cause is that the private DNS zone link is the mechanism by which resolver 168.63.129.16 knows it should respond to queries for that zone from a specific VNet. The fact that vnet-hub and vnet-spoke are peered does not change this logic: each VNet needs its own link with the zone for the resolver to answer its queries.
The information about peering being configured three days ago is irrelevant and serves as a temporal distractor, suggesting the problem may have emerged over time. In practice, the behavior hasn't changed because the cause is unrelated to peering existence time.
Option B is the most common distractor: confusing the function of auto-registration (which controls whether VMs are registered in the zone) with resolution capability (which depends only on the link existing). The link with vnet-spoke exists, so VMs in that VNet could resolve zone records, but only if the link were functional, which requires the link to be present in the correct VNet.
Answer Key β Scenario 4β
Answer: B
The cause has already been identified: custom DNS server 10.1.0.20 does not have conditional forwarding for payments.internal. The correct solution, given the constraint set, is to fix the server configuration without changing the VNet configuration.
Adding conditional forwarding directly on server 10.1.0.20 solves the problem without impact on production VMs: no VNet DHCP changes are needed, no lease renewal is forced, and the security team confirmed immediate availability to apply the change.
Option A is technically valid in another context, but here it's destructive: removing custom DNS not only causes DHCP renewal on all VMs, but also undoes any other resolution configuration that server 10.1.0.20 might be providing correctly for other domains.
Option C represents a frequent conceptual mistake: creating a second zone link with the VNet doesn't solve the problem, because the problem isn't the link, it's the forwarding on the custom DNS server. With custom DNS configured, VMs don't query 168.63.129.16 directly; they query 10.1.0.20.
Option D ignores that the security team has immediate access and that the fix can be made without a maintenance window, unnecessarily postponing resolution of an active incident.
Troubleshooting Tree: Design name resolution inside a VNetβ
Legend:
- Dark blue: initial symptom, diagnostic entry point
- Blue: diagnostic question, state verification or condition
- Red: identified cause
- Green: recommended action or resolution
- Orange: validation or intermediate verification before concluding diagnosis
To use this tree when facing a real problem, start at the root node describing the symptom and follow the branches by answering each question based on what is observable in the environment. Each answer eliminates a set of hypotheses and directs to the next more specific verification. Do not advance to a green action without having gone through the complete diagnostic path: acting on the wrong cause in a DNS environment can introduce cascading failures that obscure the original cause.