Skip to main content

Troubleshooting Lab: Design name resolution inside a VNet

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

A development team reports that VMs in vnet-app intermittently fail to resolve internal names during peak hours. The infrastructure team checks the environment and collects the following information:

  • VNet vnet-app uses two custom DNS servers: 10.2.0.4 and 10.2.0.5
  • Both servers are Windows Server DNS VMs configured as forwarders to 168.63.129.16
  • Private DNS zone app.internal is linked to the VNet with auto-registration enabled
  • Affected VMs have accelerated networking enabled
  • Subnet NSG allows unrestricted outbound traffic
  • 10.2.0.5 was provisioned two weeks ago to increase resilience

Output collected from an affected VM during a failure:

C:\> nslookup svc-orders.app.internal
Server: 10.2.0.5
Address: 10.2.0.5

DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
*** Request to 10.2.0.5 timed-out

The VM then retries and resolution works using 10.2.0.4.

What is the root cause of the intermittent behavior?

A) The NSG is selectively blocking UDP queries on port 53 to 10.2.0.5 during traffic peaks.

B) Server 10.2.0.5 is not properly configured as a forwarder to 168.63.129.16, causing failures only for queries that reach it.

C) Auto-registration enabled on the private zone is generating record conflicts during peak hours, making records temporarily inconsistent.

D) Accelerated networking on affected VMs alters DNS client load balancing behavior, causing 10.2.0.5 to receive more requests than it can process.


Scenario 2 β€” Diagnostic Sequence​

A VM in vnet-hub cannot resolve on-premises host names. The environment uses:

  • A DNS forwarder at 10.0.0.10 (Linux VM with Bind9) configured in the VNet
  • Peering between vnet-hub and vnet-spoke
  • A private DNS zone azure.corp linked to vnet-hub
  • On-premises DNS servers at 192.168.1.10 and 192.168.1.11, accessible via ExpressRoute

The responsible engineer has the following investigation steps available:

  1. Verify that 10.0.0.10 has connectivity to 192.168.1.10 on port 53
  2. Verify that VNet vnet-hub is configured to use 10.0.0.10 as custom DNS
  3. Run nslookup <on-premises-host> 10.0.0.10 from the affected VM
  4. Verify that Bind9 on 10.0.0.10 has a forwarding zone configured for the on-premises domain
  5. Verify that the affected VM can ping 10.0.0.10

What is the correct investigation sequence?

A) 5, 2, 3, 4, 1

B) 2, 5, 3, 4, 1

C) 3, 1, 4, 2, 5

D) 5, 3, 2, 4, 1


Scenario 3 β€” Root Cause​

An architect receives a ticket: VMs in vnet-spoke cannot resolve records from private DNS zone corp.internal. She accesses the portal and documents the current state:

ResourceConfiguration
Private DNS zonecorp.internal
Link with vnet-hubPresent, auto-registration enabled
Link with vnet-spokePresent, auto-registration disabled
Peering vnet-hub / vnet-spokeActive, bidirectional
Custom DNS in vnet-spokeNone (using Azure default)
Custom DNS in vnet-hubNone (using Azure default)

The architect tests from a VM in vnet-hub and resolution works perfectly. The peering was configured three days ago with no changes since.

# Test from VM in vnet-spoke
$ nslookup svc-auth.corp.internal
Server: 168.63.129.16
Address: 168.63.129.16#53

** server can't find svc-auth.corp.internal: NXDOMAIN

What is the root cause?

A) Bidirectional peering between vnet-hub and vnet-spoke does not automatically propagate access to private DNS zones linked to the partner VNet.

B) Auto-registration disabled on the link with vnet-spoke prevents any zone records from being resolved from that VNet.

C) Resolver 168.63.129.16 returns NXDOMAIN for VNets that don't have custom DNS configured, requiring a forwarder to be added.

D) The link with vnet-spoke was created after the peering, and Azure requires private DNS zone links to be created before peering to work correctly.


Scenario 4 β€” Action Decision​

The cause has been identified: VNet vnet-prod is configured with a custom DNS server (10.1.0.20) that does not have a conditional forwarding rule for private zone payments.internal. As a result, all VMs in vnet-prod resolve external names correctly but fail to resolve any records in payments.internal.

The current context is:

  • The environment is in production with dozens of active VMs
  • Changing the VNet's custom DNS forces a DHCP lease renewal on all VMs
  • The scheduled maintenance window occurs in 6 hours
  • Server 10.1.0.20 is managed by the security team, which confirms immediate access to apply configurations
  • An open incident classifies the impact as moderate, without total service interruption

What is the correct action to take at this moment?

A) Remove the custom DNS from the VNet immediately, reverting to Azure's default resolver, to restore private zone resolution without waiting for the maintenance window.

B) Request the security team to add conditional forwarding for payments.internal pointing to 168.63.129.16 on server 10.1.0.20, without changing the VNet configuration.

C) Create a second link of zone payments.internal with vnet-prod with auto-registration enabled, fixing resolution without needing to change the DNS server.

D) Wait for the maintenance window and, during it, replace the custom DNS server with a new VM with the correct forwarding configuration.


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The decisive clue in the scenario is that 10.2.0.5 was recently provisioned and that failures occur specifically when queries reach this server. The DNS client tries 10.2.0.5, waits for the 2-second timeout, and on the next attempt uses 10.2.0.4, where resolution works normally.

The intermittent behavior is not randomness: it's the predictable result of two DNS servers with different configurations, where only one was correctly configured as a forwarder to 168.63.129.16. Since the client alternates between the two servers listed in the VNet configuration, failures appear in approximately half of the queries, more visible during peak hours due to volume.

The irrelevant information is accelerated networking. This feature affects VM network performance at virtualization layers but does not interfere with DNS server selection logic by the client.

The most dangerous distractor is option A: attributing the problem to the NSG can lead the engineer to create unnecessary permissive rules, consuming time without solving the real cause. Checking the forwarding configuration on the newly added server is the correct first step.


Answer Key β€” Scenario 2​

Answer: B

The correct sequence is: 2, 5, 3, 4, 1.

The diagnostic reasoning moves from broadest to most specific, eliminating layers progressively:

  • Step 2 first: confirm that the VNet is actually using 10.0.0.10 as custom DNS. If the VNet still uses Azure's default resolver, no forwarder will work and the remaining steps are irrelevant.
  • Step 5 next: verify basic connectivity to the forwarder. If the VM cannot reach 10.0.0.10, the problem is routing, not DNS.
  • Step 3: test resolution directly against the forwarder to isolate whether the problem is in the client/forwarder chain or forwarder/upstream.
  • Step 4: verify that Bind9 has the correct forwarding zone for the on-premises domain. This is the most likely cause given the scenario.
  • Step 1 last: confirm that the forwarder can reach on-premises servers on port 53. This step only makes sense after confirming the previous chain is intact.

Option A seems logical by starting with connectivity, but testing ping before confirming that custom DNS is configured in the VNet inverts the diagnostic priority.


Answer Key β€” Scenario 3​

Answer: A

The private DNS zone corp.internal link with vnet-spoke exists and is active. VNet Peering is also active and bidirectional. Still, the VM in vnet-spoke queries 168.63.129.16 and receives NXDOMAIN.

The cause is that the private DNS zone link is the mechanism by which resolver 168.63.129.16 knows it should respond to queries for that zone from a specific VNet. The fact that vnet-hub and vnet-spoke are peered does not change this logic: each VNet needs its own link with the zone for the resolver to answer its queries.

The information about peering being configured three days ago is irrelevant and serves as a temporal distractor, suggesting the problem may have emerged over time. In practice, the behavior hasn't changed because the cause is unrelated to peering existence time.

Option B is the most common distractor: confusing the function of auto-registration (which controls whether VMs are registered in the zone) with resolution capability (which depends only on the link existing). The link with vnet-spoke exists, so VMs in that VNet could resolve zone records, but only if the link were functional, which requires the link to be present in the correct VNet.


Answer Key β€” Scenario 4​

Answer: B

The cause has already been identified: custom DNS server 10.1.0.20 does not have conditional forwarding for payments.internal. The correct solution, given the constraint set, is to fix the server configuration without changing the VNet configuration.

Adding conditional forwarding directly on server 10.1.0.20 solves the problem without impact on production VMs: no VNet DHCP changes are needed, no lease renewal is forced, and the security team confirmed immediate availability to apply the change.

Option A is technically valid in another context, but here it's destructive: removing custom DNS not only causes DHCP renewal on all VMs, but also undoes any other resolution configuration that server 10.1.0.20 might be providing correctly for other domains.

Option C represents a frequent conceptual mistake: creating a second zone link with the VNet doesn't solve the problem, because the problem isn't the link, it's the forwarding on the custom DNS server. With custom DNS configured, VMs don't query 168.63.129.16 directly; they query 10.1.0.20.

Option D ignores that the security team has immediate access and that the fix can be made without a maintenance window, unnecessarily postponing resolution of an active incident.


Troubleshooting Tree: Design name resolution inside a VNet​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Legend:

  • Dark blue: initial symptom, diagnostic entry point
  • Blue: diagnostic question, state verification or condition
  • Red: identified cause
  • Green: recommended action or resolution
  • Orange: validation or intermediate verification before concluding diagnosis

To use this tree when facing a real problem, start at the root node describing the symptom and follow the branches by answering each question based on what is observable in the environment. Each answer eliminates a set of hypotheses and directs to the next more specific verification. Do not advance to a green action without having gone through the complete diagnostic path: acting on the wrong cause in a DNS environment can introduce cascading failures that obscure the original cause.