Troubleshooting Lab: Configure DNS settings for a VNet
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that VMs in a specific VNet cannot resolve names of internal resources from the private DNS zone apps.internal, but they resolve public names like microsoft.com and Azure endpoints like *.blob.core.windows.net normally.
The environment is configured as follows:
VNet: vnet-prod-eastus
DNS Servers: 10.1.0.4 (custom DNS server, Windows Server 2022)
Private DNS zone: apps.internal
Link: vnet-prod-eastus | Auto-registration: enabled | Status: Provisioned
Custom DNS server (10.1.0.4):
Forwarders configured: 8.8.8.8, 8.8.4.4
Local zone: (none)
The team reports that the custom DNS server was deployed three weeks ago and is healthy. The private zone apps.internal was created two days ago. The link between the zone and VNet shows Provisioned status in the portal. A connectivity test with ping 10.1.0.4 from the VMs returns response normally.
# Executed from vm-prod-01
nslookup api.apps.internal
Server: 10.1.0.4
Address: 10.1.0.4
*** 10.1.0.4 can't find api.apps.internal: Non-existent domain
What is the root cause of the failure in resolving apps.internal?
A) The virtual network link was created with auto-registration enabled, which prevents resolution of manually created records in the private zone.
B) The custom DNS server is not forwarding queries to 168.63.129.16, so queries for apps.internal never reach the Azure resolver.
C) The private DNS zone apps.internal has not yet completed global replication after its creation two days ago.
D) Auto-registration is enabled on the link, but VMs were not rebooted after linking, so no A records were created in the zone.
Scenario 2 β Action Decisionβ
The network team identified that VMs in vnet-hub-westus fail to resolve names from a newly linked private DNS zone. The cause has been confirmed: the VNet's custom DNS server forwards all queries to external public servers, without any conditional forwarding rule for 168.63.129.16.
The operational context is as follows:
- The custom DNS server is shared by four production VNets
- There is a maintenance window available in 48 hours
- The security team has blocked configuration changes on production servers outside the maintenance window
- The affected VMs are for development and are not in production
- The custom DNS server is outside the scope of immediate rollback
What is the correct action to take at this moment?
A) Immediately change the custom DNS server forwarders to include 168.63.129.16 as a destination for apps.internal, as it is a low-risk change.
B) Remove the custom DNS server from the development VNet configuration and replace it with Azure default DNS until the maintenance window.
C) Create manual A records in the private DNS zone for the necessary endpoints as a temporary solution while waiting for the maintenance window to fix the forwarder.
D) Unlink the private DNS zone from the affected VNet and recreate the link after the maintenance window, when the forwarder is fixed.
Scenario 3 β Root Causeβ
A company operates a hybrid environment with centralized on-premises DNS resolution. After a partial migration to Azure, VMs in VNet vnet-spoke-01 started intermittently failing to resolve hostnames of other Azure resources, including names like vm-backend.internal.cloudapp.net.
The current configuration is:
VNet: vnet-spoke-01
DNS Servers:
- 10.0.0.10 (primary on-premises DNS, via ExpressRoute)
- 10.0.0.11 (secondary on-premises DNS, via ExpressRoute)
On-premises DNS server:
Forwarders: 168.63.129.16
Local zones: corp.local, internal.corp.com
ExpressRoute Circuit: Status = Provisioned, BGP = Connected
The failures are intermittent and occur mainly during the period between 6 PM and 10 PM. The network team confirms that the ExpressRoute circuit shows elevated latency during this interval, reaching 380ms at peaks. Outside this timeframe, DNS resolution works normally. There are no private DNS zones configured for internal.cloudapp.net.
# Log collected during failure (8:14 PM)
Event: DNS query timeout
Query: vm-backend.internal.cloudapp.net
Resolver: 10.0.0.10
Elapsed: 5001ms
Result: SERVFAIL
What is the root cause of the intermittent resolution failures?
A) The on-premises DNS server has no authority over internal.cloudapp.net and returns SERVFAIL because it cannot find the zone locally.
B) The 168.63.129.16 forwarder configured on-premises is unavailable during peak hours, causing timeout on forwarded queries.
C) The elevated latency on the ExpressRoute circuit during peak periods causes timeout on DNS queries that traverse the on-premises path before reaching the Azure resolver.
D) Azure does not allow resolution of internal.cloudapp.net through external DNS servers, requiring that the query originates directly from the VNet resolver.
Scenario 4 β Diagnostic Sequenceβ
A newly provisioned VM in vnet-dev-brazilsouth cannot resolve any names, neither public nor private. The team needs to diagnose the problem and executes the following investigation steps, presented out of order:
[P] Verify if the VM's NIC received a valid IP address via DHCP
[Q] Execute nslookup microsoft.com from the VM and observe the returned DNS server
[R] Check the DNS Servers configuration in the VNet in the Azure portal
[S] Test connectivity with ping 168.63.129.16 from the VM
[T] Confirm if the custom DNS server listed in the VNet is accessible and responding
Which investigation sequence follows the correct logic of progressive diagnosis?
A) R, Q, S, T, P
B) P, R, Q, S, T
C) Q, R, P, T, S
D) S, P, R, Q, T
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The central symptom is the failure to resolve apps.internal even with the private zone link active and the custom DNS server accessible. The decisive clue is in the DNS server forwarders configuration: they point to 8.8.8.8 and 8.8.4.4, Google's public DNS servers that have no knowledge about Azure private DNS zones.
When a VM sends a query for api.apps.internal, it reaches the custom server 10.1.0.4. Since this server doesn't have the zone locally and its forwarders are public, the query goes to Google, which returns NXDOMAIN. The Azure resolver at 168.63.129.16 is never consulted, and therefore the link with the private zone is completely irrelevant in this flow.
The information about ping working for 10.1.0.4 is deliberately irrelevant: it only confirms that the server is accessible at the network layer, not that it forwards DNS correctly. The Provisioned status of the link is also irrelevant for this diagnosis.
Alternative D represents the most common reasoning error: confusing auto-registration (record creation) with name resolution. Even without records created via auto-registration, manually added records in the zone should be resolvable, which doesn't happen here. The most dangerous distractor is A: acting based on it would lead to disabling auto-registration without fixing anything, keeping the failure intact.
Answer Key β Scenario 2β
Answer: B
The cause is already identified: the custom DNS server doesn't forward to 168.63.129.16. The critical restriction is that changes to the shared production server are blocked outside the maintenance window. The affected VMs are development, not production.
The correct action is to temporarily replace the custom DNS with Azure default DNS in the development VNet. This change is made to the VNet configuration (not the DNS server), is within the network team's scope, doesn't affect the other four production VNets, and solves the problem immediately for the development environment without violating any restrictions.
Alternative A explicitly ignores the security block for changes to production servers outside the window. Alternative C is technically valid as a workaround, but doesn't fix the cause and requires continuous manual maintenance. Alternative D is unnecessarily destructive: unlinking and recreating the link doesn't solve the forwarder problem and generates unnecessary rework.
Answer Key β Scenario 3β
Answer: C
The symptom is intermittency correlated to a specific timeframe (6 PM-10 PM), and the log shows a 5001ms timeout. The decisive operational data is the ExpressRoute latency reaching 380ms at peaks. The path of a DNS query in this environment is: VM in VNet forwards to 10.0.0.10 via ExpressRoute, the on-premises server forwards to 168.63.129.16, the response returns through the same path. With 380ms round-trip latency on the circuit, plus processing time from the on-premises server and Azure resolver, the default DNS timeout (usually 5 seconds) is easily exceeded under load.
Alternative A is technically correct as an observation (the on-premises server is indeed not authoritative for internal.cloudapp.net), but doesn't explain the intermittency: if this were the cause, the failure would be constant and predictable, not correlated to time. This alternative represents the classic error of confusing a true fact with the root cause.
Alternative B would be valid if 168.63.129.16 were an external IP subject to unavailability, but it's Azure's magic resolver, available in every Azure datacenter regardless of load. The most dangerous distractor is A: acting based on it would lead to creating an unnecessary private DNS zone for internal.cloudapp.net, without solving the real latency bottleneck.
Answer Key β Scenario 4β
Answer: B
The correct sequence is P, R, Q, S, T, which follows the logic of progressive diagnosis from the most fundamental layer to the most specific.
P (check IP via DHCP) is the starting point: without a valid IP address, all other tests are useless. R (check DNS Servers in VNet) identifies which server the VM should be using, provided by the VNet's DHCP. Q (execute nslookup and observe returned server) validates if the VM actually received and is using the correct server. S (test connectivity with 168.63.129.16) confirms if the Azure resolution plane is accessible from the VM. T (check if custom server is accessible and responding) is the last step because it only makes sense to test it after confirming that the VM knows which server to use and has network connectivity.
Sequence A starts with the portal (R) before validating if the VM even has an IP, which can generate correct diagnosis in the portal with invisible problem in the VM. Sequence D starts by testing 168.63.129.16 directly, which skips fundamental steps and can lead to false conclusions about the cause.
Troubleshooting Tree: Configure DNS settings for a VNetβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate verification or validation |
To use this tree when facing a real problem, start at the root node and answer each diagnostic question based on what you observe in the environment, not what you assume. Follow the path indicated by your answer (yes or no) until you reach a red node with identified cause. From the cause, the adjacent green node indicates the corrective action. If after applying the action the symptom persists, return to the immediately previous verification node and reassess if the answer premise was correct.