Troubleshooting Lab: Plan private endpoints

Diagnostic Scenarios

Scenario 1 — Root Cause

A team reports that an application hosted on a VM in spoke-vnet-prod cannot connect to an Azure Key Vault. The Key Vault private endpoint was provisioned in hub-vnet three days ago and worked normally during initial testing performed from VMs in the hub itself.

The environment has the following characteristics:

Peering between hub-vnet and spoke-vnet-prod is active, bidirectional and with "Allow forwarded traffic" enabled
The VM in the spoke uses custom DNS server 10.1.0.4, hosted in the hub
The custom DNS server is configured with forwarder pointing to 168.63.129.16
The NSG of the private endpoint subnet has no blocking rules for port 443
The application's service account has read permission on Key Vault via RBAC
The private DNS zone privatelink.vaultcore.azure.net is linked only to hub-vnet

The team executed the following diagnostic from the VM in the spoke:

curl -v https://contoso-kv.vault.azure.net/secrets/dbpassword
* Could not resolve host: contoso-kv.vault.azure.net
* Closing connection 0
curl: (6) Could not resolve host: contoso-kv.vault.azure.net

Then, they tested direct connectivity by IP:

curl -v https://10.1.2.5/secrets/dbpassword
* Connected to 10.1.2.5 (10.1.2.5) port 443 (#0)
* SSL certificate subject: CN=contoso-kv.vault.azure.net
* SSL certificate verify ok.
< HTTP/2 200

What is the root cause of the observed failure?

A) The peering between hub and spoke is not propagating the private endpoint routes, preventing traffic from reaching private IP 10.1.2.5
B) The private DNS zone is not linked to hub-vnet, causing the custom DNS server to be unable to resolve the FQDN
C) The private DNS zone is linked only to hub-vnet, and since the custom DNS server forwards to Azure DNS in the hub context, the spoke doesn't receive private resolution due to lack of zone link with spoke-vnet-prod
D) The NSG of the VM subnet in the spoke is blocking DNS queries on port 53 toward server 10.1.0.4

Scenario 2 — Action Decision

The platform team identified that a private endpoint provisioned in production for an Azure Service Bus namespace is inaccessible from all VMs in the VNet. The diagnosis confirmed that the cause is the absence of any record in the private DNS zone privatelink.servicebus.windows.net. The private endpoint was created manually via ARM template by an engineer who did not enable automatic private DNS integration during provisioning.

The environment has the following restrictions:

The private endpoint is in production and receiving traffic from three critical applications that use the hardcoded private IP temporarily as a workaround
The team has permission to modify private DNS zones and create links, but does not have permission to recreate or delete private endpoints in the production environment
A maintenance window is available in 48 hours
The private DNS zone privatelink.servicebus.windows.net already exists and is linked to the correct VNet

What is the correct action to take at this moment?

A) Recreate the private endpoint with DNS integration option enabled, replacing the existing endpoint during the maintenance window
B) Manually create an A record in the existing private DNS zone pointing to the private endpoint's private IP, without waiting for the maintenance window
C) Wait for the maintenance window and then reconfigure the ARM template to enable DNS integration, reapplying the complete template
D) Create a second identical private endpoint with DNS integration enabled and redirect application traffic to the new endpoint during the maintenance window

Scenario 3 — Root Cause

A security team applied a more restrictive NSG policy to the pe-subnet subnet of a VNet, with the goal of controlling inbound and outbound traffic from private endpoints hosted there. After the change, application teams reported that private endpoints remain normally accessible, without any blocking, even for connections that the new rules should block.

The environment:

pe-subnet subnet created six months ago with storage and SQL private endpoints
nsg-pe-prod NSG newly created, with explicit deny rules for traffic coming from unauthorized subnets
NSG correctly associated to pe-subnet via portal
NSG logs were enabled after the change and show zero hits on deny rules
The network team confirmed there is no Azure Firewall or NVA intercepting traffic before the NSG

The engineer responsible for the change assumes that the NSG is defective or was overridden by an Azure Policy.

What is the root cause of the observed behavior?

A) An Azure Policy with Audit effect is preventing the NSG deny rules from being evaluated correctly
B) The subnet's PrivateEndpointNetworkPolicies property is disabled, causing NSGs not to be applied to private endpoint traffic
C) NSG logs have up to 30 minutes latency, and the deny rule hits haven't appeared in the Log Analytics workspace yet
D) The NSG was associated to the subnet, but the rules were created with priority above 65000, colliding with Azure default rules

Scenario 4 — Diagnostic Sequence

A production VM starts failing to connect to a storage account via private endpoint after a network change performed by the platform team. The symptom is: timeout on port 443 when trying to access https://storaccprod.blob.core.windows.net.

The following investigation steps are available, but out of order:

Verify if the private DNS zone privatelink.blob.core.windows.net returns the correct private IP for the storage account FQDN when queried from the VM's VNet
Confirm if the IP returned by DNS corresponds to the private IP configured in the private endpoint's NIC in the Azure portal
Execute nslookup storaccprod.blob.core.windows.net from the affected VM to verify which IP address is being resolved
Try direct connection via curl https://[PRIVATE_IP]:443 to isolate if the problem is DNS or network connectivity
Verify if there was a recent change in VNet association to the private DNS zone or in the VNet's custom DNS server configuration

What is the correct investigation sequence?

A) 3 → 4 → 1 → 2 → 5
B) 1 → 3 → 2 → 4 → 5
C) 3 → 1 → 5 → 2 → 4
D) 5 → 3 → 1 → 2 → 4

Answer Key and Explanations

Answer Key — Scenario 1

Answer: C

The decisive clue is in the two tests performed by the team: access by FQDN fails with "could not resolve host", but access by direct private IP works perfectly, including SSL certificate validation. This eliminates any hypothesis of routing or NSG problems, because private IP 10.1.2.5 is reachable. The problem is exclusively DNS resolution.

The correct reasoning begins by understanding the resolution chain: the VM in the spoke queries 10.1.0.4 (custom DNS in hub), which forwards to 168.63.129.16 (Azure DNS). Azure DNS only returns the private record from zone privatelink.vaultcore.azure.net if the VNet from which it is queried is linked to the private zone. The forward reaches Azure DNS in the context of hub-vnet, and the zone is linked to the hub, so the hub resolves correctly. But the spoke is not linked, and the forward originates from the hub. This might seem sufficient, but linking to spoke-vnet-prod is also necessary for VMs in that VNet to obtain private resolution directly, especially in scenarios where forwarding is not configured completely.

Alternative A is the most dangerous distractor: the team could waste time investigating routes when the IP test itself proves that routing works. Alternative B is factually wrong because the zone is linked to the hub. Alternative D has no support in the collected data; the DNS query fails with "could not resolve", not with connection timeout.

The risk of acting based on distractor A would be opening an unnecessary network ticket, delaying the real resolution by hours.

Answer Key — Scenario 2

Answer: B

The cause is already identified in the statement: absence of DNS record. The correct private zone already exists and is linked to the VNet. The private endpoint is functional and receiving traffic (via hardcoded IP). The critical restriction is that the team does not have permission to recreate or delete the endpoint in production.

Manually creating an A record in the existing private DNS zone is the correct action, technically valid and requires no modification to the private endpoint. This action can be executed immediately, without maintenance window, as it causes no interruption and removes the hardcoded IP dependency from applications.

Alternative A explicitly violates the permissions restriction stated in the scenario. Alternative C waits for the window and recreates the endpoint via ARM template, which would also require permission to delete the existing endpoint. Alternative D creates a second endpoint, which again requires permission to create endpoints in production and introduces unnecessary complexity when the existing endpoint already works.

The most dangerous distractor is A: it seems like the "clean" and technically correct solution, but ignores the stated permissions restriction, which in production would cause a blocked attempt and potential incident.

Answer Key — Scenario 3

Answer: B

The statement contains a deliberately irrelevant piece of information: the engineer's assumption that the NSG is defective or overridden by Azure Policy. This hypothesis is plausible enough to distract, but has no support in the collected data. The NSG logs showing zero hits on deny rules don't indicate a defective NSG; they indicate that traffic never got evaluated by the rules.

The real cause is Azure's default behavior for private endpoints: a subnet's PrivateEndpointNetworkPolicies property is disabled by default, which means NSGs and UDRs associated to the subnet are not applied to private endpoint NIC traffic. The NSG was associated correctly, but the subnet is not configured to honor network policies for private endpoints.

Alternative A is the smartest distractor: Azure Policy with Audit effect doesn't block or interfere with NSG evaluation. Alternative C is plausible as a real Log Analytics phenomenon, but conflicts with the fact that access has worked for six months without blocking logs. Alternative D describes a misconception about NSG priorities; rules with priority above 65000 are Azure's inviolable default rules, but this doesn't explain the total absence of evaluation.

Acting based on distractor A would lead to a long and fruitless Azure Policy investigation, while the real problem is solved with a single change to the subnet property.

Answer Key — Scenario 4

Answer: A

The correct sequence is: 3 → 4 → 1 → 2 → 5

The correct diagnostic reasoning always starts from the symptom closest to the user and advances toward the cause, discarding layers progressively:

Step 3 (nslookup): immediately determines if the problem is DNS or connectivity. If the returned IP is public, the problem is DNS. If it's private and timeout persists, the problem is network.
Step 4 (curl by IP): isolates the network layer. If the private IP responds, confirms that the network is working and the problem is exclusively resolution.
Step 1: investigates the private DNS zone to understand what should be returned.
Step 2: compares the zone record with the actual private endpoint NIC IP, checking consistency.
Step 5: investigates the recent change as root cause of the found discrepancy.

Alternative B starts investigating the DNS zone before even knowing if the problem is DNS, which is inefficient. Alternative C mixes root cause validations (step 5) before confirming the symptom. Alternative D starts with the recent change (step 5), which is a hypothesis, not an observed fact; starting with hypothesis instead of symptom is the most common and most costly diagnostic error in production environments.

Troubleshooting Tree: Plan private endpoints

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Color Legend:

Color	Node Type
Dark blue	Initial symptom (entry point)
Blue	Diagnostic question
Red	Identified cause
Green	Recommended action or resolution
Orange	Intermediate validation or verification

When facing a real failure, always start from the root node in dark blue and answer each diagnostic question based on what was observed or measured, never based on assumption. The first bifurcation separates DNS problems from network connectivity problems, which eliminates half of the hypotheses with a single nslookup command. From there, each question node progressively reduces the space of possible causes until reaching the corresponding red node and then the green action that solves the problem. Orange nodes indicate that the action was applied and the result should be verified before closing the diagnosis.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Action Decision​

Scenario 3 — Root Cause​

Scenario 4 — Diagnostic Sequence​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Plan private endpoints​