Troubleshooting Lab: Choose when to use a service endpoint
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reported that a production application stopped writing logs to an Azure Storage Account after an infrastructure change made the previous day. The application runs on VMs within the snet-app (10.0.2.0/24) subnet of the vnet-prod VNet, located in East US.
The engineer responsible for the change confirms that they enabled the Microsoft.Storage Service Endpoint on the subnet and updated the Storage Account firewall as recorded below:
Storage Account: stlogsprod001
Firewall and virtual networks:
Allow access from: Selected networks
Virtual Networks:
- vnet-prod / snet-app [STATUS: Provisioning]
Firewall (IP Rules): no rules
Exceptions: Allow trusted Microsoft services: ON
The engineer also mentions that the subnet's NSG was updated hours earlier to block outbound traffic on port 445, as part of a security policy to disable SMB. The application uses the Azure SDK to write blobs via HTTPS on port 443.
What is the root cause of the write failure?
A) The NSG is blocking port 445, which prevents communication with the Storage Account even via SDK.
B) The VNet entry in the Storage Account firewall is still in provisioning state and hasn't taken effect yet.
C) The Microsoft.Storage Service Endpoint doesn't cover blob write operations, only read operations.
D) The "Allow trusted Microsoft services" exception is overriding the VNet rule and causing routing conflicts.
Scenario 2 β Action Decisionβ
The cause of a production incident has been identified: the Microsoft.Sql Service Endpoint was enabled on the wrong subnet. The application resides in the snet-api subnet, but the endpoint was enabled in snet-mgmt. The Azure SQL Server firewall is configured to accept only snet-mgmt, which causes all application connections to be rejected with error 40615.
The environment has the following restrictions at the moment:
- It's 2 PM on Friday and there's an active change freeze until Monday
- The application is down for end users
- The security team needs to approve any changes to NSGs and Service Endpoints
- The DBA team has portal access and can modify the SQL Server firewall without security approval, as this is within their scope of permissions
- Enabling the Service Endpoint on the correct subnet would require security team approval
What is the correct action to take at this moment?
A) Request emergency approval from the security team to enable the Service Endpoint in snet-api and update the SQL Server firewall.
B) Ask the DBA team to add the public IP of VMs from snet-api as a temporary firewall rule in SQL Server, restoring access while the definitive fix awaits approval.
C) Move the application VMs to snet-mgmt temporarily, since the endpoint is enabled in that subnet.
D) Remove the VNet restriction from the SQL Server firewall and open to "All networks" temporarily, to restore access with the fewest approvals possible.
Scenario 3 β Root Causeβ
An analyst is investigating why a VM in snet-backend (10.1.1.0/24) of the vnet-hub VNet cannot access an Azure Key Vault. When executing the command below, they observe the following:
$ curl -I https://kv-producao.vault.azure.net/
curl: (6) Could not resolve host: kv-producao.vault.azure.net
The analyst checks the VNet configuration and finds:
VNet: vnet-hub (10.1.0.0/16) -- East US
Subnet: snet-backend (10.1.1.0/24)
Service Endpoints: Microsoft.KeyVault
DNS Servers: 10.1.0.4 (internal DNS server)
Key Vault: kv-producao
Firewall:
Allow access from: Selected networks
Virtual Networks:
- vnet-hub / snet-backend [STATUS: Succeeded]
Public network access: Enabled
The analyst also notes that the VNet has peering configured with vnet-spoke, and that this spoke VNet doesn't have Service Endpoints enabled. They suspect the problem is related to peering.
What is the root cause of the observed error?
A) The Service Endpoint isn't working correctly because the VNet has active peering, which creates routing conflicts.
B) The internal DNS server in the VNet cannot resolve the Key Vault hostname, preventing the connection before traffic is even routed.
C) The Key Vault has Public network access: Enabled, which disables Service Endpoint support.
D) The Microsoft.KeyVault Service Endpoint was enabled on the subnet, but the Succeeded status in the Key Vault firewall indicates the configuration was overwritten by a network policy.
Scenario 4 β Diagnostic Sequenceβ
A team received the following report: "The production application returns error 403 when trying to access the Azure Storage Account. The problem started after a subnet reorganization performed during last night's maintenance window."
The following investigation steps are available, out of order:
- Verify if the
Microsoft.StorageService Endpoint is enabled on the application's current subnet - Confirm if the application is actually running on the subnet it's supposed to be on
- Check which VNets and subnets are authorized in the Storage Account firewall
- Test the application's connectivity to the Storage Account using the SDK and collect the exact error
- Verify if there was a change to the subnet's NSG that might be blocking outbound traffic on port 443
What diagnostic sequence is most appropriate for this scenario?
A) 4 β 2 β 3 β 1 β 5
B) 1 β 3 β 2 β 5 β 4
C) 5 β 1 β 4 β 3 β 2
D) 2 β 4 β 3 β 1 β 5
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The decisive clue is visible in the statement: the VNet entry status in the Storage Account firewall is Provisioning, not Succeeded. While provisioning is not complete, the rule is not yet active and the Storage Account continues to reject connections from that subnet, even though the Service Endpoint is already enabled on the VNet side. Both configurations need to be synchronized and active for access to work.
The information about port 445 blocking is the scenario's distraction trap. The application uses HTTPS on port 443 for blob operations via SDK, so SMB blocking is completely irrelevant to the symptom. The reader who focused on this information probably chose A, making the classic mistake of associating the most visible detail to the problem, without checking the actual state of the network configuration.
Choosing A would be dangerous in production: the engineer would remove a legitimate security policy without solving the real problem.
Answer Key β Scenario 2β
Answer: B
The central restriction of the scenario is the change freeze combined with the requirement for security team approval for Service Endpoint changes. The definitive fix (enabling the endpoint on the correct subnet) is blocked by process. However, the statement explicitly informs that the DBA team can modify the SQL Server firewall within their own scope of permissions, without additional approval.
Adding the VMs' public IP as a temporary rule in the SQL Server firewall is the only action that restores the service without violating the environment's restrictions. It's a controlled workaround, within available permissions, that doesn't unnecessarily expose the database.
Alternative D is the most dangerous: opening SQL Server to "All networks" exposes the database to the public internet, violates security policies, and goes far beyond what's necessary to restore access for a single application.
Alternative C is technically incorrect: moving VMs between subnets in production during a freeze is an infrastructure change of greater risk and impact than the original problem.
Answer Key β Scenario 3β
Answer: B
The error Could not resolve host occurs at the DNS layer, before any TCP connection attempt or routing via Service Endpoint. This means the problem is not in the Service Endpoint configuration or the Key Vault firewall: traffic never got to be routed because name resolution failed first.
The internal DNS server (10.1.0.4) is responsible for resolution in the VNet. If this server doesn't have a forwarder configured correctly for public Azure names (like *.vault.azure.net), resolution fails. Service Endpoints don't alter DNS resolution; they only influence traffic path after resolution is successful.
The analyst's suspicion about peering is the scenario's irrelevant information. The peering with vnet-spoke has no relation to the DNS failure in snet-backend. The analyst was induced to look for a complex cause when the error already clearly indicated where the failure was.
Alternative C is a common distractor: Public network access: Enabled doesn't disable Service Endpoints; both configurations coexist normally.
Answer Key β Scenario 4β
Answer: A
The sequence 4 β 2 β 3 β 1 β 5 represents the correct diagnostic reasoning for a 403 error after subnet reorganization.
The first step (4) is to confirm the actual symptom with precision: collecting the exact error from the application ensures we're diagnosing the correct problem before any investigation. Next (2), verifying where the application is actually running is essential, as subnet reorganization may have moved resources unexpectedly. With the real subnet confirmed, the next step (3) is to check the Storage Account firewall to know which subnets are authorized. Then (1) confirms if the Service Endpoint is enabled on the current subnet. Finally (5), the NSG is investigated as a secondary hypothesis, since a 403 error from Storage Account indicates rejection by the network authorization layer, not silent port blocking.
Sequence B starts with network controls before confirming where the application is, which can lead to investigating the wrong subnet. Sequence D has acceptable logic, but postponing the collection of the exact error (step 4) until after checking the current subnet is less efficient than sequence A, because without the confirmed error any subsequent findings might be interpreted imprecisely.
Troubleshooting Tree: Choose when to use a service endpointβ
Color Legend:
| Color | Meaning |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Diagnostic question (decision to verify) |
| Orange | Intermediate validation or verification |
| Red | Identified cause |
| Green | Recommended action or resolution |
To use this tree when facing a real problem, start with the root node by identifying the type of observed symptom: authorization error (403) or name resolution failure. From there, answer each diagnostic question based on what can be verified directly in the portal or via CLI, advancing through the corresponding path until reaching the cause or corrective action. Avoid skipping levels: each branch eliminates a class of hypotheses before advancing to the next.