Troubleshooting Lab: Configure service endpoint policies
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that VMs in a subnet called snet-app-prod stopped accessing a storage account called storagecontosoapp after a maintenance window last Friday. The VNet is in the eastus2 region. The infrastructure team confirms that no NSG rules were changed during maintenance. The service endpoint for Microsoft.Storage remains enabled on the subnet.
During investigation, an engineer runs the following command and gets the output below:
az network vnet subnet show \
--resource-group rg-prod \
--vnet-name vnet-prod \
--name snet-app-prod \
--query "{endpoints:serviceEndpoints, policies:serviceEndpointPolicies}"
{
"endpoints": [
{
"provisioningState": "Succeeded",
"service": "Microsoft.Storage",
"locations": ["eastus2", "eastus"]
}
],
"policies": [
{
"id": "/subscriptions/aaaa-bbbb/resourceGroups/rg-network/providers/Microsoft.Network/serviceEndpointPolicies/policy-storage-prod"
}
]
}
Next, the engineer inspects the policy and finds:
{
"serviceEndpointPolicyDefinitions": [
{
"service": "Microsoft.Storage",
"serviceResources": [
"/subscriptions/aaaa-bbbb/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/storagecontosobackup"
]
}
]
}
The team also mentions that during maintenance, a new storage account for backup was created and that the access policy of the storagecontosoapp account was not modified.
What is the root cause of the access failure to the storagecontosoapp account?
A) The service endpoint on the subnet is not configured for the correct region, as eastus2 and eastus are treated separately
B) The service endpoint policy associated with the subnet lists only the backup account storagecontosobackup, excluding the storagecontosoapp account
C) The policy was created in the rg-network resource group, different from the rg-prod resource group where the subnet is located, causing scope incompatibility
D) The access policy of the storagecontosoapp account was revoked during the creation of the new backup account, blocking traffic
Scenario 2 β Action Decisionβ
The cause of a connectivity failure has been identified: a service endpoint policy in production was incorrectly updated by an operator and now references only a development resource group, blocking access from all production subnets to the correct storage accounts. There are 14 critical applications affected. The production environment cannot be shut down. The security team requires that any policy changes go through formal approval before being applied, but there is an emergency process that allows immediate changes with subsequent logging.
What is the correct action to take at this moment?
A) Remove the association between the service endpoint policy and all production subnets immediately, restoring unrestricted access temporarily while the formal fix is prepared
B) Trigger the emergency process, fix the policy's serviceResources to reference the correct production resource groups, and log the change after application
C) Create a new service endpoint policy with the correct configuration and wait for formal approval before associating it with the subnets, keeping the incorrect policy in place
D) Revert the policy to the previous version via Azure Resource Manager by exporting the template before the incorrect change and redeploying via CI/CD pipeline
Scenario 3 β Root Causeβ
A developer reports that an application in a snet-analytics subnet can normally access Azure Blob Storage, but cannot access an Azure Data Lake Storage Gen2 called dlscontosoanalytics in the same subscription. Both accounts are in the westus2 region, same region as the VNet. The developer mentions that the Data Lake was created three weeks ago and that before that date everything worked normally. He also mentions that the network team added a service endpoint policy to the subnet four weeks ago as part of a hardening project.
The subnet has the following state:
az network vnet subnet show \
--resource-group rg-analytics \
--vnet-name vnet-analytics \
--name snet-analytics \
--query "serviceEndpointPolicies"
[
{
"id": "/subscriptions/cccc-dddd/resourceGroups/rg-network/providers/Microsoft.Network/serviceEndpointPolicies/policy-hardening-v1"
}
]
Policy inspection reveals:
{
"serviceEndpointPolicyDefinitions": [
{
"service": "Microsoft.Storage",
"serviceResources": [
"/subscriptions/cccc-dddd/resourceGroups/rg-analytics/providers/Microsoft.Storage/storageAccounts/blobcontosoanalytics"
]
}
]
}
The developer believes the problem started when the Data Lake was created, and that maybe the Microsoft.Storage service type doesn't cover Data Lake Storage Gen2.
What is the root cause of the problem?
A) The Microsoft.Storage service type doesn't cover Azure Data Lake Storage Gen2, requiring a separate type in the policy
B) The service endpoint policy was added four weeks ago and is blocking all storage traffic, including Blob Storage, contradicting the developer's report
C) The policy explicitly lists only the blobcontosoanalytics account, and the Data Lake Storage Gen2 dlscontosoanalytics is a distinct storage account not included in the allowlist
D) Data Lake Storage Gen2 requires the service endpoint to be specifically enabled for Microsoft.DataLake, not for Microsoft.Storage
Scenario 4 β Collateral Impactβ
A team identifies that a service endpoint policy on a subnet is more restrictive than necessary. To resolve quickly, an administrator removes the policy association from the subnet via the Azure portal. Access to storage accounts is immediately restored for all applications on the subnet.
What secondary consequence can this action cause?
A) The service endpoint for Microsoft.Storage is automatically disabled from the subnet when the policy is disassociated, interrupting traffic again after a few minutes
B) The subnet gains unrestricted access to any Azure storage account via service endpoint, including accounts from other organizations and other tenants
C) All other service endpoint policies associated with other subnets in the same VNet are automatically disassociated as a cascading effect
D) Policy removal invalidates the service endpoint routes in the subnet's effective route table, forcing storage traffic to exit through the internet
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The policy inspection output clearly shows that serviceResources contains only the storagecontosobackup account, created during maintenance. Since the service endpoint policy functions as an explicit allowlist, any account not listed is automatically blocked. The storagecontosoapp account, which was the original traffic destination, is not in the list and therefore access was denied.
The decisive clue in the scenario is the temporal sequence: the policy was updated during maintenance to include the new backup account, but the original account was not maintained in the list. This is a classic operational error of replacement instead of addition.
The irrelevant information in the scenario is the mention that the access policy of the storagecontosoapp account was not modified. This information leads the reader to discard this vector, but the correct focus is on the service endpoint policy, not the account's access policy itself.
Distractor C exploits a common misconception: the location of the policy resource in a different resource group from the subnet does not cause incompatibility. The policy is associated with the subnet by resource ID, regardless of the resource group where it was created. The most dangerous distractor is D, as it directs diagnosis toward the storage account instead of the network policy, which would lead to a completely misguided investigation.
Answer Key β Scenario 2β
Answer: B
The scenario presents explicit constraints: production cannot be shut down, security requires formal approval, but there is an emergency process for critical situations. With 14 applications affected, the situation qualifies as an emergency. The correct action is to use the available process for this: fix the policy immediately and log the change afterward.
Distractor A is technically valid as a contingency measure, but violates the security principle by completely removing the access control policy, exposing production subnets to unrestricted storage access. It is a more drastic action than necessary.
Distractor C respects the formal process but ignores the urgency of the situation and the fact that an emergency process exists exactly for this type of case. Keeping the incorrect policy in place while awaiting approval unnecessarily prolongs the impact.
Distractor D would be valid if a previous template existed and was accessible, but introduces unnecessary operational latency via pipeline when direct correction of the serviceResources is faster and equally effective.
Answer Key β Scenario 3β
Answer: C
Azure Data Lake Storage Gen2 is implemented on top of Azure Blob Storage and is treated by Azure as a common storage account of type Microsoft.Storage. Therefore, the service type of the policy would already cover the Data Lake, making the developer's hypothesis incorrect.
The real problem lies in the policy structure: it lists only the blobcontosoanalytics account in serviceResources. When the Data Lake dlscontosoanalytics was created three weeks ago, no one updated the policy to include it. Since the policy functions as an allowlist, traffic to the Data Lake is blocked, while traffic to the listed Blob Storage continues working normally.
The irrelevant information is the developer's belief that Microsoft.Storage doesn't cover Data Lake Storage Gen2. This reasoning is plausible for someone unfamiliar with the internal implementation, but leads to distractor A, which would be the most common and most dangerous diagnostic error: if the team acts based on this hypothesis, they will try to add a non-existent service type and won't solve the real problem.
Distractor B can be eliminated by the developer's own report, which confirms that Blob Storage still works. If the policy blocked everything, both would be inaccessible.
Answer Key β Scenario 4β
Answer: B
When a service endpoint policy is disassociated from a subnet, the service endpoint itself remains enabled. What changes is that the destination restriction is removed. Without the policy, traffic from the subnet via service endpoint can reach any Azure storage account, including accounts from other tenants and other organizations. This is exactly the risk that service endpoint policies were created to mitigate.
Distractor A represents a misconception about the coupling between service endpoints and policies. They are independent resources: removing the policy doesn't affect the service endpoint. Distractor C describes a cascading effect behavior that doesn't exist in the platform; policies are individually associated with each subnet. Distractor D is also incorrect: service endpoint routes are determined by the service endpoint enabled on the subnet, not by the policy associated with it.
The correct collateral impact is the most silent and dangerous: no error messages, no service interruption, but a significant exposure window while the fix is not reapplied.
Troubleshooting Tree: Configure service endpoint policiesβ
Color legend:
| Color | Node type |
|---|---|
| Dark blue | Initial symptom or reported failure |
| Blue | Objective diagnostic question |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate validation or verification |
To use this tree when facing a real problem, start with the root node describing the observed failure. At each decision node, verify the current state of the environment before proceeding, never assume the answer. Follow the path that corresponds to what you observed, without skipping levels. When you reach a red node, you have identified the root cause. The green node immediately connected to it indicates the correct action. If you reach the orange validation node without resolution, the problem is outside the scope of service endpoint policies and requires investigation in another network layer.