Troubleshooting Lab: Configure access to service endpoints
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that VMs in a subnet called snet-app-prod stopped accessing an Azure Storage Account after a maintenance window performed the night before. The responsible engineer confirms that no changes were made directly to the Storage Account. During maintenance, the following activities were performed: updating the NSG associated with the subnet, resizing two VMs, and migrating the subnet to a smaller CIDR address (10.1.2.0/26, previously was 10.1.2.0/24).
When checking current configurations, the engineer observes:
# Output from az network vnet subnet show command
{
"addressPrefix": "10.1.2.0/26",
"serviceEndpoints": [],
"networkSecurityGroup": {
"id": "/subscriptions/.../nsg-app-prod"
}
}
The Storage Account is configured with the following network rule:
{
"virtualNetworkRules": [
{
"virtualNetworkResourceId": "/subscriptions/.../snet-app-prod",
"action": "Allow",
"state": "networkSourceDeleted"
}
],
"defaultAction": "Deny"
}
What is the root cause of the access failure?
A) The NSG was updated during maintenance and is blocking outbound traffic to the Storage Account on port 443.
B) The subnet CIDR resizing operation automatically removed the configured Service Endpoints, and the Storage Account still requires that the authorized subnet have an active Service Endpoint.
C) The Storage Account detected that the source subnet was modified and entered quarantine state, requiring manual reauthorization through the Azure portal.
D) The virtual network rule in the Storage Account became invalid because the subnet Resource ID was changed when the CIDR was modified, and the Service Endpoint no longer exists on the subnet.
Scenario 2 β Action Decisionβ
The problem cause has been identified: the Service Endpoint for Microsoft.Storage was removed from subnet snet-data-01 during a network reconfiguration operation performed by another team member. The Storage Account in question is configured with defaultAction: Deny and has only one virtual network rule referencing snet-data-01. The environment is production and the Storage Account is used by a critical batch process that runs every 30 minutes. The next cycle occurs in 12 minutes.
The team has the necessary permissions to change subnet and Storage Account configurations. There is no open maintenance window. The company's internal policy requires opening a change request for production changes, except in cases of active incidents with confirmed production impact.
What is the correct action to take at this moment?
A) Open a formal change request before any action, as the internal policy does not allow production changes outside maintenance windows, even with imminent impact.
B) Re-enable the Service Endpoint for Microsoft.Storage on subnet snet-data-01 immediately, registering the action as an active incident given the confirmed production impact.
C) Temporarily change the Storage Account to defaultAction: Allow until the change request is approved, restoring the Service Endpoint later.
D) Create a new subnet with the Service Endpoint configured and update the Storage Account's virtual network rule to reference the new subnet before the next batch cycle.
Scenario 3 β Root Causeβ
A developer reports that an application hosted on VMs in subnet snet-backend can normally access Azure Key Vault and Azure Service Bus, both with configured Service Endpoints and corresponding virtual network rules. However, the same application fails when trying to access an Azure SQL Database that also has Service Endpoint and virtual network rule configured in an apparently identical manner.
The developer informs that the database was migrated from brazilsouth to eastus2 three days ago, right after the last successful application deployment. The database team confirms they updated the connection string in the application with the new FQDN. The network team confirms that the subnet NSG allows outbound traffic on port 1433 to any destination.
The Service Endpoint configuration on the subnet is:
{
"serviceEndpoints": [
{ "service": "Microsoft.Sql", "locations": ["brazilsouth"] },
{ "service": "Microsoft.KeyVault", "locations": ["*"] },
{ "service": "Microsoft.ServiceBus", "locations": ["*"] }
]
}
What is the root cause of the SQL Database access failure?
A) Azure SQL Database in eastus2 requires that the virtual network rule be recreated after a region migration, as the server Resource ID changes with migration.
B) The NSG only allows outbound traffic on port 1433, but Azure SQL Database also requires communication on ports 11000-11999 for gateway routing, which is blocked.
C) The Service Endpoint for Microsoft.Sql is configured only for the brazilsouth region, not covering the new database location in eastus2.
D) The connection string was updated but the VMs' DNS cache still resolves the old FQDN, redirecting connections to the original server that no longer exists.
Scenario 4 β Collateral Impactβ
To resolve an access problem, an engineer modifies the Service Endpoint configuration on subnet snet-shared, changing the locations field from ["brazilsouth"] to ["*"] on the Microsoft.Storage endpoint. The change resolves the problem immediately and access to the Storage Account in the new region is restored.
Two days later, the security team opens an alert informing that data from an application in snet-shared is being accessed from an unexpected region and that the regional data confinement model may have been compromised.
What consequence did changing locations to "*" directly cause?
A) The Storage Account began accepting connections from any internet origin, as the * value in the Service Endpoint removes virtual network restrictions configured in the service firewall.
B) The Service Endpoint began routing traffic through Microsoft's backbone to Storage instances in any region, potentially allowing the application to access Storage Accounts in regions outside the data residency policy.
C) The * value in the locations field disables backbone routing optimization, making traffic go through the public internet again and exposing data in transit.
D) The change to * automatically replicated the Storage Account's virtual network rules to all Azure regions, creating unauthorized access points in other geographies.
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: D
Explanation:
The decisive clue is in the command output showing "serviceEndpoints": [] and the "networkSourceDeleted" state in the Storage Account rule. These two signals together confirm that the Service Endpoint was removed from the subnet and that the Storage Account recognized the subnet as a source that no longer exists in a valid state.
The CIDR resizing operation on an Azure subnet requires the subnet to be reconfigured, which can result in Service Endpoints being removed. When the Service Endpoint ceases to exist on the subnet, the Storage Account marks the corresponding virtual network rule as networkSourceDeleted, a state that effectively blocks traffic even though the rule still appears listed.
The information about resizing two VMs is purposefully irrelevant and has no relation to Service Endpoints or network rules. The focus on NSG as a distractor is a classic diagnostic error: the NSG can block traffic, but the output from subnet show showing serviceEndpoints: [] and the rule state in the Storage Account are sufficient evidence to rule out the NSG as root cause before investigating it.
The most dangerous distractor is alternative A: focusing on the NSG without examining the Service Endpoint state would lead the engineer to create unnecessary NSG rules while the real problem remains uncorrected.
Answer Key β Scenario 2β
Answer: B
Explanation:
The scenario presents clear constraints: confirmed production impact, critical process with imminent execution in 12 minutes, and an internal policy that explicitly excepts active incidents with confirmed impact. The combination of this information indicates that the policy exception applies and that the correct action is to restore the Service Endpoint immediately, documenting as an incident.
Alternative A ignores the explicit policy exception for active incidents. Alternative C resolves the symptom in a risky way: changing defaultAction to Allow exposes the Storage Account to unauthorized access from any source while the change request is processed, which represents a disproportionate security risk. Alternative D would be technically viable in another context, but creating a new subnet and updating rules in less than 12 minutes in production, without a maintenance window, is riskier and more time-consuming than direct correction.
The correct decision balances urgency, impact, and policy compliance: the exception exists exactly for this type of situation.
Answer Key β Scenario 3β
Answer: C
Explanation:
The subnet configuration explicitly shows that Microsoft.Sql is enabled only for "locations": ["brazilsouth"], while the database was migrated to eastus2. This means that traffic destined for SQL in eastus2 is not covered by the Service Endpoint, and therefore does not follow Microsoft backbone routing nor is recognized by the database firewall as originating from an authorized subnet.
The contrast with Key Vault and Service Bus, which use "*" and work normally, is the structural clue that confirms the diagnosis: the problem is not NSG, DNS, or virtual network rule, but Service Endpoint regional coverage.
The information about the updated connection string is purposefully included as a distraction: it's relevant for name resolution problems, but the scenario doesn't present any symptoms compatible with DNS failure (such as resolution timeout or host not found error). The most dangerous distractor is alternative D, as DNS cache problems are common after migrations and the statement explicitly mentions connection string update, which could lead the reader to conclude that the cache hasn't been propagated yet, ignoring the direct evidence of Service Endpoint configuration.
Answer Key β Scenario 4β
Answer: B
Explanation:
The locations field in a Service Endpoint defines in which Azure regions Microsoft backbone routing is activated for that service. By changing to "*", the Service Endpoint now covers all regions, which means the application can now route traffic optimally to Storage Accounts in any Azure region.
This doesn't change the original Storage Account's firewall rules, but allows the application to reach other Storage Accounts in other regions that are also configured to accept the subnet. In environments with data residency policies, this can violate regulatory or contractual restrictions that require data to be accessed only within a certain geography.
Alternative A is the most dangerous distractor because it mixes two concepts: the Service Endpoint's locations and the destination service's firewall rules. These are independent controls. The locations: * expands backbone routing coverage, it doesn't modify Storage Account access rules. Alternative C reverses the logic of how the locations field works. Alternative D describes behavior that doesn't exist: Service Endpoints don't replicate firewall rules between regions.
Troubleshooting Tree: Configure access to service endpointsβ
Color Legend:
| Color | Meaning |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Blue | Objective diagnostic question |
| Orange | Intermediate verification or validation |
| Red | Identified cause |
| Green | Recommended action or resolution |
To use this tree when facing a real problem, start with the root node describing the symptom of access absence. Answer each question based on what is directly observable in Azure configurations (portal, CLI, or ARM), following the path that corresponds to the environment's actual state. Each branch eliminates a hypothesis and directs the investigation toward the most likely cause, avoiding premature corrective actions before the diagnosis is complete.