Troubleshooting Lab: Create service endpoints

Diagnostic Scenarios

Scenario 1 — Root Cause

An operations team reports that an application hosted on a VM in subnet-api stopped being able to write data to a storage account called stgprodeastus. The environment was modified yesterday during a maintenance window that involved NSG adjustments and a corporate network policy update.

The administrator collects the following information:

# Service Endpoint verification on subnet-api
az network vnet subnet show \
  --vnet-name vnet-prod \
  --name subnet-api \
  --resource-group rg-network \
  --query "serviceEndpoints"

# Output:
[
  {
    "locations": ["eastus"],
    "provisioningState": "Succeeded",
    "service": "Microsoft.Storage"
  }
]

# Storage account network rules verification
az storage account network-rule list \
  --account-name stgprodeastus \
  --resource-group rg-app \
  --query "virtualNetworkRules"

# Output:
[
  {
    "action": "Allow",
    "state": "networkSourceDeleted",
    "virtualNetworkResourceId": "/subscriptions/.../vnet-prod/subnets/subnet-api"
  }
]

The NSG associated with subnet-api was reviewed and has no outbound blocking rules for storage. The network administrator confirms that no changes were made to the Service Endpoint during maintenance.

What is the root cause of the access failure?

A. The Service Endpoint was disabled from the subnet during maintenance and the command output is cached.

B. The virtual network rule in the storage account has the state networkSourceDeleted, indicating that the referenced subnet or VNet was deleted or recreated, making the rule invalid.

C. The NSG is blocking outbound traffic to storage despite the review, as implicit lower-priority rules may be active.

D. The Service Endpoint is provisioned only for the eastus region, but the storage account uses geo-redundant replication that routes traffic to another region.

Scenario 2 — Root Cause

A VM in subnet-data normally accesses an Azure Key Vault protected by network firewall. An engineer adds a second subnet, subnet-mgmt, to the same VNet and enables the Service Endpoint for Microsoft.KeyVault on this new subnet. Minutes later, the original VM in subnet-data begins reporting intermittent failures when trying to retrieve secrets from the Key Vault.

The engineer verifies the following:

# Service Endpoints state on subnet-data
az network vnet subnet show \
  --vnet-name vnet-corp \
  --name subnet-data \
  --resource-group rg-net \
  --query "serviceEndpoints"

# Output:
[
  {
    "locations": ["brazilsouth"],
    "provisioningState": "Succeeded",
    "service": "Microsoft.KeyVault"
  }
]

# Key Vault network rules
Allowed virtual networks:
  - vnet-corp / subnet-data   -> state: Succeeded
  - vnet-corp / subnet-mgmt   -> state: Succeeded

The Key Vault has public access enabled with "Selected networks" configuration. There were no changes to the Key Vault network rules. The security team reports no changes in Key Vault access policies or RBAC during this period.

What is the root cause of the intermittent failures in subnet-data?

A. Adding subnet-mgmt to the Key Vault rules generated a routing conflict between the two subnets, causing duplicate packets.

B. Enabling the Service Endpoint on subnet-mgmt triggered a brief network reconfiguration on the host, which momentarily affected the network programming of neighboring subnets in the same VNet, causing temporary interruption of active TCP connections from subnet-data.

C. The Key Vault reached the limit of simultaneous virtual network rules, rejecting connections from the older subnet as a form of load balancing.

D. The security team's information is incomplete; the real cause is a silent change in a Key Vault access policy automatically triggered by adding a new subnet.

Scenario 3 — Action Decision

The cause has been identified: the virtual network rule of a critical storage account (stg-finance) references a subnet that was deleted and recreated during an infrastructure migration performed two weeks ago. The rule state is networkSourceDeleted. No application has been able to write to the storage account since then.

The environment has the following constraints:

The storage account is actively used by a financial processing pipeline that runs every 15 minutes
The pipeline has retry logic and tolerates failures of up to 5 minutes without data impact
The recreated subnet already has the Service Endpoint for Microsoft.Storage enabled and successfully provisioned
The administrator has Contributor permission on the storage account resource group
A second contingency storage account (stg-finance-bkp) is available but has never been tested in production

What is the correct action to take at this time?

A. Remove the invalid rule with state networkSourceDeleted and immediately add a new virtual network rule pointing to the recreated subnet, taking advantage of the pipeline's retry window.

B. Redirect the pipeline to the contingency storage account stg-finance-bkp before any changes to the primary account, as any network rule modification may cause additional unavailability.

C. Open a support ticket for Microsoft to restore the rule state to Succeeded, as changes to storage account network rules in active environments require support approval.

D. Wait for the next scheduled maintenance window to remove the invalid rule and add the new one, as modifications to storage account network rules always cause prolonged unavailability.

Scenario 4 — Diagnostic Sequence

A production application reports error 403 This request is not authorized to perform this operation when trying to access Azure Blob Storage. The administrator knows the environment uses Service Endpoints and that the storage account firewall is configured for "Selected networks".

The following investigation steps are available, out of order:

[P] Verify if the Service Endpoint for Microsoft.Storage is enabled
    on the source subnet of the VM

[Q] Verify if the source subnet is listed in the storage account's
    virtual network rules and if the state is "Succeeded"

[R] Verify if the application's credentials or Managed Identity
    have the correct role (e.g., Storage Blob Data Contributor) on the account

[S] Verify if the VM's public IP or NAT Gateway is in the storage
    account firewall's allowed IP list

[T] Confirm if the storage account is configured as "Selected networks"
    or "Enabled from all networks"

What is the correct investigation sequence?

A. T -> P -> Q -> R -> S

B. P -> T -> S -> Q -> R

C. R -> T -> P -> Q -> S

D. S -> P -> T -> Q -> R

Answer Key and Explanations

Answer Key — Scenario 1

Answer: B

The decisive clue is in the output of the command that lists the storage account's virtual network rules: the state field shows the value networkSourceDeleted. This state indicates that the resource referenced by the rule, in this case the VNet and subnet combination, was deleted or recreated after the rule was created. When a subnet is recreated (even with the same name and CIDR), it receives a new Resource ID, making the previous rule invalid. The Service Endpoint may be correctly provisioned on the new subnet, but the storage account firewall rule still points to the old resource ID.

The information about maintenance involving NSG is irrelevant and was deliberately included to divert diagnosis toward alternative C. NSG without blocking rules doesn't explain the error, and the network rules command output directly points to the cause.

The most dangerous distractor is alternative C. Focusing on NSG is the most common diagnostic error when there's a recent infrastructure change, but the networkSourceDeleted state is objective evidence that supersedes any NSG suspicion. Acting on NSG without correcting the invalid rule would not restore access.

Answer Key — Scenario 2

Answer: B

The root cause is the documented network reprogramming behavior that occurs when enabling a Service Endpoint on a subnet. This process affects programming at the virtual host level for VMs in the modified subnet. In some cases, especially when there are multiple subnets in a VNet with heavy traffic, this reconfiguration can cause momentary interruption of active TCP connections from neighboring VMs or the subnet itself, even if indirectly.

Temporal correlation is the central clue: failures in subnet-data started minutes after enabling the endpoint on subnet-mgmt. There were no changes to Key Vault rules or access policies.

The information about the Key Vault rules state (Succeeded for both subnets) and the security team's confirmation of no RBAC changes are relevant details that eliminate authorization causes, confirming the problem is temporary connectivity, not permission.

The most dangerous distractor is alternative D. It induces the reader to distrust a reliable information source (the security team) and seek an invisible cause, when the real cause is a known and documented technical behavior.

Answer Key — Scenario 3

Answer: A

The correct action is to remove the invalid rule and add the new one immediately, utilizing the 5-minute failure tolerance window that the pipeline already has by design. All prerequisites for the correction are satisfied: the Service Endpoint is already provisioned and in Succeeded state on the recreated subnet, and the administrator has the necessary permission (Contributor) to modify storage account network rules.

Alternative B represents the correct action applied at the wrong time. Redirecting to an untested contingency account in production introduces greater risk than direct correction, especially when existing failure tolerance already covers the time needed for the operation.

Alternative D is the most dangerous distractor because it sounds prudent. However, modifying virtual network rules on a storage account is a control plane operation that doesn't cause prolonged unavailability. The fear of waiting for a maintenance window would unnecessarily prolong an active failure that has already been causing impact for two weeks.

Alternative C is false: there is no support approval requirement for storage account network rule modifications through the portal, CLI, or ARM.

Answer Key — Scenario 4

Answer: A

The correct sequence is T -> P -> Q -> R -> S.

The correct diagnostic reasoning follows a progression from general to specific, eliminating hypotheses at the outermost layer before going deeper:

T confirms the account's base configuration: if it's "Enabled from all networks," the entire Service Endpoint investigation is unnecessary and the problem is in another layer. This step filters the scenario most quickly.

P verifies if the Service Endpoint mechanism is active at the source. Without the endpoint enabled on the subnet, traffic doesn't arrive with private network identity.

Q verifies if the network rule is present, valid, and in Succeeded state. Even with P confirmed, an absent rule or one with networkSourceDeleted state blocks access.

R investigates the authorization layer. Error 403 can originate from insufficient RBAC permission (e.g., Managed Identity without the correct role), not just network firewall.

S is the last step because, in the Service Endpoints context, traffic doesn't use public IP. Checking allowed IP lists only makes sense after confirming the Service Endpoint path is correctly configured or ruled out.

Alternative C is the most common error: starting with the authentication layer (R) before validating network infrastructure. A 403 error induces the reflex to check credentials first, but when the environment uses network firewall, the most likely cause is routing or missing rule, not permission.

Troubleshooting Tree: Create service endpoints

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Color legend:

Color	Node type
Dark blue	Initial symptom (entry point)
Blue	Diagnostic question (yes/no or observable state)
Red	Identified cause
Green	Recommended action or resolution
Orange	Intermediate verification or alternative path

To use this tree when facing a real problem, always start from the root node (access failure symptom) and answer each question based on what is directly observable in the environment, without assuming states. Each branch eliminates a hypothesis and directs to the next verification, until a cause is named and corrective action is prescribed. When the failure is intermittent and recent, also check the lateral path that branches from the root node to diagnose transient network reconfiguration.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Root Cause​

Scenario 3 — Action Decision​

Scenario 4 — Diagnostic Sequence​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Create service endpoints​