Troubleshooting Lab: Plan and configure shared or dedicated subnets

Diagnostic Scenarios

Scenario 1 — Root Cause

A team deployed an Azure API Management (external mode) in an existing VNet. The chosen subnet has a /27 and already hosts two monitoring virtual machines with static IPs assigned. The subnet's NSG allows inbound traffic on ports 80 and 443 from any source.

After the deployment successfully completes in the portal, the API Management internal health checks start failing intermittently. The operations team opens a diagnostic and collects the following events in the activity log:

[APIM] Health probe to backend pool: TIMEOUT
[APIM] Management endpoint 3443 unreachable from control plane
[APIM] Gateway status: Degraded
Source IP of probe: 20.37.158.0/23 (tag: ApiManagement)

The team identifies that the monitoring VMs are operating normally and that user traffic on port 443 is reaching the gateway. The team suspects VM overload as the cause of the timeout.

What is the root cause of the observed behavior?

A) The /27 subnet size exhausted available IP addresses after VM allocation, preventing APIM from creating its internal health probe instances.

B) The subnet's NSG does not allow inbound traffic on port 3443 originating from the ApiManagement service tag, blocking the service control plane.

C) The monitoring VMs with static IPs conflict with addresses reserved by Azure API Management, causing address collision.

D) External mode API Management requires a subnet without any pre-existing resources, and the presence of VMs invalidates the gateway configuration.

Scenario 2 — Action Decision

The infrastructure team identified that an Azure SQL Managed Instance in production is experiencing intermittent connectivity failures for clients within the same VNet. The cause has been confirmed: a User Defined Route (UDR) associated with the Managed Instance subnet redirects outbound traffic through an NVA (Network Virtual Appliance) that has high latency and is occasionally dropping packets.

The environment has the following constraints:

The Managed Instance is in continuous use by a financial application with a 99.9% SLA
The UDR was applied 3 days ago by another team as part of a traffic auditing project
Removing the UDR requires security committee approval, which only meets weekly
The NVA belongs to a different subscription managed by the security team
There is a scheduled maintenance window in 72 hours

What is the correct action to take at this moment?

A) Immediately remove the UDR from the Managed Instance subnet to restore default routing, since the cause is confirmed and production impact justifies the action.

B) Escalate the incident to the security team with root cause evidence, request that the NVA be stabilized or temporary bypass be approved under emergency regime, without changing the UDR without authorization.

C) Wait for the maintenance window in 72 hours to apply the correction in a controlled manner, documenting the impact in the interim period.

D) Recreate the Managed Instance in a new subnet without the problematic UDR, taking advantage that the cause is identified and the environment allows failover.

Scenario 3 — Root Cause

An engineer tries to deploy an Azure Container Apps Environment in an existing subnet of a corporate VNet. The deployment fails with the following message:

Error: InsufficientSubnetSize
Message: "The subnet '/subscriptions/.../subnets/aca-subnet' does not have 
enough available IP addresses. Required: 27, Available: 14."

The engineer inspects the subnet and collects the following information:

Subnet CIDR:     10.1.4.0/27
Total addresses: 32
Reserved by Azure: 5
Currently allocated:
  - 10.1.4.5  (VM jumpbox)
  - 10.1.4.6  (VM jumpbox replica)
  - 10.1.4.7  (Private Endpoint - Storage Account)
  - 10.1.4.8  (Private Endpoint - Key Vault)
Total used:      9
Available:       18 (32 - 5 - 9 = 18)

The engineer concludes that there are 18 available addresses and that the "14 available" error is incorrect, interpreting the problem as a portal update bug. He decides to try the deployment again without changes.

What is the actual root cause of the failure?

A) Azure Container Apps Environment requires subnet delegation to Microsoft.App/environments, and the absence of this delegation makes the service erroneously report a lower IP count than the actual count.

B) Private Endpoints consume an additional block of IPs invisible in the portal for their internal network interfaces, reducing the actual available space below what the manual calculation indicates.

C) The /27 subnet has only 32 total addresses, and Azure Container Apps Environment requires a minimum of /27 with at least 27 free addresses, making any previous subnet use incompatible with the deployment.

D) The engineer's calculation is incorrect: Private Endpoints consume additional addresses for their NICs that don't appear in the portal's allocated IPs listing, and the actual available space is lower than manually calculated.

Scenario 4 — Diagnostic Sequence

A team receives the following alert in a production environment:

"VMs in app-subnet lost connectivity to VMs in db-subnet within the same VNet. External connectivity from VMs in app-subnet is normal."

The available investigation steps are:

Check if there's a UDR associated with one of the subnets that redirects traffic to an incorrect or non-existent next hop
Confirm that both subnets belong to the same VNet and that there's no misconfigured VNet peering between them
Check NSG rules associated with db-subnet to identify if there's an explicit deny rule for the app-subnet IP range
Use Network Watcher — IP Flow Verify to test connectivity from source IP to destination IP on the affected port
Check if Azure Firewall or NVA is intercepting traffic between subnets and dropping packets without visible logs

What is the correct investigation sequence?

A) 2 -> 3 -> 1 -> 4 -> 5

B) 4 -> 2 -> 3 -> 1 -> 5

C) 1 -> 5 -> 2 -> 3 -> 4

D) 2 -> 4 -> 3 -> 1 -> 5

Answer Key and Explanations

Answer Key — Scenario 1

Answer: B

The decisive clue is in the collected log: port 3443 is the Azure API Management management port, used by Microsoft's control plane to check health and configure the gateway. The ApiManagement service tag represents Azure infrastructure IPs that need to reach the service on this port. The subnet's NSG was configured only for ports 80 and 443, leaving 3443 blocked.

The symptom "degraded gateway with failing health probes" with user traffic working normally is exactly the pattern of a control plane failure, not a data plane failure. This eliminates alternatives A and C.

The information about monitoring VMs operating normally is irrelevant and was inserted to induce the wrong diagnosis of alternative C. The fact that deployment "completed successfully" is also a trap: APIM can provision before validating NSG rules, making post-deployment failure the expected behavior when network configuration is incomplete.

The most dangerous distractor is A: focusing on subnet size when IP exhaustion is not evidenced in logs and instance IPs were successfully allocated during deployment.

Answer Key — Scenario 2

Answer: B

The cause is confirmed, but the critical constraint is that removing the UDR requires security committee approval. Alternative A ignores this governance constraint and represents a technically correct action applied without authorization in an environment with formal controls. In corporate environments, acting outside the approval process can create more serious compliance problems than the original failure.

Alternative C is incorrect because the 99.9% SLA does not tolerate 72 hours of degradation without action, and waiting passively contradicts any incident response obligation.

Alternative D is technically unfeasible in production with active SLA: recreating a Managed Instance is a time-consuming and high-risk operation that could cause a larger downtime window than the current failure.

The correct action is to escalate with evidence and request emergency approval from the team responsible for the UDR, which has autonomy over the NVA. This respects governance controls without ignoring production impact.

Answer Key — Scenario 3

Answer: D

The engineer's error was trusting the manual calculation based on the portal's visible IP listing. Private Endpoints allocate a NIC (Network Interface Card) for each endpoint, and this NIC consumes additional IP addresses that frequently don't appear in the subnet's consolidated allocated IPs view in the portal. The actual count of available addresses is lower than manually calculated.

Alternative B describes the same phenomenon imprecisely by mentioning "additional block of invisible IPs," which is not technically correct. The Private Endpoint's NIC is a real and visible resource if inspected individually; the problem is that the portal's subnet view doesn't consolidate it correctly in all contexts.

Alternative A is a plausible distractor because delegation is indeed necessary for some Container Apps services, but it's not the cause of the InsufficientSubnetSize error, which is explicitly an addressing problem.

The information about the engineer's decision to "retry deployment without changes" is irrelevant to identifying the root cause, but was included to test whether the reader focuses on the actual technical symptom, not the engineer's behavior.

Answer Key — Scenario 4

Answer: A

The correct sequence is 2 -> 3 -> 1 -> 4 -> 5, which follows the logic of progressive elimination from simplest to most complex.

Step 2 first: confirming that subnets are in the same VNet immediately eliminates external routing causes. Traffic within the same VNet doesn't need peering, and validating this takes seconds.

Step 3 second: with network scope confirmed, checking NSGs on the destination subnet is the most common cause of intra-VNet blocking. It's simple, direct, and resolves most cases.

Step 1 third: UDRs are the second most common cause of route deviation, but require more analysis. It only makes sense to investigate after eliminating NSG.

Step 4 fourth: IP Flow Verify validates the hypothesis from previous steps with an active test. Using it before steps 3 and 1 inverts the logic: it confirms the blocking but doesn't say where it's configured.

Step 5 last: investigating Firewall or NVA is the most complex and invasive path, appropriate only when more direct causes have been eliminated.

Alternative B makes the classic mistake of starting with the diagnostic tool (IP Flow Verify) before inspecting configurations, which can confirm the symptom without approaching the cause diagnosis.

Troubleshooting Tree: Plan and configure shared or dedicated subnets

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Color Legend:

Color	Node Type
Dark Blue	Initial symptom (entry point)
Blue	Diagnostic question (binary decision or by state)
Red	Identified cause
Green	Recommended action or resolution
Orange	Validation or intermediate verification

To use this tree when facing a real problem, start at the root node by identifying whether the failure occurred at deployment time or after it. Follow the branches by answering each question based on what is directly observable: error message, NSG state, UDR presence, control plane behavior. Resist the impulse to jump to more complex questions like UDRs or NVAs before eliminating simple causes like NSG and delegation. The orange nodes mark points where active validation with a tool is necessary before acting, avoiding premature corrections based on assumptions.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Action Decision​

Scenario 3 — Root Cause​

Scenario 4 — Diagnostic Sequence​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Plan and configure shared or dedicated subnets​