Skip to main content

Troubleshooting Lab: Choose between public and internal load balancers

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The operations team receives the following alert at 2:32 PM: instances in an internal application tier have health probe failures from the load balancer, and traffic from the frontend has stopped being distributed among them. The environment is described below:

Load Balancer SKU:       Standard (Internal)
Frontend IP: 10.1.2.10 (subnet: app-subnet, 10.1.2.0/24)
Backend pool: vm-app-01 (10.1.3.4), vm-app-02 (10.1.3.5)
Health Probe: HTTP, port 8080, path /health
Probe interval: 5s, unhealthy threshold: 2
NSG on VM NICs: allows TCP 443 inbound from any source
Last change: new application version deployed at 2:20 PM
Azure Monitor Metric: DipAvailability = 0 for both VMs

The engineer verifies that the VMs are running, respond to internal pings, and the operating system is stable. The application service started without apparent errors in the OS log.

What is the root cause of the observed failure?

A) The Internal Load Balancer frontend IP is in a different subnet from the backend pool VMs' subnet, which prevents proper health probe routing.

B) The NSG associated with the VM NICs blocks health probe traffic, as it only allows TCP 443 and the probe is configured on port 8080.

C) The 5-second health probe interval is insufficient for applications that take more than 10 seconds to respond, generating false negatives.

D) The Standard Load Balancer health probe requires the /health path to return HTTP 200; since the deployment occurred minutes before, the new version probably changed this endpoint's behavior.


Scenario 2 β€” Action Decision​

The network team identified that a Public Load Balancer Standard SKU in production has no Outbound Rules configured, and the VMs in the backend pool don't have public IPs assigned directly. The cause was confirmed: the VMs cannot establish outbound connections to the internet.

The environment has the following restrictions:

  • The load balancer serves external client requests 24/7, with no agreed maintenance window
  • Any changes to the frontend IP or load balancing rules require change management approval with 48 hours advance notice
  • The team has permission to add and modify resources in the subnet without prior approval
  • There is a NAT Gateway already provisioned in the subscription, currently unassociated with any subnet

What is the correct action to take at this moment?

A) Create an Outbound Rule in the Public Load Balancer, allocating a new public IP for outbound traffic.

B) Associate the already provisioned NAT Gateway to the subnet where the VMs are located, solving the outbound problem without changing the load balancer.

C) Assign a public IP directly to each VM's NIC so they can make outbound connections independent of the load balancer.

D) Wait for an approved maintenance window before any intervention, as any change may impact the frontend.


Scenario 3 β€” Root Cause​

A developer reports that a client application, running on a VM in vnet-prod (10.0.0.0/16), cannot reach the internal endpoint of an API balanced by an Internal Load Balancer. The ILB frontend IP is 10.0.5.100.

Client VM test result:
$ curl -v http://10.0.5.100:80/api/test
* Trying 10.0.5.100:80...
* connect to 10.0.5.100 port 80 failed: Connection timed out

ILB Configuration:
Frontend IP: 10.0.5.100 (subnet: ilb-frontend-subnet, 10.0.5.0/24)
Backend pool: vm-api-01 (10.0.6.4), vm-api-02 (10.0.6.5)
LB Rule: TCP 80 -> TCP 80
Health probe: TCP 80 β€” Status: Healthy (both VMs)
Floating IP: Disabled

Client VM: 10.0.1.15 (subnet: client-subnet, 10.0.1.0/24)
NSG on client subnet: allows all outbound traffic
NSG on ilb-frontend-subnet: allows inbound TCP 80 from 10.0.0.0/16

The engineer confirms that both backend VMs respond correctly to direct requests on port 80. The health probe is marking both as healthy. The VNet has no peering with other networks.

What is the root cause of the problem?

A) The NSG on subnet ilb-frontend-subnet is blocking traffic, as the rule only allows range 10.0.0.0/16 and the client VM is outside this range.

B) The Internal Load Balancer doesn't route traffic from clients in different subnets from the frontend subnet, requiring client and frontend to be in the same subnet.

C) The NSG associated with the backend pool VMs' subnet is blocking traffic forwarded by the load balancer, preventing requests from reaching the instances.

D) The client VM is trying to reach the ILB frontend IP directly, but there's an NSG on subnet client-subnet or on the client VM's NIC blocking outbound traffic to port 80 toward range 10.0.5.0/24.


Scenario 4 β€” Collateral Impact​

An engineer identified that VMs in a Public Load Balancer Standard SKU had no outbound internet access. To solve this, he removed the VMs from the Standard Load Balancer backend pool and associated a public IP directly to each VM's NIC, immediately restoring outbound connectivity.

The outbound problem was solved. The load balancer continues operating and the inbound load balancing rules remain active.

What secondary consequence can this action cause?

A) The VMs start responding to inbound requests directly through the NIC's public IP, bypassing the load balancer and exposing them without frontend rule protections.

B) The Standard Load Balancer loses the ability to health probe the VMs removed from the backend pool, making availability metrics inaccurate in Azure Monitor.

C) By associating a public IP to a VM's NIC that's still in a Standard Load Balancer backend pool, outbound traffic uses the NIC's IP, but inbound traffic continues being routed by the load balancer, which can create route asymmetry for established connections.

D) The public IP associated with the VM's NIC is automatically promoted to the load balancer frontend, overwriting existing rules and causing failure in other clients' inbound connections.


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The central clue is in the combination of two pieces of data: the health probe is configured on port 8080 via HTTP, and the NSG on the VM NICs only allows TCP 443 inbound. The Standard Load Balancer sends health probes directly to the VM NICs, and this traffic must be explicitly allowed by the NSG. Since port 8080 isn't allowed, the probes fail, and the load balancer marks both instances as unhealthy, stopping traffic distribution.

The irrelevant information in this scenario is the VMs' operational state: the fact they respond to ping and the OS is stable has no relation to the HTTP health probe behavior on the specific port.

Alternative A represents a common misconception: the ILB frontend can legally be in a different subnet from the backends subnet. C is theoretically plausible, but the 5-second interval is standard and there's no evidence of application slowness. D describes a real deployment problem, but the statement indicates the service started without errors, and the probe failed on both VMs simultaneously right after deploying a version that didn't change the endpoint, while the blocked port in the NSG better explains the observed pattern.

The most dangerous distractor is D, as the deployment timing creates a tempting correlation that can divert diagnosis from the real cause.


Answer Key β€” Scenario 2​

Answer: B

The critical restriction is that any change to the load balancer requires approval with 48 hours advance notice, and creating an Outbound Rule (alternative A) is a change directly to the load balancer resource. Assigning public IPs to VM NICs (alternative C) is also technically valid, but creates unnecessary public exposure and doesn't leverage the already available NAT Gateway.

The statement explicitly informs that the team has permission to modify subnet resources without prior approval. Associating the existing NAT Gateway to the subnet is exactly this type of operation: a subnet change, not a load balancer change, that solves the outbound problem without touching the frontend or load balancing rules.

Alternative D is the most dangerous distractor in corporate environments: the tendency to wait for approval even when a permitted action is available can unnecessarily prolong a production incident.


Answer Key β€” Scenario 3​

Answer: D

The correct diagnosis requires eliminating apparent causes and focusing on the layer that wasn't verified. The health probe is healthy, VMs respond directly, the frontend subnet NSG allows the correct range, and there's no peering with external networks. The client VM is at 10.0.1.15, which belongs to 10.0.0.0/16, so it's within the range allowed by the frontend NSG.

What wasn't verified in the statement is the NSG on subnet client-subnet or on the client VM's NIC for outbound traffic toward subnet 10.0.5.0/24 on port 80. The permissive outbound traffic described in the client subnet NSG refers to the subnet level, but an NSG directly associated with the VM's NIC may have more restrictive rules that override this permission.

Alternative A is the most attractive distractor, but fails the math: 10.0.1.15 is within 10.0.0.0/16. B describes behavior that doesn't exist: the ILB routes traffic from any source within the VNet (or connected networks), not just from the same subnet as the frontend. C is plausible, but the statement affirms that probes are healthy, meaning traffic from the load balancer reaches the VMs.


Answer Key β€” Scenario 4​

Answer: C

When a VM is in a Standard Load Balancer backend pool and also has a public IP associated with its NIC, Azure applies the following precedence rule: outbound traffic uses the NIC's public IP, while inbound traffic is still routed by the load balancer. This creates route asymmetry for established TCP connections: the inbound packet arrives via load balancer (frontend IP), but the response exits through the NIC's public IP. For stateless connections, this behavior may be imperceptible, but for long-duration TCP connections or in scenarios with state inspection, asymmetry can cause drops and intermittent behaviors.

Alternative A describes a real risk, but incomplete: the VMs remain in the backend pool, so inbound traffic still goes through the load balancer. B is false: health probes don't depend on VM presence in the backend pool to function; they remain in the pool. D describes behavior that doesn't exist in the Azure platform: NIC IPs don't automatically overwrite load balancer frontends.


Troubleshooting Tree: Choose between public and internal load balancers​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark BlueInitial symptom (entry point)
BlueDiagnostic question (binary decision or observable state)
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate validation or verification

To use this tree when facing a real problem, start with the root node by identifying whether the failure is inbound or outbound traffic. Follow each branch by objectively answering the node's question based on what you can observe or measure in the environment, such as health probe status, NSG rules, presence of Outbound Rules, and load balancer type. Each path ends in a named cause or a concrete action, preventing diagnosis from drifting toward unverified hypotheses.